Stats¶
Generates correlation reports between data sets.
-
gretel_synthetics.utils.stats.
calculate_correlation
(df: pandas.core.frame.DataFrame, nominal_columns: List[str] = None, job_count: int = 4, opt: bool = False) → pandas.core.frame.DataFrame¶ Given a dataframe, calculate a matrix of the correlations between the various rows. We use the calculate_pearsons_r, calculate_correlation_ratio and calculate_theils_u to fill in the matrix values.
- Parameters
df – The input dataframe.
nominal_columns – Columns to treat as categorical.
job_count – For parallelization of computations.
opt – “optimized.” If opt is True, then go the faster (just not quite as accurate) route of global replace missing with 0.
- Returns
A dataframe of correlation values.
-
gretel_synthetics.utils.stats.
calculate_correlation_ratio
(x, y, opt)¶ Calculates the Correlation Ratio for categorical-continuous association. Used in constructing correlation matrix. See http://shakedzy.xyz/dython/modules/nominal/#correlation_ratio.
- Parameters
x – first input array, categorical.
y – second input array, numeric.
opt – “optimized.” If False, drop missing values if y (the numeric column) is null/nan.
- Returns
float in the range of [0,1].
-
gretel_synthetics.utils.stats.
calculate_pearsons_r
(x, y, opt) → Tuple[float, float]¶ Calculate the Pearson correlation coefficient for this pair of rows of our correlation matrix. See https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html.
- Parameters
x – first input array.
y – second input array.
opt – “optimized.” If False, drop missing values when either the x or y value is null/nan. If True, we’ve already replaced nan’s with 0’s for entire datafile.
- Returns
As per scipy, tuple of Pearson’s correlation coefficient and Two-tailed p-value.
-
gretel_synthetics.utils.stats.
calculate_theils_u
(x, y)¶ Calculates Theil’s U statistic (Uncertainty coefficient) for categorical-categorical association. Used in constructing correlation matrix. See http://shakedzy.xyz/dython/modules/nominal/#theils_u.
- Parameters
x – first input array, categorical.
y – second input array, categorical.
- Returns
float in the range of [0,1].
-
gretel_synthetics.utils.stats.
compute_distribution_distance
(d1: dict, d2: dict) → float¶ Calculates the Jensen Shannon distance between two distributions.
- Parameters
d1 – Distribution dict. Values must be a probability vector (all values are floats in [0,1], sum of all values is 1.0).
d2 – Another distribution dict.
- Returns
The distance between the two vectors, range in [0, 1].
- Return type
float
-
gretel_synthetics.utils.stats.
compute_pca
(df: pandas.core.frame.DataFrame, n_components: int = 2) → pandas.core.frame.DataFrame¶ Do PCA on a dataframe. See https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html.
- Parameters
df – The dataframe to analyze for principal components.
n_components – Number of components to keep.
- Returns
Dataframe of principal components.
-
gretel_synthetics.utils.stats.
count_memorized_lines
(df1: pandas.core.frame.DataFrame, df2: pandas.core.frame.DataFrame) → int¶ Checks for overlap between training and synthesized data.
- Parameters
df1 – DataFrame of training data.
df2 – DataFrame of synthetic data.
- Returns
int, the number of overlapping elements.
-
gretel_synthetics.utils.stats.
get_categorical_field_distribution
(field: pandas.core.series.Series) → dict¶ Calculates the normalized distribution of a categorical field.
- Parameters
field – A sanitized column extracted from one of the df’s.
- Returns
keys are the unique values in the field, values are percentages (floats in [0, 100]).
- Return type
dict
-
gretel_synthetics.utils.stats.
get_numeric_distribution_bins
(training: pandas.core.series.Series, synthetic: pandas.core.series.Series)¶ To calculate the distribution distance between two numeric series a la categorical fields we need to bin the data. We want the same bins between both series, based on scrubbed data.
- Parameters
training – The numeric series from the training dataframe.
synthetic – The numeric series from the synthetic dataframe.
- Returns
bin_edges, numpy array of dtype float
-
gretel_synthetics.utils.stats.
get_numeric_field_distribution
(field: pandas.core.series.Series, bins) → dict¶ Calculates the normalized distribution of a numeric field cut into bins.
- Parameters
field – A sanitized column extracted from one of the df’s.
bins – Usually an np.ndarray from get_bins, but can be anything that can be safely passed to pandas.cut.
- Returns
keys are the unique values in the field, values are floats in [0, 1].
- Return type
dict
-
gretel_synthetics.utils.stats.
normalize_dataset
(df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame¶ Prep a dataframe for PCA. Divide the dataframe into numeric and categorical, fill missing values and encode categorical columns by the frequency of each value and standardize all values.
- Parameters
df – The dataframe to be subjected to PCA.
- Returns
The dataframe, normalized.