# Stats¶

Generates correlation reports between data sets.

`gretel_synthetics.utils.stats.``calculate_correlation`(df: pandas.core.frame.DataFrame, nominal_columns: List[str] = None, job_count: int = 4, opt: bool = False) → pandas.core.frame.DataFrame

Given a dataframe, calculate a matrix of the correlations between the various rows. We use the calculate_pearsons_r, calculate_correlation_ratio and calculate_theils_u to fill in the matrix values.

Parameters
• df – The input dataframe.

• nominal_columns – Columns to treat as categorical.

• job_count – For parallelization of computations.

• opt – “optimized.” If opt is True, then go the faster (just not quite as accurate) route of global replace missing with 0.

Returns

A dataframe of correlation values.

`gretel_synthetics.utils.stats.``calculate_correlation_ratio`(x, y, opt)

Calculates the Correlation Ratio for categorical-continuous association. Used in constructing correlation matrix. See http://shakedzy.xyz/dython/modules/nominal/#correlation_ratio.

Parameters
• x – first input array, categorical.

• y – second input array, numeric.

• opt – “optimized.” If False, drop missing values if y (the numeric column) is null/nan.

Returns

float in the range of [0,1].

`gretel_synthetics.utils.stats.``calculate_pearsons_r`(x, y, opt) → Tuple[float, float]

Calculate the Pearson correlation coefficient for this pair of rows of our correlation matrix. See https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html.

Parameters
• x – first input array.

• y – second input array.

• opt – “optimized.” If False, drop missing values when either the x or y value is null/nan. If True, we’ve already replaced nan’s with 0’s for entire datafile.

Returns

As per scipy, tuple of Pearson’s correlation coefficient and Two-tailed p-value.

`gretel_synthetics.utils.stats.``calculate_theils_u`(x, y)

Calculates Theil’s U statistic (Uncertainty coefficient) for categorical-categorical association. Used in constructing correlation matrix. See http://shakedzy.xyz/dython/modules/nominal/#theils_u.

Parameters
• x – first input array, categorical.

• y – second input array, categorical.

Returns

float in the range of [0,1].

`gretel_synthetics.utils.stats.``compute_distribution_distance`(d1: dict, d2: dict) → float

Calculates the Jensen Shannon distance between two distributions.

Parameters
• d1 – Distribution dict. Values must be a probability vector (all values are floats in [0,1], sum of all values is 1.0).

• d2 – Another distribution dict.

Returns

The distance between the two vectors, range in [0, 1].

Return type

float

`gretel_synthetics.utils.stats.``compute_pca`(df: pandas.core.frame.DataFrame, n_components: int = 2) → pandas.core.frame.DataFrame

Do PCA on a dataframe. See https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html.

Parameters
• df – The dataframe to analyze for principal components.

• n_components – Number of components to keep.

Returns

Dataframe of principal components.

`gretel_synthetics.utils.stats.``count_memorized_lines`(df1: pandas.core.frame.DataFrame, df2: pandas.core.frame.DataFrame) → int

Checks for overlap between training and synthesized data.

Parameters
• df1 – DataFrame of training data.

• df2 – DataFrame of synthetic data.

Returns

int, the number of overlapping elements.

`gretel_synthetics.utils.stats.``get_categorical_field_distribution`(field: pandas.core.series.Series) → dict

Calculates the normalized distribution of a categorical field.

Parameters

field – A sanitized column extracted from one of the df’s.

Returns

keys are the unique values in the field, values are percentages (floats in [0, 100]).

Return type

dict

`gretel_synthetics.utils.stats.``get_numeric_distribution_bins`(training: pandas.core.series.Series, synthetic: pandas.core.series.Series)

To calculate the distribution distance between two numeric series a la categorical fields we need to bin the data. We want the same bins between both series, based on scrubbed data.

Parameters
• training – The numeric series from the training dataframe.

• synthetic – The numeric series from the synthetic dataframe.

Returns

bin_edges, numpy array of dtype float

`gretel_synthetics.utils.stats.``get_numeric_field_distribution`(field: pandas.core.series.Series, bins) → dict

Calculates the normalized distribution of a numeric field cut into bins.

Parameters
• field – A sanitized column extracted from one of the df’s.

• bins – Usually an np.ndarray from get_bins, but can be anything that can be safely passed to pandas.cut.

Returns

keys are the unique values in the field, values are floats in [0, 1].

Return type

dict

`gretel_synthetics.utils.stats.``normalize_dataset`(df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame

Prep a dataframe for PCA. Divide the dataframe into numeric and categorical, fill missing values and encode categorical columns by the frequency of each value and standardize all values.

Parameters

df – The dataframe to be subjected to PCA.

Returns

The dataframe, normalized.