Stats

Generates correlation reports between data sets.

gretel_synthetics.utils.stats.calculate_correlation(df: DataFrame, nominal_columns: List[str] | None = None, job_count: int = 4, opt: bool = False) DataFrame

Given a dataframe, calculate a matrix of the correlations between the various rows. We use the calculate_pearsons_r, calculate_correlation_ratio and calculate_theils_u to fill in the matrix values.

Parameters:
  • df – The input dataframe.

  • nominal_columns – Columns to treat as categorical.

  • job_count – For parallelization of computations.

  • opt – “optimized.” If opt is True, then go the faster (just not quite as accurate) route of global replace missing with 0.

Returns:

A dataframe of correlation values.

gretel_synthetics.utils.stats.calculate_correlation_ratio(x, y, opt)

Calculates the Correlation Ratio for categorical-continuous association. Used in constructing correlation matrix. See http://shakedzy.xyz/dython/modules/nominal/#correlation_ratio.

Parameters:
  • x – first input array, categorical.

  • y – second input array, numeric.

  • opt – “optimized.” If False, drop missing values if y (the numeric column) is null/nan.

Returns:

float in the range of [0,1].

gretel_synthetics.utils.stats.calculate_pearsons_r(x, y, opt) Tuple[float, float]

Calculate the Pearson correlation coefficient for this pair of rows of our correlation matrix. See https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html.

Parameters:
  • x – first input array.

  • y – second input array.

  • opt – “optimized.” If False, drop missing values when either the x or y value is null/nan. If True, we’ve already replaced nan’s with 0’s for entire datafile.

Returns:

As per scipy, tuple of Pearson’s correlation coefficient and Two-tailed p-value.

gretel_synthetics.utils.stats.calculate_theils_u(x, y)

Calculates Theil’s U statistic (Uncertainty coefficient) for categorical-categorical association. Used in constructing correlation matrix. See http://shakedzy.xyz/dython/modules/nominal/#theils_u.

Parameters:
  • x – first input array, categorical.

  • y – second input array, categorical.

Returns:

float in the range of [0,1].

gretel_synthetics.utils.stats.compute_distribution_distance(d1: dict, d2: dict) float

Calculates the Jensen Shannon distance between two distributions.

Parameters:
  • d1 – Distribution dict. Values must be a probability vector (all values are floats in [0,1], sum of all values is 1.0).

  • d2 – Another distribution dict.

Returns:

The distance between the two vectors, range in [0, 1].

Return type:

float

gretel_synthetics.utils.stats.compute_pca(df: DataFrame, n_components: int = 2) DataFrame

Do PCA on a dataframe. See https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html.

Parameters:
  • df – The dataframe to analyze for principal components.

  • n_components – Number of components to keep.

Returns:

Dataframe of principal components.

gretel_synthetics.utils.stats.count_memorized_lines(df1: DataFrame, df2: DataFrame) int

Checks for overlap between training and synthesized data.

Parameters:
  • df1 – DataFrame of training data.

  • df2 – DataFrame of synthetic data.

Returns:

int, the number of overlapping elements.

gretel_synthetics.utils.stats.get_categorical_field_distribution(field: Series) dict

Calculates the normalized distribution of a categorical field.

Parameters:

field – A sanitized column extracted from one of the df’s.

Returns:

keys are the unique values in the field, values are percentages (floats in [0, 100]).

Return type:

dict

gretel_synthetics.utils.stats.get_numeric_distribution_bins(training: Series, synthetic: Series)

To calculate the distribution distance between two numeric series a la categorical fields we need to bin the data. We want the same bins between both series, based on scrubbed data.

Parameters:
  • training – The numeric series from the training dataframe.

  • synthetic – The numeric series from the synthetic dataframe.

Returns:

bin_edges, numpy array of dtype float

gretel_synthetics.utils.stats.get_numeric_field_distribution(field: Series, bins) dict

Calculates the normalized distribution of a numeric field cut into bins.

Parameters:
  • field – A sanitized column extracted from one of the df’s.

  • bins – Usually an np.ndarray from get_bins, but can be anything that can be safely passed to pandas.cut.

Returns:

keys are the unique values in the field, values are floats in [0, 1].

Return type:

dict

gretel_synthetics.utils.stats.normalize_dataset(df: DataFrame) DataFrame

Prep a dataframe for PCA. Divide the dataframe into numeric and categorical, fill missing values and encode categorical columns by the frequency of each value and standardize all values.

Parameters:

df – The dataframe to be subjected to PCA.

Returns:

The dataframe, normalized.