Stats
Generates correlation reports between data sets.
- gretel_synthetics.utils.stats.calculate_correlation(df: DataFrame, nominal_columns: List[str] | None = None, job_count: int = 4, opt: bool = False) DataFrame
Given a dataframe, calculate a matrix of the correlations between the various rows. We use the calculate_pearsons_r, calculate_correlation_ratio and calculate_theils_u to fill in the matrix values.
- Parameters:
df – The input dataframe.
nominal_columns – Columns to treat as categorical.
job_count – For parallelization of computations.
opt – “optimized.” If opt is True, then go the faster (just not quite as accurate) route of global replace missing with 0.
- Returns:
A dataframe of correlation values.
- gretel_synthetics.utils.stats.calculate_correlation_ratio(x, y, opt)
Calculates the Correlation Ratio for categorical-continuous association. Used in constructing correlation matrix. See http://shakedzy.xyz/dython/modules/nominal/#correlation_ratio.
- Parameters:
x – first input array, categorical.
y – second input array, numeric.
opt – “optimized.” If False, drop missing values if y (the numeric column) is null/nan.
- Returns:
float in the range of [0,1].
- gretel_synthetics.utils.stats.calculate_pearsons_r(x, y, opt) Tuple[float, float]
Calculate the Pearson correlation coefficient for this pair of rows of our correlation matrix. See https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html.
- Parameters:
x – first input array.
y – second input array.
opt – “optimized.” If False, drop missing values when either the x or y value is null/nan. If True, we’ve already replaced nan’s with 0’s for entire datafile.
- Returns:
As per scipy, tuple of Pearson’s correlation coefficient and Two-tailed p-value.
- gretel_synthetics.utils.stats.calculate_theils_u(x, y)
Calculates Theil’s U statistic (Uncertainty coefficient) for categorical-categorical association. Used in constructing correlation matrix. See http://shakedzy.xyz/dython/modules/nominal/#theils_u.
- Parameters:
x – first input array, categorical.
y – second input array, categorical.
- Returns:
float in the range of [0,1].
- gretel_synthetics.utils.stats.compute_distribution_distance(d1: dict, d2: dict) float
Calculates the Jensen Shannon distance between two distributions.
- Parameters:
d1 – Distribution dict. Values must be a probability vector (all values are floats in [0,1], sum of all values is 1.0).
d2 – Another distribution dict.
- Returns:
The distance between the two vectors, range in [0, 1].
- Return type:
float
- gretel_synthetics.utils.stats.compute_pca(df: DataFrame, n_components: int = 2) DataFrame
Do PCA on a dataframe. See https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html.
- Parameters:
df – The dataframe to analyze for principal components.
n_components – Number of components to keep.
- Returns:
Dataframe of principal components.
- gretel_synthetics.utils.stats.count_memorized_lines(df1: DataFrame, df2: DataFrame) int
Checks for overlap between training and synthesized data.
- Parameters:
df1 – DataFrame of training data.
df2 – DataFrame of synthetic data.
- Returns:
int, the number of overlapping elements.
- gretel_synthetics.utils.stats.get_categorical_field_distribution(field: Series) dict
Calculates the normalized distribution of a categorical field.
- Parameters:
field – A sanitized column extracted from one of the df’s.
- Returns:
keys are the unique values in the field, values are percentages (floats in [0, 100]).
- Return type:
dict
- gretel_synthetics.utils.stats.get_numeric_distribution_bins(training: Series, synthetic: Series)
To calculate the distribution distance between two numeric series a la categorical fields we need to bin the data. We want the same bins between both series, based on scrubbed data.
- Parameters:
training – The numeric series from the training dataframe.
synthetic – The numeric series from the synthetic dataframe.
- Returns:
bin_edges, numpy array of dtype float
- gretel_synthetics.utils.stats.get_numeric_field_distribution(field: Series, bins) dict
Calculates the normalized distribution of a numeric field cut into bins.
- Parameters:
field – A sanitized column extracted from one of the df’s.
bins – Usually an np.ndarray from get_bins, but can be anything that can be safely passed to pandas.cut.
- Returns:
keys are the unique values in the field, values are floats in [0, 1].
- Return type:
dict
- gretel_synthetics.utils.stats.normalize_dataset(df: DataFrame) DataFrame
Prep a dataframe for PCA. Divide the dataframe into numeric and categorical, fill missing values and encode categorical columns by the frequency of each value and standardize all values.
- Parameters:
df – The dataframe to be subjected to PCA.
- Returns:
The dataframe, normalized.