Header Clusters

This module enables the clustering of DataFrame headers into like clusters based on correlations between columns

gretel_synthetics.utils.header_clusters.cluster(df: pandas.core.frame.DataFrame, header_prefix: List[str] = None, maxsize: int = 20, method: str = 'single', numeric_cat: List[str] = None, plot=False) → List[List[str]]

Given an input dataframe, extract clusters of similar headers based on a set of heuristics.

  • df – The dataframe to cluster headers from.

  • header_prefix – List of columns to remove before cluster generation.

  • maxsize – The max number of header clusters to generate from the input dataframe.

  • method – Linkage method used to compute header cluster distances. For more information please refer to the scipy docs, https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy-cluster-hierarchy-linkage.

  • numeric_cat – A list of fields to define as categorical. The header clustering code will automatically define pandas “object” and “category” columns as categorical. The numeric_cat parameter may be used to define additional categorical fields that may not automatically get identified as such.

  • plot – Plot header list as a dendogram.