Header Clusters

gretel_synthetics.utils.header_clusters.cluster(df: DataFrame, header_prefix: List[str] | None = None, maxsize: int = 20, average_record_length_threshold: float = 0, method: str = 'single', numeric_cat: List[str] | None = None, plot: bool = False, isolate_complex_field: bool = True) List[List[str]]

Given an input dataframe, extract clusters of similar headers based on a set of heuristics. :param df: The dataframe to cluster headers from. :param header_prefix: List of columns to remove before cluster generation. :param maxsize: The max number of fields in a cluster. :param average_record_length_threshold: Threshold for how long a cluster’s records can be.

The default, 0, turns off the average record length (arl) logic. To use arl, use a positive value. Based on our research we recommend setting this value to 250.0.

Parameters:
  • method – Linkage method used to compute header cluster distances. For more information please refer to the scipy docs, https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy-cluster-hierarchy-linkage. # noqa

  • numeric_cat – A list of fields to define as categorical. The header clustering code will automatically define pandas “object” and “category” columns as categorical. The numeric_cat parameter may be used to define additional categorical fields that may not automatically get identified as such.

  • plot – Plot header list as a dendogram.

  • isolate_complex_field – Enables isolation of complex fields when clustering.

Returns:

A list of lists of column names, each column name list being an identified cluster.