ACTGAN
The ACTGAN sub-package contains an alternate implementation of the SDV CTGAN model. It provides some improvement and automation around automatic detection of datetime fields and optional usage of a binary encoder for discrete columns for better memory usage.
Please see the “ACTGAN_Demo” Notebook in the “examples” directory in the repository root.
Wrapper around ACTGAN model.
- class gretel_synthetics.actgan.actgan_wrapper.ACTGAN(field_names: List[str] | None = None, field_types: Dict[str, dict] | None = None, field_transformers: Dict[str, BaseTransformer | str] | None = None, auto_transform_datetimes: bool = False, anonymize_fields: Dict[str, str] | None = None, primary_key: str | None = None, constraints: List[Constraint] | List[dict] | None = None, table_metadata: Metadata | dict | None = None, embedding_dim: int = 128, generator_dim: Sequence[int] = (256, 256), discriminator_dim: Sequence[int] = (256, 256), generator_lr: float = 0.0002, generator_decay: float = 1e-06, discriminator_lr: float = 0.0002, discriminator_decay: float = 1e-06, batch_size: int = 500, discriminator_steps: int = 1, binary_encoder_cutoff: int = 500, binary_encoder_nan_handler: str | None = None, cbn_sample_size: int | None = 250000, log_frequency: bool = True, verbose: bool = False, epochs: int = 300, epoch_callback: Callable[[EpochInfo], None] | None = None, pac: int = 10, cuda: bool = True, learn_rounding_scheme: bool = True, enforce_min_max_values: bool = True, conditional_vector_type: ConditionalVectorType = ConditionalVectorType.SINGLE_DISCRETE, conditional_select_mean_columns: float | None = None, conditional_select_column_prob: float | None = None, reconstruction_loss_coef: float = 1.0, force_conditioning: bool = False)
- Parameters:
field_names – List of names of the fields that need to be modeled and included in the generated output data. Any additional fields found in the data will be ignored and will not be included in the generated output. If
None
, all the fields found in the data are used.field_types – Dictinary specifying the data types and subtypes of the fields that will be modeled. Field types and subtypes combinations must be compatible with the SDV Metadata Schema.
field_transformers –
Dictinary specifying which transformers to use for each field. Available transformers are:
FloatFormatter
: Uses aFloatFormatter
for numerical data.FrequencyEncoder
: Uses aFrequencyEncoder
without gaussian noise.FrequencyEncoder_noised
: Uses aFrequencyEncoder
adding gaussian noise.OneHotEncoder
: Uses aOneHotEncoder
.LabelEncoder
: Uses aLabelEncoder
without gaussian nose.LabelEncoder_noised
: Uses aLabelEncoder
adding gaussian noise.BinaryEncoder
: Uses aBinaryEncoder
.UnixTimestampEncoder
: Uses aUnixTimestampEncoder
.
NOTE: Specifically for ACTGAN, some attributes such as
auto_transform_datetimes
will automatically attempt to detect field types and will automatically set thefield_transformers
dictionary at construction time. However, autodetection offield_types
andfield_transformers
will not be over-written by any concrete values that were provided to this constructor.auto_transform_datetimes – If set, prior to fitting, each column will be checked for being a potential “datetime” type. For each column that is discovered as a “datetime” the field_types and field_transformers SDV metadata dicts will be automatically updated such that datetimes are transformed to Unix timestamps. NOTE: if fields are already specified in field_types or field_transformers these fields will be skipped by the auto detector.
anonymize_fields – Dict specifying which fields to anonymize and what faker category they belong to.
primary_key – Name of the field which is the primary key of the table.
constraints – List of Constraint objects or dicts.
table_metadata – Table metadata instance or dict representation. If given alongside any other metadata-related arguments, an exception will be raised. If not given at all, it will be built using the other arguments or learned from the data.
embedding_dim – Size of the random sample passed to the Generator. Defaults to 128.
generator_dim – Size of the output samples for each one of the Residuals. A Residual Layer will be created for each one of the values provided. Defaults to (256, 256).
discriminator_dim – Size of the output samples for each one of the Discriminator Layers. A Linear Layer will be created for each one of the values provided. Defaults to (256, 256).
generator_lr – Learning rate for the generator. Defaults to 2e-4.
generator_decay – Generator weight decay for the Adam Optimizer. Defaults to 1e-6.
discriminator_lr – Learning rate for the discriminator. Defaults to 2e-4.
discriminator_decay – Discriminator weight decay for the Adam Optimizer. Defaults to 1e-6.
batch_size – Number of data samples to process in each step.
discriminator_steps – Number of discriminator updates to do for each generator update. From the WGAN paper: https://arxiv.org/abs/1701.07875. WGAN paper default is 5. Default used is 1 to match original CTGAN implementation.
binary_encoder_cutoff – For any given column, the number of unique values that should exist before switching over to binary encoding instead of OHE. This will help reduce memory consumption for datasets with a lot of unique values.
binary_encoder_nan_handler – Binary encoding currently may produce errant NaN values during reverse transformation. By default these NaN’s will be left in place, however if this value is set to “mode” then those NaN’ will be replaced by a random value that is a known mode for a given column.
cbn_sample_size – Number of rows to sample from each column for identifying clusters for the cluster-based normalizer. This only applies to float columns. If set to
0
, no sampling is done and all values are considered, which may be very slow. Defaults to 250_000.log_frequency – Whether to use log frequency of categorical levels in conditional sampling. Defaults to
True
.verbose – Whether to have print statements for progress results. Defaults to
False
.epochs – Number of training epochs. Defaults to 300.
epoch_callback – An optional function to call after each epoch, the argument will be a
EpochInfo
instancepac – Number of samples to group together when applying the discriminator. Defaults to 10.
cuda – If
True
, use CUDA. If astr
, use the indicated device. IfFalse
, do not use cuda at all. Defaults toTrue
.learn_rounding_scheme – Define rounding scheme for
FloatFormatter
. IfTrue
, the data returned byreverse_transform
will be rounded to that place. Defaults toTrue
.enforce_min_max_values – Specify whether or not to clip the data returned by
reverse_transform
of the numerical transformer,FloatFormatter
, to the min and max values seen duringfit
. Defaults toTrue
.conditional_vector_type – Type of conditional vector to include in input to the generator. Influences how effective and flexible the native conditional generation is. Options include SINGLE_DISCRETE (original CTGAN setup) and ANYWAY. Default is SINGLE_DISCRETE.
conditional_select_mean_columns – Target number of columns to select for conditioning on average during training. Only used for ANYWAY conditioning. Use if typical number of columns to seed on is known. If set, conditional_select_column_prob must be None. Equivalent to setting conditional_select_column_prob to conditional_select_mean_columns / # of columns. Defaults to None.
conditional_select_column_prob – Probability to select any given column to be conditioned on during training. Only used for ANYWAY conditioning. If set, conditional_select_mean_columns must be None. Defaults to None.
reconstruction_loss_coef – Multiplier on reconstruction loss, higher values focus the generator optimization more on accurate conditional vector generation. Defaults to 1.0.
force_conditioning – Directly set the requested conditional generation columns in generated data. Will bypass rejection sampling and be faster, but may reduce quality of the generated data and correlations between conditioned columns and other variables may be weaker. Defaults to False.
- fit(*args, **kwargs)
Fit the ACTGAN model to the provided data. Prior to fitting, specific auto-detection of data types will be done if the provided
data
is a DataFrame.
- sample(*args, **kwargs)
Sample rows from this table.
- Parameters:
num_rows (int) – Number of rows to sample. This parameter is required.
randomize_samples (bool) – Whether or not to use a fixed seed when sampling. Defaults to True.
max_tries_per_batch (int) – Number of times to retry sampling until the batch size is met. Defaults to 100.
batch_size (int or None) – The batch size to sample. Defaults to num_rows, if None.
output_file_path (str or None) – The file to periodically write sampled rows to. If None, does not write rows anywhere.
conditions – Deprecated argument. Use the sample_conditions method with sdv.sampling.Condition objects instead.
- Returns:
Sampled data.
- Return type:
pandas.DataFrame
- sample_remaining_columns(*args, **kwargs)
Sample rows from this table.
- Parameters:
known_columns (pandas.DataFrame) – A pandas.DataFrame with the columns that are already known. The output is a DataFrame such that each row in the output is sampled conditionally on the corresponding row in the input.
max_tries_per_batch (int) – Number of times to retry sampling until the batch size is met. Defaults to 100.
batch_size (int) – The batch size to use per sampling call.
randomize_samples (bool) – Whether or not to use a fixed seed when sampling. Defaults to True.
output_file_path (str or None) – The file to periodically write sampled rows to. Defaults to a temporary file, if None.
- Returns:
Sampled data.
- Return type:
pandas.DataFrame
- Raises:
ConstraintsNotMetError – If the conditions are not valid for the given constraints.
ValueError – If any of the following happens: * any of the conditions’ columns are not valid. * no rows could be generated.
Complex datastructures for ACTGAN
- class gretel_synthetics.actgan.structures.ColumnIdInfo(discrete_column_id: 'int', column_id: 'int', value_id: 'np.ndarray')
- class gretel_synthetics.actgan.structures.ColumnTransformInfo(column_name: 'str', column_type: 'ColumnType', transform: 'BaseTransformer', encodings: 'List[ColumnEncoding]')
- class gretel_synthetics.actgan.structures.ColumnType(value)
An enumeration.
- class gretel_synthetics.actgan.structures.ConditionalVectorType(value)
An enumeration.
- class gretel_synthetics.actgan.structures.EpochInfo(epoch: int, loss_g: float, loss_d: float, loss_r: float)
When creating a model such as ACTGAN if the
epoch_callback
attribute is set to a callable, then after each epoch the provided callable will be called with an instance of this class as the only argument.