Timeseries DGAN

The Timeseries DGAN module contains a PyTorch implementation of the DoppelGANger model, see https://arxiv.org/abs/1909.13403 for a detailed description of the model.

import numpy as np
 from gretel_synthetics.timeseries_dgan.dgan import DGAN
 from gretel_synthetics.timeseries_dgan.config import DGANConfig

 attributes = np.random.rand(10000, 3)
 features = np.random.rand(10000, 20, 2)

 config = DGANConfig(
     max_sequence_len=20,
     sample_len=5,
     batch_size=1000,
     epochs=10
 )
 model = DGAN(config)

 model.train(attributes, features)

 synthetic_attributes, synthetic_features = model.generate(1000)
class gretel_synthetics.timeseries_dgan.config.DGANConfig(max_sequence_len: int, sample_len: int, attribute_noise_dim: int = 10, feature_noise_dim: int = 10, attribute_num_layers: int = 3, attribute_num_units: int = 100, feature_num_layers: int = 1, feature_num_units: int = 100, use_attribute_discriminator: bool = True, normalization: Normalization = Normalization.ZERO_ONE, apply_feature_scaling: bool = True, apply_example_scaling: bool = True, binary_encoder_cutoff: int = 150, forget_bias: bool = False, gradient_penalty_coef: float = 10.0, attribute_gradient_penalty_coef: float = 10.0, attribute_loss_coef: float = 1.0, generator_learning_rate: float = 0.001, generator_beta1: float = 0.5, discriminator_learning_rate: float = 0.001, discriminator_beta1: float = 0.5, attribute_discriminator_learning_rate: float = 0.001, attribute_discriminator_beta1: float = 0.5, batch_size: int = 1024, epochs: int = 400, discriminator_rounds: int = 1, generator_rounds: int = 1, cuda: bool = True, mixed_precision_training: bool = False)

Config object with parameters for training a DGAN model.

Parameters:
  • max_sequence_len – length of time series sequences, variable length sequences are not supported, so all training and generated data will have the same length sequences

  • sample_len – time series steps to generate from each LSTM cell in DGAN, must be a divisor of max_sequence_len

  • attribute_noise_dim – length of the GAN noise vectors for attribute generation

  • feature_noise_dim – length of GAN noise vectors for feature generation

  • attribute_num_layers – # of layers in the GAN discriminator network

  • attribute_num_units – # of units per layer in the GAN discriminator network

  • feature_num_layers – # of LSTM layers in the GAN generator network

  • feature_num_units – # of units per layer in the GAN generator network

  • use_attribute_discriminator – use separaste discriminator only on attributes, helps DGAN match attribute distributions, Default: True

  • normalization – default normalization for continuous variables, used when metadata output is not specified during DGAN initialization

  • apply_feature_scaling – scale each continuous variable to [0,1] or [-1,1] (based on normalization param) before training and rescale to original range during generation, if False then training data must be within range and DGAN will only generate values in [0,1] or [-1,1], Default: True

  • apply_example_scaling – compute midpoint and halfrange (equivalent to min/max) for each time series variable and include these as additional attributes that are generated, this provides better support for time series with highly variable ranges, e.g., in network data, a dial-up connection has bandwidth usage in [1kb, 10kb], while a fiber connection is in [100mb, 1gb], Default: True

  • binary_encoder_cutoff – use binary encoder (instead of one hot encoder) for any column with more than this many unique values. This helps reduce memory consumption for datasets with a lot of unique values.

  • forget_bias – initialize forget gate bias paramters to 1 in LSTM layers, when True initialization matches tf1 LSTMCell behavior, otherwise default pytorch initialization is used, Default: False

  • gradient_penalty_coef – coefficient for gradient penalty in Wasserstein loss, Default: 10.0

  • attribute_gradient_penalty_coef – coefficient for gradient penalty in Wasserstein loss for the attribute discriminator, Default: 10.0

  • attribute_loss_coef – coefficient for attribute discriminator loss in comparison the standard discriminator on attributes and features, higher values should encourage DGAN to learn attribute distributions, Default: 1.0

  • generator_learning_rate – learning rate for Adam optimizer

  • generator_beta1 – Adam param for exponential decay of 1st moment

  • discriminator_learning_rate – learning rate for Adam optimizer

  • discriminator_beta1 – Adam param for exponential decay of 1st moment

  • attribute_discriminator_learning_rate – learning rate for Adam optimizer

  • attribute_discriminator_beta1 – Adam param for exponential decay of 1st moment

  • batch_size – # of examples used in batches, for both training and generation

  • epochs – # of epochs to train model discriminator_rounds: training steps

  • discriminator (for the) – batch

  • generator_rounds – training steps for the generator in each batch

  • cuda – use GPU if available

  • mixed_precision_training – enabling automatic mixed precision while training in order to reduce memory costs, bandwith, and time by identifying the steps that require full precision and using 32-bit floating point for only those steps while using 16-bit floating point everywhere else.

to_dict()

Return dictionary representation of DGANConfig.

Returns:

Dictionary of member variables, usable to initialize a new config object, e.g., DGANConfig(**config.to_dict())

class gretel_synthetics.timeseries_dgan.config.DfStyle(value)

Supported styles for parsing pandas DataFrames.

See train_dataframe method in dgan.py for details.

class gretel_synthetics.timeseries_dgan.config.Normalization(value)

Normalization types for continuous variables.

Determines if a sigmoid (ZERO_ONE) or tanh (MINUSONE_ONE) activation is used for the output layers in the generation network.

class gretel_synthetics.timeseries_dgan.config.OutputType(value)

Supported variables types.

Determines internal representation of variables and output layers in generation network.

PyTorch implementation of DoppelGANger, from https://arxiv.org/abs/1909.13403

Based on tensorflow 1 code in https://github.com/fjxmlzn/DoppelGANger

DoppelGANger is a generative adversarial network (GAN) model for time series. It supports multi-variate time series (referred to as features) and fixed variables for each time series (attributes). The combination of attribute values and sequence of feature values is 1 example. Once trained, the model can generate novel examples that exhibit the same temporal correlations as seen in the training data. See https://arxiv.org/abs/1909.13403 for additional details on the model.

As a reference for terminology, consider open-high-low-close (OHLC) data from stock markets. Each stock is an example, with fixed attributes such as exchange, sector, country. The features or time series consists of open, high, low, and closing prices for each time interval (daily). After being trained on historical data, the model can generate more hypothetical stocks and price behavior on the training time range.

Sample usage:

import numpy as np
from gretel_synthetics.timeseries_dgan.dgan import DGAN
from gretel_synthetics.timeseries_dgan.config import DGANConfig

attributes = np.random.rand(10000, 3)
features = np.random.rand(10000, 20, 2)

config = DGANConfig(
    max_sequence_len=20,
    sample_len=5,
    batch_size=1000,
    epochs=10
)

model = DGAN(config)

model.train_numpy(attributes=attributes, features=features)

synthetic_attributes, synthetic_features = model.generate_numpy(1000)
class gretel_synthetics.timeseries_dgan.dgan.DGAN(config: DGANConfig, attribute_outputs: List[Output] | None = None, feature_outputs: List[Output] | None = None)

DoppelGANger model.

Interface for training model and generating data based on configuration in an DGANConfig instance.

DoppelGANger uses a specific internal representation for data which is hidden from the user in the public interface. Standard usage of DGAN instances should pass continuous variables as floats in the original space (not normalized), and discrete variables may be strings, integers, or floats. This is the format expected by both train_numpy() and train_dataframe() and the generate_numpy() and generate_dataframe() functions will return data in this same format. In standard usage, the detailed transformation info in attribute_outputs and feature_outputs are not needed, those will be created automatically when a train* function is called with data.

If more control is needed and you want to use the normalized values and one-hot encoding directly, use the _train() and _generate() functions. transformations.py contains internal helper functions for working with the Output metadata instances and converting data to and from the internal representation. To dive even deeper into the model structure, see the torch_modules.py which contains the torch implementations of the networks used in DGAN. As internal details, transformations.py and torch_modules.py are not part of the public interface and may change at any time without notice.

__init__(config: DGANConfig, attribute_outputs: List[Output] | None = None, feature_outputs: List[Output] | None = None)

Create a DoppelGANger model.

Parameters:
  • config – DGANConfig containing model parameters

  • attribute_outputs – custom metadata for attributes, not needed for standard usage

  • feature_outputs – custom metadata for features, not needed for standard usage

generate_dataframe(n: int | None = None, attribute_noise: Tensor | None = None, feature_noise: Tensor | None = None) DataFrame

Generate synthetic data from DGAN model.

Once trained, a DGAN model can generate arbitrary amounts of synthetic data by sampling from the noise distributions. Specify either the number of records to generate, or the specific noise vectors to use.

Parameters:
  • n – number of examples to generate

  • attribute_noise – noise vectors to create synthetic data

  • feature_noise – noise vectors to create synthetic data

Returns:

pandas DataFrame in same format used in ‘train_dataframe’ call

generate_numpy(n: int | None = None, attribute_noise: Tensor | None = None, feature_noise: Tensor | None = None) Tuple[ndarray | None, list[numpy.ndarray]]

Generate synthetic data from DGAN model.

Once trained, a DGAN model can generate arbitrary amounts of synthetic data by sampling from the noise distributions. Specify either the number of records to generate, or the specific noise vectors to use.

Parameters:
  • n – number of examples to generate

  • attribute_noise – noise vectors to create synthetic data

  • feature_noise – noise vectors to create synthetic data

Returns:

Tuple of attributes and features as numpy arrays.

classmethod load(file_name: str, **kwargs) DGAN

Load DGAN model instance from a file.

Parameters:
  • file_name – location to load from

  • kwargs – additional parameters passed to torch.load, for example, use map_location=torch.device(“cpu”) to load a model saved for GPU on a machine without cuda

Returns:

DGAN model instance

save(file_name: str, **kwargs)

Save DGAN model to a file.

Parameters:
  • file_name – location to save serialized model

  • kwargs – additional parameters passed to torch.save

train_dataframe(df: DataFrame, attribute_columns: List[str] | None = None, feature_columns: List[str] | None = None, example_id_column: str | None = None, time_column: str | None = None, discrete_columns: List[str] | None = None, df_style: DfStyle = DfStyle.WIDE, progress_callback: Callable[[ProgressInfo], None] | None = None) None

Train DGAN model on data in pandas DataFrame.

Training data can be in either “wide” or “long” format. “Wide” format uses one row for each example with 0 or more attribute columns and 1 column per time point in the time series. “Wide” format is restricted to 1 feature variable. “Long” format uses one row per time point, supports multiple feature variables, and uses additional example id to split into examples and time column to sort.

Parameters:
  • df – DataFrame of training data

  • attribute_columns – list of column names containing attributes, if None, no attribute columns are used. Must be disjoint from the feature columns.

  • feature_columns – list of column names containing features, if None all non-attribute columns are used. Must be disjoint from attribute columns.

  • example_id_column – column name used to split “long” format data frame into multiple examples, if None, data is treated as a single example. This value must be unique from the other column list parameters.

  • time_column – column name used to sort “long” format data frame, if None, data frame order of rows/time points is used. This value must be unique from the other column list parameters.

  • discrete_columns – column names (either attributes or features) to treat as discrete (use one-hot or binary encoding), any string or object columns are automatically treated as discrete

  • df_style – str enum of “wide” or “long” indicating format of the DataFrame

train_numpy(features: ndarray | list[numpy.ndarray], feature_types: List[OutputType] | None = None, attributes: ndarray | None = None, attribute_types: List[OutputType] | None = None, progress_callback: Callable[[ProgressInfo], None] | None = None) None

Train DGAN model on data in numpy arrays.

Training data is passed in 2 numpy arrays, one for attributes (2d) and one for features (3d), features may be a ragged array with variable length sequences, and then it is a list of numpy arrays. This data should be in the original space and is not transformed. If the data is already transformed into the internal DGAN representation (continuous variable scaled to [0,1] or [-1,1] and discrete variables one-hot or binary encoded), use the internal _train() function instead of train_numpy().

In standard usage, attribute_types and feature_types may be provided on the first call to train() to setup the model structure. If not specified, the default is to assume continuous variables for floats and integers, and discrete for strings. If outputs metadata was specified when the instance was initialized or train() was previously called, then attribute_types and feature_types are not needed.

Parameters:
  • features – 3-d numpy array of time series features for the training, size is (# of training examples) X max_sequence_len X (# of features) OR list of 2-d numpy arrays with one sequence per numpy array, each numpy array should then have size seq_len X (# of features) where seq_len <= max_sequence_len

  • feature_types (Optional) – Specification of Discrete or Continuous type for each variable of the features. If None, assume continuous variables for floats and integers, and discrete for strings. Ignored if the model was already built, either by passing output params at initialization or because train_ was called previously.

  • attributes (Optional) – 2-d numpy array of attributes for the training examples, size is (# of training examples) X (# of attributes)

  • attribute_types (Optional) – Specification of Discrete or Continuous type for each variable of the attributes. If None, assume continuous variables for floats and integers, and discrete for strings. Ignored if the model was already built, either by passing output params at initialization or because train_ was called previously.

gretel_synthetics.timeseries_dgan.dgan.find_max_consecutive_nans(array: ndarray) int

Returns the maximum number of consecutive NaNs in an array.

Parameters:

array – 1-d numpy array of time series per example.

Returns:

The maximum number of consecutive NaNs in a times series array.

Return type:

max_cons_nan

gretel_synthetics.timeseries_dgan.dgan.nan_linear_interpolation(features: list[numpy.ndarray], continuous_features_ind: list[int])

Replaces all NaNs via linear interpolation.

Changes numpy arrays in features in place.

Parameters:
  • features – list of 2-d numpy arrays, each element is a sequence of shape (sequence_len, #features)

  • continuous_features_ind – features to apply nan interpolation to, indexes the 2nd dimension of the sequence arrays of features

gretel_synthetics.timeseries_dgan.dgan.validation_check(features: list[numpy.ndarray], continuous_features_ind: list[int], invalid_examples_ratio_cutoff: float = 0.5, nans_ratio_cutoff: float = 0.1, consecutive_nans_max: int = 5, consecutive_nans_ratio_cutoff: float = 0.05) ndarray

Checks if continuous features of examples are valid.

Returns a 1-d numpy array of booleans with shape (#examples) indicating valid examples. Examples with continuous features fall into 3 categories: good, valid (fixable) and invalid (non-fixable). - “Good” examples have no NaNs. - “Valid” examples have a low percentage of nans and a below a threshold number of consecutive NaNs. - “Invalid” are the rest, and are marked “False” in the returned array. Later on, these are omitted from training. If there are too many, later, we error out.

Parameters:
  • features – list of 2-d numpy arrays, each element is a sequence of possibly varying length

  • continuous_features_ind – list of indices of continuous features to analyze, indexes the 2nd dimension of the sequence arrays in features

  • invalid_examples_ratio_cutoff – Error out if the invalid examples ratio in the dataset is higher than this value.

  • nans_ratio_cutoff – If the percentage of nans for any continuous feature in an example is greater than this value, the example is invalid.

  • consecutive_nans_max – If the maximum number of consecutive nans in a continuous feature is greater than this number, then that example is invalid.

  • consecutive_nans_ratio_cutoff – If the maximum number of consecutive nans in a continuous feature is greater than this ratio times the length of the example (number samples), then the example is invalid.

Returns:

1-d numpy array of booleans indicating valid examples with shape (#examples).

Return type:

valid_examples