Timeseries DGAN

The Timeseries DGAN module contains a PyTorch implementation of the DoppelGANger model, see https://arxiv.org/abs/1909.13403 for a detailed description of the model.

import numpy as np
 from gretel_synthetics.timeseries_dgan.dgan import DGAN
 from gretel_synthetics.timeseries_dgan.config import DGANConfig

 attributes = np.random.rand(10000, 3)
 features = np.random.rand(10000, 20, 2)

 config = DGANConfig(
     max_sequence_len=20,
     sample_len=5,
     batch_size=1000,
     epochs=10
 )
 model = DGAN(config)

 model.train(attributes, features)

 synthetic_attributes, synthetic_features = model.generate(1000)
class gretel_synthetics.timeseries_dgan.config.DGANConfig(max_sequence_len: int, sample_len: int, attribute_noise_dim: int = 10, feature_noise_dim: int = 10, attribute_num_layers: int = 3, attribute_num_units: int = 100, feature_num_layers: int = 1, feature_num_units: int = 100, use_attribute_discriminator: bool = True, normalization: gretel_synthetics.timeseries_dgan.config.Normalization = <Normalization.ZERO_ONE: 0>, apply_feature_scaling: bool = True, apply_example_scaling: bool = True, forget_bias: bool = False, gradient_penalty_coef: float = 10.0, attribute_gradient_penalty_coef: float = 10.0, attribute_loss_coef: float = 1.0, generator_learning_rate: float = 0.001, generator_beta1: float = 0.5, discriminator_learning_rate: float = 0.001, discriminator_beta1: float = 0.5, attribute_discriminator_learning_rate: float = 0.001, attribute_discriminator_beta1: float = 0.5, batch_size: int = 1024, epochs: int = 400, discriminator_rounds: int = 1, generator_rounds: int = 1, cuda: bool = True, mixed_precision_training: bool = False)

Config object with parameters for training a DGAN model.

Parameters
  • max_sequence_len – length of time series sequences, variable length sequences are not supported, so all training and generated data will have the same length sequences

  • sample_len – time series steps to generate from each LSTM cell in DGAN, must be a divisor of max_sequence_len

  • attribute_noise_dim – length of the GAN noise vectors for attribute generation

  • feature_noise_dim – length of GAN noise vectors for feature generation

  • attribute_num_layers – # of layers in the GAN discriminator network

  • attribute_num_units – # of units per layer in the GAN discriminator network

  • feature_num_layers – # of LSTM layers in the GAN generator network

  • feature_num_units – # of units per layer in the GAN generator network

  • use_attribute_discriminator – use separaste discriminator only on attributes, helps DGAN match attribute distributions, Default: True

  • normalization – default normalization for continuous variables, used when metadata output is not specified during DGAN initialization

  • apply_feature_scaling – scale each continuous variable to [0,1] or [-1,1] (based on normalization param) before training and rescale to original range during generation, if False then training data must be within range and DGAN will only generate values in [0,1] or [-1,1], Default: True

  • apply_example_scaling – compute midpoint and halfrange (equivalent to min/max) for each time series variable and include these as additional attributes that are generated, this provides better support for time series with highly variable ranges, e.g., in network data, a dial-up connection has bandwidth usage in [1kb, 10kb], while a fiber connection is in [100mb, 1gb], Default: True

  • forget_bias – initialize forget gate bias paramters to 1 in LSTM layers, when True initialization matches tf1 LSTMCell behavior, otherwise default pytorch initialization is used, Default: False

  • gradient_penalty_coef – coefficient for gradient penalty in Wasserstein loss, Default: 10.0

  • attribute_gradient_penalty_coef – coefficient for gradient penalty in Wasserstein loss for the attribute discriminator, Default: 10.0

  • attribute_loss_coef – coefficient for attribute discriminator loss in comparison the standard discriminator on attributes and features, higher values should encourage DGAN to learn attribute distributions, Default: 1.0

  • generator_learning_rate – learning rate for Adam optimizer

  • generator_beta1 – Adam param for exponential decay of 1st moment

  • discriminator_learning_rate – learning rate for Adam optimizer

  • discriminator_beta1 – Adam param for exponential decay of 1st moment

  • attribute_discriminator_learning_rate – learning rate for Adam optimizer

  • attribute_discriminator_beta1 – Adam param for exponential decay of 1st moment

  • batch_size – # of examples used in batches, for both training and generation

  • epochs – # of epochs to train model discriminator_rounds: training steps

  • the discriminator (for) – batch

  • generator_rounds – training steps for the generator in each batch

  • cuda – use GPU if available

  • mixed_precision_training – enabling automatic mixed precision while training in order to reduce memory costs, bandwith, and time by identifying the steps that require full precision and using 32-bit floating point for only those steps while using 16-bit floating point everywhere else.

to_dict()

Return dictionary representation of DGANConfig.

Returns

Dictionary of member variables, usable to initialize a new config object, e.g., DGANConfig(**config.to_dict())

class gretel_synthetics.timeseries_dgan.config.DfStyle

Supported styles for parsing pandas DataFrames.

See train_dataframe method in dgan.py for details.

class gretel_synthetics.timeseries_dgan.config.Normalization

Normalization types for continuous variables.

Determines if a sigmoid (ZERO_ONE) or tanh (MINUSONE_ONE) activation is used for the output layers in the generation network.

class gretel_synthetics.timeseries_dgan.config.OutputType

Supported variables types.

Determines internal representation of variables and output layers in generation network.

PyTorch implementation of DoppelGANger, from https://arxiv.org/abs/1909.13403

Based on tensorflow 1 code in https://github.com/fjxmlzn/DoppelGANger

DoppelGANger is a generative adversarial network (GAN) model for time series. It supports multi-variate time series (referred to as features) and fixed variables for each time series (attributes). The combination of attribute values and sequence of feature values is 1 example. Once trained, the model can generate novel examples that exhibit the same temporal correlations as seen in the training data. See https://arxiv.org/abs/1909.13403 for additional details on the model.

As a reference for terminology, consider open-high-low-close (OHLC) data from stock markets. Each stock is an example, with fixed attributes such as exchange, sector, country. The features or time series consists of open, high, low, and closing prices for each time interval (daily). After being trained on historical data, the model can generate more hypothetical stocks and price behavior on the training time range.

Sample usage:

import numpy as np
from gretel_synthetics.timeseries_dgan.dgan import DGAN
from gretel_synthetics.timeseries_dgan.config import DGANConfig

attributes = np.random.rand(10000, 3)
features = np.random.rand(10000, 20, 2)

config = DGANConfig(
    max_sequence_len=20,
    sample_len=5,
    batch_size=1000,
    epochs=10
)

model = DGAN(config)

model.train_numpy(attributes, features)

synthetic_attributes, synthetic_features = model.generate_numpy(1000)
class gretel_synthetics.timeseries_dgan.dgan.DGAN(config: gretel_synthetics.timeseries_dgan.config.DGANConfig, attribute_outputs: Optional[List[gretel_synthetics.timeseries_dgan.transformations.Output]] = None, feature_outputs: Optional[List[gretel_synthetics.timeseries_dgan.transformations.Output]] = None)

DoppelGANger model.

Interface for training model and generating data based on configuration in an DGANConfig instance.

DoppelGANger uses a specific internal representation for data which is hidden from the user in the public interface. Continuous variables should be in the original space and discrete variables represented as [0.0, 1.0, 2.0, …] when using the train_numpy() and train_dataframe() functions. The generate_numpy() and generate_dataframe() functions will return data in this original space. In standard usage, the detailed transformation info in attribute_outputs and feature_outputs are not needed, those will be created automatically when a train* function is called with data.

If more control is needed and you want to use the normalized values and one-hot encoding directly, use the _train() and _generate() functions. transformations.py contains internal helper functions for working with the Output metadata instances and converting data to and from the internal representation. To dive even deeper into the model structure, see the torch_modules.py which contains the torch implementations of the networks used in DGAN. As internal details, transformations.py and torch_modules.py are not part of the public interface and may change at any time without notice.

__init__(config: gretel_synthetics.timeseries_dgan.config.DGANConfig, attribute_outputs: Optional[List[gretel_synthetics.timeseries_dgan.transformations.Output]] = None, feature_outputs: Optional[List[gretel_synthetics.timeseries_dgan.transformations.Output]] = None)

Create a DoppelGANger model.

Parameters
  • config – DGANConfig containing model parameters

  • attribute_outputs – custom metadata for attributes, not needed for standard usage

  • feature_outputs – custom metadata for features, not needed for standard usage

generate_dataframe(n: Optional[int] = None, attribute_noise: Optional[torch.Tensor] = None, feature_noise: Optional[torch.Tensor] = None) → pandas.core.frame.DataFrame

Generate synthetic data from DGAN model.

Once trained, a DGAN model can generate arbitrary amounts of synthetic data by sampling from the noise distributions. Specify either the number of records to generate, or the specific noise vectors to use.

Parameters
  • n – number of examples to generate

  • attribute_noise – noise vectors to create synthetic data

  • feature_noise – noise vectors to create synthetic data

Returns

pandas DataFrame in same format used in ‘train_dataframe’ call

generate_numpy(n: Optional[int] = None, attribute_noise: Optional[torch.Tensor] = None, feature_noise: Optional[torch.Tensor] = None) → Tuple[Optional[numpy.ndarray], numpy.ndarray]

Generate synthetic data from DGAN model.

Once trained, a DGAN model can generate arbitrary amounts of synthetic data by sampling from the noise distributions. Specify either the number of records to generate, or the specific noise vectors to use.

Parameters
  • n – number of examples to generate

  • attribute_noise – noise vectors to create synthetic data

  • feature_noise – noise vectors to create synthetic data

Returns

Tuple of attributes and features as numpy arrays.

classmethod load(file_name: str, **kwargs)gretel_synthetics.timeseries_dgan.dgan.DGAN

Load DGAN model instance from a file.

Parameters
  • file_name – location to load from

  • kwargs – additional parameters passed to torch.load, for example, use map_location=torch.device(“cpu”) to load a model saved for GPU on a machine without cuda

Returns

DGAN model instance

save(file_name: str, **kwargs)

Save DGAN model to a file.

Parameters
  • file_name – location to save serialized model

  • kwargs – additional parameters passed to torch.save

train_dataframe(df: pandas.core.frame.DataFrame, attribute_columns: Optional[List[str]] = None, feature_columns: Optional[List[str]] = None, example_id_column: Optional[str] = None, time_column: Optional[str] = None, discrete_columns: Optional[List[str]] = None, df_style: gretel_synthetics.timeseries_dgan.config.DfStyle = <DfStyle.WIDE: 'wide'>)

Train DGAN model on data in pandas DataFrame.

Training data can be in either “wide” or “long” format. “Wide” format uses one row for each example with 0 or more attribute columns and 1 column per time point in the time series. “Wide” format is restricted to 1 feature variable. “Long” format uses one row per time point, supports multiple feature variables, and uses additional example id to split into examples and time column to sort.

Parameters
  • df – DataFrame of training data

  • attribute_columns – list of column names containing attributes, if None, no attribute columns are used, Default: None

  • feature_columns – list of column names containing features, if None all non-attribute columns are used, Default: None

  • example_id_column – column name used to split “long” format data frame into multiple examples, if None, data is treated as a single example

  • time_column – column name used to sort “long” format data frame, if None, data frame order of rows/time points is used

  • discrete_columns – column names (either attributes or features) to use discrete, onehot encoding, discrete values must be integer in [0,1,2,3…]

  • df_style – str enum of “wide” or “long” indicating format of the DataFrame

train_numpy(features: numpy.ndarray, feature_types: Optional[List[gretel_synthetics.timeseries_dgan.config.OutputType]] = None, attributes: Optional[numpy.ndarray] = None, attribute_types: Optional[List[gretel_synthetics.timeseries_dgan.config.OutputType]] = None)

Train DGAN model on data in numpy arrays.

Training data is passed in 2 numpy arrays, one for attributes (2d) and one for features (3d). This data should be in the original space and is not transformed. If the data is already transformed into the internal DGAN representation (continuous variable scaled to [0,1] or [-1,1] and discrete variables one-hot encoded), use the internal _train() function instead of train_numpy(), or specify apply_feature_scaling=False in the DGANConfig.

In standard usage, attribute_types and feature_types should be provided on the first call to train() to correctly setup the model structure. If not specified, the default is to assume continuous variables. If outputs metadata was specified when the instance was initialized or train() was previously called, then attribute_types and feature_types are not needed.

Parameters
  • features – 3-d numpy array of time series features for the training, size is (# of training examples) X max_sequence_len X (# of features)

  • feature_types (Optional) – Specification of Discrete or Continuous type for each variable of the features. Discrete attributes should be 0-indexed (not one-hot encoded). If None, assume all features are continuous. Ignored if the model was already built, either by passing output params at initialization or because train_ was called previously.

  • attributes (Optional) – 2-d numpy array of attributes for the training examples, size is (# of training examples) X (# of attributes)

  • attribute_types (Optional) – Specification of Discrete or Continuous type for each variable of the attributes. Discrete attributes should be 0-indexed (not one-hot encoded). If None, assume all attributes are continuous. Ignored if the model was already built, either by passing output params at initialization or because train_ was called previously.