Timeseries DGAN¶
The Timeseries DGAN module contains a PyTorch implementation of the DoppelGANger model, see https://arxiv.org/abs/1909.13403 for a detailed description of the model.
import numpy as np
from gretel_synthetics.timeseries_dgan.dgan import DGAN
from gretel_synthetics.timeseries_dgan.config import DGANConfig
attributes = np.random.rand(10000, 3)
features = np.random.rand(10000, 20, 2)
config = DGANConfig(
max_sequence_len=20,
sample_len=5,
batch_size=1000,
epochs=10
)
model = DGAN(config)
model.train(attributes, features)
synthetic_attributes, synthetic_features = model.generate(1000)

class
gretel_synthetics.timeseries_dgan.config.
DGANConfig
(max_sequence_len: int, sample_len: int, attribute_noise_dim: int = 10, feature_noise_dim: int = 10, attribute_num_layers: int = 3, attribute_num_units: int = 100, feature_num_layers: int = 1, feature_num_units: int = 100, use_attribute_discriminator: bool = True, normalization: gretel_synthetics.timeseries_dgan.config.Normalization = <Normalization.ZERO_ONE: 0>, apply_feature_scaling: bool = True, apply_example_scaling: bool = True, binary_encoder_cutoff: int = 150, forget_bias: bool = False, gradient_penalty_coef: float = 10.0, attribute_gradient_penalty_coef: float = 10.0, attribute_loss_coef: float = 1.0, generator_learning_rate: float = 0.001, generator_beta1: float = 0.5, discriminator_learning_rate: float = 0.001, discriminator_beta1: float = 0.5, attribute_discriminator_learning_rate: float = 0.001, attribute_discriminator_beta1: float = 0.5, batch_size: int = 1024, epochs: int = 400, discriminator_rounds: int = 1, generator_rounds: int = 1, cuda: bool = True, mixed_precision_training: bool = False)¶ Config object with parameters for training a DGAN model.
 Parameters
max_sequence_len – length of time series sequences, variable length sequences are not supported, so all training and generated data will have the same length sequences
sample_len – time series steps to generate from each LSTM cell in DGAN, must be a divisor of max_sequence_len
attribute_noise_dim – length of the GAN noise vectors for attribute generation
feature_noise_dim – length of GAN noise vectors for feature generation
attribute_num_layers – # of layers in the GAN discriminator network
attribute_num_units – # of units per layer in the GAN discriminator network
feature_num_layers – # of LSTM layers in the GAN generator network
feature_num_units – # of units per layer in the GAN generator network
use_attribute_discriminator – use separaste discriminator only on attributes, helps DGAN match attribute distributions, Default: True
normalization – default normalization for continuous variables, used when metadata output is not specified during DGAN initialization
apply_feature_scaling – scale each continuous variable to [0,1] or [1,1] (based on normalization param) before training and rescale to original range during generation, if False then training data must be within range and DGAN will only generate values in [0,1] or [1,1], Default: True
apply_example_scaling – compute midpoint and halfrange (equivalent to min/max) for each time series variable and include these as additional attributes that are generated, this provides better support for time series with highly variable ranges, e.g., in network data, a dialup connection has bandwidth usage in [1kb, 10kb], while a fiber connection is in [100mb, 1gb], Default: True
binary_encoder_cutoff – use binary encoder (instead of one hot encoder) for any column with more than this many unique values. This helps reduce memory consumption for datasets with a lot of unique values.
forget_bias – initialize forget gate bias paramters to 1 in LSTM layers, when True initialization matches tf1 LSTMCell behavior, otherwise default pytorch initialization is used, Default: False
gradient_penalty_coef – coefficient for gradient penalty in Wasserstein loss, Default: 10.0
attribute_gradient_penalty_coef – coefficient for gradient penalty in Wasserstein loss for the attribute discriminator, Default: 10.0
attribute_loss_coef – coefficient for attribute discriminator loss in comparison the standard discriminator on attributes and features, higher values should encourage DGAN to learn attribute distributions, Default: 1.0
generator_learning_rate – learning rate for Adam optimizer
generator_beta1 – Adam param for exponential decay of 1st moment
discriminator_learning_rate – learning rate for Adam optimizer
discriminator_beta1 – Adam param for exponential decay of 1st moment
attribute_discriminator_learning_rate – learning rate for Adam optimizer
attribute_discriminator_beta1 – Adam param for exponential decay of 1st moment
batch_size – # of examples used in batches, for both training and generation
epochs – # of epochs to train model discriminator_rounds: training steps
the discriminator (for) – batch
generator_rounds – training steps for the generator in each batch
cuda – use GPU if available
mixed_precision_training – enabling automatic mixed precision while training in order to reduce memory costs, bandwith, and time by identifying the steps that require full precision and using 32bit floating point for only those steps while using 16bit floating point everywhere else.

to_dict
()¶ Return dictionary representation of DGANConfig.
 Returns
Dictionary of member variables, usable to initialize a new config object, e.g., DGANConfig(**config.to_dict())

class
gretel_synthetics.timeseries_dgan.config.
DfStyle
¶ Supported styles for parsing pandas DataFrames.
See train_dataframe method in dgan.py for details.

class
gretel_synthetics.timeseries_dgan.config.
Normalization
¶ Normalization types for continuous variables.
Determines if a sigmoid (ZERO_ONE) or tanh (MINUSONE_ONE) activation is used for the output layers in the generation network.

class
gretel_synthetics.timeseries_dgan.config.
OutputType
¶ Supported variables types.
Determines internal representation of variables and output layers in generation network.
PyTorch implementation of DoppelGANger, from https://arxiv.org/abs/1909.13403
Based on tensorflow 1 code in https://github.com/fjxmlzn/DoppelGANger
DoppelGANger is a generative adversarial network (GAN) model for time series. It supports multivariate time series (referred to as features) and fixed variables for each time series (attributes). The combination of attribute values and sequence of feature values is 1 example. Once trained, the model can generate novel examples that exhibit the same temporal correlations as seen in the training data. See https://arxiv.org/abs/1909.13403 for additional details on the model.
As a reference for terminology, consider openhighlowclose (OHLC) data from stock markets. Each stock is an example, with fixed attributes such as exchange, sector, country. The features or time series consists of open, high, low, and closing prices for each time interval (daily). After being trained on historical data, the model can generate more hypothetical stocks and price behavior on the training time range.
Sample usage:
import numpy as np
from gretel_synthetics.timeseries_dgan.dgan import DGAN
from gretel_synthetics.timeseries_dgan.config import DGANConfig
attributes = np.random.rand(10000, 3)
features = np.random.rand(10000, 20, 2)
config = DGANConfig(
max_sequence_len=20,
sample_len=5,
batch_size=1000,
epochs=10
)
model = DGAN(config)
model.train_numpy(attributes, features)
synthetic_attributes, synthetic_features = model.generate_numpy(1000)

class
gretel_synthetics.timeseries_dgan.dgan.
DGAN
(config: gretel_synthetics.timeseries_dgan.config.DGANConfig, attribute_outputs: Optional[List[gretel_synthetics.timeseries_dgan.transformations.Output]] = None, feature_outputs: Optional[List[gretel_synthetics.timeseries_dgan.transformations.Output]] = None)¶ DoppelGANger model.
Interface for training model and generating data based on configuration in an DGANConfig instance.
DoppelGANger uses a specific internal representation for data which is hidden from the user in the public interface. Continuous variables should be in the original space and discrete variables represented as [0.0, 1.0, 2.0, …] when using the train_numpy() and train_dataframe() functions. The generate_numpy() and generate_dataframe() functions will return data in this original space. In standard usage, the detailed transformation info in attribute_outputs and feature_outputs are not needed, those will be created automatically when a train* function is called with data.
If more control is needed and you want to use the normalized values and onehot encoding directly, use the _train() and _generate() functions. transformations.py contains internal helper functions for working with the Output metadata instances and converting data to and from the internal representation. To dive even deeper into the model structure, see the torch_modules.py which contains the torch implementations of the networks used in DGAN. As internal details, transformations.py and torch_modules.py are not part of the public interface and may change at any time without notice.

__init__
(config: gretel_synthetics.timeseries_dgan.config.DGANConfig, attribute_outputs: Optional[List[gretel_synthetics.timeseries_dgan.transformations.Output]] = None, feature_outputs: Optional[List[gretel_synthetics.timeseries_dgan.transformations.Output]] = None)¶ Create a DoppelGANger model.
 Parameters
config – DGANConfig containing model parameters
attribute_outputs – custom metadata for attributes, not needed for standard usage
feature_outputs – custom metadata for features, not needed for standard usage

generate_dataframe
(n: Optional[int] = None, attribute_noise: Optional[torch.Tensor] = None, feature_noise: Optional[torch.Tensor] = None) → pandas.core.frame.DataFrame¶ Generate synthetic data from DGAN model.
Once trained, a DGAN model can generate arbitrary amounts of synthetic data by sampling from the noise distributions. Specify either the number of records to generate, or the specific noise vectors to use.
 Parameters
n – number of examples to generate
attribute_noise – noise vectors to create synthetic data
feature_noise – noise vectors to create synthetic data
 Returns
pandas DataFrame in same format used in ‘train_dataframe’ call

generate_numpy
(n: Optional[int] = None, attribute_noise: Optional[torch.Tensor] = None, feature_noise: Optional[torch.Tensor] = None) → Tuple[Optional[numpy.ndarray], numpy.ndarray]¶ Generate synthetic data from DGAN model.
Once trained, a DGAN model can generate arbitrary amounts of synthetic data by sampling from the noise distributions. Specify either the number of records to generate, or the specific noise vectors to use.
 Parameters
n – number of examples to generate
attribute_noise – noise vectors to create synthetic data
feature_noise – noise vectors to create synthetic data
 Returns
Tuple of attributes and features as numpy arrays.

classmethod
load
(file_name: str, **kwargs) → gretel_synthetics.timeseries_dgan.dgan.DGAN¶ Load DGAN model instance from a file.
 Parameters
file_name – location to load from
kwargs – additional parameters passed to torch.load, for example, use map_location=torch.device(“cpu”) to load a model saved for GPU on a machine without cuda
 Returns
DGAN model instance

save
(file_name: str, **kwargs)¶ Save DGAN model to a file.
 Parameters
file_name – location to save serialized model
kwargs – additional parameters passed to torch.save

train_dataframe
(df: pd.DataFrame, attribute_columns: Optional[List[str]] = None, feature_columns: Optional[List[str]] = None, example_id_column: Optional[str] = None, time_column: Optional[str] = None, discrete_columns: Optional[List[str]] = None, df_style: DfStyle = <DfStyle.WIDE: 'wide'>, progress_callback: Optional[Callable[[ProgressInfo]]] = None) → None¶ Train DGAN model on data in pandas DataFrame.
Training data can be in either “wide” or “long” format. “Wide” format uses one row for each example with 0 or more attribute columns and 1 column per time point in the time series. “Wide” format is restricted to 1 feature variable. “Long” format uses one row per time point, supports multiple feature variables, and uses additional example id to split into examples and time column to sort.
 Parameters
df – DataFrame of training data
attribute_columns – list of column names containing attributes, if None, no attribute columns are used. Must be disjoint from the feature columns.
feature_columns – list of column names containing features, if None all nonattribute columns are used. Must be disjoint from attribute columns.
example_id_column – column name used to split “long” format data frame into multiple examples, if None, data is treated as a single example. This value must be unique from the other column list parameters.
time_column – column name used to sort “long” format data frame, if None, data frame order of rows/time points is used. This value must be unique from the other column list parameters.
discrete_columns – column names (either attributes or features) to use discrete, onehot encoding, discrete values must be integer in [0,1,2,3…]
df_style – str enum of “wide” or “long” indicating format of the DataFrame

train_numpy
(features: np.ndarray, feature_types: Optional[List[OutputType]] = None, attributes: Optional[np.ndarray] = None, attribute_types: Optional[List[OutputType]] = None, progress_callback: Optional[Callable[[ProgressInfo]]] = None) → None¶ Train DGAN model on data in numpy arrays.
Training data is passed in 2 numpy arrays, one for attributes (2d) and one for features (3d). This data should be in the original space and is not transformed. If the data is already transformed into the internal DGAN representation (continuous variable scaled to [0,1] or [1,1] and discrete variables onehot encoded), use the internal _train() function instead of train_numpy(), or specify apply_feature_scaling=False in the DGANConfig.
In standard usage, attribute_types and feature_types should be provided on the first call to train() to correctly setup the model structure. If not specified, the default is to assume continuous variables. If outputs metadata was specified when the instance was initialized or train() was previously called, then attribute_types and feature_types are not needed.
 Parameters
features – 3d numpy array of time series features for the training, size is (# of training examples) X max_sequence_len X (# of features)
feature_types (Optional) – Specification of Discrete or Continuous type for each variable of the features. Discrete attributes should be 0indexed (not onehot encoded). If None, assume all features are continuous. Ignored if the model was already built, either by passing output params at initialization or because train_ was called previously.
attributes (Optional) – 2d numpy array of attributes for the training examples, size is (# of training examples) X (# of attributes)
attribute_types (Optional) – Specification of Discrete or Continuous type for each variable of the attributes. Discrete attributes should be 0indexed (not onehot encoded). If None, assume all attributes are continuous. Ignored if the model was already built, either by passing output params at initialization or because train_ was called previously.


gretel_synthetics.timeseries_dgan.dgan.
find_max_consecutive_nans
(array: numpy.array) → int¶ Returns the maximum number of consecutive NaNs in an array.
 Parameters
array – 1d numpy array of time series per example.
 Returns
The maximum number of consecutive NaNs in a times series array.
 Return type
max_cons_nan

gretel_synthetics.timeseries_dgan.dgan.
nan_linear_interpolation
(arrays: numpy.ndarray) → numpy.ndarray¶ Replaces all NaNs via linear interpolation.
 Parameters
arrays – 3d numpy array of continuous features, with shape
max_sequence_length, #continuous features) ((#examples,) –
 Returns
3d numpy array where NaNs are replaced via linear interpolation.
 Return type
arrays

gretel_synthetics.timeseries_dgan.dgan.
validation_check
(array: numpy.ndarray, invalid_examples_ratio_cutoff: float = 0.5, nans_ratio_cutoff: float = 0.1, consecutive_nans_max: int = 5, consecutive_nans_ratio_cutoff: float = 0.05) → numpy.array¶ Checks if continuous features of examples are valid.
Returns a 1d numpy array of booleans with shape (#examples) indicating valid examples. Examples with continuous features fall into 3 categories: good, valid (fixable) and invalid (nonfixable).  “Good” examples have no NaNs.  “Valid” examples have a low percentage of nans and a below a threshold number of consecutive NaNs.  “Invalid” are the rest, and are marked “False” in the returned array. Later on, these are omitted from training. If there are too many, later, we error out.
 Parameters
array – 3d numpy array of continuous features with
shape (#examples,max_sequence_length, #continuous features) –
invalid_examples_ratio_cutoff – Error out if the invalid examples ratio in the dataset
higher than this value. (is) –
nans_ratio_cutoff – If the percentage of nans for any continuous feature in an example
greater than this value, the example is invalid. (is) –
consecutive_nans_max – If the maximum number of consecutive nans in a continuous
is greater than this number, then that example is invalid. (feature) –
consecutive_nans_ratio_cutoff – If the maximum number of consecutive nans in a
feature is greater than this ratio times the length of the example (continuous) –
samples), then the example is invalid. ((number) –
 Returns
1d numpy array of booleans indicating valid examples with shape (#examples).
 Return type
valid_examples