Timeseries DGAN¶
The Timeseries DGAN module contains a PyTorch implementation of the DoppelGANger model, see https://arxiv.org/abs/1909.13403 for a detailed description of the model.
import numpy as np
from gretel_synthetics.timeseries_dgan.dgan import DGAN
from gretel_synthetics.timeseries_dgan.config import DGANConfig
attributes = np.random.rand(10000, 3)
features = np.random.rand(10000, 20, 2)
config = DGANConfig(
max_sequence_len=20,
sample_len=5,
batch_size=1000,
epochs=10
)
model = DGAN(config)
model.train(attributes, features)
synthetic_attributes, synthetic_features = model.generate(1000)
-
class
gretel_synthetics.timeseries_dgan.config.
DGANConfig
(max_sequence_len: int, sample_len: int, attribute_noise_dim: int = 10, feature_noise_dim: int = 10, attribute_num_layers: int = 3, attribute_num_units: int = 100, feature_num_layers: int = 1, feature_num_units: int = 100, use_attribute_discriminator: bool = True, normalization: gretel_synthetics.timeseries_dgan.config.Normalization = <Normalization.ZERO_ONE: 0>, apply_feature_scaling: bool = True, apply_example_scaling: bool = True, forget_bias: bool = False, gradient_penalty_coef: float = 10.0, attribute_gradient_penalty_coef: float = 10.0, attribute_loss_coef: float = 1.0, generator_learning_rate: float = 0.001, generator_beta1: float = 0.5, discriminator_learning_rate: float = 0.001, discriminator_beta1: float = 0.5, attribute_discriminator_learning_rate: float = 0.001, attribute_discriminator_beta1: float = 0.5, batch_size: int = 1024, epochs: int = 400, discriminator_rounds: int = 1, generator_rounds: int = 1, cuda: bool = True, mixed_precision_training: bool = False)¶ Config object with parameters for training a DGAN model.
- Parameters
max_sequence_len – length of time series sequences, variable length sequences are not supported, so all training and generated data will have the same length sequences
sample_len – time series steps to generate from each LSTM cell in DGAN, must be a divisor of max_sequence_len
attribute_noise_dim – length of the GAN noise vectors for attribute generation
feature_noise_dim – length of GAN noise vectors for feature generation
attribute_num_layers – # of layers in the GAN discriminator network
attribute_num_units – # of units per layer in the GAN discriminator network
feature_num_layers – # of LSTM layers in the GAN generator network
feature_num_units – # of units per layer in the GAN generator network
use_attribute_discriminator – use separaste discriminator only on attributes, helps DGAN match attribute distributions, Default: True
normalization – default normalization for continuous variables, used when metadata output is not specified during DGAN initialization
apply_feature_scaling – scale each continuous variable to [0,1] or [-1,1] (based on normalization param) before training and rescale to original range during generation, if False then training data must be within range and DGAN will only generate values in [0,1] or [-1,1], Default: True
apply_example_scaling – compute midpoint and halfrange (equivalent to min/max) for each time series variable and include these as additional attributes that are generated, this provides better support for time series with highly variable ranges, e.g., in network data, a dial-up connection has bandwidth usage in [1kb, 10kb], while a fiber connection is in [100mb, 1gb], Default: True
forget_bias – initialize forget gate bias paramters to 1 in LSTM layers, when True initialization matches tf1 LSTMCell behavior, otherwise default pytorch initialization is used, Default: False
gradient_penalty_coef – coefficient for gradient penalty in Wasserstein loss, Default: 10.0
attribute_gradient_penalty_coef – coefficient for gradient penalty in Wasserstein loss for the attribute discriminator, Default: 10.0
attribute_loss_coef – coefficient for attribute discriminator loss in comparison the standard discriminator on attributes and features, higher values should encourage DGAN to learn attribute distributions, Default: 1.0
generator_learning_rate – learning rate for Adam optimizer
generator_beta1 – Adam param for exponential decay of 1st moment
discriminator_learning_rate – learning rate for Adam optimizer
discriminator_beta1 – Adam param for exponential decay of 1st moment
attribute_discriminator_learning_rate – learning rate for Adam optimizer
attribute_discriminator_beta1 – Adam param for exponential decay of 1st moment
batch_size – # of examples used in batches, for both training and generation
epochs – # of epochs to train model discriminator_rounds: training steps
the discriminator (for) – batch
generator_rounds – training steps for the generator in each batch
cuda – use GPU if available
mixed_precision_training – enabling automatic mixed precision while training in order to reduce memory costs, bandwith, and time by identifying the steps that require full precision and using 32-bit floating point for only those steps while using 16-bit floating point everywhere else.
-
to_dict
()¶ Return dictionary representation of DGANConfig.
- Returns
Dictionary of member variables, usable to initialize a new config object, e.g., DGANConfig(**config.to_dict())
-
class
gretel_synthetics.timeseries_dgan.config.
DfStyle
¶ Supported styles for parsing pandas DataFrames.
See train_dataframe method in dgan.py for details.
-
class
gretel_synthetics.timeseries_dgan.config.
Normalization
¶ Normalization types for continuous variables.
Determines if a sigmoid (ZERO_ONE) or tanh (MINUSONE_ONE) activation is used for the output layers in the generation network.
-
class
gretel_synthetics.timeseries_dgan.config.
OutputType
¶ Supported variables types.
Determines internal representation of variables and output layers in generation network.
PyTorch implementation of DoppelGANger, from https://arxiv.org/abs/1909.13403
Based on tensorflow 1 code in https://github.com/fjxmlzn/DoppelGANger
DoppelGANger is a generative adversarial network (GAN) model for time series. It supports multi-variate time series (referred to as features) and fixed variables for each time series (attributes). The combination of attribute values and sequence of feature values is 1 example. Once trained, the model can generate novel examples that exhibit the same temporal correlations as seen in the training data. See https://arxiv.org/abs/1909.13403 for additional details on the model.
As a reference for terminology, consider open-high-low-close (OHLC) data from stock markets. Each stock is an example, with fixed attributes such as exchange, sector, country. The features or time series consists of open, high, low, and closing prices for each time interval (daily). After being trained on historical data, the model can generate more hypothetical stocks and price behavior on the training time range.
Sample usage:
import numpy as np
from gretel_synthetics.timeseries_dgan.dgan import DGAN
from gretel_synthetics.timeseries_dgan.config import DGANConfig
attributes = np.random.rand(10000, 3)
features = np.random.rand(10000, 20, 2)
config = DGANConfig(
max_sequence_len=20,
sample_len=5,
batch_size=1000,
epochs=10
)
model = DGAN(config)
model.train_numpy(attributes, features)
synthetic_attributes, synthetic_features = model.generate_numpy(1000)
-
class
gretel_synthetics.timeseries_dgan.dgan.
DGAN
(config: gretel_synthetics.timeseries_dgan.config.DGANConfig, attribute_outputs: Optional[List[gretel_synthetics.timeseries_dgan.transformations.Output]] = None, feature_outputs: Optional[List[gretel_synthetics.timeseries_dgan.transformations.Output]] = None)¶ DoppelGANger model.
Interface for training model and generating data based on configuration in an DGANConfig instance.
DoppelGANger uses a specific internal representation for data which is hidden from the user in the public interface. Continuous variables should be in the original space and discrete variables represented as [0.0, 1.0, 2.0, …] when using the train_numpy() and train_dataframe() functions. The generate_numpy() and generate_dataframe() functions will return data in this original space. In standard usage, the detailed transformation info in attribute_outputs and feature_outputs are not needed, those will be created automatically when a train* function is called with data.
If more control is needed and you want to use the normalized values and one-hot encoding directly, use the _train() and _generate() functions. transformations.py contains internal helper functions for working with the Output metadata instances and converting data to and from the internal representation. To dive even deeper into the model structure, see the torch_modules.py which contains the torch implementations of the networks used in DGAN. As internal details, transformations.py and torch_modules.py are not part of the public interface and may change at any time without notice.
-
__init__
(config: gretel_synthetics.timeseries_dgan.config.DGANConfig, attribute_outputs: Optional[List[gretel_synthetics.timeseries_dgan.transformations.Output]] = None, feature_outputs: Optional[List[gretel_synthetics.timeseries_dgan.transformations.Output]] = None)¶ Create a DoppelGANger model.
- Parameters
config – DGANConfig containing model parameters
attribute_outputs – custom metadata for attributes, not needed for standard usage
feature_outputs – custom metadata for features, not needed for standard usage
-
generate_dataframe
(n: Optional[int] = None, attribute_noise: Optional[torch.Tensor] = None, feature_noise: Optional[torch.Tensor] = None) → pandas.core.frame.DataFrame¶ Generate synthetic data from DGAN model.
Once trained, a DGAN model can generate arbitrary amounts of synthetic data by sampling from the noise distributions. Specify either the number of records to generate, or the specific noise vectors to use.
- Parameters
n – number of examples to generate
attribute_noise – noise vectors to create synthetic data
feature_noise – noise vectors to create synthetic data
- Returns
pandas DataFrame in same format used in ‘train_dataframe’ call
-
generate_numpy
(n: Optional[int] = None, attribute_noise: Optional[torch.Tensor] = None, feature_noise: Optional[torch.Tensor] = None) → Tuple[Optional[numpy.ndarray], numpy.ndarray]¶ Generate synthetic data from DGAN model.
Once trained, a DGAN model can generate arbitrary amounts of synthetic data by sampling from the noise distributions. Specify either the number of records to generate, or the specific noise vectors to use.
- Parameters
n – number of examples to generate
attribute_noise – noise vectors to create synthetic data
feature_noise – noise vectors to create synthetic data
- Returns
Tuple of attributes and features as numpy arrays.
-
classmethod
load
(file_name: str, **kwargs) → gretel_synthetics.timeseries_dgan.dgan.DGAN¶ Load DGAN model instance from a file.
- Parameters
file_name – location to load from
kwargs – additional parameters passed to torch.load, for example, use map_location=torch.device(“cpu”) to load a model saved for GPU on a machine without cuda
- Returns
DGAN model instance
-
save
(file_name: str, **kwargs)¶ Save DGAN model to a file.
- Parameters
file_name – location to save serialized model
kwargs – additional parameters passed to torch.save
-
train_dataframe
(df: pandas.core.frame.DataFrame, attribute_columns: Optional[List[str]] = None, feature_columns: Optional[List[str]] = None, example_id_column: Optional[str] = None, time_column: Optional[str] = None, discrete_columns: Optional[List[str]] = None, df_style: gretel_synthetics.timeseries_dgan.config.DfStyle = <DfStyle.WIDE: 'wide'>)¶ Train DGAN model on data in pandas DataFrame.
Training data can be in either “wide” or “long” format. “Wide” format uses one row for each example with 0 or more attribute columns and 1 column per time point in the time series. “Wide” format is restricted to 1 feature variable. “Long” format uses one row per time point, supports multiple feature variables, and uses additional example id to split into examples and time column to sort.
- Parameters
df – DataFrame of training data
attribute_columns – list of column names containing attributes, if None, no attribute columns are used, Default: None
feature_columns – list of column names containing features, if None all non-attribute columns are used, Default: None
example_id_column – column name used to split “long” format data frame into multiple examples, if None, data is treated as a single example
time_column – column name used to sort “long” format data frame, if None, data frame order of rows/time points is used
discrete_columns – column names (either attributes or features) to use discrete, onehot encoding, discrete values must be integer in [0,1,2,3…]
df_style – str enum of “wide” or “long” indicating format of the DataFrame
-
train_numpy
(features: numpy.ndarray, feature_types: Optional[List[gretel_synthetics.timeseries_dgan.config.OutputType]] = None, attributes: Optional[numpy.ndarray] = None, attribute_types: Optional[List[gretel_synthetics.timeseries_dgan.config.OutputType]] = None)¶ Train DGAN model on data in numpy arrays.
Training data is passed in 2 numpy arrays, one for attributes (2d) and one for features (3d). This data should be in the original space and is not transformed. If the data is already transformed into the internal DGAN representation (continuous variable scaled to [0,1] or [-1,1] and discrete variables one-hot encoded), use the internal _train() function instead of train_numpy(), or specify apply_feature_scaling=False in the DGANConfig.
In standard usage, attribute_types and feature_types should be provided on the first call to train() to correctly setup the model structure. If not specified, the default is to assume continuous variables. If outputs metadata was specified when the instance was initialized or train() was previously called, then attribute_types and feature_types are not needed.
- Parameters
features – 3-d numpy array of time series features for the training, size is (# of training examples) X max_sequence_len X (# of features)
feature_types (Optional) – Specification of Discrete or Continuous type for each variable of the features. Discrete attributes should be 0-indexed (not one-hot encoded). If None, assume all features are continuous. Ignored if the model was already built, either by passing output params at initialization or because train_ was called previously.
attributes (Optional) – 2-d numpy array of attributes for the training examples, size is (# of training examples) X (# of attributes)
attribute_types (Optional) – Specification of Discrete or Continuous type for each variable of the attributes. Discrete attributes should be 0-indexed (not one-hot encoded). If None, assume all attributes are continuous. Ignored if the model was already built, either by passing output params at initialization or because train_ was called previously.
-