Config
This module provides a set of dataclasses that can be used to hold all necessary confguration parameters for training a model and generating data.
For example usage please see our Jupyter Notebooks.
- class gretel_synthetics.config.BaseConfig(input_data_path: str | None = None, validation_split: bool = True, checkpoint_dir: str | None = None, training_data_path: str | None = None, field_delimiter: str | None = None, field_delimiter_token: str = '<d>', model_type: str | None = None, max_lines: int = 0, overwrite: bool = False, epoch_callback: Callable | None = None, max_training_time_seconds: int | None = None, vocab_size: int = 20000, character_coverage: float = 1.0, pretrain_sentence_count: int = 1000000, max_line_len: int = 2048)
This class should not be used directly, engine specific classes should derived from this class.
- as_dict()
Serialize the config attrs to a dict
- checkpoint_dir: str = None
Directory where model data will be stored, user provided.
- epoch_callback: Callable | None = None
Callback to be invoked at the end of each epoch. It will be invoked with an EpochState instance as its only parameter. NOTE that the callback is deleted when save_model_params is called, we do not attempt to serialize it to JSON.
- field_delimiter: str | None = None
If the input data is structured, you may specify a field delimiter which can be used to split the generated text into a list of strings. For more detail please see the
GenText
class in thegenerate.py
module.
- field_delimiter_token: str = '<d>'
Depending on the tokenizer used, a special token can be used to represent characters. For tokenizers, like SentencePiece that support this, we will replace the field delimiter char with this token to provide better learning and generation. If the tokenizer used does not support custom tokens, this value will be ignored
- abstract get_generator_class() None
This must be implemented by all specific configs. It should return the class that should be used as the Generator for creating records.
- abstract get_training_callable() Callable
This must be implemented by all specific configs. It should return a callable that should be used as the entrypoint for training a model.
- gpu_check()
Optionally do a GPU check and warn if a GPU is not available, if not overridden, do nothing
- input_data_path: str = None
Path to raw training data, user provided.
- max_lines: int = 0
The maximum number of lines to utilize from the raw input data.
- max_training_time_seconds: int | None = None
If set, training will cease after the number of seconds specified elapses. This timeout will be evaluated after each epoch.
- model_type: str = None
A string version of the model config class. This is used to keep track of what underlying engine was used when writing the config to a file. This will be automatically updated during construction.
- overwrite: bool = False
Set to
True
to automatically overwrite previously saved model checkpoints. IfFalse
, the trainer will generate an error if checkpoints exist in the model directory. Default isFalse
.
- training_data_path: str = None
Where annotated and tokenized training data will be stored. This attr will be modified during construction.
- validation_split: bool = True
Use a fraction of the training data as validation data. Use of a validation set is recommended as it helps prevent over-fitting and memorization. When enabled, 20% of data will be used for validation.
- gretel_synthetics.config.CONFIG_MAP = {'TensorFlowConfig': <class 'gretel_synthetics.config.TensorFlowConfig'>}
A mapping of configuration subclass string names to their actual classes. This can be used to re-instantiate a config from a serialized state.
- gretel_synthetics.config.LocalConfig
alias of
TensorFlowConfig
- class gretel_synthetics.config.TensorFlowConfig(input_data_path: str | None = None, validation_split: bool = True, checkpoint_dir: str | None = None, training_data_path: str | None = None, field_delimiter: str | None = None, field_delimiter_token: str = '<d>', model_type: str | None = None, max_lines: int = 0, overwrite: bool = False, epoch_callback: Callable | None = None, max_training_time_seconds: int | None = None, vocab_size: int = 20000, character_coverage: float = 1.0, pretrain_sentence_count: int = 1000000, max_line_len: int = 2048, epochs: int = 100, early_stopping: bool = True, early_stopping_patience: int = 5, best_model_metric: str | None = None, early_stopping_min_delta: float = 0.001, batch_size: int = 64, buffer_size: int = 10000, seq_length: int = 100, embedding_dim: int = 256, rnn_units: int = 256, learning_rate: float = 0.01, dropout_rate: float = 0.2, rnn_initializer: str = 'glorot_uniform', dp: bool = False, dp_noise_multiplier: float = 0.1, dp_l2_norm_clip: float = 3.0, dp_microbatches: int = 1, gen_temp: float = 1.0, gen_chars: int = 0, gen_lines: int = 1000, predict_batch_size: int = 64, reset_states: bool = True, save_all_checkpoints: bool = False, save_best_model: bool = True)
TensorFlow config that contains all of the main parameters for training a model and generating data.
- Parameters:
epochs (optional) – Number of epochs to train the model. An epoch is an iteration over the entire training set provided. For production use cases, 15-50 epochs are recommended. The default is
100
and is intentionally set extra high. By default,early_stopping
is also enabled and will stop training epochs once the model is no longer improving.early_stopping (optional) – deduce when the model is no longer improving and terminating training.
early_stopping_patience (optional) – in the model. After this number of epochs, training will terminate.
best_model_metric (optional) – The metric to use to track when a model is no longer improving. Alternative options are “val_acc” or “acc”. A error will be raised if a valid value is not specified.
early_stopping_min_delta (optional) – as an improvement, i.e. an absolute change of less than min_delta will count as no improvement.
batch_size (optional) – Number of samples per gradient update. Using larger batch sizes can help make more efficient use of CPU/GPU parallelization, at the cost of memory. If unspecified, batch_size will default to
64
.buffer_size (optional) – Buffer size which is used to shuffle elements during training. Default size is
10000
.seq_length (optional) – The maximum length sentence we want for a single training input in characters. Note that this setting is different than max_line_length, as seq_length simply affects the length of the training examples passed to the neural network to predict the next token. Default size is
100
.embedding_dim (optional) – Vector size for the lookup table used in the neural network Embedding layer that maps the numbers of each character. Default size is
256
.rnn_units (optional) – Positive integer, dimensionality of the output space for LSTM layers. Default size is
256
.dropout_rate (optional) – Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs. Using a dropout can help to prevent overfitting by ignoring randomly selected neurons during training. 0.2 (20%) is often used as a good compromise between retaining model accuracy and preventing overfitting. Default is 0.2.
rnn_initializer (optional) – Initializer for the kernal weights matrix, used for the linear transformation of the inputs. Default is
glorot_transform
.dp (optional) – If
True
, train model with differential privacy enabled. This setting provides assurances that the models will encode general patterns in data rather than facts about specific training examples. These additional guarantees can usefully strengthen the protections offered for sensitive data and content, at a small loss in model accuracy and synthetic data quality. The differential privacy epsilon and delta values will be printed when training completes. Default isFalse
.learning_rate (optional) – The higher the learning rate, the more that each update during training matters. Note: When training with differential privacy enabled, if the updates are noisy (such as when the additive noise is large compared to the clipping threshold), a low learning rate may help with training. Default is
0.01
.dp_noise_multiplier (optional) – The amount of noise sampled and added to gradients during training. Generally, more noise results in better privacy, at the expense of model accuracy. Default is
0.1
.dp_l2_norm_clip (optional) – The maximum Euclidean (L2) norm of each gradient is applied to update model parameters. This hyperparameter bounds the optimizer’s sensitivity to individual training points. Default is
3.0
.dp_microbatches (optional) – Each batch of data is split into smaller units called micro-batches. Computational overhead can be reduced by increasing the size of micro-batches to include more than one training example. The number of micro-batches should divide evenly into the overall
batch_size
. Default is1
.gen_temp (optional) – Controls the randomness of predictions by scaling the logits before applying softmax. Low temperatures result in more predictable text. Higher temperatures result in more surprising text. Experiment to find the best setting. Default is
1.0
.gen_chars (optional) – Maximum number of characters to generate per line. Default is
0
(no limit).gen_lines (optional) – Maximum number of text lines to generate. This function is used by
generate_text
and the optionalline_validator
to make sure that all lines created by the model pass validation. Default is1000
.predict_batch_size (optional) – How many words to generate in parallel. Higher values may result in increased throughput. The default of
64
should provide reasonable performance for most users.reset_states (optional) – Reset RNN model states between each record created guarantees more consistent record creation over time, at the expense of model accuracy. Default is
True
.save_all_checkpoints (optional) – which can be useful for optimal model selection. Set to
False
to save only the latest checkpoint. Default isTrue
.save_best_model (optional). Defaults to
True
. Track the best version of the model (checkpoint) – Ifsave_all_checkpoints
is disabled, then the saved model will be overwritten by newer ones only if they are better.
- get_generator_class()
This must be implemented by all specific configs. It should return the class that should be used as the Generator for creating records.
- get_training_callable()
This must be implemented by all specific configs. It should return a callable that should be used as the entrypoint for training a model.
- gpu_check()
Optionally do a GPU check and warn if a GPU is not available, if not overridden, do nothing
- gretel_synthetics.config.config_from_model_dir(model_dir: str) BaseConfig
Factory that will take a known directory of a model and return a class instance for that config. We automatically try and detect the correct BaseConfig sub-class to use based on the saved model params.
If there is no
model_type
param in the saved config, we assume that the model was saved using an earlier version of the package and will instantiate a TensorFlowConfig