Generate¶
Abstract module for generating data. The generate_text
function is the primary entrypoint for
creating text.
-
class
gretel_synthetics.generate.
BaseGenerator
¶ Do not use directly.
Specific generation modules should have a subclass of this ABC that implements the core logic for generating data
-
class
gretel_synthetics.generate.
GenText
(valid: bool = None, text: str = None, explain: str = None, delimiter: str = None)¶
-
gretel_synthetics.generate.
PredString
¶ alias of
gretel_synthetics.generate.pred_string
-
class
gretel_synthetics.generate.
SeedingGenerator
(config: None, *, seed_list: List[str], line_validator: Optional[Callable] = None, max_invalid: int = 1000)¶ A single threaded line / text generator that is specifically for using with a list of seeds. This also exposes the
Settings
class back to the caller so the actual seed list can be directly accessed, which controls the underlying progression of the main text generator.This is useful when you need to manipulate the actual seed list as data is being generated.
-
class
gretel_synthetics.generate.
Settings
(config: None, start_string: Union[str, List[str], None] = None, multi_seed: bool = False, line_validator: Optional[Callable] = None, max_invalid: int = 1000, tokenizer: gretel_synthetics.tokenizers.BaseTokenizer = None, generator: gretel_synthetics.generate.BaseGenerator = None)¶ Do not use directly.
Arguments for a generator generating lines of text.
This class contains basic settings for a generation process. It is separated from the Generator class for ensuring reliable serializability without an excess amount of code tied to it.
This class also will take a provided start string and validate that it can be utilized for text generation. If the
start_string
is something other than the default, we have to do a couple things:If the config utilizes a field delimiter, the
start_string
MUST end with that delimiterConvert the user-facing delim char into the special delim token specified in the config
-
class
gretel_synthetics.generate.
gen_text
(valid: bool = None, text: str = None, explain: str = None, delimiter: str = None)¶ A record that is yielded from the
Generator.generate_next
generator.-
valid
¶ True, False, or None. If the line passed a validation function, then this will be
True
. If the validation function raised an exception then this will be automatically set toFalse
. If no validation function is used, then this value will beNone.
-
text
¶ The actual record as a string
-
explain
¶ A string that describes why a record failed validation. This is the string representation of the
Exception
that is thrown in a validation function. This will only be set if validation fails, otherwise will beNone.
-
delimiter
¶ If the generated text are column/field based records. This will hold the delimiter used to separate the fields from each other.
-
as_dict
() → dict¶ Serialize the generated record to a dictionary
-
values_as_list
() → Optional[List[str]]¶ Attempt to split the generated text on the provided delimiter
- Returns
A list of values that are separated by the object’s delimiter or None is there is no delimiter in the text
-
-
gretel_synthetics.generate.
generate_text
(config: None, start_string: Union[str, List[str], None] = None, line_validator: Optional[Callable] = None, max_invalid: int = 1000, num_lines: Optional[int] = None, parallelism: int = 0) → Iterator[gretel_synthetics.generate.GenText]¶ A generator that will load a model and start creating records.
- Parameters
config – A configuration object, which you must have created previously
start_string –
A prefix string that is used to seed the record generation. By default we use a newline, but you may substitue any initial value here which will influence how the generator predicts what to generate. If you are working with a field delimiter, and you want to seed more than one column value, then you MUST utilize the field delimiter specified in your config. An example would be “foo,bar,baz,”. Also, if using a field delimiter, the string MUST end with the delimiter value.
Note
This param may also be a list of prefixes. If this is provided, then the generator will attempt to create exactly 1 record for each seed in the list. The
num_lines
param will be implicity set to the size of the list and this number of records will be created at a 1:1 ratio between prefix strings and valid records.line_validator – An optional callback validator function that will take the raw string value from the generator as a single argument. This validator can executue arbitrary code with the raw string value. The validator function may return a bool to indicate line validity. This boolean value will be set on the yielded
gen_text
object. Additionally, if the validator throws an exception, thegen_text
object will be set with a failed validation. If the validator returns None, we will assume successful validation.max_invalid – If using a
line_validator
, this is the maximum number of invalid lines to generate. If the number of invalid lines exceeds this value aRunTimeError
will be raised.num_lines –
If not
None
, this will override thegen_lines
value that is provided in theconfig
. .. note:If ``start_string`` is a list, this value will be set to the length of that list and any other values for the param are ignored.
parallelism – The number of concurrent workers to use.
1
(the default) disables parallelization, while a non-positive value means “number of CPUs + x” (i.e., use0
for using as many workers as there are CPUs). A floating-point value is interpreted as a fraction of the available CPUs, rounded down.
Simple validator example:
def my_validator(raw_line: str): parts = raw_line.split(',') if len(parts) != 5: raise Exception('record does not have 5 fields')
Note
gen_lines
from theconfig
is important for this function. If a line validator is not provided, each line will count towards the number of total generated lines. When the total lines generated is >=gen_lines
we stop. If a line validator is provided, only valid lines will count towards the total number of lines generated. When the total number of valid lines generated is >=gen_lines
, we stop.Note
gen_chars
, controls the possible maximum number of characters a single generated line can have. If a newline character has not been generated before reaching this number, then the line will be returned. For example ifgen_chars
is 180 and a newline has not been generated, once 180 chars have been created, the line will be returned no matter what. As a note, if this value is 0, then each line will generate until a newline is observed.- Yields
A
GenText
object for each record that is generated. The generator will stop after the max number of lines is reached (based on your config).- Raises
A RunTimeError if the max_invalid number of lines is generated –