Tokenizers
Interface definitions for tokenizers. The classes in the module are segmented into two abstract types: Trainers and Tokenizers. They are kept separate because the parameters used to train a tokenizer are not necessarily loaded back in and utilized by a trained tokenizer. While its more explicit to utilize two types of classes, it also removes any ambiguity in which methods are able to be used based on training or tokenizing.
Trainers require a specific configuration to be provided. Based on the configuration received, the tokenizer trainers will create the actual training data file that will be used by the downstream training process. In this respect, utilizing at least one of these tokenizers is required for training since it is the tokenizers responsbility to create the final training data to be used.
The general process that is followed when using these tokenizers is:
Create a trainer instance, with desired parameters, including providing the config as a required param.
Call the annotate_data
for your tokenizer trainer. What is important to note here
is that this method actually iterates the input data line by line, and does any special processing, then
writes a new data file that will be used for actual training. This new data file is written to the
model directory.
Call the train
method, which will create your tokenization model and save it to the model
directory.
Now you will use the load()
class method from an actual tokenizer class to load that
trained model in and now you can use it on input data.
- class gretel_synthetics.tokenizers.Base
High level base class for shared class attrs and validation. Should not be used directly.
- class gretel_synthetics.tokenizers.BaseTokenizer(model_data: Any, model_dir: str)
Base class for loading a tokenizer from disk. Should not be used directly.
- decode_from_ids(ids: List[int]) str
Given a list of token IDs, convert it to a single string that would be the original string it was.
Note
We automatically call a method that can optionally restore any special reserved tokens back to their original values (such as field delimiter values, etc)
- encode_to_ids(data: str) List[int]
Given an input string, convert it to a list of token IDs
- abstract classmethod load(model_dir: str)
Given a directory to a model, load the specific tokenizer model into an instance. Subclasses should implement this logic specific to how they need to load a model back in
- abstract property total_vocab_size
Return the total count of unique tokens in the vocab, specific to the underlying tokenizer to be used.
- class gretel_synthetics.tokenizers.BaseTokenizerTrainer(*, config: None, vocab_size: int | None = None)
Base class for training tokenizers. Should not be used directly.
- annotate_data() Iterator[str]
This should be called _before_ training as it is required to have the annotated training data created in the model directory.
Read in the configurations raw input data path, and create a file I/O pipeline where each line of the input data path can optionally route through an annotation function and then we will write each raw line out into a training data file as specified by the config.
- config: None
A subclass instace of
BaseConfig
. This will be used to find the input data for tokenization
- data_iterator() Iterator[str]
Create a generator that will iterate each line of the training data that was created during the annotation step. Synthetic model trainers will most likely need to iterate this to process each line of the annotated training data.
- num_lines: int = 0
The number of lines that were processed after
create_annotated_training_data
is called
- train()
Train a tokenizer and save the tokenizer settings to a file located in the model directory specified by the
config
object
- vocab_size: int
The max size of the vocab (tokens) to be extracted from the input dataset.
- class gretel_synthetics.tokenizers.CharTokenizer(model_data: Any, model_dir: str)
Load a simple character tokenizer from disk to conduct encoding an decoding operations
- classmethod load(model_dir: str)
Create an instance of this tokenizer.
- Parameters:
model_dir – The path to the model directory
- property total_vocab_size
Get the number of unique characters (tokens)
- class gretel_synthetics.tokenizers.CharTokenizerTrainer(*, config: None, vocab_size: int | None = None)
Train a simple tokenizer that maps every single character to a unique ID. If
vocab_size
is not specified, the learned vocab size will be the number of unique characters in the training dataset.- Parameters:
vocab_size – Max number of tokens (chars) to map to tokens.
- class gretel_synthetics.tokenizers.SentencePieceColumnTokenizer(sp: SentencePieceProcessor, model_dir: str)
- class gretel_synthetics.tokenizers.SentencePieceColumnTokenizerTrainer(col_pattern: str = '<col{}>', **kwargs)
- class gretel_synthetics.tokenizers.SentencePieceTokenizer(model_data: Any, model_dir: str)
Load a SentencePiece tokenizer from disk so encoding / decoding can be done
- classmethod load(model_dir: str)
Load a SentencePiece tokenizer from a model directory.
- Parameters:
model_dir – The model directory.
- property total_vocab_size
The number of unique tokens in the model
- class gretel_synthetics.tokenizers.SentencePieceTokenizerTrainer(*, character_coverage: float = 1.0, pretrain_sentence_count: int = 1000000, max_line_len: int = 2048, **kwargs)
Train a tokenizer using Google SentencePiece.
- character_coverage: float
The amount of characters covered by the model. Unknown characters will be replaced with the <unk> tag. Good defaults are
0.995
for languages with rich character sets like Japanese or Chinese, and 1.0 for other languages or machine data. Default is1.0
.
- max_line_line: int
Maximum line length for input training data. Any lines longer than this length will be ignored. Default is
2048
.
- pretrain_sentence_count: int
The number of lines spm_train first loads. Remaining lines are simply discarded. Since spm_train loads entire corpus into memory, this size will depend on the memory size of the machine. It also affects training time. Default is
1000000
.
- vocab_size: int
Pre-determined maximum vocabulary size prior to neural model training, based on subword units including byte-pair-encoding (BPE) and unigram language model, with the extension of direct training from raw sentences. We generally recommend using a large vocabulary size of 20,000 to 50,000. Default is
20000
.
- exception gretel_synthetics.tokenizers.TokenizerError
- gretel_synthetics.tokenizers.tokenizer_from_model_dir(model_dir: str) BaseTokenizer
A factory function that will return a tokenizer instance that can be used for encoding / decoding data. It will try to automatically infer what type of class to use based on the stored tokenizer params in the provided model directory.
If no specific tokenizer type is found, we assume that we are restoring a SentencePiece tokenizer because the model is from a version <= 0.14.x
- Parameters:
model_dir – A directory that holds synthetic model data.