parlai.core.dict¶

Contains code for parsing and building a dictionary from text.

class parlai.core.dict.TokenizationMode(value)[source]¶

Bases: Enum

An enumeration.

parlai.core.dict.escape(s)[source]¶

Replace potential special characters with escaped version.

For example, n => \n and t => \t

Parameters: s – string to escape

parlai.core.dict.unescape(s)[source]¶

Revert escaped characters back to their special version.

For example, \n => n and \t => t

Parameters: s – string to unescape

parlai.core.dict.find_ngrams(token_dict, text, n)[source]¶

Break text into ngrams that appear in token_dict.

Parameters

token_dict – dict to check for ngrams
text – str to look for ngrams in
n – int max size of ngrams

class parlai.core.dict.DictionaryAgent(opt: Opt, shared=None)[source]¶

Bases: Agent

Builds and/or loads a dictionary.

The dictionary provides access to the frequency of each token, functions to translate sentences from tokens to their vectors (list of ints, each int is the index of a token in the dictionary) and back from vectors to tokenized text.

classmethod add_cmdline_args(parser: ParlaiParser, partial_opt: Optional[Opt] = None) → ParlaiParser[source]¶: Add commandline arguments related to the dictionary.

__init__(opt: Opt, shared=None)[source]¶: Initialize DictionaryAgent.

add_additional_special_tokens(additional_special_tokens: List[str])[source]¶

Add additional special tokens to the dictionary.

Should only be called after initialization of the existing dictionary.

is_prebuilt()[source]¶: Indicates whether the dictionary is fixed, and does not require building.

add_token(word)[source]¶: Add a single token to the dictionary.

keys()[source]¶: Return all the words in the dictionary.

nltk_tokenize(text, building=False)[source]¶

Tokenize using NLTK PunktTokenizer.

Uses nltk-trained PunktTokenizer for sentence tokenization and Treebank Word Tokenizer for tokenizing words within sentences.

gpt2_tokenize(text)[source]¶: Tokenize using Gpt2 BPE tokenizer.

slow_bytelevel_bpe_tokenize(text)[source]¶: Tokenize using Gpt2 BPE tokenizer.

bytelevelbpe_tokenize(text)[source]¶: Tokenize using Gpt2 BPE tokenizer.

static re_tokenize(text)[source]¶

Tokenize using a liberal regular expression.

Find boundaries between word characters, newlines, and non-word non-whitespace tokens (r'[\\w\\n]+ | [^\\w\\s] | \\n').

This splits along whitespace and punctuation and keeps the newline as a token in the returned list.

static split_tokenize(text)[source]¶

Tokenize on whitespace and some limited punctuation.

Splits tokens based on whitespace after adding whitespace around punctuation.

Use re_tokenize if you want more robust handling of punctuation.

static space_tokenize(text)[source]¶

Tokenize exactly on spaces.

Useful when text is pre-tokenized.

span_tokenize(text)[source]¶: Tokenize and find starting index of each token in the original string.

tokenize(text, building=False)[source]¶

Return a sequence of tokens from the iterable.

Also handles special tokens for some tokenizers

bpe_tokenize(text)[source]¶: Return a sequence of BPE-tokens from the text.

add_to_dict(tokens)[source]¶: Build dictionary from the list of provided tokens.

remove_tail(min_freq)[source]¶: Remove elements below the frequency cutoff from the dictionary.

resize_to_max(maxtokens)[source]¶: Trims the dictionary to the maximum number of tokens.

load(filename)[source]¶

Load pre-existing dictionary in ‘token[<TAB>count]’ format.

Initialize counts from other dictionary, or 0 if they aren’t included.

save(filename=None, append=False, sort=True)[source]¶

Save dictionary to file.

Format is ‘token<TAB>count’ for every token in the dictionary, sorted by count with the most frequent words first.

If append (default False) is set to True, appends instead of overwriting.

If sort (default True), then first sort the dictionary before saving.

sort(trim=True)[source]¶

Sort the dictionary.

Inline operation. Rearranges the dictionary so that the elements with the lowest index have the highest counts. This reindexes the dictionary according to the sorted frequencies, breaking ties alphabetically by token.

Parameters: trim (bool) – If True, truncate the dictionary based on minfreq and maxtokens.

parse(txt_or_vec, vec_type=<class 'list'>)[source]¶

Parse either text or a vector of indices.

Calls ~txt2vec if txt_or_vec is a string, or `~vec2txt otherwise.

Parameters: vec_type – type of the returned vector if the input is a string.

txt2vec(text: str, vec_type=<class 'list'>)[source]¶

Convert a string to a vector (list of ints).

First runs a sentence tokenizer, then a word tokenizer.

Parameters: vec_type (type) – The type of the returned vector if the input is a string. Suggested list, tuple, set, or np.ndarray.

vec2txt(vector, delimiter=' ')[source]¶

Convert a vector of IDs to a string.

Converts a vector (iterable of ints) into a string, with each token separated by the delimiter (default ' ').

act()[source]¶

Add words in the last observation to the dictionary.

This checks any fields in the message present in the –dict-textfields argument (e.g. “text,labels”).

share()[source]¶: Share internal dicts.

shutdown()[source]¶: Save on shutdown if save_path is set.

set_tokenization_mode(mode: TokenizationMode)[source]¶

Indicate what “kind” of tokenization is being done.

This can be Training Time / Testing Time, and it can be over context or labels.

This is used to signal from TorchAgent to the dict that it’s allowed to enable things like BPE dropout. It is NOT used to indicate whether the dictionary itself is in training time.

Use True for training time, False for not.