parlai.core.dict¶
Contains code for parsing and building a dictionary from text.
- parlai.core.dict.escape(s)[source]¶
Replace potential special characters with escaped version.
For example, n => \n and t => \t
- Parameters
s – string to escape
- parlai.core.dict.unescape(s)[source]¶
Revert escaped characters back to their special version.
For example, \n => n and \t => t
- Parameters
s – string to unescape
- parlai.core.dict.find_ngrams(token_dict, text, n)[source]¶
Break text into ngrams that appear in
token_dict
.- Parameters
token_dict –
dict
to check for ngramstext –
str
to look for ngrams inn –
int
max size of ngrams
- class parlai.core.dict.DictionaryAgent(opt: Opt, shared=None)[source]¶
Bases:
Agent
Builds and/or loads a dictionary.
The dictionary provides access to the frequency of each token, functions to translate sentences from tokens to their vectors (list of ints, each int is the index of a token in the dictionary) and back from vectors to tokenized text.
- classmethod add_cmdline_args(parser: ParlaiParser, partial_opt: Optional[Opt] = None) ParlaiParser [source]¶
Add commandline arguments related to the dictionary.
- add_additional_special_tokens(additional_special_tokens: List[str])[source]¶
Add additional special tokens to the dictionary.
Should only be called after initialization of the existing dictionary.
- nltk_tokenize(text, building=False)[source]¶
Tokenize using NLTK PunktTokenizer.
Uses nltk-trained PunktTokenizer for sentence tokenization and Treebank Word Tokenizer for tokenizing words within sentences.
- static re_tokenize(text)[source]¶
Tokenize using a liberal regular expression.
Find boundaries between word characters, newlines, and non-word non-whitespace tokens
(r'[\\w\\n]+ | [^\\w\\s] | \\n')
.This splits along whitespace and punctuation and keeps the newline as a token in the returned list.
- static split_tokenize(text)[source]¶
Tokenize on whitespace and some limited punctuation.
Splits tokens based on whitespace after adding whitespace around punctuation.
Use re_tokenize if you want more robust handling of punctuation.
- tokenize(text, building=False)[source]¶
Return a sequence of tokens from the iterable.
Also handles special tokens for some tokenizers
- load(filename)[source]¶
Load pre-existing dictionary in ‘token[<TAB>count]’ format.
Initialize counts from other dictionary, or 0 if they aren’t included.
- save(filename=None, append=False, sort=True)[source]¶
Save dictionary to file.
Format is ‘token<TAB>count’ for every token in the dictionary, sorted by count with the most frequent words first.
If
append
(defaultFalse
) is set toTrue
, appends instead of overwriting.If
sort
(defaultTrue
), then first sort the dictionary before saving.
- sort(trim=True)[source]¶
Sort the dictionary.
Inline operation. Rearranges the dictionary so that the elements with the lowest index have the highest counts. This reindexes the dictionary according to the sorted frequencies, breaking ties alphabetically by token.
- Parameters
trim (bool) – If True, truncate the dictionary based on minfreq and maxtokens.
- parse(txt_or_vec, vec_type=<class 'list'>)[source]¶
Parse either text or a vector of indices.
Calls ~txt2vec if txt_or_vec is a string, or `~vec2txt otherwise.
- Parameters
vec_type – type of the returned vector if the input is a string.
- txt2vec(text: str, vec_type=<class 'list'>)[source]¶
Convert a string to a vector (list of ints).
First runs a sentence tokenizer, then a word tokenizer.
- Parameters
vec_type (type) – The type of the returned vector if the input is a string. Suggested
list
,tuple
,set
, ornp.ndarray
.
- vec2txt(vector, delimiter=' ')[source]¶
Convert a vector of IDs to a string.
Converts a vector (iterable of ints) into a string, with each token separated by the delimiter (default
' '
).
- act()[source]¶
Add words in the last observation to the dictionary.
This checks any fields in the message present in the –dict-textfields argument (e.g. “text,labels”).
Share internal dicts.
- set_tokenization_mode(mode: TokenizationMode)[source]¶
Indicate what “kind” of tokenization is being done.
This can be Training Time / Testing Time, and it can be over context or labels.
This is used to signal from TorchAgent to the dict that it’s allowed to enable things like BPE dropout. It is NOT used to indicate whether the dictionary itself is in training time.
Use True for training time, False for not.