core.dict

Contains code for parsing and building a dictionary from text.

parlai.core.dict.escape(s)

Replace potential special characters with escaped version.

For example, n => \n and t => \t

Parameters:s – string to escape
parlai.core.dict.unescape(s)

Revert escaped characters back to their special version.

For example, \n => n and \t => t

Parameters:s – string to unescape
parlai.core.dict.find_ngrams(token_dict, text, n)

Break text into ngrams that appear in token_dict.

Parameters:
  • token_dictdict to check for ngrams
  • textstr to look for ngrams in
  • nint max size of ngrams
class parlai.core.dict.DictionaryAgent(opt, shared=None)

Builds and/or loads a dictionary.

The dictionary provides access to the frequency of each token, functions to translate sentences from tokens to their vectors (list of ints, each int is the index of a token in the dictionary) and back from vectors to tokenized text.

static add_cmdline_args(argparser)

Add commandline arguments related to the dictionary.

__init__(opt, shared=None)

Initialize DictionaryAgent.

__contains__(key)

If key is an int, returns whether the key is in the indices. If key is a str, return if the token is in the dict of tokens.

__getitem__(key)

If key is an int, returns the corresponding token. If it does not exist, return the unknown token. If key is a str, return the token’s index. If the token is not in the dictionary, return the index of the unknown token. If there is no unknown token, return None.

__setitem__(key, value)

If the key is not in the dictionary, add it to the dictionary and set its frequency to value.

copy_dict(dictionary)

Overwrite own state with any state in the other dictionary. This allows loading of the contents of another dictionary while keeping the current dictionary version.

spacy_span_tokenize(text)

Returns tuple of tokens, spans.

nltk_tokenize(text, building=False)

Uses nltk-trained PunktTokenizer for sentence tokenization and Treebank Word Tokenizer for tokenizing words within sentences.

static re_tokenize(text)
Find boundaries between word characters, newlines, and non-word
non-whitespace tokens (r’[w

]+ | [^ws] | ‘).

This splits along whitespace and punctuation and keeps the newline as a token in the returned list.
static split_tokenize(text)

Splits tokens based on whitespace after adding whitespace around punctuation. Use re_tokenize if you want more robust handling of punctuation.

span_tokenize(text)

Tokenizes, and then calculates the starting index of each token in the original string.

tokenize(text, building=False)

Returns a sequence of tokens from the iterable.

bpe_tokenize(text)

Return a sequence of BPE-tokens from the text.

add_to_dict(tokens)

Build dictionary from the list of provided tokens.

remove_tail(min_freq)

Remove elements below the frequency cutoff from the dictionary.

resize_to_max(maxtokens)

Trims the dictionary to the maximum number of tokens.

load(filename)

Load pre-existing dictionary in ‘token[<TAB>count]’ format.

Initialize counts from other dictionary, or 0 if they aren’t included.

save(filename=None, append=False, sort=True)

Save dictionary to file. Format is ‘token<TAB>count’ for every token in the dictionary, sorted by count with the most frequent words first.

If append (default False) is set to True, appends instead of overwriting.

If sort (default True), then first sort the dictionary before saving.

sort(trim=True)

Sorts the dictionary, so that the elements with the lowest index have the highest counts. This reindexes the dictionary according to the sorted frequencies, breaking ties alphabetically by token.

Parameters:trim (bool) – If True, truncate the dictionary based on minfreq and maxtokens.
parse(txt_or_vec, vec_type=<class 'list'>)

Convenience function for parsing either text or vectors of indices.

vec_type is the type of the returned vector if the input is a string.

txt2vec(text, vec_type=<class 'list'>)

Converts a string to a vector (list of ints).

First runs a sentence tokenizer, then a word tokenizer.

vec_type is the type of the returned vector if the input is a string.

vec2txt(vector, delimiter=' ')

Converts a vector (iterable of ints) into a string, with each token separated by the delimiter (default ' ').

act()

Add words in the last observation to the dictionary.

This checks any fields in the message present in the –dict-textfields argument (e.g. “text,labels”).

share()

Share internal dicts.

shutdown()

Save on shutdown if save_path is set.

__str__()

Return string representation of frequencies in dictionary.