core.dict

Contains code for parsing and building a dictionary from text.

class parlai.core.dict.DictionaryAgent(opt, shared=None)

Bases: parlai.core.agents.Agent

Builds and/or loads a dictionary.

The dictionary provides access to the frequency of each token, functions to translate sentences from tokens to their vectors (list of ints, each int is the index of a token in the dictionary) and back from vectors to tokenized text.

__init__(opt, shared=None)

Initialize DictionaryAgent.

act()

Add words in the last observation to the dictionary.

This checks any fields in the message present in the –dict-textfields argument (e.g. “text,labels”).

static add_cmdline_args(argparser)

Add commandline arguments related to the dictionary.

add_to_dict(tokens)

Build dictionary from the list of provided tokens.

bpe_tokenize(text)

Return a sequence of BPE-tokens from the text.

copy_dict(dictionary)

Overwrite own state with any state in the other dictionary. This allows loading of the contents of another dictionary while keeping the current dictionary version.

load(filename)

Load pre-existing dictionary in ‘token[<TAB>count]’ format.

Initialize counts from other dictionary, or 0 if they aren’t included.

nltk_tokenize(text, building=False)

Uses nltk-trained PunktTokenizer for sentence tokenization and Treebank Word Tokenizer for tokenizing words within sentences.

parse(txt_or_vec, vec_type=<class 'list'>)

Convenience function for parsing either text or vectors of indices.

vec_type is the type of the returned vector if the input is a string.

static re_tokenize(text)

Find boundaries between word characters, newlines, and non-word non-whitespace tokens (r'[\w\n]+ | [^\w\s] | \n').

This splits along whitespace and punctuation and keeps the newline as a token in the returned list.

remove_tail(min_freq)

Remove elements below the frequency cutoff from the dictionary.

resize_to_max(maxtokens)

Trims the dictionary to the maximum number of tokens.

save(filename=None, append=False, sort=True)

Save dictionary to file. Format is ‘token<TAB>count’ for every token in the dictionary, sorted by count with the most frequent words first.

If append (default False) is set to True, appends instead of overwriting.

If sort (default True), then first sort the dictionary before saving.

share()

Share internal dicts.

shutdown()

Save on shutdown if save_path is set.

sort(trim=True)

Sorts the dictionary, so that the elements with the lowest index have the highest counts. This reindexes the dictionary according to the sorted frequencies, breaking ties alphabetically by token.

Parameters:trim (bool) – If True, truncate the dictionary based on minfreq and maxtokens.
spacy_span_tokenize(text)

Returns tuple of tokens, spans.

span_tokenize(text)

Tokenizes, and then calculates the starting index of each token in the original string.

static split_tokenize(text)

Splits tokens based on whitespace after adding whitespace around punctuation. Use re_tokenize if you want more robust handling of punctuation.

tokenize(text, building=False)

Returns a sequence of tokens from the iterable.

txt2vec(text, vec_type=<class 'list'>)

Converts a string to a vector (list of ints).

First runs a sentence tokenizer, then a word tokenizer.

vec_type is the type of the returned vector if the input is a string.

vec2txt(vector, delimiter=' ')

Converts a vector (iterable of ints) into a string, with each token separated by the delimiter (default ' ').

parlai.core.dict.escape(s)

Replace potential special characters with escaped version.

For example, n => \n and t => \t

Parameters:s – string to escape
parlai.core.dict.find_ngrams(token_dict, text, n)

Break text into ngrams that appear in token_dict.

Parameters:
  • token_dictdict to check for ngrams
  • textstr to look for ngrams in
  • nint max size of ngrams
parlai.core.dict.unescape(s)

Revert escaped characters back to their special version.

For example, \n => n and \t => t

Parameters:s – string to unescape