TFIDF Retriever

The TFIDF Retriever is an agent that constructs a TF-IDF matrix for all entries in a given task. It generates responses via returning the highest-scoring documents for a query. It uses a SQLite database for storing the sparse tfidf matrix, adapted from here.

Basic Examples

Construct a TFIDF matrix for use in retrieval for the personachat task

parlai train_model -m tfidf_retriever -t personachat -mf /tmp/personachat_tfidf -dt train:ordered -eps 1

After construction, load and evaluate that model on the Persona-Chat test set.

parlai eval_model -t personachat -mf /tmp/personachat_tfidf -dt test

Alternatively, interact with a Wikipedia-based TFIDF model from the model zoo

parlai interactive -mf zoo:wikipedia_full/tfidf_retriever/model

TfidfRetrieverAgent Options

Retriever Arguments




Number of CPU processes (for tokenizing, etc)


Use up to N-size n-grams (e.g. 2 = unigrams + bigrams)

Default: 2.


Number of buckets to use for hashing ngrams

Default: 16777216.


String option specifying tokenizer type to use.

Default: simple.


How many docs to retrieve.

Default: 5.


Whether to remove the title from the retrieved passage

Default: False.


Whether to retrieve the stored key or the stored value. For example, if you want to return the text of an example, use keys here; if you want to return the label, use values here.

Choices: keys, values.

Default: values.


Whether to index into database by doc id as an integer. This defaults to true for DBs built using ParlAI; for the DrQA wiki dump, it is necessary to set this to False to index into the DB appropriately

Default: True.


Number of past utterances to remember when building flattened batches of data in multi-example episodes.

Default: -1.


Specifies whether or not to include labels as past utterances when building flattened batches of data in multi-example episodes.

Default: True.