TFIDF Retriever¶

The TFIDF Retriever is an agent that constructs a TF-IDF matrix for all entries in a given task. It generates responses via returning the highest-scoring documents for a query. It uses a SQLite database for storing the sparse tfidf matrix, adapted from here.

Basic Examples¶

Construct a TFIDF matrix for use in retrieval for the personachat task

parlai train_model -m tfidf_retriever -t personachat -mf /tmp/personachat_tfidf -dt train:ordered -eps 1

After construction, load and evaluate that model on the Persona-Chat test set.

parlai eval_model -t personachat -mf /tmp/personachat_tfidf -dt test

Alternatively, interact with a Wikipedia-based TFIDF model from the model zoo

parlai interactive -mf zoo:wikipedia_full/tfidf_retriever/model

TfidfRetrieverAgent Options¶

Retriever Arguments

Argument	Description
`--retriever-numworkers`	Number of CPU processes (for tokenizing, etc)
`--retriever-ngram`	Use up to N-size n-grams (e.g. 2 = unigrams + bigrams) Default: `2`.
`--retriever-hashsize`	Number of buckets to use for hashing ngrams Default: `16777216`.
`--retriever-tokenizer`	String option specifying tokenizer type to use. Default: `simple`.
`--retriever-num-retrieved`	How many docs to retrieve. Default: `5`.
`--remove-title`	Whether to remove the title from the retrieved passage Default: `False`.
`--retriever-mode`	Whether to retrieve the stored key or the stored value. For example, if you want to return the text of an example, use keys here; if you want to return the label, use values here. Choices: `keys`, `values`. Default: `values`.
`--index-by-int-id`	Whether to index into database by doc id as an integer. This defaults to true for DBs built using ParlAI. Default: `True`.
`--tfidf-context-length`	Number of past utterances to remember when building flattened batches of data in multi-example episodes. Default: `-1`.
`--tfidf-include-labels`	Specifies whether or not to include labels as past utterances when building flattened batches of data in multi-example episodes. Default: `True`.