TFIDF Retriever

The TFIDF Retriever is an agent that constructs a TF-IDF matrix for all entries in a given task. It generates responses via returning the highest-scoring documents for a query. It uses a SQLite database for storing the sparse tfidf matrix, adapted from here.

Basic Examples

Construct a TFIDF matrix for use in retrieval for the personachat task

parlai train_model -m tfidf_retriever -t personachat -mf /tmp/personachat_tfidf -dt train:ordered -eps 1

After construction, load and evaluate that model on the Persona-Chat test set.

parlai eval_model -t personachat -mf /tmp/personachat_tfidf -dt test

Alternatively, interact with a Wikipedia-based TFIDF model from the model zoo

parlai interactive -mf zoo:wikipedia_full/tfidf_retriever/model

TfidfRetrieverAgent Options

Retriever Arguments

Argument

Description

--retriever-numworkers

Number of CPU processes (for tokenizing, etc)

--retriever-ngram

Use up to N-size n-grams (e.g. 2 = unigrams + bigrams)

Default: 2.

--retriever-hashsize

Number of buckets to use for hashing ngrams

Default: 16777216.

--retriever-tokenizer

String option specifying tokenizer type to use.

Default: simple.

--retriever-num-retrieved

How many docs to retrieve.

Default: 5.

--remove-title

Whether to remove the title from the retrieved passage

Default: False.

--retriever-mode

Whether to retrieve the stored key or the stored value. For example, if you want to return the text of an example, use keys here; if you want to return the label, use values here.

Choices: keys, values.

Default: values.

--index-by-int-id

Whether to index into database by doc id as an integer. This defaults to true for DBs built using ParlAI; for the DrQA wiki dump, it is necessary to set this to False to index into the DB appropriately

Default: True.

--tfidf-context-length

Number of past utterances to remember when building flattened batches of data in multi-example episodes.

Default: -1.

--tfidf-include-labels

Specifies whether or not to include labels as past utterances when building flattened batches of data in multi-example episodes.

Default: True.