Retrieval-Augmented Generation (RAG)¶

The code in this directory implements the RAG Model as used in the reducing hallucination project. The README is broken up into the following sections:

Installation instructions
Quick-start tutorial to using RAG.
In-depth discussion of RAG Options
Tutorial for generating your own embeddings / build your own index.
Directory structure/overview.

If you have any questions, please reach out to @klshuster or @spencerp.

Installation / Memory Requirements.¶

Before using RAG, you’ll need to make sure that you have installed FAISS; preferably, you should install the faiss-gpu library (installation instructions here), but RAG will work with faiss-cpu as well (faiss-gpu will simply speed up index construction).

To train a RAG model with the default options – RAG-Token with BART-Large generator, and DPR Retrieval over all of Wikipedia – you’ll need the following system requirements:

RAM¶

Loading the Wikipedia passages into memory requires ~22GB of RAM.

If you use --indexer-type compressed --path-to-index zoo:hallucination/wiki_passages_compressed/compressed_pq, you’ll only require an additional ~3GB of RAM; if you use --indexer-type exact --path-to-index zoo:hallucination/wiki_passages_exact/exact, you’ll need an additional ~80GB of RAM.

GPU¶

To train BART-Large RAG / FiD models, with a batchsize of 16 (or DPR-Poly models with a batchsize of 8), you’ll want to have at least 4x32gb GPUs. You can adjust the batchsize to fit your GPU memory constraints.

To evaluate / interact with any pre-trained models (e.g., those mentioned here), you’ll only need 1 16gb GPU.

RAG Quick Start¶

You can use RAG like any other model in ParlAI; simply specify -m rag, and you’re good to go! Here’s an example command to train RAG on the Wizard of Wikipedia Dataset:

parlai train_model -m rag -t wizard_of_wikipedia -mf /path/to/model_file \
# standard optimization/truncation parameters
--batchsize 16 --fp16 True --gradient-clip 0.1 --label-truncate 128 \
--log-every-n-secs 30 --lr-scheduler reduceonplateau --lr-scheduler-patience 1 \
--model-parallel True --optimizer adam --text-truncate 512 --truncate 512 \
-lr 1e-05 -vmm min -veps 0.25 -vme 1000 -vmt ppl -vp 5 \
# BART-Large parameters
-o arch/bart_large

RAG Options¶

RAG in ParlAI is quite flexible, and can support a variety of different base seq2seq models, retrievers, and “model types”; we outline the different options below. Bolded options are the default options.

RAG Seq2Seq Generators: `--generation-model`¶

We support three backbones:

--generation-model bart: The default option uses BART as the backbone generator, which was used in the vast majority of experiments in this paper.
--generation-model transformer/generator: If you want to use/initialize RAG with a standard Transformer model trained in ParlAI, set to transformer/generator
--generation-model t5: Finally, we provide T5 as a generator backbone as well; see here for additional T5-specific parameters.

RAG Model Types: `--rag-model-type`¶

RAG comes in three flavors: RAG-Token, RAG-Sequence, and RAG-Turn. The first two are outlined in the original RAG paper; the third is outlined in our retrieval-augmented dialogue work.

--rag-model-type token: The RAG-Token model jointly attends to all documents, allowing each token to draw from a latent document.
--rag-model-type sequence: The RAG-Sequence model attends to each retrieved document separately, re-ranking generations according to document probabilities.
--rag-model-type turn: The RAG-Turn model retrieves documents for each turn of dialogue context, and either attends jointly over all turns and documents (--rag-turn-marginalize doc_then_turn) or over each turn separately (--rag-turn-marginalize doc_only).

RAG Retriever Types: `--rag-retriever-type`¶

We provide a few of the several retrievers considered in our work; we outline them below:

--rag-retriever-type dpr: The canonical retrieval system for RAG uses a Dense Passage Retriever for retrieval over a FAISS Index. The default options retrieve over all of Wikipedia.
--rag-retriever-type tfidf: One can additionally use a TFIDF retriever; the default retrieves over all of Wikipedia.
--rag-retriever-type dpr_then_poly: The RAG DPR-Poly model adds a re-ranking step with a Poly-encoder that re-ranks the retrieved documents from a DPR model.
--rag-retriever-type poly_faiss: If you have trained a Dropout Poly-encoder and have built an index with that model, you can use the PolyFAISS method, which uses a Poly-encoder model directly to both query FAISS and re-rank retrieved documents.

Other RAG Options¶

All of the options for using RAG can be found in the args.py file; below, we highlight a few that are important.

Number of Retrieved Documents¶

Set the --n-docs flag to tell RAG how many documents to retrieve.

Thorough Decoding¶

For RAG-Sequence, and RAG-Turn Doc-Only, you can specify --thorough True to use thorough decoding; this method will rescore hypotheses by running an additional forward pass of the model.

FAISS Indexes¶

We provide two indexes in our model zoo, which can be specified via the --path-to-index flag:

--path-to-index zoo:hallucination/wiki_passages/exact --indexer-type exact: The “exact” representation of the document embeddings in a FAISS Index. This index is over 80gb of RAM but provides the fastest/most accurate results.
--path-to-index zoo:hallucination/wiki_passages/compressed --indexer-type compressed: The “compressed” representation of the document embeddings a FAISS Index. This index is only ~3gb of RAM but comes at the price of performance degradation. This is the default option as it works quite well despite the compression.

Generating your own FAISS Index.¶

The default RAG parameters use the zoo:hallucination/wiki_passages/psgs_w100.tsv corpus, which is 21m passages spanning all of Wikipedia. You can also generate your own FAISS index by following the steps below:

1a. [Recommended] Obtain/Choose a (Pre-trained) DPR Model¶

The RAG model works really well with DPR models as the backbone retrievers; check out the DPR repository for some pre-trained DPR models (or, train your own!). Alternatively, you can specify a RAG or FiD model with DPR weights (perhaps, e.g., one from the ParlAI model zoo, such as zoo:hallucination/bart_rag_token/model).

1b. Train your own Dropout Poly-encoder¶

To utilize the PolyFAISS method, you can train your own DropoutPolyencoder) as usual in ParlAI.

2. Generate Dense Embeddings (~1-2 hours minutes if sharded appropriately - 50 x 1 GPU).¶

WARNING: If you generated passage embeddings prior to 11/19/2021, you may have corrupted embeddings, especially if you were using a relatively small set of passages (anything under ~50k), and found that indexing took excessively long (anything over a couple minutes); see #4199 for more details.

After obtaining a DPR model, you’ll need to generate dense embeddings on a dataset. The data should be in a tab-separated (tsv) file with the following format:

  integer document id starting at zero<tab>document text<tab>document title

Check /path/to/ParlAI/data/models/hallucination/wiki_passages/psgs_w100.tsv for inspiration.

Then, you can use the generate_dense_embeddings.py script to run the following command:

python generate_dense_embeddings.py --model-file /path/to/dpr/model --dpr-model True \
--passages-file /path/to/passages --outfile /path/to/saved/embeddings \
--shard-id <shard_id> --num-shards <num_shards> -bs <batchsize>

If the provided --model-file is either a path to a DPR model or a path to a ParlAI RAG/FiD model, specify --dpr-model True so that the script can appropriately extract the DPR weights; if you use a Dropout Poly-encoder, set --dpr-model to False. The script will generate embeddings with the DPR model for shard <shard_id> of the data, and save two files:

/path/to/saved/embeddings_<shard_id>: The concatenated tensor of embeddings
/path/to/saved/ids_<shard_id>: The list of document ids that corresponds to these embeddings.

An example command would look like this:

python generate_dense_embeddings.py -mf zoo:hallucination/multiset_dpr/hf_bert_base.cp --dpr-model True \
--passages-file zoo:hallucination/wiki_passages/psgs_w100.tsv  \
--outfile /tmp/wiki_passage_embeddings/wiki_passages --num-shards 50 --shard-id 0 -bs 32

--num-shards: If your dataset is relatively small, you can feel free to only generate with only one shard.

3. Index the Dense Embeddings¶

The final step is to build the full FAISS index from these dense embeddings. You can use the index_dense_embeddings.py script to achieve this. You can choose one of the following options when indexing your embeddings for varying results, depending on the size of your dataset:

Recommended for large passage sets --indexer-type compressed: This will build a compressed index using FAISS compression techniques; this usually only takes a couple hours, and results in small index files, but comes at the cost of accuracy. Only use this if your machine would struggle to fit all of your dense embedding vectors in memory.
Recommended for small passage sets --indexer-type exact: This will build a large HNSW-style index with the flat embeddings. The index that is built is generally as large, if not more so, than the sum of the sizes of the embeddings. Use with caution with large passage sets; however, if you can reasonably fit all of your dense embedding vectors in memory, this is a suitable option.
--indexer-type compressed --compressed-indexer-factory <index_factory>: If you know what you’re doing (and understand how to use the index factory in FAISS), feel free to specify your own Index Factory settings. This method is only recommended if you’re an advanced FAISS user.

If we saved our embedding shards at /path/to/saved/embeddings_0, the script is used as follows:

python index_dense_embeddings.py --retriever-embedding-size <retriever_emb_size>  \
--embeddings-dir /path/to/saved/ --embeddings-name <embeddings> --indexer-type <indexer_type>

Example:

python index_dense_embeddings.py --retriever-embedding-size 768  \
--embeddings-dir /tmp/wiki_passage_embeddings/ --embeddings-name wiki_passages

Note the default index factory setting is IVF4096_HNSW128,PQ128, if you are processing small files, you may encounter errors such as Error: 'nx >= k' failed, then you need to set --compressed-indexer-factory to other indexes in the index factory in FAISS such as HNSW32.

Directory Structure / Custom Components¶

I will outline here the structure of the RAG directory, and where you might want to add custom components if you so desire.

args.py: Contains the parameters used to train RAG Models. Explore at your leisure
conversion_utils.py: Utility functions for converting DPR models to ParlAI-style models
dpr_agent.py: A wrapper around DPR Models for use in ParlAI.
indexers.py: Contains implementations of “Indexers”, which are essentially wrappers for interacting with FAISS Indexes.
model_types.py: Contains the interfaces for RAG-Token, RAG-Sequence, and RAG-Turn. The interfaces define the model-type-specific functionality for each RAG type.
modules.py: Contains the actual RagModel implementation. The components of a RagModel are model-type-agnostic, and thus they are separate from the implementations in model_types.py
rag.py: Contains the RagAgent implementation.
retrievers.py: Contains retrievers used in the RagModel

Custom Components¶

Sequence to Sequence Models¶

The RagModel tries to be as generic as possible with the underlying seq2seq architecture; to fit future generator models, one can look to the T5RagModel in modules.py as inspiration for what a custom model may look like.

Retriever Models¶

The RAG Retriever models are generic as well, and simply require that a retrieve function is defined. The base RagRetriever defines the interface for the retriever, so as long as a subclass implements the necessary functions, adding new retrievers is a straightforward exercise.

BartAgent Options¶

optional arguments

Argument	Description
`--gpu-beam-blocking`	Set to use CUDA kernel for beam search ngram blocking Default: `False`.
`--verbose-topk`	Return the topk logits in the act message, if verbose mode is set. Default: `-1`.

Transformer Arguments

Argument	Description
`--embedding-size`, `--esz`	Size of all embedding layers. Must be a multiple of –n-heads. Default: `1024`.
`--n-layers`, `--nl`	Number of transformer layers. Default: `2`.
`--ffn-size`, `--hid`	Hidden size of the FFN layers Default: `4096`.
`--dropout`	Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets. Default: `0.1`.
`--attention-dropout`	Dropout used after attention softmax. This is not used in Vaswani 2017. Default: `0.1`.
`--relu-dropout`	Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor. Default: `0.0`.
`--n-heads`	Number of multihead attention heads Default: `16`.
`--learn-positional-embeddings`	If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch. Default: `True`.
`--embeddings-scale`	Default: `False`.
`--n-segments`	The number of segments that support the model. If zero no segment and no langs_embedding. Default: `0`.
`--variant`	Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models Choices: `xlm`, `prelayernorm`, `bart`, `aiayn`. Default: `bart`. Recommended: `xlm`.
`--activation`	Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu. Choices: `gelu`, `relu`. Default: `gelu`. Recommended: `gelu`.
`--output-scaling`	Scale the output of every transformer by this quantity. Default: `1.0`.
`--share-word-embeddings`	Share word embeddings table for candidate and contextin the memory network Default: `True`.
`--n-encoder-layers`, `--nel`	This will overidde the n-layers for asymmetrical transformers Default: `12`.
`--n-decoder-layers`, `--ndl`	This will overidde the n-layers for asymmetrical transformers Default: `12`.
`--model-parallel`	Shard the layers across multiple GPUs. Default: `False`.
`--checkpoint-activations`	Recompute activations on backward pass to conserve memory. Default: `False`.

Torch Generator Agent

Argument	Description
`--beam-size`	Beam size, if 1 then greedy search Default: `1`.
`--beam-min-length`	Minimum length of prediction to be generated by the beam search Default: `1`.
`--beam-context-block-ngram`	Size n-grams to block in beam search from the context. val <= 0 implies no blocking Default: `-1`.
`--beam-block-ngram`	Size n-grams to block in beam search. val <= 0 implies no blocking Default: `-1`.
`--beam-block-full-context`	Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent Default: `True`.
`--beam-length-penalty`	Applies a length penalty. Set to 0 for no penalty. Default: `0.65`.
`--inference`	Generation algorithm Choices: `beam`, `nucleus`, `delayedbeam`, `greedy`, `delayednucleusbeam`, `topk`, `factual_nucleus`. Default: `greedy`.
`--topk`	K used in Top K sampling Default: `10`.
`--topp`	P used in nucleus sampling Default: `0.9`.
`--beam-delay`	Used in delayedbeam search Default: `30`.
`--lambda-decay`	Decay factor in factual nucleus sampling Default: `0.9`.
`--omega-bound`	Lower bound in factual nucleus sampling Default: `0.3`.
`--p-reset`	Whether to reset p value in factual nucleus at full stops Default: `True`.
`--beam-block-list-filename`	Load a text file of hard blocks for beam search to never say.
`--temperature`	Temperature to add during decoding Default: `1.0`.
`--compute-tokenized-bleu`	If true, compute tokenized bleu scores Default: `False`.

TorchAgent Arguments

Argument	Description
`--interactive-mode`, `--i`	Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts. Default: `False`.
`--embedding-type`, `--emb`	Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training. Choices: `random`, `glove`, `glove-fixed`, `fasttext`, `fasttext-fixed`, `fasttext_cc`, `fasttext_cc-fixed`. Default: `random`.
`--embedding-projection`, `--embp`	If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice. Default: `random`.
`--fp16`	Use fp16 computations. Default: `True`.
`--fp16-impl`	Implementation of FP16 to use Choices: `safe`, `mem_efficient`. Default: `safe`.
`--rank-candidates`, `--rc`	Whether the model should parse candidates for ranking. Default: `False`.
`--truncate`, `--tr`	Truncate input lengths to increase speed / use less memory. Default: `-1`.
`--text-truncate`	Text input truncation length: if not specified, this will default to `truncate`
`--label-truncate`	Label truncation length: if not specified, this will default to `truncate`
`--history-reversed`	Reverse the history Default: `False`.
`--history-size`, `--histsz`	Number of past dialog utterances to remember. Default: `-1`.
`--person-tokens`, `--pt`	Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization. Default: `False`.
`--split-lines`	Split the dialogue history on newlines and save in separate vectors Default: `False`.
`--delimiter`	Join history lines with this token, defaults to newline Default: `\n`.
`--special-tok-lst`	Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence.
`-gpu`, `--gpu`	Which GPU to use Default: `-1`.
`--no-cuda`	Disable GPUs even if available. otherwise, will use GPUs if available on the device. Default: `False`.

Optimizer Arguments

Argument	Description
`--optimizer`, `--opt`	Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor. Choices: `adadelta`, `adagrad`, `adam`, `adamw`, `sparseadam`, `adamax`, `asgd`, `sgd`, `radam`, `rprop`, `rmsprop`, `optimizer`, `nadam`, `lbfgs`, `mem_eff_adam`, `adafactor`. Default: `sgd`.
`--learningrate`, `--lr`	Learning rate Default: `1`.
`--gradient-clip`, `--clip`	Gradient clipping using l2 norm Default: `0.1`.
`--adafactor-eps`	Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively Default: `1e-30,1e-3`. Recommended: `1e-30,1e-3`.
`--momentum`, `--mom`	If applicable, momentum value for optimizer. Default: `0`.
`--nesterov`	If applicable, whether to use nesterov momentum. Default: `True`.
`--nus`, `--nu`	If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0 Default: `0.7`.
`--betas`, `--beta`	If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999 Default: `0.9,0.999`.
`--weight-decay`, `--wdecay`	Weight decay on the weights.

BPEHelper Arguments

Argument	Description
`--bpe-vocab`	Path to pre-trained tokenizer vocab
`--bpe-merge`	Path to pre-trained tokenizer merge
`--bpe-dropout`	Use BPE dropout during training.

Learning Rate Scheduler

Argument	Description
`--lr-scheduler`	Learning rate scheduler. Choices: `reduceonplateau`, `none`, `fixed`, `invsqrt`, `cosine`, `linear`. Default: `reduceonplateau`.
`--lr-scheduler-patience`	LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every validations. Default: `3`.
`--lr-scheduler-decay`	Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered. Default: `0.5`.
`--invsqrt-lr-decay-gamma`	Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt Default: `-1`.

Bart Args

Argument	Description
`--init-fairseq-model`	Fairseq checkpoint for bart
`--output-conversion-path`	Where to save fairseq conversion

BartRagAgent Options¶

optional arguments

Argument	Description
`--gpu-beam-blocking`	Set to use CUDA kernel for beam search ngram blocking Default: `False`.
`--verbose-topk`	Return the topk logits in the act message, if verbose mode is set. Default: `-1`.

Transformer Arguments

Argument	Description
`--embedding-size`, `--esz`	Size of all embedding layers. Must be a multiple of –n-heads. Default: `1024`.
`--n-layers`, `--nl`	Number of transformer layers. Default: `2`.
`--ffn-size`, `--hid`	Hidden size of the FFN layers Default: `4096`.
`--dropout`	Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets. Default: `0.1`.
`--attention-dropout`	Dropout used after attention softmax. This is not used in Vaswani 2017. Default: `0.1`.
`--relu-dropout`	Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor. Default: `0.0`.
`--n-heads`	Number of multihead attention heads Default: `16`.
`--learn-positional-embeddings`	If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch. Default: `True`.
`--embeddings-scale`	Default: `False`.
`--n-segments`	The number of segments that support the model. If zero no segment and no langs_embedding. Default: `0`.
`--variant`	Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models Choices: `xlm`, `prelayernorm`, `bart`, `aiayn`. Default: `bart`. Recommended: `xlm`.
`--activation`	Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu. Choices: `gelu`, `relu`. Default: `gelu`. Recommended: `gelu`.
`--output-scaling`	Scale the output of every transformer by this quantity. Default: `1.0`.
`--share-word-embeddings`	Share word embeddings table for candidate and contextin the memory network Default: `True`.
`--n-encoder-layers`, `--nel`	This will overidde the n-layers for asymmetrical transformers Default: `12`.
`--n-decoder-layers`, `--ndl`	This will overidde the n-layers for asymmetrical transformers Default: `12`.
`--model-parallel`	Shard the layers across multiple GPUs. Default: `False`.
`--checkpoint-activations`	Recompute activations on backward pass to conserve memory. Default: `False`.

Torch Generator Agent

Argument	Description
`--beam-size`	Beam size, if 1 then greedy search Default: `1`.
`--beam-min-length`	Minimum length of prediction to be generated by the beam search Default: `1`.
`--beam-context-block-ngram`	Size n-grams to block in beam search from the context. val <= 0 implies no blocking Default: `-1`.
`--beam-block-ngram`	Size n-grams to block in beam search. val <= 0 implies no blocking Default: `-1`.
`--beam-block-full-context`	Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent Default: `True`.
`--beam-length-penalty`	Applies a length penalty. Set to 0 for no penalty. Default: `0.65`.
`--inference`	Generation algorithm Choices: `beam`, `nucleus`, `delayedbeam`, `greedy`, `delayednucleusbeam`, `topk`, `factual_nucleus`. Default: `greedy`.
`--topk`	K used in Top K sampling Default: `10`.
`--topp`	P used in nucleus sampling Default: `0.9`.
`--beam-delay`	Used in delayedbeam search Default: `30`.
`--lambda-decay`	Decay factor in factual nucleus sampling Default: `0.9`.
`--omega-bound`	Lower bound in factual nucleus sampling Default: `0.3`.
`--p-reset`	Whether to reset p value in factual nucleus at full stops Default: `True`.
`--beam-block-list-filename`	Load a text file of hard blocks for beam search to never say.
`--temperature`	Temperature to add during decoding Default: `1.0`.
`--compute-tokenized-bleu`	If true, compute tokenized bleu scores Default: `False`.

TorchAgent Arguments

Argument	Description
`--interactive-mode`, `--i`	Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts. Default: `False`.
`--embedding-type`, `--emb`	Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training. Choices: `random`, `glove`, `glove-fixed`, `fasttext`, `fasttext-fixed`, `fasttext_cc`, `fasttext_cc-fixed`. Default: `random`.
`--embedding-projection`, `--embp`	If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice. Default: `random`.
`--fp16`	Use fp16 computations. Default: `True`.
`--fp16-impl`	Implementation of FP16 to use Choices: `safe`, `mem_efficient`. Default: `safe`.
`--rank-candidates`, `--rc`	Whether the model should parse candidates for ranking. Default: `False`.
`--truncate`, `--tr`	Truncate input lengths to increase speed / use less memory. Default: `-1`.
`--text-truncate`	Text input truncation length: if not specified, this will default to `truncate`
`--label-truncate`	Label truncation length: if not specified, this will default to `truncate`
`--history-reversed`	Reverse the history Default: `False`.
`--history-size`, `--histsz`	Number of past dialog utterances to remember. Default: `-1`.
`--person-tokens`, `--pt`	Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization. Default: `False`.
`--split-lines`	Split the dialogue history on newlines and save in separate vectors Default: `False`.
`--delimiter`	Join history lines with this token, defaults to newline Default: `\n`.
`--special-tok-lst`	Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence.
`-gpu`, `--gpu`	Which GPU to use Default: `-1`.
`--no-cuda`	Disable GPUs even if available. otherwise, will use GPUs if available on the device. Default: `False`.

Optimizer Arguments

Argument	Description
`--optimizer`, `--opt`	Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor. Choices: `adadelta`, `adagrad`, `adam`, `adamw`, `sparseadam`, `adamax`, `asgd`, `sgd`, `radam`, `rprop`, `rmsprop`, `optimizer`, `nadam`, `lbfgs`, `mem_eff_adam`, `adafactor`. Default: `sgd`.
`--learningrate`, `--lr`	Learning rate Default: `1`.
`--gradient-clip`, `--clip`	Gradient clipping using l2 norm Default: `0.1`.
`--adafactor-eps`	Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively Default: `1e-30,1e-3`. Recommended: `1e-30,1e-3`.
`--momentum`, `--mom`	If applicable, momentum value for optimizer. Default: `0`.
`--nesterov`	If applicable, whether to use nesterov momentum. Default: `True`.
`--nus`, `--nu`	If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0 Default: `0.7`.
`--betas`, `--beta`	If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999 Default: `0.9,0.999`.
`--weight-decay`, `--wdecay`	Weight decay on the weights.

BPEHelper Arguments

Argument	Description
`--bpe-vocab`	Path to pre-trained tokenizer vocab
`--bpe-merge`	Path to pre-trained tokenizer merge
`--bpe-dropout`	Use BPE dropout during training.

Learning Rate Scheduler

Argument	Description
`--lr-scheduler`	Learning rate scheduler. Choices: `reduceonplateau`, `none`, `fixed`, `invsqrt`, `cosine`, `linear`. Default: `reduceonplateau`.
`--lr-scheduler-patience`	LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every validations. Default: `3`.
`--lr-scheduler-decay`	Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered. Default: `0.5`.
`--invsqrt-lr-decay-gamma`	Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt Default: `-1`.

Bart Args

Argument	Description
`--init-fairseq-model`	Fairseq checkpoint for bart
`--output-conversion-path`	Where to save fairseq conversion

DictionaryAgent Options¶

BPEHelper Arguments

Argument	Description
`--bpe-vocab`	Path to pre-trained tokenizer vocab
`--bpe-merge`	Path to pre-trained tokenizer merge
`--bpe-dropout`	Use BPE dropout during training.

PolyencoderAgent Options¶

optional arguments

Argument	Description
`--share-word-embeddings`	Share word embeddings table for candidate and contextin the memory network Default: `True`.

TorchAgent Arguments

Argument	Description
`--interactive-mode`, `--i`	Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts. Default: `False`.
`--embedding-type`, `--emb`	Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training. Choices: `random`, `glove`, `glove-fixed`, `fasttext`, `fasttext-fixed`, `fasttext_cc`, `fasttext_cc-fixed`. Default: `random`.
`--embedding-projection`, `--embp`	If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice. Default: `random`.
`--fp16`	Use fp16 computations. Default: `False`.
`--fp16-impl`	Implementation of FP16 to use Choices: `safe`, `mem_efficient`. Default: `safe`.
`--rank-candidates`, `--rc`	Whether the model should parse candidates for ranking. Default: `False`.
`--truncate`, `--tr`	Truncate input lengths to increase speed / use less memory. Default: `1024`.
`--text-truncate`	Text input truncation length: if not specified, this will default to `truncate`
`--label-truncate`	Label truncation length: if not specified, this will default to `truncate`
`--history-reversed`	Reverse the history Default: `False`.
`--history-size`, `--histsz`	Number of past dialog utterances to remember. Default: `-1`.
`--person-tokens`, `--pt`	Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization. Default: `False`.
`--split-lines`	Split the dialogue history on newlines and save in separate vectors Default: `False`.
`--delimiter`	Join history lines with this token, defaults to newline Default: `\n`.
`--special-tok-lst`	Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence.
`-gpu`, `--gpu`	Which GPU to use Default: `-1`.
`--no-cuda`	Disable GPUs even if available. otherwise, will use GPUs if available on the device. Default: `False`.

Optimizer Arguments

Argument	Description
`--optimizer`, `--opt`	Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor. Choices: `adadelta`, `adagrad`, `adam`, `adamw`, `sparseadam`, `adamax`, `asgd`, `sgd`, `radam`, `rprop`, `rmsprop`, `optimizer`, `nadam`, `lbfgs`, `mem_eff_adam`, `adafactor`. Default: `adamax`.
`--learningrate`, `--lr`	Learning rate Default: `0.0001`.
`--gradient-clip`, `--clip`	Gradient clipping using l2 norm Default: `0.1`.
`--adafactor-eps`	Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively Default: `1e-30,1e-3`. Recommended: `1e-30,1e-3`.
`--momentum`, `--mom`	If applicable, momentum value for optimizer. Default: `0`.
`--nesterov`	If applicable, whether to use nesterov momentum. Default: `True`.
`--nus`, `--nu`	If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0 Default: `0.7`.
`--betas`, `--beta`	If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999 Default: `0.9,0.999`.
`--weight-decay`, `--wdecay`	Weight decay on the weights.

Learning Rate Scheduler

Argument	Description
`--lr-scheduler`	Learning rate scheduler. Choices: `reduceonplateau`, `none`, `fixed`, `invsqrt`, `cosine`, `linear`. Default: `reduceonplateau`.
`--lr-scheduler-patience`	LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every validations. Default: `3`.
`--lr-scheduler-decay`	Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered. Default: `0.5`.
`--invsqrt-lr-decay-gamma`	Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt Default: `-1`.

TorchRankerAgent

Argument	Description
`--candidates`, `--cands`	The source of candidates during training (see TorchRankerAgent._build_candidates() for details). Choices: `batch`, `inline`, `fixed`, `batch-all-cands`. Default: `inline`.
`--eval-candidates`, `--ecands`	The source of candidates during evaluation (defaults to the samevalue as –candidates if no flag is given) Choices: `batch`, `inline`, `fixed`, `vocab`, `batch-all-cands`. Default: `inline`.
`--interactive-candidates`, `--icands`	The source of candidates during interactive mode. Since in interactive mode, batchsize == 1, we cannot use batch candidates. Choices: `fixed`, `inline`, `vocab`. Default: `fixed`.
`--repeat-blocking-heuristic`	Block repeating previous utterances. Helpful for many models that score repeats highly, so switched on by default. Default: `True`.
`--fixed-candidates-path`, `--fcp`	A text file of fixed candidates to use for all examples, one candidate per line
`--fixed-candidate-vecs`	One of “reuse”, “replace”, or a path to a file with vectors corresponding to the candidates at –fixed-candidates-path. The default path is a /path/to/model-file.<cands_name>, where <cands_name> is the name of the file (not the full path) passed by the flag –fixed-candidates-path. By default, this file is created once and reused. To replace it, use the “replace” option. Default: `reuse`.
`--encode-candidate-vecs`	Cache and save the encoding of the candidate vecs. This might be used when interacting with the model in real time or evaluating on fixed candidate set when the encoding of the candidates is independent of the input. Default: `True`.
`--init-model`	Initialize model with weights from this file.
`--train-predict`	Get predictions and calculate mean rank during the train step. Turning this on may slow down training. Default: `False`.
`--cap-num-predictions`	Limit to the number of predictions in output.text_candidates Default: `100`.
`--ignore-bad-candidates`	Ignore examples for which the label is not present in the label candidates. Default behavior results in RuntimeError. Default: `False`.
`--rank-top-k`	Ranking returns the top k results of k > 0, otherwise sorts every single candidate according to the ranking. Default: `-1`.
`--inference`	Final response output algorithm Choices: `max`, `topk`. Default: `max`.
`--topk`	K used in Top K sampling inference, when selected Default: `5`.
`--return-cand-scores`	Return sorted candidate scores from eval_step Default: `False`.

Transformer Arguments

Argument	Description
`--embedding-size`, `--esz`	Size of all embedding layers. Must be a multiple of –n-heads. Default: `300`.
`--n-layers`, `--nl`	Number of transformer layers. Default: `2`.
`--ffn-size`, `--hid`	Hidden size of the FFN layers Default: `300`.
`--dropout`	Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets. Default: `0.0`.
`--attention-dropout`	Dropout used after attention softmax. This is not used in Vaswani 2017. Default: `0.0`.
`--relu-dropout`	Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor. Default: `0.0`.
`--n-heads`	Number of multihead attention heads Default: `2`.
`--learn-positional-embeddings`	If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch. Default: `False`.
`--embeddings-scale`	Default: `True`.
`--n-segments`	The number of segments that support the model. If zero no segment and no langs_embedding. Default: `0`.
`--variant`	Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models Choices: `xlm`, `prelayernorm`, `bart`, `aiayn`. Default: `aiayn`. Recommended: `xlm`.
`--activation`	Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu. Choices: `gelu`, `relu`. Default: `relu`. Recommended: `gelu`.
`--output-scaling`	Scale the output of every transformer by this quantity. Default: `1.0`.
`--n-encoder-layers`, `--nel`	This will overidde the n-layers for asymmetrical transformers Default: `-1`.
`--n-decoder-layers`, `--ndl`	This will overidde the n-layers for asymmetrical transformers Default: `-1`.
`--model-parallel`	Shard the layers across multiple GPUs. Default: `False`.
`--checkpoint-activations`	Recompute activations on backward pass to conserve memory. Default: `False`.
`--use-memories`	Use memories: must implement the function `_vectorize_memories` to use this Default: `False`.
`--wrap-memory-encoder`	Wrap memory encoder with MLP Default: `False`.
`--memory-attention`	Similarity for basic attention mechanism when using transformer to encode memories Choices: `cosine`, `dot`, `sqrt`. Default: `sqrt`.
`--normalize-sent-emb`	Default: `False`.
`--share-encoders`	Default: `True`.
`--learn-embeddings`	Learn embeddings Default: `True`.
`--data-parallel`	Use model in data parallel, requires multiple gpus Default: `False`.
`--reduction-type`	Type of reduction at the end of transformer Choices: `first`, `max`, `mean`. Default: `mean`.

BPEHelper Arguments

Argument	Description
`--bpe-vocab`	Path to pre-trained tokenizer vocab
`--bpe-merge`	Path to pre-trained tokenizer merge
`--bpe-dropout`	Use BPE dropout during training.

Polyencoder Arguments

Argument	Description
`--polyencoder-type`	Type of polyencoder, either we computevectors using codes + attention, or we simply take the first N vectors. Choices: `codes`, `n_first`. Default: `codes`. Recommended: `codes`.
`--poly-n-codes`	Number of vectors used to represent the contextin the case of n_first, those are the numberof vectors that are considered. Default: `64`. Recommended: `64`.
`--poly-attention-type`	Type of the top aggregation layer of the poly-encoder (where the candidate representation isthe key) Choices: `basic`, `sqrt`, `multihead`. Default: `basic`. Recommended: `basic`.
`--poly-attention-num-heads`	In case poly-attention-type is multihead, specify the number of heads Default: `4`.
`--codes-attention-type`	Type Choices: `basic`, `sqrt`, `multihead`. Default: `basic`. Recommended: `basic`.
`--codes-attention-num-heads`	In case codes-attention-type is multihead, specify the number of heads Default: `4`.

RagAgent Options¶

optional arguments

Argument	Description
`--gpu-beam-blocking`	Set to use CUDA kernel for beam search ngram blocking Default: `False`.
`--verbose-topk`	Return the topk logits in the act message, if verbose mode is set. Default: `-1`.

TorchRankerAgent

Argument	Description
`--candidates`, `--cands`	The source of candidates during training (see TorchRankerAgent._build_candidates() for details). Choices: `batch`, `inline`, `fixed`, `batch-all-cands`. Default: `inline`.
`--eval-candidates`, `--ecands`	The source of candidates during evaluation (defaults to the samevalue as –candidates if no flag is given) Choices: `batch`, `inline`, `fixed`, `vocab`, `batch-all-cands`. Default: `inline`.
`--interactive-candidates`, `--icands`	The source of candidates during interactive mode. Since in interactive mode, batchsize == 1, we cannot use batch candidates. Choices: `fixed`, `inline`, `vocab`. Default: `fixed`.
`--repeat-blocking-heuristic`	Block repeating previous utterances. Helpful for many models that score repeats highly, so switched on by default. Default: `True`.
`--fixed-candidates-path`, `--fcp`	A text file of fixed candidates to use for all examples, one candidate per line
`--fixed-candidate-vecs`	One of “reuse”, “replace”, or a path to a file with vectors corresponding to the candidates at –fixed-candidates-path. The default path is a /path/to/model-file.<cands_name>, where <cands_name> is the name of the file (not the full path) passed by the flag –fixed-candidates-path. By default, this file is created once and reused. To replace it, use the “replace” option. Default: `reuse`.
`--encode-candidate-vecs`	Cache and save the encoding of the candidate vecs. This might be used when interacting with the model in real time or evaluating on fixed candidate set when the encoding of the candidates is independent of the input. Default: `True`.
`--init-model`	Initialize model with weights from this file.
`--train-predict`	Get predictions and calculate mean rank during the train step. Turning this on may slow down training. Default: `False`.
`--cap-num-predictions`	Limit to the number of predictions in output.text_candidates Default: `100`.
`--ignore-bad-candidates`	Ignore examples for which the label is not present in the label candidates. Default behavior results in RuntimeError. Default: `False`.
`--rank-top-k`	Ranking returns the top k results of k > 0, otherwise sorts every single candidate according to the ranking. Default: `-1`.
`--return-cand-scores`	Return sorted candidate scores from eval_step Default: `False`.

Transformer Arguments

Argument	Description
`--use-memories`	Use memories: must implement the function `_vectorize_memories` to use this Default: `False`.
`--wrap-memory-encoder`	Wrap memory encoder with MLP Default: `False`.
`--memory-attention`	Similarity for basic attention mechanism when using transformer to encode memories Choices: `cosine`, `dot`, `sqrt`. Default: `sqrt`.
`--normalize-sent-emb`	Default: `False`.
`--share-encoders`	Default: `True`.
`--learn-embeddings`	Learn embeddings Default: `True`.
`--data-parallel`	Use model in data parallel, requires multiple gpus Default: `False`.
`--reduction-type`	Type of reduction at the end of transformer Choices: `first`, `max`, `mean`. Default: `mean`.

Polyencoder Arguments

Argument	Description
`--polyencoder-type`	Type of polyencoder, either we computevectors using codes + attention, or we simply take the first N vectors. Choices: `codes`, `n_first`. Default: `codes`. Recommended: `codes`.
`--poly-n-codes`	Number of vectors used to represent the contextin the case of n_first, those are the numberof vectors that are considered. Default: `64`. Recommended: `64`.
`--poly-attention-type`	Type of the top aggregation layer of the poly-encoder (where the candidate representation isthe key) Choices: `basic`, `sqrt`, `multihead`. Default: `basic`. Recommended: `basic`.
`--poly-attention-num-heads`	In case poly-attention-type is multihead, specify the number of heads Default: `4`.
`--codes-attention-type`	Type Choices: `basic`, `sqrt`, `multihead`. Default: `basic`. Recommended: `basic`.
`--codes-attention-num-heads`	In case codes-attention-type is multihead, specify the number of heads Default: `4`.

Transformer Arguments

Argument	Description
`--embedding-size`, `--esz`	Size of all embedding layers. Must be a multiple of –n-heads. Default: `300`.
`--n-layers`, `--nl`	Number of transformer layers. Default: `2`.
`--ffn-size`, `--hid`	Hidden size of the FFN layers Default: `300`.
`--dropout`	Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets. Default: `0.0`.
`--attention-dropout`	Dropout used after attention softmax. This is not used in Vaswani 2017. Default: `0.0`.
`--relu-dropout`	Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor. Default: `0.0`.
`--n-heads`	Number of multihead attention heads Default: `2`.
`--learn-positional-embeddings`	If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch. Default: `False`.
`--embeddings-scale`	Default: `True`.
`--n-segments`	The number of segments that support the model. If zero no segment and no langs_embedding. Default: `0`.
`--variant`	Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models Choices: `xlm`, `prelayernorm`, `bart`, `aiayn`. Default: `aiayn`. Recommended: `xlm`.
`--activation`	Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu. Choices: `gelu`, `relu`. Default: `relu`. Recommended: `gelu`.
`--output-scaling`	Scale the output of every transformer by this quantity. Default: `1.0`.
`--share-word-embeddings`	Share word embeddings table for candidate and contextin the memory network Default: `True`.
`--n-encoder-layers`, `--nel`	This will overidde the n-layers for asymmetrical transformers Default: `-1`.
`--n-decoder-layers`, `--ndl`	This will overidde the n-layers for asymmetrical transformers Default: `-1`.
`--model-parallel`	Shard the layers across multiple GPUs. Default: `False`.
`--checkpoint-activations`	Recompute activations on backward pass to conserve memory. Default: `False`.

RAG Model Args

Argument	Description
`--generation-model`	Which generation model to use Choices: `transformer/generator`, `bart`, `t5`. Default: `bart`.
`--query-model`	Which query model to use for DPR. Choices: `bert`, `bert_from_parlai_rag`, `dropout_poly`. Default: `bert`.
`--rag-model-type`	Which rag model decoding to use. Choices: `token`, `sequence`, `turn`. Default: `token`.
`--thorough`	Whether to use thorough decoding for rag sequence. Default: `False`.

Modified RAG Args

Argument	Description
`--n-extra-positions`	Specify > 0 to include extra positions in the encoder, in which retrieved knowledge will go. In this setup, knowledge is appended instead of prepended. Default: `0`.
`--gold-knowledge-passage-key`	Key in the observation dict that indicates the gold knowledge passage. Specify, along with –debug, to compute passage retrieval metrics at train/test time. Default: `checked_sentence`.
`--gold-knowledge-title-key`	Key in the observation dict that indicates the gold knowledge passage title. Specify, along with –debug, to compute passage retrieval metrics at train/test time. Default: `title`.

RAG Retriever Args

Argument	Description
`--rag-retriever-query`	What to use as the query for retrieval. `one_turn` retrieves only on the last turn of dialogue; `full_history` retrieves based on the full dialogue history. Choices: `one_turn`, `full_history`. Default: `full_history`.
`--rag-retriever-type`	Which retriever to use Choices: `dpr`, `tfidf`, `dpr_then_poly`, `poly_faiss`, `search_engine`, `search_term_faiss`, `observation_echo_retriever`. Default: `dpr`.
`--retriever-debug-index`	Load specified small index, for debugging. Choices: `None`, `none`, `exact`, `compressed`.
`--n-docs`	How many documents to retrieve Default: `5`.
`--min-doc-token-length`	Minimum amount of information to retain from document. Useful to define if encoder does not use a lot of BPE token context. Default: `64`.
`--max-doc-token-length`	Maximum amount of information to retain from document. Default: `256`.
`--rag-query-truncate`	Max token length of query for retrieval. Default: `512`.
`--print-docs`	Whether to print docs; usually useful during interactive mode. Default: `False`.

RAG Dense Passage Retriever Args

Argument	Description
`--path-to-index`	Path to FAISS Index. Default: `zoo:hallucination/wiki_index_compressed/compressed_pq`.
`--path-to-dense-embeddings`	Path to dense embeddings directory used to build index. Default None will assume embeddings and index are in the same directory.
`--dpr-model-file`	Path to DPR Model. Default: `zoo:hallucination/multiset_dpr/hf_bert_base.cp`.
`--path-to-dpr-passages`	Path to DPR passages, used to build index. Default: `zoo:hallucination/wiki_passages/psgs_w100.tsv`.
`--retriever-embedding-size`	Embedding size of dense retriever Default: `768`.

RAG TFIDF Retriever Args

Argument	Description
`--tfidf-max-doc-paragraphs`	If > 0, limit documents to this many paragraphs Default: `-1`.
`--tfidf-model-path`	Optionally override TFIDF model. Default: `zoo:wikipedia_full/tfidf_retriever/model`.

RAG DPR-POLY Retriever Args

Argument	Description
`--dpr-num-docs`	In two stage retrieval, how many DPR documents to retrieve Default: `25`.
`--poly-score-initial-lambda`	In two stage retrieval, how much weight to give to the poly scores. Note: Learned parameter. Specify initial value here Default: `0.5`.
`--polyencoder-init-model`	Which init model to initialize polyencoder with. Specify wikito or reddit to use models from the ParlAI zoo; otherwise, provide a path to a trained polyencoder Default: `wikito`.

RAG PolyFAISS retriever args

Argument	Description
`--poly-faiss-model-file`	Path to poly-encoder for use in poly-faiss retrieval.

RAG ReGReT args

Argument	Description
`--regret`	Retrieve, Generate, Retrieve, Tune. Retrieve, generate, then retrieve again, and finally tune (refine). Default: `False`.
`--regret-intermediate-maxlen`	Maximum length in intermediate regret generation Default: `32`.
`--regret-model-file`	Path to model for initial round of retrieval.
`--regret-dict-file`	Path to dict file for model for initial round of retrieval.
`--regret-override-index`	Overrides the index used with the ReGReT model, if using separate models. I.e., the initial round of retrieval uses the same index as specified for the second round of retrieval Default: `False`.

RAG Indexer Args

Argument	Description
`--indexer-type`	Granularity of RAG Indexer. Choose compressed to save on RAM costs, at the possible expense of accuracy. Choices: `exact`, `compressed`. Default: `compressed`.
`--indexer-buffer-size`	Buffer size for adding vectors to the index Default: `65536`.
`--compressed-indexer-factory`	If specified, builds compressed indexer from a FAISS Index Factory. see https://github.com/facebookresearch/faiss/wiki/The-index-factory for details Default: `IVF4096_HNSW128,PQ128`.
`--compressed-indexer-nprobe`	How many centroids to search in compressed indexer. See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes#cell-probe-methods-indexivf-indexes for details Default: `64`.

RAG-Turn Args

Argument	Description
`--rag-turn-n-turns`	How many turns to split up retrieval into. The most recent text is split by delimiter; all turns after (n-1)th turn are combined. Default: `2`.
`--rag-turn-marginalize`	How to marginalize rag-turn. Choices: `doc_only`, `doc_then_turn`. Default: `doc_then_turn`.
`--rag-turn-discount-factor`	Discount factor for turns beyond most recent one. We employ exponential discounting. Only considered if 0 < factor < 1.0. Default: `1.0`.

Torch Generator Agent

Argument	Description
`--beam-size`	Beam size, if 1 then greedy search Default: `1`.
`--beam-min-length`	Minimum length of prediction to be generated by the beam search Default: `1`.
`--beam-context-block-ngram`	Size n-grams to block in beam search from the context. val <= 0 implies no blocking Default: `-1`.
`--beam-block-ngram`	Size n-grams to block in beam search. val <= 0 implies no blocking Default: `-1`.
`--beam-block-full-context`	Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent Default: `True`.
`--beam-length-penalty`	Applies a length penalty. Set to 0 for no penalty. Default: `0.65`.
`--inference`	Generation algorithm Choices: `beam`, `nucleus`, `delayedbeam`, `greedy`, `delayednucleusbeam`, `topk`, `factual_nucleus`. Default: `greedy`.
`--topk`	K used in Top K sampling Default: `10`.
`--topp`	P used in nucleus sampling Default: `0.9`.
`--beam-delay`	Used in delayedbeam search Default: `30`.
`--lambda-decay`	Decay factor in factual nucleus sampling Default: `0.9`.
`--omega-bound`	Lower bound in factual nucleus sampling Default: `0.3`.
`--p-reset`	Whether to reset p value in factual nucleus at full stops Default: `True`.
`--beam-block-list-filename`	Load a text file of hard blocks for beam search to never say.
`--temperature`	Temperature to add during decoding Default: `1.0`.
`--compute-tokenized-bleu`	If true, compute tokenized bleu scores Default: `False`.

TorchAgent Arguments

Argument	Description
`--interactive-mode`, `--i`	Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts. Default: `False`.
`--embedding-type`, `--emb`	Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training. Choices: `random`, `glove`, `glove-fixed`, `fasttext`, `fasttext-fixed`, `fasttext_cc`, `fasttext_cc-fixed`. Default: `random`.
`--embedding-projection`, `--embp`	If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice. Default: `random`.
`--fp16`	Use fp16 computations. Default: `False`.
`--fp16-impl`	Implementation of FP16 to use Choices: `safe`, `mem_efficient`. Default: `safe`.
`--rank-candidates`, `--rc`	Whether the model should parse candidates for ranking. Default: `False`.
`--truncate`, `--tr`	Truncate input lengths to increase speed / use less memory. Default: `-1`.
`--text-truncate`	Text input truncation length: if not specified, this will default to `truncate`
`--label-truncate`	Label truncation length: if not specified, this will default to `truncate`
`--history-reversed`	Reverse the history Default: `False`.
`--history-size`, `--histsz`	Number of past dialog utterances to remember. Default: `-1`.
`--person-tokens`, `--pt`	Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization. Default: `False`.
`--split-lines`	Split the dialogue history on newlines and save in separate vectors Default: `False`.
`--delimiter`	Join history lines with this token, defaults to newline Default: `\n`.
`--special-tok-lst`	Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence.
`-gpu`, `--gpu`	Which GPU to use Default: `-1`.
`--no-cuda`	Disable GPUs even if available. otherwise, will use GPUs if available on the device. Default: `False`.

Optimizer Arguments

Argument	Description
`--optimizer`, `--opt`	Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor. Choices: `adadelta`, `adagrad`, `adam`, `adamw`, `sparseadam`, `adamax`, `asgd`, `sgd`, `radam`, `rprop`, `rmsprop`, `optimizer`, `nadam`, `lbfgs`, `mem_eff_adam`, `adafactor`. Default: `sgd`.
`--learningrate`, `--lr`	Learning rate Default: `1`.
`--gradient-clip`, `--clip`	Gradient clipping using l2 norm Default: `0.1`.
`--adafactor-eps`	Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively Default: `1e-30,1e-3`. Recommended: `1e-30,1e-3`.
`--momentum`, `--mom`	If applicable, momentum value for optimizer. Default: `0`.
`--nesterov`	If applicable, whether to use nesterov momentum. Default: `True`.
`--nus`, `--nu`	If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0 Default: `0.7`.
`--betas`, `--beta`	If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999 Default: `0.9,0.999`.
`--weight-decay`, `--wdecay`	Weight decay on the weights.

BPEHelper Arguments

Argument	Description
`--bpe-vocab`	Path to pre-trained tokenizer vocab
`--bpe-merge`	Path to pre-trained tokenizer merge
`--bpe-dropout`	Use BPE dropout during training.

Learning Rate Scheduler

Argument	Description
`--lr-scheduler`	Learning rate scheduler. Choices: `reduceonplateau`, `none`, `fixed`, `invsqrt`, `cosine`, `linear`. Default: `reduceonplateau`.
`--lr-scheduler-patience`	LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every validations. Default: `3`.
`--lr-scheduler-decay`	Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered. Default: `0.5`.
`--invsqrt-lr-decay-gamma`	Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt Default: `-1`.

T5 Args

Argument	Description
`--t5-model-arch`	Choices: `t5-small`, `t5-base`, `t5-large`, `t5-3b`, `t5-11b`, `google/flan-t5-small`, `google/flan-t5-base`, `google/flan-t5-large`, `google/flan-t5-xl`, `google/flan-t5-xxl`. Default: `t5-base`.
`--t5-model-parallel`	Use HF model parallel Default: `False`.
`--t5-dropout`	Dropout for T5 Default: `0.0`.
`--t5-generation-config`	Task specific generation config for T5 Choices: `summarization`, `translation_en_to_de`, `translation_en_to_fr`, `translation_en_to_ro`.

T5Agent Options¶

optional arguments

Argument	Description
`--gpu-beam-blocking`	Set to use CUDA kernel for beam search ngram blocking Default: `False`.
`--verbose-topk`	Return the topk logits in the act message, if verbose mode is set. Default: `-1`.

Torch Generator Agent

Argument	Description
`--beam-size`	Beam size, if 1 then greedy search Default: `1`.
`--beam-min-length`	Minimum length of prediction to be generated by the beam search Default: `1`.
`--beam-context-block-ngram`	Size n-grams to block in beam search from the context. val <= 0 implies no blocking Default: `-1`.
`--beam-block-ngram`	Size n-grams to block in beam search. val <= 0 implies no blocking Default: `-1`.
`--beam-block-full-context`	Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent Default: `True`.
`--beam-length-penalty`	Applies a length penalty. Set to 0 for no penalty. Default: `0.65`.
`--inference`	Generation algorithm Choices: `beam`, `nucleus`, `delayedbeam`, `greedy`, `delayednucleusbeam`, `topk`, `factual_nucleus`. Default: `greedy`.
`--topk`	K used in Top K sampling Default: `10`.
`--topp`	P used in nucleus sampling Default: `0.9`.
`--beam-delay`	Used in delayedbeam search Default: `30`.
`--lambda-decay`	Decay factor in factual nucleus sampling Default: `0.9`.
`--omega-bound`	Lower bound in factual nucleus sampling Default: `0.3`.
`--p-reset`	Whether to reset p value in factual nucleus at full stops Default: `True`.
`--beam-block-list-filename`	Load a text file of hard blocks for beam search to never say.
`--temperature`	Temperature to add during decoding Default: `1.0`.
`--compute-tokenized-bleu`	If true, compute tokenized bleu scores Default: `False`.

TorchAgent Arguments

Argument	Description
`--interactive-mode`, `--i`	Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts. Default: `False`.
`--embedding-type`, `--emb`	Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training. Choices: `random`, `glove`, `glove-fixed`, `fasttext`, `fasttext-fixed`, `fasttext_cc`, `fasttext_cc-fixed`. Default: `random`.
`--embedding-projection`, `--embp`	If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice. Default: `random`.
`--fp16`	Use fp16 computations. Default: `False`.
`--fp16-impl`	Implementation of FP16 to use Choices: `safe`, `mem_efficient`. Default: `safe`.
`--rank-candidates`, `--rc`	Whether the model should parse candidates for ranking. Default: `False`.
`--truncate`, `--tr`	Truncate input lengths to increase speed / use less memory. Default: `-1`.
`--text-truncate`	Text input truncation length: if not specified, this will default to `truncate`
`--label-truncate`	Label truncation length: if not specified, this will default to `truncate`
`--history-reversed`	Reverse the history Default: `False`.
`--history-size`, `--histsz`	Number of past dialog utterances to remember. Default: `-1`.
`--person-tokens`, `--pt`	Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization. Default: `False`.
`--split-lines`	Split the dialogue history on newlines and save in separate vectors Default: `False`.
`--delimiter`	Join history lines with this token, defaults to newline Default: `\n`.
`--special-tok-lst`	Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence.
`-gpu`, `--gpu`	Which GPU to use Default: `-1`.
`--no-cuda`	Disable GPUs even if available. otherwise, will use GPUs if available on the device. Default: `False`.

Optimizer Arguments

Argument	Description
`--optimizer`, `--opt`	Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor. Choices: `adadelta`, `adagrad`, `adam`, `adamw`, `sparseadam`, `adamax`, `asgd`, `sgd`, `radam`, `rprop`, `rmsprop`, `optimizer`, `nadam`, `lbfgs`, `mem_eff_adam`, `adafactor`. Default: `sgd`.
`--learningrate`, `--lr`	Learning rate Default: `1`.
`--gradient-clip`, `--clip`	Gradient clipping using l2 norm Default: `0.1`.
`--adafactor-eps`	Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively Default: `1e-30,1e-3`. Recommended: `1e-30,1e-3`.
`--momentum`, `--mom`	If applicable, momentum value for optimizer. Default: `0`.
`--nesterov`	If applicable, whether to use nesterov momentum. Default: `True`.
`--nus`, `--nu`	If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0 Default: `0.7`.
`--betas`, `--beta`	If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999 Default: `0.9,0.999`.
`--weight-decay`, `--wdecay`	Weight decay on the weights.

BPEHelper Arguments

Argument	Description
`--bpe-vocab`	Path to pre-trained tokenizer vocab
`--bpe-merge`	Path to pre-trained tokenizer merge
`--bpe-dropout`	Use BPE dropout during training.

Learning Rate Scheduler

Argument	Description
`--lr-scheduler`	Learning rate scheduler. Choices: `reduceonplateau`, `none`, `fixed`, `invsqrt`, `cosine`, `linear`. Default: `reduceonplateau`.
`--lr-scheduler-patience`	LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every validations. Default: `3`.
`--lr-scheduler-decay`	Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered. Default: `0.5`.
`--invsqrt-lr-decay-gamma`	Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt Default: `-1`.

T5 Args

Argument	Description
`--t5-model-arch`	Choices: `t5-small`, `t5-base`, `t5-large`, `t5-3b`, `t5-11b`, `google/flan-t5-small`, `google/flan-t5-base`, `google/flan-t5-large`, `google/flan-t5-xl`, `google/flan-t5-xxl`. Default: `t5-base`.
`--t5-model-parallel`	Use HF model parallel Default: `False`.
`--t5-dropout`	Dropout for T5 Default: `0.0`.
`--t5-generation-config`	Task specific generation config for T5 Choices: `summarization`, `translation_en_to_de`, `translation_en_to_fr`, `translation_en_to_ro`.

T5RagAgent Options¶

optional arguments

Argument	Description
`--gpu-beam-blocking`	Set to use CUDA kernel for beam search ngram blocking Default: `False`.
`--verbose-topk`	Return the topk logits in the act message, if verbose mode is set. Default: `-1`.

Torch Generator Agent

Argument	Description
`--beam-size`	Beam size, if 1 then greedy search Default: `1`.
`--beam-min-length`	Minimum length of prediction to be generated by the beam search Default: `1`.
`--beam-context-block-ngram`	Size n-grams to block in beam search from the context. val <= 0 implies no blocking Default: `-1`.
`--beam-block-ngram`	Size n-grams to block in beam search. val <= 0 implies no blocking Default: `-1`.
`--beam-block-full-context`	Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent Default: `True`.
`--beam-length-penalty`	Applies a length penalty. Set to 0 for no penalty. Default: `0.65`.
`--inference`	Generation algorithm Choices: `beam`, `nucleus`, `delayedbeam`, `greedy`, `delayednucleusbeam`, `topk`, `factual_nucleus`. Default: `greedy`.
`--topk`	K used in Top K sampling Default: `10`.
`--topp`	P used in nucleus sampling Default: `0.9`.
`--beam-delay`	Used in delayedbeam search Default: `30`.
`--lambda-decay`	Decay factor in factual nucleus sampling Default: `0.9`.
`--omega-bound`	Lower bound in factual nucleus sampling Default: `0.3`.
`--p-reset`	Whether to reset p value in factual nucleus at full stops Default: `True`.
`--beam-block-list-filename`	Load a text file of hard blocks for beam search to never say.
`--temperature`	Temperature to add during decoding Default: `1.0`.
`--compute-tokenized-bleu`	If true, compute tokenized bleu scores Default: `False`.

TorchAgent Arguments

Argument	Description
`--interactive-mode`, `--i`	Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts. Default: `False`.
`--embedding-type`, `--emb`	Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training. Choices: `random`, `glove`, `glove-fixed`, `fasttext`, `fasttext-fixed`, `fasttext_cc`, `fasttext_cc-fixed`. Default: `random`.
`--embedding-projection`, `--embp`	If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice. Default: `random`.
`--fp16`	Use fp16 computations. Default: `False`.
`--fp16-impl`	Implementation of FP16 to use Choices: `safe`, `mem_efficient`. Default: `safe`.
`--rank-candidates`, `--rc`	Whether the model should parse candidates for ranking. Default: `False`.
`--truncate`, `--tr`	Truncate input lengths to increase speed / use less memory. Default: `-1`.
`--text-truncate`	Text input truncation length: if not specified, this will default to `truncate`
`--label-truncate`	Label truncation length: if not specified, this will default to `truncate`
`--history-reversed`	Reverse the history Default: `False`.
`--history-size`, `--histsz`	Number of past dialog utterances to remember. Default: `-1`.
`--person-tokens`, `--pt`	Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization. Default: `False`.
`--split-lines`	Split the dialogue history on newlines and save in separate vectors Default: `False`.
`--delimiter`	Join history lines with this token, defaults to newline Default: `\n`.
`--special-tok-lst`	Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence.
`-gpu`, `--gpu`	Which GPU to use Default: `-1`.
`--no-cuda`	Disable GPUs even if available. otherwise, will use GPUs if available on the device. Default: `False`.

Optimizer Arguments

Argument	Description
`--optimizer`, `--opt`	Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor. Choices: `adadelta`, `adagrad`, `adam`, `adamw`, `sparseadam`, `adamax`, `asgd`, `sgd`, `radam`, `rprop`, `rmsprop`, `optimizer`, `nadam`, `lbfgs`, `mem_eff_adam`, `adafactor`. Default: `sgd`.
`--learningrate`, `--lr`	Learning rate Default: `1`.
`--gradient-clip`, `--clip`	Gradient clipping using l2 norm Default: `0.1`.
`--adafactor-eps`	Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively Default: `1e-30,1e-3`. Recommended: `1e-30,1e-3`.
`--momentum`, `--mom`	If applicable, momentum value for optimizer. Default: `0`.
`--nesterov`	If applicable, whether to use nesterov momentum. Default: `True`.
`--nus`, `--nu`	If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0 Default: `0.7`.
`--betas`, `--beta`	If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999 Default: `0.9,0.999`.
`--weight-decay`, `--wdecay`	Weight decay on the weights.

BPEHelper Arguments

Argument	Description
`--bpe-vocab`	Path to pre-trained tokenizer vocab
`--bpe-merge`	Path to pre-trained tokenizer merge
`--bpe-dropout`	Use BPE dropout during training.

Learning Rate Scheduler

Argument	Description
`--lr-scheduler`	Learning rate scheduler. Choices: `reduceonplateau`, `none`, `fixed`, `invsqrt`, `cosine`, `linear`. Default: `reduceonplateau`.
`--lr-scheduler-patience`	LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every validations. Default: `3`.
`--lr-scheduler-decay`	Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered. Default: `0.5`.
`--invsqrt-lr-decay-gamma`	Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt Default: `-1`.

T5 Args

Argument	Description
`--t5-model-arch`	Choices: `t5-small`, `t5-base`, `t5-large`, `t5-3b`, `t5-11b`, `google/flan-t5-small`, `google/flan-t5-base`, `google/flan-t5-large`, `google/flan-t5-xl`, `google/flan-t5-xxl`. Default: `t5-base`.
`--t5-model-parallel`	Use HF model parallel Default: `False`.
`--t5-dropout`	Dropout for T5 Default: `0.0`.
`--t5-generation-config`	Task specific generation config for T5 Choices: `summarization`, `translation_en_to_de`, `translation_en_to_fr`, `translation_en_to_ro`.

TransformerGeneratorAgent Options¶

optional arguments

Argument	Description
`--gpu-beam-blocking`	Set to use CUDA kernel for beam search ngram blocking Default: `False`.
`--verbose-topk`	Return the topk logits in the act message, if verbose mode is set. Default: `-1`.

Transformer Arguments

Argument	Description
`--embedding-size`, `--esz`	Size of all embedding layers. Must be a multiple of –n-heads. Default: `300`.
`--n-layers`, `--nl`	Number of transformer layers. Default: `2`.
`--ffn-size`, `--hid`	Hidden size of the FFN layers Default: `300`.
`--dropout`	Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets. Default: `0.0`.
`--attention-dropout`	Dropout used after attention softmax. This is not used in Vaswani 2017. Default: `0.0`.
`--relu-dropout`	Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor. Default: `0.0`.
`--n-heads`	Number of multihead attention heads Default: `2`.
`--learn-positional-embeddings`	If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch. Default: `False`.
`--embeddings-scale`	Default: `True`.
`--n-segments`	The number of segments that support the model. If zero no segment and no langs_embedding. Default: `0`.
`--variant`	Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models Choices: `xlm`, `prelayernorm`, `bart`, `aiayn`. Default: `aiayn`. Recommended: `xlm`.
`--activation`	Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu. Choices: `gelu`, `relu`. Default: `relu`. Recommended: `gelu`.
`--output-scaling`	Scale the output of every transformer by this quantity. Default: `1.0`.
`--share-word-embeddings`	Share word embeddings table for candidate and contextin the memory network Default: `True`.
`--n-encoder-layers`, `--nel`	This will overidde the n-layers for asymmetrical transformers Default: `-1`.
`--n-decoder-layers`, `--ndl`	This will overidde the n-layers for asymmetrical transformers Default: `-1`.
`--model-parallel`	Shard the layers across multiple GPUs. Default: `False`.
`--checkpoint-activations`	Recompute activations on backward pass to conserve memory. Default: `False`.

Torch Generator Agent

Argument	Description
`--beam-size`	Beam size, if 1 then greedy search Default: `1`.
`--beam-min-length`	Minimum length of prediction to be generated by the beam search Default: `1`.
`--beam-context-block-ngram`	Size n-grams to block in beam search from the context. val <= 0 implies no blocking Default: `-1`.
`--beam-block-ngram`	Size n-grams to block in beam search. val <= 0 implies no blocking Default: `-1`.
`--beam-block-full-context`	Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent Default: `True`.
`--beam-length-penalty`	Applies a length penalty. Set to 0 for no penalty. Default: `0.65`.
`--inference`	Generation algorithm Choices: `beam`, `nucleus`, `delayedbeam`, `greedy`, `delayednucleusbeam`, `topk`, `factual_nucleus`. Default: `greedy`.
`--topk`	K used in Top K sampling Default: `10`.
`--topp`	P used in nucleus sampling Default: `0.9`.
`--beam-delay`	Used in delayedbeam search Default: `30`.
`--lambda-decay`	Decay factor in factual nucleus sampling Default: `0.9`.
`--omega-bound`	Lower bound in factual nucleus sampling Default: `0.3`.
`--p-reset`	Whether to reset p value in factual nucleus at full stops Default: `True`.
`--beam-block-list-filename`	Load a text file of hard blocks for beam search to never say.
`--temperature`	Temperature to add during decoding Default: `1.0`.
`--compute-tokenized-bleu`	If true, compute tokenized bleu scores Default: `False`.

TorchAgent Arguments

Argument	Description
`--interactive-mode`, `--i`	Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts. Default: `False`.
`--embedding-type`, `--emb`	Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training. Choices: `random`, `glove`, `glove-fixed`, `fasttext`, `fasttext-fixed`, `fasttext_cc`, `fasttext_cc-fixed`. Default: `random`.
`--embedding-projection`, `--embp`	If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice. Default: `random`.
`--fp16`	Use fp16 computations. Default: `False`.
`--fp16-impl`	Implementation of FP16 to use Choices: `safe`, `mem_efficient`. Default: `safe`.
`--rank-candidates`, `--rc`	Whether the model should parse candidates for ranking. Default: `False`.
`--truncate`, `--tr`	Truncate input lengths to increase speed / use less memory. Default: `-1`.
`--text-truncate`	Text input truncation length: if not specified, this will default to `truncate`
`--label-truncate`	Label truncation length: if not specified, this will default to `truncate`
`--history-reversed`	Reverse the history Default: `False`.
`--history-size`, `--histsz`	Number of past dialog utterances to remember. Default: `-1`.
`--person-tokens`, `--pt`	Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization. Default: `False`.
`--split-lines`	Split the dialogue history on newlines and save in separate vectors Default: `False`.
`--delimiter`	Join history lines with this token, defaults to newline Default: `\n`.
`--special-tok-lst`	Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence.
`-gpu`, `--gpu`	Which GPU to use Default: `-1`.
`--no-cuda`	Disable GPUs even if available. otherwise, will use GPUs if available on the device. Default: `False`.

Optimizer Arguments

Argument	Description
`--optimizer`, `--opt`	Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor. Choices: `adadelta`, `adagrad`, `adam`, `adamw`, `sparseadam`, `adamax`, `asgd`, `sgd`, `radam`, `rprop`, `rmsprop`, `optimizer`, `nadam`, `lbfgs`, `mem_eff_adam`, `adafactor`. Default: `sgd`.
`--learningrate`, `--lr`	Learning rate Default: `1`.
`--gradient-clip`, `--clip`	Gradient clipping using l2 norm Default: `0.1`.
`--adafactor-eps`	Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively Default: `1e-30,1e-3`. Recommended: `1e-30,1e-3`.
`--momentum`, `--mom`	If applicable, momentum value for optimizer. Default: `0`.
`--nesterov`	If applicable, whether to use nesterov momentum. Default: `True`.
`--nus`, `--nu`	If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0 Default: `0.7`.
`--betas`, `--beta`	If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999 Default: `0.9,0.999`.
`--weight-decay`, `--wdecay`	Weight decay on the weights.

BPEHelper Arguments

Argument	Description
`--bpe-vocab`	Path to pre-trained tokenizer vocab
`--bpe-merge`	Path to pre-trained tokenizer merge
`--bpe-dropout`	Use BPE dropout during training.

Learning Rate Scheduler

Argument	Description
`--lr-scheduler`	Learning rate scheduler. Choices: `reduceonplateau`, `none`, `fixed`, `invsqrt`, `cosine`, `linear`. Default: `reduceonplateau`.
`--lr-scheduler-patience`	LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every validations. Default: `3`.
`--lr-scheduler-decay`	Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered. Default: `0.5`.
`--invsqrt-lr-decay-gamma`	Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt Default: `-1`.

TransformerGeneratorRagAgent Options¶

optional arguments

Argument	Description
`--gpu-beam-blocking`	Set to use CUDA kernel for beam search ngram blocking Default: `False`.
`--verbose-topk`	Return the topk logits in the act message, if verbose mode is set. Default: `-1`.

Transformer Arguments

Argument	Description
`--embedding-size`, `--esz`	Size of all embedding layers. Must be a multiple of –n-heads. Default: `300`.
`--n-layers`, `--nl`	Number of transformer layers. Default: `2`.
`--ffn-size`, `--hid`	Hidden size of the FFN layers Default: `300`.
`--dropout`	Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets. Default: `0.0`.
`--attention-dropout`	Dropout used after attention softmax. This is not used in Vaswani 2017. Default: `0.0`.
`--relu-dropout`	Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor. Default: `0.0`.
`--n-heads`	Number of multihead attention heads Default: `2`.
`--learn-positional-embeddings`	If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch. Default: `False`.
`--embeddings-scale`	Default: `True`.
`--n-segments`	The number of segments that support the model. If zero no segment and no langs_embedding. Default: `0`.
`--variant`	Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models Choices: `xlm`, `prelayernorm`, `bart`, `aiayn`. Default: `aiayn`. Recommended: `xlm`.
`--activation`	Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu. Choices: `gelu`, `relu`. Default: `relu`. Recommended: `gelu`.
`--output-scaling`	Scale the output of every transformer by this quantity. Default: `1.0`.
`--share-word-embeddings`	Share word embeddings table for candidate and contextin the memory network Default: `True`.
`--n-encoder-layers`, `--nel`	This will overidde the n-layers for asymmetrical transformers Default: `-1`.
`--n-decoder-layers`, `--ndl`	This will overidde the n-layers for asymmetrical transformers Default: `-1`.
`--model-parallel`	Shard the layers across multiple GPUs. Default: `False`.
`--checkpoint-activations`	Recompute activations on backward pass to conserve memory. Default: `False`.

Torch Generator Agent

Argument	Description
`--beam-size`	Beam size, if 1 then greedy search Default: `1`.
`--beam-min-length`	Minimum length of prediction to be generated by the beam search Default: `1`.
`--beam-context-block-ngram`	Size n-grams to block in beam search from the context. val <= 0 implies no blocking Default: `-1`.
`--beam-block-ngram`	Size n-grams to block in beam search. val <= 0 implies no blocking Default: `-1`.
`--beam-block-full-context`	Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent Default: `True`.
`--beam-length-penalty`	Applies a length penalty. Set to 0 for no penalty. Default: `0.65`.
`--inference`	Generation algorithm Choices: `beam`, `nucleus`, `delayedbeam`, `greedy`, `delayednucleusbeam`, `topk`, `factual_nucleus`. Default: `greedy`.
`--topk`	K used in Top K sampling Default: `10`.
`--topp`	P used in nucleus sampling Default: `0.9`.
`--beam-delay`	Used in delayedbeam search Default: `30`.
`--lambda-decay`	Decay factor in factual nucleus sampling Default: `0.9`.
`--omega-bound`	Lower bound in factual nucleus sampling Default: `0.3`.
`--p-reset`	Whether to reset p value in factual nucleus at full stops Default: `True`.
`--beam-block-list-filename`	Load a text file of hard blocks for beam search to never say.
`--temperature`	Temperature to add during decoding Default: `1.0`.
`--compute-tokenized-bleu`	If true, compute tokenized bleu scores Default: `False`.

TorchAgent Arguments

Argument	Description
`--interactive-mode`, `--i`	Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts. Default: `False`.
`--embedding-type`, `--emb`	Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training. Choices: `random`, `glove`, `glove-fixed`, `fasttext`, `fasttext-fixed`, `fasttext_cc`, `fasttext_cc-fixed`. Default: `random`.
`--embedding-projection`, `--embp`	If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice. Default: `random`.
`--fp16`	Use fp16 computations. Default: `False`.
`--fp16-impl`	Implementation of FP16 to use Choices: `safe`, `mem_efficient`. Default: `safe`.
`--rank-candidates`, `--rc`	Whether the model should parse candidates for ranking. Default: `False`.
`--truncate`, `--tr`	Truncate input lengths to increase speed / use less memory. Default: `-1`.
`--text-truncate`	Text input truncation length: if not specified, this will default to `truncate`
`--label-truncate`	Label truncation length: if not specified, this will default to `truncate`
`--history-reversed`	Reverse the history Default: `False`.
`--history-size`, `--histsz`	Number of past dialog utterances to remember. Default: `-1`.
`--person-tokens`, `--pt`	Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization. Default: `False`.
`--split-lines`	Split the dialogue history on newlines and save in separate vectors Default: `False`.
`--delimiter`	Join history lines with this token, defaults to newline Default: `\n`.
`--special-tok-lst`	Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence.
`-gpu`, `--gpu`	Which GPU to use Default: `-1`.
`--no-cuda`	Disable GPUs even if available. otherwise, will use GPUs if available on the device. Default: `False`.

Optimizer Arguments

Argument	Description
`--optimizer`, `--opt`	Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor. Choices: `adadelta`, `adagrad`, `adam`, `adamw`, `sparseadam`, `adamax`, `asgd`, `sgd`, `radam`, `rprop`, `rmsprop`, `optimizer`, `nadam`, `lbfgs`, `mem_eff_adam`, `adafactor`. Default: `sgd`.
`--learningrate`, `--lr`	Learning rate Default: `1`.
`--gradient-clip`, `--clip`	Gradient clipping using l2 norm Default: `0.1`.
`--adafactor-eps`	Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively Default: `1e-30,1e-3`. Recommended: `1e-30,1e-3`.
`--momentum`, `--mom`	If applicable, momentum value for optimizer. Default: `0`.
`--nesterov`	If applicable, whether to use nesterov momentum. Default: `True`.
`--nus`, `--nu`	If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0 Default: `0.7`.
`--betas`, `--beta`	If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999 Default: `0.9,0.999`.
`--weight-decay`, `--wdecay`	Weight decay on the weights.

BPEHelper Arguments

Argument	Description
`--bpe-vocab`	Path to pre-trained tokenizer vocab
`--bpe-merge`	Path to pre-trained tokenizer merge
`--bpe-dropout`	Use BPE dropout during training.

Learning Rate Scheduler

Argument	Description
`--lr-scheduler`	Learning rate scheduler. Choices: `reduceonplateau`, `none`, `fixed`, `invsqrt`, `cosine`, `linear`. Default: `reduceonplateau`.
`--lr-scheduler-patience`	LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every validations. Default: `3`.
`--lr-scheduler-decay`	Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered. Default: `0.5`.
`--invsqrt-lr-decay-gamma`	Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt Default: `-1`.

Retrieval-Augmented Generation (RAG)¶

Installation / Memory Requirements.¶

RAM¶

GPU¶

RAG Quick Start¶

RAG Options¶

RAG Seq2Seq Generators: --generation-model¶

RAG Model Types: --rag-model-type¶

RAG Retriever Types: --rag-retriever-type¶

Other RAG Options¶

Number of Retrieved Documents¶

Thorough Decoding¶

FAISS Indexes¶

Generating your own FAISS Index.¶

1a. [Recommended] Obtain/Choose a (Pre-trained) DPR Model¶

1b. Train your own Dropout Poly-encoder¶

2. Generate Dense Embeddings (~1-2 hours minutes if sharded appropriately - 50 x 1 GPU).¶

3. Index the Dense Embeddings¶

Directory Structure / Custom Components¶

Custom Components¶

Sequence to Sequence Models¶

Retriever Models¶

BartAgent Options¶

BartRagAgent Options¶

DictionaryAgent Options¶

PolyencoderAgent Options¶

RagAgent Options¶

T5Agent Options¶

T5RagAgent Options¶

TransformerGeneratorAgent Options¶

TransformerGeneratorRagAgent Options¶

RAG Seq2Seq Generators: `--generation-model`¶

RAG Model Types: `--rag-model-type`¶

RAG Retriever Types: `--rag-retriever-type`¶