Fusion in Decoder (FiD)¶
The FiD model is first described in Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering (G. Izacard, E. Grave 2020); the original implementation can be found here. The implementation we provide uses the RAG models as a backbone; thus, instructions for options to use when running a FiD model can be found in the RAG README, as well as the corresponding project page.
Simply swap --model rag
with --model fid
, and you’re good to go!
DictionaryAgent Options¶
BPEHelper Arguments
Argument |
Description |
---|---|
|
Path to pre-trained tokenizer vocab |
|
Path to pre-trained tokenizer merge |
|
Use BPE dropout during training. |
FidAgent Options¶
optional arguments
Argument |
Description |
---|---|
|
Set to use CUDA kernel for beam search ngram blocking |
|
Return the topk logits in the act message, if verbose mode is set. |
TorchRankerAgent
Argument |
Description |
---|---|
|
The source of candidates during training (see TorchRankerAgent._build_candidates() for details). |
|
The source of candidates during evaluation (defaults to the samevalue as –candidates if no flag is given) |
|
The source of candidates during interactive mode. Since in interactive mode, batchsize == 1, we cannot use batch candidates. |
|
Block repeating previous utterances. Helpful for many models that score repeats highly, so switched on by default. |
|
A text file of fixed candidates to use for all examples, one candidate per line |
|
One of “reuse”, “replace”, or a path to a file with vectors corresponding to the candidates at –fixed-candidates-path. The default path is a /path/to/model-file.<cands_name>, where <cands_name> is the name of the file (not the full path) passed by the flag –fixed-candidates-path. By default, this file is created once and reused. To replace it, use the “replace” option. |
|
Cache and save the encoding of the candidate vecs. This might be used when interacting with the model in real time or evaluating on fixed candidate set when the encoding of the candidates is independent of the input. |
|
Initialize model with weights from this file. |
|
Get predictions and calculate mean rank during the train step. Turning this on may slow down training. |
|
Limit to the number of predictions in output.text_candidates |
|
Ignore examples for which the label is not present in the label candidates. Default behavior results in RuntimeError. |
|
Ranking returns the top k results of k > 0, otherwise sorts every single candidate according to the ranking. |
|
Return sorted candidate scores from eval_step |
Transformer Arguments
Argument |
Description |
---|---|
|
Use memories: must implement the function |
|
Wrap memory encoder with MLP |
|
Similarity for basic attention mechanism when using transformer to encode memories |
|
Default: |
|
Default: |
|
Learn embeddings |
|
Use model in data parallel, requires multiple gpus |
|
Type of reduction at the end of transformer |
Polyencoder Arguments
Argument |
Description |
---|---|
|
Type of polyencoder, either we computevectors using codes + attention, or we simply take the first N vectors. |
|
Number of vectors used to represent the contextin the case of n_first, those are the numberof vectors that are considered. |
|
Type of the top aggregation layer of the poly-encoder (where the candidate representation isthe key) |
|
In case poly-attention-type is multihead, specify the number of heads |
|
Type |
|
In case codes-attention-type is multihead, specify the number of heads |
Transformer Arguments
Argument |
Description |
---|---|
|
Size of all embedding layers. Must be a multiple of –n-heads. |
|
Number of transformer layers. |
|
Hidden size of the FFN layers |
|
Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets. |
|
Dropout used after attention softmax. This is not used in Vaswani 2017. |
|
Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor. |
|
Number of multihead attention heads |
|
If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch. |
|
Default: |
|
The number of segments that support the model. If zero no segment and no langs_embedding. |
|
Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models |
|
Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu. |
|
Scale the output of every transformer by this quantity. |
|
Share word embeddings table for candidate and contextin the memory network |
|
This will overidde the n-layers for asymmetrical transformers |
|
This will overidde the n-layers for asymmetrical transformers |
|
Shard the layers across multiple GPUs. |
|
Recompute activations on backward pass to conserve memory. |
RAG Model Args
Argument |
Description |
---|---|
|
Which generation model to use |
|
Which query model to use for DPR. |
|
Which rag model decoding to use. |
|
Whether to use thorough decoding for rag sequence. |
Modified RAG Args
Argument |
Description |
---|---|
|
Specify > 0 to include extra positions in the encoder, in which retrieved knowledge will go. In this setup, knowledge is appended instead of prepended. |
|
Key in the observation dict that indicates the gold knowledge passage. Specify, along with –debug, to compute passage retrieval metrics at train/test time. |
|
Key in the observation dict that indicates the gold knowledge passage title. Specify, along with –debug, to compute passage retrieval metrics at train/test time. |
RAG Retriever Args
Argument |
Description |
---|---|
|
What to use as the query for retrieval. |
|
Which retriever to use |
|
Load specified small index, for debugging. |
|
How many documents to retrieve |
|
Minimum amount of information to retain from document. Useful to define if encoder does not use a lot of BPE token context. |
|
Maximum amount of information to retain from document. |
|
Max token length of query for retrieval. |
|
Whether to print docs; usually useful during interactive mode. |
RAG Dense Passage Retriever Args
Argument |
Description |
---|---|
|
Path to FAISS Index. |
|
Path to dense embeddings directory used to build index. Default None will assume embeddings and index are in the same directory. |
|
Path to DPR Model. |
|
Path to DPR passages, used to build index. |
|
Embedding size of dense retriever |
RAG TFIDF Retriever Args
Argument |
Description |
---|---|
|
If > 0, limit documents to this many paragraphs |
|
Optionally override TFIDF model. |
RAG DPR-POLY Retriever Args
Argument |
Description |
---|---|
|
In two stage retrieval, how many DPR documents to retrieve |
|
In two stage retrieval, how much weight to give to the poly scores. Note: Learned parameter. Specify initial value here |
|
Which init model to initialize polyencoder with. Specify wikito or reddit to use models from the ParlAI zoo; otherwise, provide a path to a trained polyencoder |
RAG PolyFAISS retriever args
Argument |
Description |
---|---|
|
Path to poly-encoder for use in poly-faiss retrieval. |
RAG ReGReT args
Argument |
Description |
---|---|
|
Retrieve, Generate, Retrieve, Tune. Retrieve, generate, then retrieve again, and finally tune (refine). |
|
Maximum length in intermediate regret generation |
|
Path to model for initial round of retrieval. |
|
Path to dict file for model for initial round of retrieval. |
|
Overrides the index used with the ReGReT model, if using separate models. I.e., the initial round of retrieval uses the same index as specified for the second round of retrieval |
RAG Indexer Args
Argument |
Description |
---|---|
|
Granularity of RAG Indexer. Choose compressed to save on RAM costs, at the possible expense of accuracy. |
|
Buffer size for adding vectors to the index |
|
If specified, builds compressed indexer from a FAISS Index Factory. see https://github.com/facebookresearch/faiss/wiki/The-index-factory for details |
|
How many centroids to search in compressed indexer. See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes#cell-probe-methods-indexivf-indexes for details |
RAG-Turn Args
Argument |
Description |
---|---|
|
How many turns to split up retrieval into. The most recent text is split by delimiter; all turns after (n-1)th turn are combined. |
|
How to marginalize rag-turn. |
|
Discount factor for turns beyond most recent one. We employ exponential discounting. Only considered if 0 < factor < 1.0. |
Torch Generator Agent
Argument |
Description |
---|---|
|
Beam size, if 1 then greedy search |
|
Minimum length of prediction to be generated by the beam search |
|
Size n-grams to block in beam search from the context. val <= 0 implies no blocking |
|
Size n-grams to block in beam search. val <= 0 implies no blocking |
|
Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent |
|
Applies a length penalty. Set to 0 for no penalty. |
|
Generation algorithm |
|
K used in Top K sampling |
|
P used in nucleus sampling |
|
Used in delayedbeam search |
|
Decay factor in factual nucleus sampling |
|
Lower bound in factual nucleus sampling |
|
Whether to reset p value in factual nucleus at full stops |
|
Load a text file of hard blocks for beam search to never say. |
|
Temperature to add during decoding |
|
If true, compute tokenized bleu scores |
TorchAgent Arguments
Argument |
Description |
---|---|
|
Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts. |
|
Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training. |
|
If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice. |
|
Use fp16 computations. |
|
Implementation of FP16 to use |
|
Whether the model should parse candidates for ranking. |
|
Truncate input lengths to increase speed / use less memory. |
|
Text input truncation length: if not specified, this will default to |
|
Label truncation length: if not specified, this will default to |
|
Reverse the history |
|
Number of past dialog utterances to remember. |
|
Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization. |
|
Split the dialogue history on newlines and save in separate vectors |
|
Join history lines with this token, defaults to newline |
|
Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence. |
|
Which GPU to use |
|
Disable GPUs even if available. otherwise, will use GPUs if available on the device. |
Optimizer Arguments
Argument |
Description |
---|---|
|
Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor. |
|
Learning rate |
|
Gradient clipping using l2 norm |
|
Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively |
|
If applicable, momentum value for optimizer. |
|
If applicable, whether to use nesterov momentum. |
|
If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0 |
|
If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999 |
|
Weight decay on the weights. |
BPEHelper Arguments
Argument |
Description |
---|---|
|
Path to pre-trained tokenizer vocab |
|
Path to pre-trained tokenizer merge |
|
Use BPE dropout during training. |
Learning Rate Scheduler
Argument |
Description |
---|---|
|
Learning rate scheduler. |
|
LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every |
|
Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered. |
|
Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt |
T5 Args
Argument |
Description |
---|---|
|
Choices: |
|
Use HF model parallel |
|
Dropout for T5 |
|
Task specific generation config for T5 |
RagAgent Options¶
optional arguments
Argument |
Description |
---|---|
|
Set to use CUDA kernel for beam search ngram blocking |
|
Return the topk logits in the act message, if verbose mode is set. |
TorchRankerAgent
Argument |
Description |
---|---|
|
The source of candidates during training (see TorchRankerAgent._build_candidates() for details). |
|
The source of candidates during evaluation (defaults to the samevalue as –candidates if no flag is given) |
|
The source of candidates during interactive mode. Since in interactive mode, batchsize == 1, we cannot use batch candidates. |
|
Block repeating previous utterances. Helpful for many models that score repeats highly, so switched on by default. |
|
A text file of fixed candidates to use for all examples, one candidate per line |
|
One of “reuse”, “replace”, or a path to a file with vectors corresponding to the candidates at –fixed-candidates-path. The default path is a /path/to/model-file.<cands_name>, where <cands_name> is the name of the file (not the full path) passed by the flag –fixed-candidates-path. By default, this file is created once and reused. To replace it, use the “replace” option. |
|
Cache and save the encoding of the candidate vecs. This might be used when interacting with the model in real time or evaluating on fixed candidate set when the encoding of the candidates is independent of the input. |
|
Initialize model with weights from this file. |
|
Get predictions and calculate mean rank during the train step. Turning this on may slow down training. |
|
Limit to the number of predictions in output.text_candidates |
|
Ignore examples for which the label is not present in the label candidates. Default behavior results in RuntimeError. |
|
Ranking returns the top k results of k > 0, otherwise sorts every single candidate according to the ranking. |
|
Return sorted candidate scores from eval_step |
Transformer Arguments
Argument |
Description |
---|---|
|
Use memories: must implement the function |
|
Wrap memory encoder with MLP |
|
Similarity for basic attention mechanism when using transformer to encode memories |
|
Default: |
|
Default: |
|
Learn embeddings |
|
Use model in data parallel, requires multiple gpus |
|
Type of reduction at the end of transformer |
Polyencoder Arguments
Argument |
Description |
---|---|
|
Type of polyencoder, either we computevectors using codes + attention, or we simply take the first N vectors. |
|
Number of vectors used to represent the contextin the case of n_first, those are the numberof vectors that are considered. |
|
Type of the top aggregation layer of the poly-encoder (where the candidate representation isthe key) |
|
In case poly-attention-type is multihead, specify the number of heads |
|
Type |
|
In case codes-attention-type is multihead, specify the number of heads |
Transformer Arguments
Argument |
Description |
---|---|
|
Size of all embedding layers. Must be a multiple of –n-heads. |
|
Number of transformer layers. |
|
Hidden size of the FFN layers |
|
Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets. |
|
Dropout used after attention softmax. This is not used in Vaswani 2017. |
|
Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor. |
|
Number of multihead attention heads |
|
If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch. |
|
Default: |
|
The number of segments that support the model. If zero no segment and no langs_embedding. |
|
Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models |
|
Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu. |
|
Scale the output of every transformer by this quantity. |
|
Share word embeddings table for candidate and contextin the memory network |
|
This will overidde the n-layers for asymmetrical transformers |
|
This will overidde the n-layers for asymmetrical transformers |
|
Shard the layers across multiple GPUs. |
|
Recompute activations on backward pass to conserve memory. |
RAG Model Args
Argument |
Description |
---|---|
|
Which generation model to use |
|
Which query model to use for DPR. |
|
Which rag model decoding to use. |
|
Whether to use thorough decoding for rag sequence. |
Modified RAG Args
Argument |
Description |
---|---|
|
Specify > 0 to include extra positions in the encoder, in which retrieved knowledge will go. In this setup, knowledge is appended instead of prepended. |
|
Key in the observation dict that indicates the gold knowledge passage. Specify, along with –debug, to compute passage retrieval metrics at train/test time. |
|
Key in the observation dict that indicates the gold knowledge passage title. Specify, along with –debug, to compute passage retrieval metrics at train/test time. |
RAG Retriever Args
Argument |
Description |
---|---|
|
What to use as the query for retrieval. |
|
Which retriever to use |
|
Load specified small index, for debugging. |
|
How many documents to retrieve |
|
Minimum amount of information to retain from document. Useful to define if encoder does not use a lot of BPE token context. |
|
Maximum amount of information to retain from document. |
|
Max token length of query for retrieval. |
|
Whether to print docs; usually useful during interactive mode. |
RAG Dense Passage Retriever Args
Argument |
Description |
---|---|
|
Path to FAISS Index. |
|
Path to dense embeddings directory used to build index. Default None will assume embeddings and index are in the same directory. |
|
Path to DPR Model. |
|
Path to DPR passages, used to build index. |
|
Embedding size of dense retriever |
RAG TFIDF Retriever Args
Argument |
Description |
---|---|
|
If > 0, limit documents to this many paragraphs |
|
Optionally override TFIDF model. |
RAG DPR-POLY Retriever Args
Argument |
Description |
---|---|
|
In two stage retrieval, how many DPR documents to retrieve |
|
In two stage retrieval, how much weight to give to the poly scores. Note: Learned parameter. Specify initial value here |
|
Which init model to initialize polyencoder with. Specify wikito or reddit to use models from the ParlAI zoo; otherwise, provide a path to a trained polyencoder |
RAG PolyFAISS retriever args
Argument |
Description |
---|---|
|
Path to poly-encoder for use in poly-faiss retrieval. |
RAG ReGReT args
Argument |
Description |
---|---|
|
Retrieve, Generate, Retrieve, Tune. Retrieve, generate, then retrieve again, and finally tune (refine). |
|
Maximum length in intermediate regret generation |
|
Path to model for initial round of retrieval. |
|
Path to dict file for model for initial round of retrieval. |
|
Overrides the index used with the ReGReT model, if using separate models. I.e., the initial round of retrieval uses the same index as specified for the second round of retrieval |
RAG Indexer Args
Argument |
Description |
---|---|
|
Granularity of RAG Indexer. Choose compressed to save on RAM costs, at the possible expense of accuracy. |
|
Buffer size for adding vectors to the index |
|
If specified, builds compressed indexer from a FAISS Index Factory. see https://github.com/facebookresearch/faiss/wiki/The-index-factory for details |
|
How many centroids to search in compressed indexer. See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes#cell-probe-methods-indexivf-indexes for details |
RAG-Turn Args
Argument |
Description |
---|---|
|
How many turns to split up retrieval into. The most recent text is split by delimiter; all turns after (n-1)th turn are combined. |
|
How to marginalize rag-turn. |
|
Discount factor for turns beyond most recent one. We employ exponential discounting. Only considered if 0 < factor < 1.0. |
Torch Generator Agent
Argument |
Description |
---|---|
|
Beam size, if 1 then greedy search |
|
Minimum length of prediction to be generated by the beam search |
|
Size n-grams to block in beam search from the context. val <= 0 implies no blocking |
|
Size n-grams to block in beam search. val <= 0 implies no blocking |
|
Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent |
|
Applies a length penalty. Set to 0 for no penalty. |
|
Generation algorithm |
|
K used in Top K sampling |
|
P used in nucleus sampling |
|
Used in delayedbeam search |
|
Decay factor in factual nucleus sampling |
|
Lower bound in factual nucleus sampling |
|
Whether to reset p value in factual nucleus at full stops |
|
Load a text file of hard blocks for beam search to never say. |
|
Temperature to add during decoding |
|
If true, compute tokenized bleu scores |
TorchAgent Arguments
Argument |
Description |
---|---|
|
Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts. |
|
Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training. |
|
If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice. |
|
Use fp16 computations. |
|
Implementation of FP16 to use |
|
Whether the model should parse candidates for ranking. |
|
Truncate input lengths to increase speed / use less memory. |
|
Text input truncation length: if not specified, this will default to |
|
Label truncation length: if not specified, this will default to |
|
Reverse the history |
|
Number of past dialog utterances to remember. |
|
Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization. |
|
Split the dialogue history on newlines and save in separate vectors |
|
Join history lines with this token, defaults to newline |
|
Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence. |
|
Which GPU to use |
|
Disable GPUs even if available. otherwise, will use GPUs if available on the device. |
Optimizer Arguments
Argument |
Description |
---|---|
|
Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor. |
|
Learning rate |
|
Gradient clipping using l2 norm |
|
Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively |
|
If applicable, momentum value for optimizer. |
|
If applicable, whether to use nesterov momentum. |
|
If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0 |
|
If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999 |
|
Weight decay on the weights. |
BPEHelper Arguments
Argument |
Description |
---|---|
|
Path to pre-trained tokenizer vocab |
|
Path to pre-trained tokenizer merge |
|
Use BPE dropout during training. |
Learning Rate Scheduler
Argument |
Description |
---|---|
|
Learning rate scheduler. |
|
LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every |
|
Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered. |
|
Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt |
T5 Args
Argument |
Description |
---|---|
|
Choices: |
|
Use HF model parallel |
|
Dropout for T5 |
|
Task specific generation config for T5 |
SearchQueryFAISSIndexFiDAgent Options¶
optional arguments
Argument |
Description |
---|---|
|
Set to use CUDA kernel for beam search ngram blocking |
|
Return the topk logits in the act message, if verbose mode is set. |
|
Document chunk size (in characters). |
TorchRankerAgent
Argument |
Description |
---|---|
|
The source of candidates during training (see TorchRankerAgent._build_candidates() for details). |
|
The source of candidates during evaluation (defaults to the samevalue as –candidates if no flag is given) |
|
The source of candidates during interactive mode. Since in interactive mode, batchsize == 1, we cannot use batch candidates. |
|
Block repeating previous utterances. Helpful for many models that score repeats highly, so switched on by default. |
|
A text file of fixed candidates to use for all examples, one candidate per line |
|
One of “reuse”, “replace”, or a path to a file with vectors corresponding to the candidates at –fixed-candidates-path. The default path is a /path/to/model-file.<cands_name>, where <cands_name> is the name of the file (not the full path) passed by the flag –fixed-candidates-path. By default, this file is created once and reused. To replace it, use the “replace” option. |
|
Cache and save the encoding of the candidate vecs. This might be used when interacting with the model in real time or evaluating on fixed candidate set when the encoding of the candidates is independent of the input. |
|
Initialize model with weights from this file. |
|
Get predictions and calculate mean rank during the train step. Turning this on may slow down training. |
|
Limit to the number of predictions in output.text_candidates |
|
Ignore examples for which the label is not present in the label candidates. Default behavior results in RuntimeError. |
|
Ranking returns the top k results of k > 0, otherwise sorts every single candidate according to the ranking. |
|
Return sorted candidate scores from eval_step |
Transformer Arguments
Argument |
Description |
---|---|
|
Use memories: must implement the function |
|
Wrap memory encoder with MLP |
|
Similarity for basic attention mechanism when using transformer to encode memories |
|
Default: |
|
Default: |
|
Learn embeddings |
|
Use model in data parallel, requires multiple gpus |
|
Type of reduction at the end of transformer |
Polyencoder Arguments
Argument |
Description |
---|---|
|
Type of polyencoder, either we computevectors using codes + attention, or we simply take the first N vectors. |
|
Number of vectors used to represent the contextin the case of n_first, those are the numberof vectors that are considered. |
|
Type of the top aggregation layer of the poly-encoder (where the candidate representation isthe key) |
|
In case poly-attention-type is multihead, specify the number of heads |
|
Type |
|
In case codes-attention-type is multihead, specify the number of heads |
Transformer Arguments
Argument |
Description |
---|---|
|
Size of all embedding layers. Must be a multiple of –n-heads. |
|
Number of transformer layers. |
|
Hidden size of the FFN layers |
|
Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets. |
|
Dropout used after attention softmax. This is not used in Vaswani 2017. |
|
Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor. |
|
Number of multihead attention heads |
|
If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch. |
|
Default: |
|
The number of segments that support the model. If zero no segment and no langs_embedding. |
|
Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models |
|
Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu. |
|
Scale the output of every transformer by this quantity. |
|
Share word embeddings table for candidate and contextin the memory network |
|
This will overidde the n-layers for asymmetrical transformers |
|
This will overidde the n-layers for asymmetrical transformers |
|
Shard the layers across multiple GPUs. |
|
Recompute activations on backward pass to conserve memory. |
RAG Model Args
Argument |
Description |
---|---|
|
Which generation model to use |
|
Which query model to use for DPR. |
|
Which rag model decoding to use. |
|
Whether to use thorough decoding for rag sequence. |
Modified RAG Args
Argument |
Description |
---|---|
|
Specify > 0 to include extra positions in the encoder, in which retrieved knowledge will go. In this setup, knowledge is appended instead of prepended. |
|
Key in the observation dict that indicates the gold knowledge passage. Specify, along with –debug, to compute passage retrieval metrics at train/test time. |
|
Key in the observation dict that indicates the gold knowledge passage title. Specify, along with –debug, to compute passage retrieval metrics at train/test time. |
RAG Retriever Args
Argument |
Description |
---|---|
|
What to use as the query for retrieval. |
|
Which retriever to use |
|
Load specified small index, for debugging. |
|
How many documents to retrieve |
|
Minimum amount of information to retain from document. Useful to define if encoder does not use a lot of BPE token context. |
|
Maximum amount of information to retain from document. |
|
Max token length of query for retrieval. |
|
Whether to print docs; usually useful during interactive mode. |
RAG Dense Passage Retriever Args
Argument |
Description |
---|---|
|
Path to FAISS Index. |
|
Path to dense embeddings directory used to build index. Default None will assume embeddings and index are in the same directory. |
|
Path to DPR Model. |
|
Path to DPR passages, used to build index. |
|
Embedding size of dense retriever |
RAG TFIDF Retriever Args
Argument |
Description |
---|---|
|
If > 0, limit documents to this many paragraphs |
|
Optionally override TFIDF model. |
RAG DPR-POLY Retriever Args
Argument |
Description |
---|---|
|
In two stage retrieval, how many DPR documents to retrieve |
|
In two stage retrieval, how much weight to give to the poly scores. Note: Learned parameter. Specify initial value here |
|
Which init model to initialize polyencoder with. Specify wikito or reddit to use models from the ParlAI zoo; otherwise, provide a path to a trained polyencoder |
RAG PolyFAISS retriever args
Argument |
Description |
---|---|
|
Path to poly-encoder for use in poly-faiss retrieval. |
RAG ReGReT args
Argument |
Description |
---|---|
|
Retrieve, Generate, Retrieve, Tune. Retrieve, generate, then retrieve again, and finally tune (refine). |
|
Maximum length in intermediate regret generation |
|
Path to model for initial round of retrieval. |
|
Path to dict file for model for initial round of retrieval. |
|
Overrides the index used with the ReGReT model, if using separate models. I.e., the initial round of retrieval uses the same index as specified for the second round of retrieval |
RAG Indexer Args
Argument |
Description |
---|---|
|
Granularity of RAG Indexer. Choose compressed to save on RAM costs, at the possible expense of accuracy. |
|
Buffer size for adding vectors to the index |
|
If specified, builds compressed indexer from a FAISS Index Factory. see https://github.com/facebookresearch/faiss/wiki/The-index-factory for details |
|
How many centroids to search in compressed indexer. See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes#cell-probe-methods-indexivf-indexes for details |
RAG-Turn Args
Argument |
Description |
---|---|
|
How many turns to split up retrieval into. The most recent text is split by delimiter; all turns after (n-1)th turn are combined. |
|
How to marginalize rag-turn. |
|
Discount factor for turns beyond most recent one. We employ exponential discounting. Only considered if 0 < factor < 1.0. |
Torch Generator Agent
Argument |
Description |
---|---|
|
Beam size, if 1 then greedy search |
|
Minimum length of prediction to be generated by the beam search |
|
Size n-grams to block in beam search from the context. val <= 0 implies no blocking |
|
Size n-grams to block in beam search. val <= 0 implies no blocking |
|
Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent |
|
Applies a length penalty. Set to 0 for no penalty. |
|
Generation algorithm |
|
K used in Top K sampling |
|
P used in nucleus sampling |
|
Used in delayedbeam search |
|
Decay factor in factual nucleus sampling |
|
Lower bound in factual nucleus sampling |
|
Whether to reset p value in factual nucleus at full stops |
|
Load a text file of hard blocks for beam search to never say. |
|
Temperature to add during decoding |
|
If true, compute tokenized bleu scores |
TorchAgent Arguments
Argument |
Description |
---|---|
|
Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts. |
|
Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training. |
|
If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice. |
|
Use fp16 computations. |
|
Implementation of FP16 to use |
|
Whether the model should parse candidates for ranking. |
|
Truncate input lengths to increase speed / use less memory. |
|
Text input truncation length: if not specified, this will default to |
|
Label truncation length: if not specified, this will default to |
|
Reverse the history |
|
Number of past dialog utterances to remember. |
|
Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization. |
|
Split the dialogue history on newlines and save in separate vectors |
|
Join history lines with this token, defaults to newline |
|
Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence. |
|
Which GPU to use |
|
Disable GPUs even if available. otherwise, will use GPUs if available on the device. |
Optimizer Arguments
Argument |
Description |
---|---|
|
Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor. |
|
Learning rate |
|
Gradient clipping using l2 norm |
|
Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively |
|
If applicable, momentum value for optimizer. |
|
If applicable, whether to use nesterov momentum. |
|
If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0 |
|
If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999 |
|
Weight decay on the weights. |
BPEHelper Arguments
Argument |
Description |
---|---|
|
Path to pre-trained tokenizer vocab |
|
Path to pre-trained tokenizer merge |
|
Use BPE dropout during training. |
Learning Rate Scheduler
Argument |
Description |
---|---|
|
Learning rate scheduler. |
|
LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every |
|
Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered. |
|
Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt |
T5 Args
Argument |
Description |
---|---|
|
Choices: |
|
Use HF model parallel |
|
Dropout for T5 |
|
Task specific generation config for T5 |
Search Query FiD Params
Argument |
Description |
---|---|
|
Path to a query generator model. |
|
Generation algorithm for the search query generator model |
|
The beam_min_length opt for the search query generator model |
|
The beam_size opt for the search query generator model |
|
Truncates the input to the search query generator model |
|
The number of tokens in each document split |
|
Split the docs by white space (word) or dict tokens. |
|
Number of document chunks to keep if documents is too long and has to be splitted. |
|
How to rank doc chunks. |
SearchQueryFiDAgent Options¶
optional arguments
Argument |
Description |
---|---|
|
Set to use CUDA kernel for beam search ngram blocking |
|
Return the topk logits in the act message, if verbose mode is set. |
|
Document chunk size (in characters). |
TorchRankerAgent
Argument |
Description |
---|---|
|
The source of candidates during training (see TorchRankerAgent._build_candidates() for details). |
|
The source of candidates during evaluation (defaults to the samevalue as –candidates if no flag is given) |
|
The source of candidates during interactive mode. Since in interactive mode, batchsize == 1, we cannot use batch candidates. |
|
Block repeating previous utterances. Helpful for many models that score repeats highly, so switched on by default. |
|
A text file of fixed candidates to use for all examples, one candidate per line |
|
One of “reuse”, “replace”, or a path to a file with vectors corresponding to the candidates at –fixed-candidates-path. The default path is a /path/to/model-file.<cands_name>, where <cands_name> is the name of the file (not the full path) passed by the flag –fixed-candidates-path. By default, this file is created once and reused. To replace it, use the “replace” option. |
|
Cache and save the encoding of the candidate vecs. This might be used when interacting with the model in real time or evaluating on fixed candidate set when the encoding of the candidates is independent of the input. |
|
Initialize model with weights from this file. |
|
Get predictions and calculate mean rank during the train step. Turning this on may slow down training. |
|
Limit to the number of predictions in output.text_candidates |
|
Ignore examples for which the label is not present in the label candidates. Default behavior results in RuntimeError. |
|
Ranking returns the top k results of k > 0, otherwise sorts every single candidate according to the ranking. |
|
Return sorted candidate scores from eval_step |
Transformer Arguments
Argument |
Description |
---|---|
|
Use memories: must implement the function |
|
Wrap memory encoder with MLP |
|
Similarity for basic attention mechanism when using transformer to encode memories |
|
Default: |
|
Default: |
|
Learn embeddings |
|
Use model in data parallel, requires multiple gpus |
|
Type of reduction at the end of transformer |
Polyencoder Arguments
Argument |
Description |
---|---|
|
Type of polyencoder, either we computevectors using codes + attention, or we simply take the first N vectors. |
|
Number of vectors used to represent the contextin the case of n_first, those are the numberof vectors that are considered. |
|
Type of the top aggregation layer of the poly-encoder (where the candidate representation isthe key) |
|
In case poly-attention-type is multihead, specify the number of heads |
|
Type |
|
In case codes-attention-type is multihead, specify the number of heads |
Transformer Arguments
Argument |
Description |
---|---|
|
Size of all embedding layers. Must be a multiple of –n-heads. |
|
Number of transformer layers. |
|
Hidden size of the FFN layers |
|
Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets. |
|
Dropout used after attention softmax. This is not used in Vaswani 2017. |
|
Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor. |
|
Number of multihead attention heads |
|
If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch. |
|
Default: |
|
The number of segments that support the model. If zero no segment and no langs_embedding. |
|
Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models |
|
Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu. |
|
Scale the output of every transformer by this quantity. |
|
Share word embeddings table for candidate and contextin the memory network |
|
This will overidde the n-layers for asymmetrical transformers |
|
This will overidde the n-layers for asymmetrical transformers |
|
Shard the layers across multiple GPUs. |
|
Recompute activations on backward pass to conserve memory. |
RAG Model Args
Argument |
Description |
---|---|
|
Which generation model to use |
|
Which query model to use for DPR. |
|
Which rag model decoding to use. |
|
Whether to use thorough decoding for rag sequence. |
Modified RAG Args
Argument |
Description |
---|---|
|
Specify > 0 to include extra positions in the encoder, in which retrieved knowledge will go. In this setup, knowledge is appended instead of prepended. |
|
Key in the observation dict that indicates the gold knowledge passage. Specify, along with –debug, to compute passage retrieval metrics at train/test time. |
|
Key in the observation dict that indicates the gold knowledge passage title. Specify, along with –debug, to compute passage retrieval metrics at train/test time. |
RAG Retriever Args
Argument |
Description |
---|---|
|
What to use as the query for retrieval. |
|
Which retriever to use |
|
Load specified small index, for debugging. |
|
How many documents to retrieve |
|
Minimum amount of information to retain from document. Useful to define if encoder does not use a lot of BPE token context. |
|
Maximum amount of information to retain from document. |
|
Max token length of query for retrieval. |
|
Whether to print docs; usually useful during interactive mode. |
RAG Dense Passage Retriever Args
Argument |
Description |
---|---|
|
Path to FAISS Index. |
|
Path to dense embeddings directory used to build index. Default None will assume embeddings and index are in the same directory. |
|
Path to DPR Model. |
|
Path to DPR passages, used to build index. |
|
Embedding size of dense retriever |
RAG TFIDF Retriever Args
Argument |
Description |
---|---|
|
If > 0, limit documents to this many paragraphs |
|
Optionally override TFIDF model. |
RAG DPR-POLY Retriever Args
Argument |
Description |
---|---|
|
In two stage retrieval, how many DPR documents to retrieve |
|
In two stage retrieval, how much weight to give to the poly scores. Note: Learned parameter. Specify initial value here |
|
Which init model to initialize polyencoder with. Specify wikito or reddit to use models from the ParlAI zoo; otherwise, provide a path to a trained polyencoder |
RAG PolyFAISS retriever args
Argument |
Description |
---|---|
|
Path to poly-encoder for use in poly-faiss retrieval. |
RAG ReGReT args
Argument |
Description |
---|---|
|
Retrieve, Generate, Retrieve, Tune. Retrieve, generate, then retrieve again, and finally tune (refine). |
|
Maximum length in intermediate regret generation |
|
Path to model for initial round of retrieval. |
|
Path to dict file for model for initial round of retrieval. |
|
Overrides the index used with the ReGReT model, if using separate models. I.e., the initial round of retrieval uses the same index as specified for the second round of retrieval |
RAG Indexer Args
Argument |
Description |
---|---|
|
Granularity of RAG Indexer. Choose compressed to save on RAM costs, at the possible expense of accuracy. |
|
Buffer size for adding vectors to the index |
|
If specified, builds compressed indexer from a FAISS Index Factory. see https://github.com/facebookresearch/faiss/wiki/The-index-factory for details |
|
How many centroids to search in compressed indexer. See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes#cell-probe-methods-indexivf-indexes for details |
RAG-Turn Args
Argument |
Description |
---|---|
|
How many turns to split up retrieval into. The most recent text is split by delimiter; all turns after (n-1)th turn are combined. |
|
How to marginalize rag-turn. |
|
Discount factor for turns beyond most recent one. We employ exponential discounting. Only considered if 0 < factor < 1.0. |
Torch Generator Agent
Argument |
Description |
---|---|
|
Beam size, if 1 then greedy search |
|
Minimum length of prediction to be generated by the beam search |
|
Size n-grams to block in beam search from the context. val <= 0 implies no blocking |
|
Size n-grams to block in beam search. val <= 0 implies no blocking |
|
Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent |
|
Applies a length penalty. Set to 0 for no penalty. |
|
Generation algorithm |
|
K used in Top K sampling |
|
P used in nucleus sampling |
|
Used in delayedbeam search |
|
Decay factor in factual nucleus sampling |
|
Lower bound in factual nucleus sampling |
|
Whether to reset p value in factual nucleus at full stops |
|
Load a text file of hard blocks for beam search to never say. |
|
Temperature to add during decoding |
|
If true, compute tokenized bleu scores |
TorchAgent Arguments
Argument |
Description |
---|---|
|
Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts. |
|
Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training. |
|
If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice. |
|
Use fp16 computations. |
|
Implementation of FP16 to use |
|
Whether the model should parse candidates for ranking. |
|
Truncate input lengths to increase speed / use less memory. |
|
Text input truncation length: if not specified, this will default to |
|
Label truncation length: if not specified, this will default to |
|
Reverse the history |
|
Number of past dialog utterances to remember. |
|
Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization. |
|
Split the dialogue history on newlines and save in separate vectors |
|
Join history lines with this token, defaults to newline |
|
Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence. |
|
Which GPU to use |
|
Disable GPUs even if available. otherwise, will use GPUs if available on the device. |
Optimizer Arguments
Argument |
Description |
---|---|
|
Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor. |
|
Learning rate |
|
Gradient clipping using l2 norm |
|
Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively |
|
If applicable, momentum value for optimizer. |
|
If applicable, whether to use nesterov momentum. |
|
If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0 |
|
If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999 |
|
Weight decay on the weights. |
BPEHelper Arguments
Argument |
Description |
---|---|
|
Path to pre-trained tokenizer vocab |
|
Path to pre-trained tokenizer merge |
|
Use BPE dropout during training. |
Learning Rate Scheduler
Argument |
Description |
---|---|
|
Learning rate scheduler. |
|
LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every |
|
Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered. |
|
Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt |
T5 Args
Argument |
Description |
---|---|
|
Choices: |
|
Use HF model parallel |
|
Dropout for T5 |
|
Task specific generation config for T5 |
Search Query FiD Params
Argument |
Description |
---|---|
|
Path to a query generator model. |
|
Generation algorithm for the search query generator model |
|
The beam_min_length opt for the search query generator model |
|
The beam_size opt for the search query generator model |
|
Truncates the input to the search query generator model |
|
The number of tokens in each document split |
|
Split the docs by white space (word) or dict tokens. |
|
Number of document chunks to keep if documents is too long and has to be splitted. |
|
How to rank doc chunks. |
SearchQuerySearchEngineFiDAgent Options¶
optional arguments
Argument |
Description |
---|---|
|
Set to use CUDA kernel for beam search ngram blocking |
|
Return the topk logits in the act message, if verbose mode is set. |
|
Document chunk size (in characters). |
TorchRankerAgent
Argument |
Description |
---|---|
|
The source of candidates during training (see TorchRankerAgent._build_candidates() for details). |
|
The source of candidates during evaluation (defaults to the samevalue as –candidates if no flag is given) |
|
The source of candidates during interactive mode. Since in interactive mode, batchsize == 1, we cannot use batch candidates. |
|
Block repeating previous utterances. Helpful for many models that score repeats highly, so switched on by default. |
|
A text file of fixed candidates to use for all examples, one candidate per line |
|
One of “reuse”, “replace”, or a path to a file with vectors corresponding to the candidates at –fixed-candidates-path. The default path is a /path/to/model-file.<cands_name>, where <cands_name> is the name of the file (not the full path) passed by the flag –fixed-candidates-path. By default, this file is created once and reused. To replace it, use the “replace” option. |
|
Cache and save the encoding of the candidate vecs. This might be used when interacting with the model in real time or evaluating on fixed candidate set when the encoding of the candidates is independent of the input. |
|
Initialize model with weights from this file. |
|
Get predictions and calculate mean rank during the train step. Turning this on may slow down training. |
|
Limit to the number of predictions in output.text_candidates |
|
Ignore examples for which the label is not present in the label candidates. Default behavior results in RuntimeError. |
|
Ranking returns the top k results of k > 0, otherwise sorts every single candidate according to the ranking. |
|
Return sorted candidate scores from eval_step |
Transformer Arguments
Argument |
Description |
---|---|
|
Use memories: must implement the function |
|
Wrap memory encoder with MLP |
|
Similarity for basic attention mechanism when using transformer to encode memories |
|
Default: |
|
Default: |
|
Learn embeddings |
|
Use model in data parallel, requires multiple gpus |
|
Type of reduction at the end of transformer |
Polyencoder Arguments
Argument |
Description |
---|---|
|
Type of polyencoder, either we computevectors using codes + attention, or we simply take the first N vectors. |
|
Number of vectors used to represent the contextin the case of n_first, those are the numberof vectors that are considered. |
|
Type of the top aggregation layer of the poly-encoder (where the candidate representation isthe key) |
|
In case poly-attention-type is multihead, specify the number of heads |
|
Type |
|
In case codes-attention-type is multihead, specify the number of heads |
Transformer Arguments
Argument |
Description |
---|---|
|
Size of all embedding layers. Must be a multiple of –n-heads. |
|
Number of transformer layers. |
|
Hidden size of the FFN layers |
|
Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets. |
|
Dropout used after attention softmax. This is not used in Vaswani 2017. |
|
Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor. |
|
Number of multihead attention heads |
|
If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch. |
|
Default: |
|
The number of segments that support the model. If zero no segment and no langs_embedding. |
|
Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models |
|
Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu. |
|
Scale the output of every transformer by this quantity. |
|
Share word embeddings table for candidate and contextin the memory network |
|
This will overidde the n-layers for asymmetrical transformers |
|
This will overidde the n-layers for asymmetrical transformers |
|
Shard the layers across multiple GPUs. |
|
Recompute activations on backward pass to conserve memory. |
RAG Model Args
Argument |
Description |
---|---|
|
Which generation model to use |
|
Which query model to use for DPR. |
|
Which rag model decoding to use. |
|
Whether to use thorough decoding for rag sequence. |
Modified RAG Args
Argument |
Description |
---|---|
|
Specify > 0 to include extra positions in the encoder, in which retrieved knowledge will go. In this setup, knowledge is appended instead of prepended. |
|
Key in the observation dict that indicates the gold knowledge passage. Specify, along with –debug, to compute passage retrieval metrics at train/test time. |
|
Key in the observation dict that indicates the gold knowledge passage title. Specify, along with –debug, to compute passage retrieval metrics at train/test time. |
RAG Retriever Args
Argument |
Description |
---|---|
|
What to use as the query for retrieval. |
|
Which retriever to use |
|
Load specified small index, for debugging. |
|
How many documents to retrieve |
|
Minimum amount of information to retain from document. Useful to define if encoder does not use a lot of BPE token context. |
|
Maximum amount of information to retain from document. |
|
Max token length of query for retrieval. |
|
Whether to print docs; usually useful during interactive mode. |
RAG Dense Passage Retriever Args
Argument |
Description |
---|---|
|
Path to FAISS Index. |
|
Path to dense embeddings directory used to build index. Default None will assume embeddings and index are in the same directory. |
|
Path to DPR Model. |
|
Path to DPR passages, used to build index. |
|
Embedding size of dense retriever |
RAG TFIDF Retriever Args
Argument |
Description |
---|---|
|
If > 0, limit documents to this many paragraphs |
|
Optionally override TFIDF model. |
RAG DPR-POLY Retriever Args
Argument |
Description |
---|---|
|
In two stage retrieval, how many DPR documents to retrieve |
|
In two stage retrieval, how much weight to give to the poly scores. Note: Learned parameter. Specify initial value here |
|
Which init model to initialize polyencoder with. Specify wikito or reddit to use models from the ParlAI zoo; otherwise, provide a path to a trained polyencoder |
RAG PolyFAISS retriever args
Argument |
Description |
---|---|
|
Path to poly-encoder for use in poly-faiss retrieval. |
RAG ReGReT args
Argument |
Description |
---|---|
|
Retrieve, Generate, Retrieve, Tune. Retrieve, generate, then retrieve again, and finally tune (refine). |
|
Maximum length in intermediate regret generation |
|
Path to model for initial round of retrieval. |
|
Path to dict file for model for initial round of retrieval. |
|
Overrides the index used with the ReGReT model, if using separate models. I.e., the initial round of retrieval uses the same index as specified for the second round of retrieval |
RAG Indexer Args
Argument |
Description |
---|---|
|
Granularity of RAG Indexer. Choose compressed to save on RAM costs, at the possible expense of accuracy. |
|
Buffer size for adding vectors to the index |
|
If specified, builds compressed indexer from a FAISS Index Factory. see https://github.com/facebookresearch/faiss/wiki/The-index-factory for details |
|
How many centroids to search in compressed indexer. See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes#cell-probe-methods-indexivf-indexes for details |
RAG-Turn Args
Argument |
Description |
---|---|
|
How many turns to split up retrieval into. The most recent text is split by delimiter; all turns after (n-1)th turn are combined. |
|
How to marginalize rag-turn. |
|
Discount factor for turns beyond most recent one. We employ exponential discounting. Only considered if 0 < factor < 1.0. |
Torch Generator Agent
Argument |
Description |
---|---|
|
Beam size, if 1 then greedy search |
|
Minimum length of prediction to be generated by the beam search |
|
Size n-grams to block in beam search from the context. val <= 0 implies no blocking |
|
Size n-grams to block in beam search. val <= 0 implies no blocking |
|
Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent |
|
Applies a length penalty. Set to 0 for no penalty. |
|
Generation algorithm |
|
K used in Top K sampling |
|
P used in nucleus sampling |
|
Used in delayedbeam search |
|
Decay factor in factual nucleus sampling |
|
Lower bound in factual nucleus sampling |
|
Whether to reset p value in factual nucleus at full stops |
|
Load a text file of hard blocks for beam search to never say. |
|
Temperature to add during decoding |
|
If true, compute tokenized bleu scores |
TorchAgent Arguments
Argument |
Description |
---|---|
|
Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts. |
|
Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training. |
|
If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice. |
|
Use fp16 computations. |
|
Implementation of FP16 to use |
|
Whether the model should parse candidates for ranking. |
|
Truncate input lengths to increase speed / use less memory. |
|
Text input truncation length: if not specified, this will default to |
|
Label truncation length: if not specified, this will default to |
|
Reverse the history |
|
Number of past dialog utterances to remember. |
|
Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization. |
|
Split the dialogue history on newlines and save in separate vectors |
|
Join history lines with this token, defaults to newline |
|
Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence. |
|
Which GPU to use |
|
Disable GPUs even if available. otherwise, will use GPUs if available on the device. |
Optimizer Arguments
Argument |
Description |
---|---|
|
Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor. |
|
Learning rate |
|
Gradient clipping using l2 norm |
|
Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively |
|
If applicable, momentum value for optimizer. |
|
If applicable, whether to use nesterov momentum. |
|
If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0 |
|
If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999 |
|
Weight decay on the weights. |
BPEHelper Arguments
Argument |
Description |
---|---|
|
Path to pre-trained tokenizer vocab |
|
Path to pre-trained tokenizer merge |
|
Use BPE dropout during training. |
Learning Rate Scheduler
Argument |
Description |
---|---|
|
Learning rate scheduler. |
|
LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every |
|
Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered. |
|
Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt |
T5 Args
Argument |
Description |
---|---|
|
Choices: |
|
Use HF model parallel |
|
Dropout for T5 |
|
Task specific generation config for T5 |
Search Query FiD Params
Argument |
Description |
---|---|
|
Path to a query generator model. |
|
Generation algorithm for the search query generator model |
|
The beam_min_length opt for the search query generator model |
|
The beam_size opt for the search query generator model |
|
Truncates the input to the search query generator model |
|
The number of tokens in each document split |
|
Split the docs by white space (word) or dict tokens. |
|
Number of document chunks to keep if documents is too long and has to be splitted. |
|
How to rank doc chunks. |
Search Engine FiD Params
Argument |
Description |
---|---|
|
A search server address. |
WizIntGoldDocRetrieverFiDAgent Options¶
optional arguments
Argument |
Description |
---|---|
|
Set to use CUDA kernel for beam search ngram blocking |
|
Return the topk logits in the act message, if verbose mode is set. |
|
Document chunk size (in characters). |
TorchRankerAgent
Argument |
Description |
---|---|
|
The source of candidates during training (see TorchRankerAgent._build_candidates() for details). |
|
The source of candidates during evaluation (defaults to the samevalue as –candidates if no flag is given) |
|
The source of candidates during interactive mode. Since in interactive mode, batchsize == 1, we cannot use batch candidates. |
|
Block repeating previous utterances. Helpful for many models that score repeats highly, so switched on by default. |
|
A text file of fixed candidates to use for all examples, one candidate per line |
|
One of “reuse”, “replace”, or a path to a file with vectors corresponding to the candidates at –fixed-candidates-path. The default path is a /path/to/model-file.<cands_name>, where <cands_name> is the name of the file (not the full path) passed by the flag –fixed-candidates-path. By default, this file is created once and reused. To replace it, use the “replace” option. |
|
Cache and save the encoding of the candidate vecs. This might be used when interacting with the model in real time or evaluating on fixed candidate set when the encoding of the candidates is independent of the input. |
|
Initialize model with weights from this file. |
|
Get predictions and calculate mean rank during the train step. Turning this on may slow down training. |
|
Limit to the number of predictions in output.text_candidates |
|
Ignore examples for which the label is not present in the label candidates. Default behavior results in RuntimeError. |
|
Ranking returns the top k results of k > 0, otherwise sorts every single candidate according to the ranking. |
|
Return sorted candidate scores from eval_step |
Transformer Arguments
Argument |
Description |
---|---|
|
Use memories: must implement the function |
|
Wrap memory encoder with MLP |
|
Similarity for basic attention mechanism when using transformer to encode memories |
|
Default: |
|
Default: |
|
Learn embeddings |
|
Use model in data parallel, requires multiple gpus |
|
Type of reduction at the end of transformer |
Polyencoder Arguments
Argument |
Description |
---|---|
|
Type of polyencoder, either we computevectors using codes + attention, or we simply take the first N vectors. |
|
Number of vectors used to represent the contextin the case of n_first, those are the numberof vectors that are considered. |
|
Type of the top aggregation layer of the poly-encoder (where the candidate representation isthe key) |
|
In case poly-attention-type is multihead, specify the number of heads |
|
Type |
|
In case codes-attention-type is multihead, specify the number of heads |
Transformer Arguments
Argument |
Description |
---|---|
|
Size of all embedding layers. Must be a multiple of –n-heads. |
|
Number of transformer layers. |
|
Hidden size of the FFN layers |
|
Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets. |
|
Dropout used after attention softmax. This is not used in Vaswani 2017. |
|
Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor. |
|
Number of multihead attention heads |
|
If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch. |
|
Default: |
|
The number of segments that support the model. If zero no segment and no langs_embedding. |
|
Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models |
|
Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu. |
|
Scale the output of every transformer by this quantity. |
|
Share word embeddings table for candidate and contextin the memory network |
|
This will overidde the n-layers for asymmetrical transformers |
|
This will overidde the n-layers for asymmetrical transformers |
|
Shard the layers across multiple GPUs. |
|
Recompute activations on backward pass to conserve memory. |
RAG Model Args
Argument |
Description |
---|---|
|
Which generation model to use |
|
Which query model to use for DPR. |
|
Which rag model decoding to use. |
|
Whether to use thorough decoding for rag sequence. |
Modified RAG Args
Argument |
Description |
---|---|
|
Specify > 0 to include extra positions in the encoder, in which retrieved knowledge will go. In this setup, knowledge is appended instead of prepended. |
|
Key in the observation dict that indicates the gold knowledge passage. Specify, along with –debug, to compute passage retrieval metrics at train/test time. |
|
Key in the observation dict that indicates the gold knowledge passage title. Specify, along with –debug, to compute passage retrieval metrics at train/test time. |
RAG Retriever Args
Argument |
Description |
---|---|
|
What to use as the query for retrieval. |
|
Which retriever to use |
|
Load specified small index, for debugging. |
|
How many documents to retrieve |
|
Minimum amount of information to retain from document. Useful to define if encoder does not use a lot of BPE token context. |
|
Maximum amount of information to retain from document. |
|
Max token length of query for retrieval. |
|
Whether to print docs; usually useful during interactive mode. |
RAG Dense Passage Retriever Args
Argument |
Description |
---|---|
|
Path to FAISS Index. |
|
Path to dense embeddings directory used to build index. Default None will assume embeddings and index are in the same directory. |
|
Path to DPR Model. |
|
Path to DPR passages, used to build index. |
|
Embedding size of dense retriever |
RAG TFIDF Retriever Args
Argument |
Description |
---|---|
|
If > 0, limit documents to this many paragraphs |
|
Optionally override TFIDF model. |
RAG DPR-POLY Retriever Args
Argument |
Description |
---|---|
|
In two stage retrieval, how many DPR documents to retrieve |
|
In two stage retrieval, how much weight to give to the poly scores. Note: Learned parameter. Specify initial value here |
|
Which init model to initialize polyencoder with. Specify wikito or reddit to use models from the ParlAI zoo; otherwise, provide a path to a trained polyencoder |
RAG PolyFAISS retriever args
Argument |
Description |
---|---|
|
Path to poly-encoder for use in poly-faiss retrieval. |
RAG ReGReT args
Argument |
Description |
---|---|
|
Retrieve, Generate, Retrieve, Tune. Retrieve, generate, then retrieve again, and finally tune (refine). |
|
Maximum length in intermediate regret generation |
|
Path to model for initial round of retrieval. |
|
Path to dict file for model for initial round of retrieval. |
|
Overrides the index used with the ReGReT model, if using separate models. I.e., the initial round of retrieval uses the same index as specified for the second round of retrieval |
RAG Indexer Args
Argument |
Description |
---|---|
|
Granularity of RAG Indexer. Choose compressed to save on RAM costs, at the possible expense of accuracy. |
|
Buffer size for adding vectors to the index |
|
If specified, builds compressed indexer from a FAISS Index Factory. see https://github.com/facebookresearch/faiss/wiki/The-index-factory for details |
|
How many centroids to search in compressed indexer. See https://github.com/facebookresearch/faiss/wiki/Faiss-indexes#cell-probe-methods-indexivf-indexes for details |
RAG-Turn Args
Argument |
Description |
---|---|
|
How many turns to split up retrieval into. The most recent text is split by delimiter; all turns after (n-1)th turn are combined. |
|
How to marginalize rag-turn. |
|
Discount factor for turns beyond most recent one. We employ exponential discounting. Only considered if 0 < factor < 1.0. |
Torch Generator Agent
Argument |
Description |
---|---|
|
Beam size, if 1 then greedy search |
|
Minimum length of prediction to be generated by the beam search |
|
Size n-grams to block in beam search from the context. val <= 0 implies no blocking |
|
Size n-grams to block in beam search. val <= 0 implies no blocking |
|
Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent |
|
Applies a length penalty. Set to 0 for no penalty. |
|
Generation algorithm |
|
K used in Top K sampling |
|
P used in nucleus sampling |
|
Used in delayedbeam search |
|
Decay factor in factual nucleus sampling |
|
Lower bound in factual nucleus sampling |
|
Whether to reset p value in factual nucleus at full stops |
|
Load a text file of hard blocks for beam search to never say. |
|
Temperature to add during decoding |
|
If true, compute tokenized bleu scores |
TorchAgent Arguments
Argument |
Description |
---|---|
|
Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts. |
|
Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training. |
|
If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice. |
|
Use fp16 computations. |
|
Implementation of FP16 to use |
|
Whether the model should parse candidates for ranking. |
|
Truncate input lengths to increase speed / use less memory. |
|
Text input truncation length: if not specified, this will default to |
|
Label truncation length: if not specified, this will default to |
|
Reverse the history |
|
Number of past dialog utterances to remember. |
|
Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization. |
|
Split the dialogue history on newlines and save in separate vectors |
|
Join history lines with this token, defaults to newline |
|
Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence. |
|
Which GPU to use |
|
Disable GPUs even if available. otherwise, will use GPUs if available on the device. |
Optimizer Arguments
Argument |
Description |
---|---|
|
Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor. |
|
Learning rate |
|
Gradient clipping using l2 norm |
|
Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively |
|
If applicable, momentum value for optimizer. |
|
If applicable, whether to use nesterov momentum. |
|
If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0 |
|
If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999 |
|
Weight decay on the weights. |
BPEHelper Arguments
Argument |
Description |
---|---|
|
Path to pre-trained tokenizer vocab |
|
Path to pre-trained tokenizer merge |
|
Use BPE dropout during training. |
Learning Rate Scheduler
Argument |
Description |
---|---|
|
Learning rate scheduler. |
|
LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every |
|
Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered. |
|
Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt |
T5 Args
Argument |
Description |
---|---|
|
Choices: |
|
Use HF model parallel |
|
Dropout for T5 |
|
Task specific generation config for T5 |
Search Query FiD Params
Argument |
Description |
---|---|
|
Path to a query generator model. |
|
Generation algorithm for the search query generator model |
|
The beam_min_length opt for the search query generator model |
|
The beam_size opt for the search query generator model |
|
Truncates the input to the search query generator model |
|
The number of tokens in each document split |
|
Split the docs by white space (word) or dict tokens. |
|
Number of document chunks to keep if documents is too long and has to be splitted. |
|
How to rank doc chunks. |