Image+Seq2Seq¶
The Image+Seq2Seq agent is a model that incorporates image features with a sequence to sequence transformer generator. A core component of the dodecaDialogue task.
Basic Examples¶
Train an Image+Seq2Seq model on an image captioning task:
python parlai/scripts/train_model.py -m image_seq2seq -t flickr30k --image-mode resnext101_32x48d_wsl -mf /tmp/model
Train an Image+Seq2Seq model on a dialogue task:
python parlai/scripts/train_model.py -m image_seq2seq -t convai2 -mf /tmp/model
Multi-task train an Image+Seq2Seq model on a dialogue and captioning task:
python parlai/scripts/train_model.py -m image_seq2seq -t flickr30k,convai2 -mf /tmp/model --image-mode resnext101_32x48d_wsl
DictionaryAgent Options¶
BPEHelper Arguments
Argument |
Description |
---|---|
|
Path to pre-trained tokenizer vocab |
|
Path to pre-trained tokenizer merge |
|
Use BPE dropout during training. |
ImageSeq2seqAgent Options¶
optional arguments
Argument |
Description |
---|---|
|
Set to use CUDA kernel for beam search ngram blocking |
|
Return the topk logits in the act message, if verbose mode is set. |
Transformer Arguments
Argument |
Description |
---|---|
|
Size of all embedding layers. Must be a multiple of –n-heads. |
|
Number of transformer layers. |
|
Hidden size of the FFN layers |
|
Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets. |
|
Dropout used after attention softmax. This is not used in Vaswani 2017. |
|
Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor. |
|
Number of multihead attention heads |
|
If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch. |
|
Default: |
|
The number of segments that support the model. If zero no segment and no langs_embedding. |
|
Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models |
|
Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu. |
|
Scale the output of every transformer by this quantity. |
|
Share word embeddings table for candidate and contextin the memory network |
|
This will overidde the n-layers for asymmetrical transformers |
|
This will overidde the n-layers for asymmetrical transformers |
|
Shard the layers across multiple GPUs. |
|
Recompute activations on backward pass to conserve memory. |
Torch Generator Agent
Argument |
Description |
---|---|
|
Beam size, if 1 then greedy search |
|
Minimum length of prediction to be generated by the beam search |
|
Size n-grams to block in beam search from the context. val <= 0 implies no blocking |
|
Size n-grams to block in beam search. val <= 0 implies no blocking |
|
Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent |
|
Applies a length penalty. Set to 0 for no penalty. |
|
Generation algorithm |
|
K used in Top K sampling |
|
P used in nucleus sampling |
|
Used in delayedbeam search |
|
Decay factor in factual nucleus sampling |
|
Lower bound in factual nucleus sampling |
|
Whether to reset p value in factual nucleus at full stops |
|
Load a text file of hard blocks for beam search to never say. |
|
Temperature to add during decoding |
|
If true, compute tokenized bleu scores |
TorchAgent Arguments
Argument |
Description |
---|---|
|
Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts. |
|
Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training. |
|
If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice. |
|
Use fp16 computations. |
|
Implementation of FP16 to use |
|
Whether the model should parse candidates for ranking. |
|
Truncate input lengths to increase speed / use less memory. |
|
Text input truncation length: if not specified, this will default to |
|
Label truncation length: if not specified, this will default to |
|
Reverse the history |
|
Number of past dialog utterances to remember. |
|
Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization. |
|
Split the dialogue history on newlines and save in separate vectors |
|
Join history lines with this token, defaults to newline |
|
Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence. |
|
Which GPU to use |
|
Disable GPUs even if available. otherwise, will use GPUs if available on the device. |
Optimizer Arguments
Argument |
Description |
---|---|
|
Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor. |
|
Learning rate |
|
Gradient clipping using l2 norm |
|
Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively |
|
If applicable, momentum value for optimizer. |
|
If applicable, whether to use nesterov momentum. |
|
If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0 |
|
If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999 |
|
Weight decay on the weights. |
BPEHelper Arguments
Argument |
Description |
---|---|
|
Path to pre-trained tokenizer vocab |
|
Path to pre-trained tokenizer merge |
|
Use BPE dropout during training. |
Learning Rate Scheduler
Argument |
Description |
---|---|
|
Learning rate scheduler. |
|
LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every |
|
Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered. |
|
Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt |
Image args
Argument |
Description |
---|---|
|
Dimensionality of image features |
|
Number of linear layers to encode image features with |
|
Number of tokens that the image encoding will consist of. Specify to spread image encoding over multiple tokens |
|
Number of channels that the image encoding will consist of. Specify if incoming image is multidimensional |
Image Encoder Args
Argument |
Description |
---|---|
|
If true, include image token (or no image token) for each example |
|
Which fusion type to use |
TransformerGeneratorAgent Options¶
optional arguments
Argument |
Description |
---|---|
|
Set to use CUDA kernel for beam search ngram blocking |
|
Return the topk logits in the act message, if verbose mode is set. |
Transformer Arguments
Argument |
Description |
---|---|
|
Size of all embedding layers. Must be a multiple of –n-heads. |
|
Number of transformer layers. |
|
Hidden size of the FFN layers |
|
Dropout used around embeddings and before layer layer normalizations. This is used in Vaswani 2017 and works well on large datasets. |
|
Dropout used after attention softmax. This is not used in Vaswani 2017. |
|
Dropout used after the ReLU in the FFN. Not used in Vaswani 2017, but used in Tensor2Tensor. |
|
Number of multihead attention heads |
|
If off, sinusoidal embeddings are used. If on, position embeddings are learned from scratch. |
|
Default: |
|
The number of segments that support the model. If zero no segment and no langs_embedding. |
|
Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models |
|
Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu. |
|
Scale the output of every transformer by this quantity. |
|
Share word embeddings table for candidate and contextin the memory network |
|
This will overidde the n-layers for asymmetrical transformers |
|
This will overidde the n-layers for asymmetrical transformers |
|
Shard the layers across multiple GPUs. |
|
Recompute activations on backward pass to conserve memory. |
Torch Generator Agent
Argument |
Description |
---|---|
|
Beam size, if 1 then greedy search |
|
Minimum length of prediction to be generated by the beam search |
|
Size n-grams to block in beam search from the context. val <= 0 implies no blocking |
|
Size n-grams to block in beam search. val <= 0 implies no blocking |
|
Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent |
|
Applies a length penalty. Set to 0 for no penalty. |
|
Generation algorithm |
|
K used in Top K sampling |
|
P used in nucleus sampling |
|
Used in delayedbeam search |
|
Decay factor in factual nucleus sampling |
|
Lower bound in factual nucleus sampling |
|
Whether to reset p value in factual nucleus at full stops |
|
Load a text file of hard blocks for beam search to never say. |
|
Temperature to add during decoding |
|
If true, compute tokenized bleu scores |
TorchAgent Arguments
Argument |
Description |
---|---|
|
Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts. |
|
Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training. |
|
If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice. |
|
Use fp16 computations. |
|
Implementation of FP16 to use |
|
Whether the model should parse candidates for ranking. |
|
Truncate input lengths to increase speed / use less memory. |
|
Text input truncation length: if not specified, this will default to |
|
Label truncation length: if not specified, this will default to |
|
Reverse the history |
|
Number of past dialog utterances to remember. |
|
Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization. |
|
Split the dialogue history on newlines and save in separate vectors |
|
Join history lines with this token, defaults to newline |
|
Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence. |
|
Which GPU to use |
|
Disable GPUs even if available. otherwise, will use GPUs if available on the device. |
Optimizer Arguments
Argument |
Description |
---|---|
|
Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor. |
|
Learning rate |
|
Gradient clipping using l2 norm |
|
Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively |
|
If applicable, momentum value for optimizer. |
|
If applicable, whether to use nesterov momentum. |
|
If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0 |
|
If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999 |
|
Weight decay on the weights. |
BPEHelper Arguments
Argument |
Description |
---|---|
|
Path to pre-trained tokenizer vocab |
|
Path to pre-trained tokenizer merge |
|
Use BPE dropout during training. |
Learning Rate Scheduler
Argument |
Description |
---|---|
|
Learning rate scheduler. |
|
LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every |
|
Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered. |
|
Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt |