Transformer

We offer a variety of agent implementations whose core model is the transformer, a self-attention based encoding mechanism first described in Vaswani et al 2017.

Agent Variations

  • transformer/biencoder - A retrieval-based agent that encodes a context sequence and a candidate sequence with separate BERT-based Transformers. A candidate is chosen via the highest dot-product score between the context and candidate encodings. See Humeau et al 2019 for more details.

  • transformer/classifier - A classifier agent with a Transformer as the model.

  • transformer/crossencoder - A retrieval-based agent that jointly encodes a context and candidate sequence in a single BERT-based Transformer, with a final linear layer used to compute a score. A candidate is chosen via the highest scoring encoding. See Humeau et al 2019 for more details.

  • transformer/generator - A generative-based agent that performs seq2seq encoding/decoding with transformer encoders/decoders.

  • transformer/polyencoder - A retrieval-based agent that, similar to the bi-encoder agent, encodes context and candidate sequences with separate BERT-based Transformers. However, to compute a final score, the agent performs an additional layer of attention using global context vectors before computing the final dot product, thus incorporating the candidate encoding into the context encoding prior to producing a dot-product score. See Humeau et al 2019 for more details.

  • transformer/ranker - A retrieval-based agent that encodes a context sequence and a candidate sequence with separate Transformers, before computing a dot-product to obtain a score for a candidate encoding.

TransformerClassifierAgent Options

optional arguments

--share-word-embeddings

Share word embeddings table for candidate and contextin the memory network

Default: True.

--load-from-pretrained-ranker

Load model from base transformer ranking model (used for pretraining)

Default: False.

TorchRankerAgent

-cands, --candidates

The source of candidates during training (see TorchRankerAgent._build_candidates() for details).

Choices: batch, inline, fixed, batch-all-cands.

Default: inline.

-ecands, --eval-candidates

The source of candidates during evaluation (defaults to the samevalue as –candidates if no flag is given)

Choices: batch, inline, fixed, vocab, batch-all-cands.

Default: inline.

-icands, --interactive-candidates

The source of candidates during interactive mode. Since in interactive mode, batchsize == 1, we cannot use batch candidates.

Choices: fixed, inline, vocab.

Default: fixed.

--repeat-blocking-heuristic

Block repeating previous utterances. Helpful for many models that score repeats highly, so switched on by default.

Default: True.

-fcp, --fixed-candidates-path

A text file of fixed candidates to use for all examples, one candidate per line

--fixed-candidate-vecs

One of “reuse”, “replace”, or a path to a file with vectors corresponding to the candidates at –fixed-candidates-path. The default path is a /path/to/model-file.<cands_name>, where <cands_name> is the name of the file (not the full path) passed by the flag –fixed-candidates-path. By default, this file is created once and reused. To replace it, use the “replace” option.

Default: reuse.

--encode-candidate-vecs

Cache and save the encoding of the candidate vecs. This might be used when interacting with the model in real time or evaluating on fixed candidate set when the encoding of the candidates is independent of the input.

Default: True.

--init-model

Initialize model with weights from this file.

--train-predict

Get predictions and calculate mean rank during the train step. Turning this on may slow down training.

Default: False.

--cap-num-predictions

Limit to the number of predictions in output.text_candidates

Default: 100.

--ignore-bad-candidates

Ignore examples for which the label is not present in the label candidates. Default behavior results in RuntimeError.

Default: False.

--rank-top-k

Ranking returns the top k results of k > 0, otherwise sorts every single candidate according to the ranking.

Default: -1.

--inference

Final response output algorithm

Choices: max, topk.

Default: max.

--topk

K used in Top K sampling inference, when selected

Default: 5.

--return-cand-scores

Return sorted candidate scores from eval_step

Default: False.

Transformer Arguments

-esz, --embedding-size

Size of all embedding layers

Default: 300.

-nl, --n-layers

Default: 2.

-hid, --ffn-size

Hidden size of the FFN layers

Default: 300.

--dropout

Dropout used in Vaswani 2017.

Default: 0.0.

--attention-dropout

Dropout used after attention softmax.

Default: 0.0.

--relu-dropout

Dropout used after ReLU. From tensor2tensor.

Default: 0.0.

--n-heads

Number of multihead attention heads

Default: 2.

--learn-positional-embeddings

Default: False.

--embeddings-scale

Default: True.

--n-segments

The number of segments that support the model. If zero no segment and no langs_embedding.

Default: 0.

--variant

Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models

Choices: bart, prelayernorm, xlm, aiayn.

Default: aiayn. Recommended: xlm.

--activation

Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu.

Choices: gelu, relu.

Default: relu. Recommended: gelu.

--output-scaling

Scale the output of every transformer by this quantity.

Default: 1.0.

-nel, --n-encoder-layers

This will overide the n-layers for asymmetrical transformers

Default: -1.

-ndl, --n-decoder-layers

This will overide the n-layers for asymmetrical transformers

Default: -1.

--model-parallel

Shard the layers across multiple GPUs.

Default: False.

--use-memories

Use memories: must implement the function _vectorize_memories to use this

Default: False.

--wrap-memory-encoder

Wrap memory encoder with MLP

Default: False.

--memory-attention

Similarity for basic attention mechanism when using transformer to encode memories

Choices: cosine, dot, sqrt.

Default: sqrt.

--normalize-sent-emb

Default: False.

--share-encoders

Default: True.

--learn-embeddings

Learn embeddings

Default: True.

--reduction-type

Type of reduction at the end of transformer

Choices: first, max, mean.

Default: first.

TorchAgent Arguments

-i, --interactive-mode

Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts.

Default: False.

-emb, --embedding-type

Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training.

Choices: random, glove, glove-fixed, fasttext, fasttext-fixed, fasttext_cc, fasttext_cc-fixed.

Default: random.

-embp, --embedding-projection

If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice.

Default: random.

--fp16

Use fp16 computations.

Default: False.

--fp16-impl

Implementation of FP16 to use

Choices: apex, mem_efficient.

Default: apex.

-rc, --rank-candidates

Whether the model should parse candidates for ranking.

Default: False.

-tr, --truncate

Truncate input lengths to increase speed / use less memory.

Default: -1.

--text-truncate

Text input truncation length: if not specified, this will default to truncate

--label-truncate

Label truncation length: if not specified, this will default to truncate

-histsz, --history-size

Number of past dialog utterances to remember.

Default: -1.

-pt, --person-tokens

Add person tokens to history. adds __p1__ in front of input text and __p2__ in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization.

Default: False.

--split-lines

Split the dialogue history on newlines and save in separate vectors

Default: False.

--delimiter

Join history lines with this token, defaults to newline

Default: \n.

-gpu, --gpu

Which GPU to use

Default: -1.

--no-cuda

Disable GPUs even if available. otherwise, will use GPUs if available on the device.

Default: False.

Optimizer Arguments

-opt, --optimizer

Choose between pytorch optimizers. Any member of torch.optim should be valid.

Choices: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, rprop, rmsprop, optimizer, lbfgs, mem_eff_adam, adafactor.

Default: sgd.

-lr, --learningrate

Learning rate

Default: 1.

-clip, --gradient-clip

Gradient clipping using l2 norm

Default: 0.1.

--adafactor-eps

Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively

Default: 1e-30,1e-3. Recommended: 1e-30,1e-3.

-mom, --momentum

If applicable, momentum value for optimizer.

Default: 0.

--nesterov

If applicable, whether to use nesterov momentum.

Default: True.

-nu, --nus

If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0

Default: 0.7.

-beta, --betas

If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999

Default: 0.9,0.999.

-wdecay, --weight-decay

Weight decay on the weights.

BPEHelper Arguments

--bpe-vocab

Path to pre-trained tokenizer vocab

--bpe-merge

Path to pre-trained tokenizer merge

Learning Rate Scheduler

--lr-scheduler

Learning rate scheduler.

Choices: reduceonplateau, none, fixed, invsqrt, cosine, linear.

Default: reduceonplateau.

--lr-scheduler-patience

LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every <patience> validations.

Default: 3.

--lr-scheduler-decay

Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered.

Default: 0.5.

--max-lr-steps

Number of train steps the scheduler should take after warmup. Training is terminated after this many steps. This should only be set for –lr-scheduler cosine or linear

Default: -1.

--invsqrt-lr-decay-gamma

Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt

Default: -1.

Torch Classifier Arguments

--classes

The name of the classes.

--class-weights

Weight of each of the classes for the softmax

--threshold

During evaluation, threshold for choosing ref class; only applies to binary classification

Default: 0.5.

--print-scores

Print probability of chosen class during interactive mode

Default: False.

--data-parallel

Uses nn.DataParallel for multi GPU

Default: False.

--classes-from-file

Loads the list of classes from a file

--ignore-labels

Ignore labels provided to model

TransformerGeneratorAgent Options

Transformer Arguments

-esz, --embedding-size

Size of all embedding layers

Default: 300.

-nl, --n-layers

Default: 2.

-hid, --ffn-size

Hidden size of the FFN layers

Default: 300.

--dropout

Dropout used in Vaswani 2017.

Default: 0.0.

--attention-dropout

Dropout used after attention softmax.

Default: 0.0.

--relu-dropout

Dropout used after ReLU. From tensor2tensor.

Default: 0.0.

--n-heads

Number of multihead attention heads

Default: 2.

--learn-positional-embeddings

Default: False.

--embeddings-scale

Default: True.

--n-segments

The number of segments that support the model. If zero no segment and no langs_embedding.

Default: 0.

--variant

Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models

Choices: bart, prelayernorm, xlm, aiayn.

Default: aiayn. Recommended: xlm.

--activation

Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu.

Choices: gelu, relu.

Default: relu. Recommended: gelu.

--output-scaling

Scale the output of every transformer by this quantity.

Default: 1.0.

--share-word-embeddings

Share word embeddings table for candidate and contextin the memory network

Default: True.

-nel, --n-encoder-layers

This will overide the n-layers for asymmetrical transformers

Default: -1.

-ndl, --n-decoder-layers

This will overide the n-layers for asymmetrical transformers

Default: -1.

--model-parallel

Shard the layers across multiple GPUs.

Default: False.

Torch Generator Agent

--beam-size

Beam size, if 1 then greedy search

Default: 1.

--beam-min-length

Minimum length of prediction to be generated by the beam search

Default: 1.

--beam-context-block-ngram

Size n-grams to block in beam search from the context. val <= 0 implies no blocking

Default: -1.

--beam-block-ngram

Size n-grams to block in beam search. val <= 0 implies no blocking

Default: -1.

--beam-length-penalty

Applies a length penalty. Set to 0 for no penalty.

Default: 0.65.

--inference

Generation algorithm

Choices: nucleus, topk, greedy, delayedbeam, beam.

Default: greedy.

--topk

K used in Top K sampling

Default: 10.

--topp

P used in nucleus sampling

Default: 0.9.

--beam-delay

Used in delayedbeam search

Default: 30.

--beam-block-list-filename

Load a text file of hard blocks for beam search to never say.

--temperature

Temperature to add during decoding

Default: 1.0.

--compute-tokenized-bleu

If true, compute tokenized bleu scores

Default: False.

TorchAgent Arguments

-i, --interactive-mode

Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts.

Default: False.

-emb, --embedding-type

Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training.

Choices: random, glove, glove-fixed, fasttext, fasttext-fixed, fasttext_cc, fasttext_cc-fixed.

Default: random.

-embp, --embedding-projection

If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice.

Default: random.

--fp16

Use fp16 computations.

Default: False.

--fp16-impl

Implementation of FP16 to use

Choices: apex, mem_efficient.

Default: apex.

-rc, --rank-candidates

Whether the model should parse candidates for ranking.

Default: False.

-tr, --truncate

Truncate input lengths to increase speed / use less memory.

Default: -1.

--text-truncate

Text input truncation length: if not specified, this will default to truncate

--label-truncate

Label truncation length: if not specified, this will default to truncate

-histsz, --history-size

Number of past dialog utterances to remember.

Default: -1.

-pt, --person-tokens

Add person tokens to history. adds __p1__ in front of input text and __p2__ in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization.

Default: False.

--split-lines

Split the dialogue history on newlines and save in separate vectors

Default: False.

--delimiter

Join history lines with this token, defaults to newline

Default: \n.

-gpu, --gpu

Which GPU to use

Default: -1.

--no-cuda

Disable GPUs even if available. otherwise, will use GPUs if available on the device.

Default: False.

Optimizer Arguments

-opt, --optimizer

Choose between pytorch optimizers. Any member of torch.optim should be valid.

Choices: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, rprop, rmsprop, optimizer, lbfgs, mem_eff_adam, adafactor.

Default: sgd.

-lr, --learningrate

Learning rate

Default: 1.

-clip, --gradient-clip

Gradient clipping using l2 norm

Default: 0.1.

--adafactor-eps

Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively

Default: 1e-30,1e-3. Recommended: 1e-30,1e-3.

-mom, --momentum

If applicable, momentum value for optimizer.

Default: 0.

--nesterov

If applicable, whether to use nesterov momentum.

Default: True.

-nu, --nus

If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0

Default: 0.7.

-beta, --betas

If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999

Default: 0.9,0.999.

-wdecay, --weight-decay

Weight decay on the weights.

BPEHelper Arguments

--bpe-vocab

Path to pre-trained tokenizer vocab

--bpe-merge

Path to pre-trained tokenizer merge

Learning Rate Scheduler

--lr-scheduler

Learning rate scheduler.

Choices: reduceonplateau, none, fixed, invsqrt, cosine, linear.

Default: reduceonplateau.

--lr-scheduler-patience

LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every <patience> validations.

Default: 3.

--lr-scheduler-decay

Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered.

Default: 0.5.

--max-lr-steps

Number of train steps the scheduler should take after warmup. Training is terminated after this many steps. This should only be set for –lr-scheduler cosine or linear

Default: -1.

--invsqrt-lr-decay-gamma

Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt

Default: -1.

TransformerRankerAgent Options

optional arguments

--share-word-embeddings

Share word embeddings table for candidate and contextin the memory network

Default: True.

TorchAgent Arguments

-i, --interactive-mode

Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts.

Default: False.

-emb, --embedding-type

Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training.

Choices: random, glove, glove-fixed, fasttext, fasttext-fixed, fasttext_cc, fasttext_cc-fixed.

Default: random.

-embp, --embedding-projection

If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice.

Default: random.

--fp16

Use fp16 computations.

Default: False.

--fp16-impl

Implementation of FP16 to use

Choices: apex, mem_efficient.

Default: apex.

-rc, --rank-candidates

Whether the model should parse candidates for ranking.

Default: False.

-tr, --truncate

Truncate input lengths to increase speed / use less memory.

Default: 1024.

--text-truncate

Text input truncation length: if not specified, this will default to truncate

--label-truncate

Label truncation length: if not specified, this will default to truncate

-histsz, --history-size

Number of past dialog utterances to remember.

Default: -1.

-pt, --person-tokens

Add person tokens to history. adds __p1__ in front of input text and __p2__ in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization.

Default: False.

--split-lines

Split the dialogue history on newlines and save in separate vectors

Default: False.

--delimiter

Join history lines with this token, defaults to newline

Default: \n.

-gpu, --gpu

Which GPU to use

Default: -1.

--no-cuda

Disable GPUs even if available. otherwise, will use GPUs if available on the device.

Default: False.

Optimizer Arguments

-opt, --optimizer

Choose between pytorch optimizers. Any member of torch.optim should be valid.

Choices: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, rprop, rmsprop, optimizer, lbfgs, mem_eff_adam, adafactor.

Default: adamax.

-lr, --learningrate

Learning rate

Default: 0.0001.

-clip, --gradient-clip

Gradient clipping using l2 norm

Default: 0.1.

--adafactor-eps

Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively

Default: 1e-30,1e-3. Recommended: 1e-30,1e-3.

-mom, --momentum

If applicable, momentum value for optimizer.

Default: 0.

--nesterov

If applicable, whether to use nesterov momentum.

Default: True.

-nu, --nus

If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0

Default: 0.7.

-beta, --betas

If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999

Default: 0.9,0.999.

-wdecay, --weight-decay

Weight decay on the weights.

Learning Rate Scheduler

--lr-scheduler

Learning rate scheduler.

Choices: reduceonplateau, none, fixed, invsqrt, cosine, linear.

Default: reduceonplateau.

--lr-scheduler-patience

LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every <patience> validations.

Default: 3.

--lr-scheduler-decay

Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered.

Default: 0.5.

--max-lr-steps

Number of train steps the scheduler should take after warmup. Training is terminated after this many steps. This should only be set for –lr-scheduler cosine or linear

Default: -1.

--invsqrt-lr-decay-gamma

Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt

Default: -1.

TorchRankerAgent

-cands, --candidates

The source of candidates during training (see TorchRankerAgent._build_candidates() for details).

Choices: batch, inline, fixed, batch-all-cands.

Default: inline.

-ecands, --eval-candidates

The source of candidates during evaluation (defaults to the samevalue as –candidates if no flag is given)

Choices: batch, inline, fixed, vocab, batch-all-cands.

Default: inline.

-icands, --interactive-candidates

The source of candidates during interactive mode. Since in interactive mode, batchsize == 1, we cannot use batch candidates.

Choices: fixed, inline, vocab.

Default: fixed.

--repeat-blocking-heuristic

Block repeating previous utterances. Helpful for many models that score repeats highly, so switched on by default.

Default: True.

-fcp, --fixed-candidates-path

A text file of fixed candidates to use for all examples, one candidate per line

--fixed-candidate-vecs

One of “reuse”, “replace”, or a path to a file with vectors corresponding to the candidates at –fixed-candidates-path. The default path is a /path/to/model-file.<cands_name>, where <cands_name> is the name of the file (not the full path) passed by the flag –fixed-candidates-path. By default, this file is created once and reused. To replace it, use the “replace” option.

Default: reuse.

--encode-candidate-vecs

Cache and save the encoding of the candidate vecs. This might be used when interacting with the model in real time or evaluating on fixed candidate set when the encoding of the candidates is independent of the input.

Default: True.

--init-model

Initialize model with weights from this file.

--train-predict

Get predictions and calculate mean rank during the train step. Turning this on may slow down training.

Default: False.

--cap-num-predictions

Limit to the number of predictions in output.text_candidates

Default: 100.

--ignore-bad-candidates

Ignore examples for which the label is not present in the label candidates. Default behavior results in RuntimeError.

Default: False.

--rank-top-k

Ranking returns the top k results of k > 0, otherwise sorts every single candidate according to the ranking.

Default: -1.

--inference

Final response output algorithm

Choices: max, topk.

Default: max.

--topk

K used in Top K sampling inference, when selected

Default: 5.

--return-cand-scores

Return sorted candidate scores from eval_step

Default: False.

Transformer Arguments

-esz, --embedding-size

Size of all embedding layers

Default: 300.

-nl, --n-layers

Default: 2.

-hid, --ffn-size

Hidden size of the FFN layers

Default: 300.

--dropout

Dropout used in Vaswani 2017.

Default: 0.0.

--attention-dropout

Dropout used after attention softmax.

Default: 0.0.

--relu-dropout

Dropout used after ReLU. From tensor2tensor.

Default: 0.0.

--n-heads

Number of multihead attention heads

Default: 2.

--learn-positional-embeddings

Default: False.

--embeddings-scale

Default: True.

--n-segments

The number of segments that support the model. If zero no segment and no langs_embedding.

Default: 0.

--variant

Chooses locations of layer norms, etc. prelayernorm is used to match some fairseq models

Choices: bart, prelayernorm, xlm, aiayn.

Default: aiayn. Recommended: xlm.

--activation

Nonlinear activation to use. AIAYN uses relu, but more recent papers prefer gelu.

Choices: gelu, relu.

Default: relu. Recommended: gelu.

--output-scaling

Scale the output of every transformer by this quantity.

Default: 1.0.

-nel, --n-encoder-layers

This will overide the n-layers for asymmetrical transformers

Default: -1.

-ndl, --n-decoder-layers

This will overide the n-layers for asymmetrical transformers

Default: -1.

--model-parallel

Shard the layers across multiple GPUs.

Default: False.

--use-memories

Use memories: must implement the function _vectorize_memories to use this

Default: False.

--wrap-memory-encoder

Wrap memory encoder with MLP

Default: False.

--memory-attention

Similarity for basic attention mechanism when using transformer to encode memories

Choices: cosine, dot, sqrt.

Default: sqrt.

--normalize-sent-emb

Default: False.

--share-encoders

Default: True.

--learn-embeddings

Learn embeddings

Default: True.

--data-parallel

Use model in data parallel, requires multiple gpus

Default: False.

--reduction-type

Type of reduction at the end of transformer

Choices: first, max, mean.

Default: mean.

BPEHelper Arguments

--bpe-vocab

Path to pre-trained tokenizer vocab

--bpe-merge

Path to pre-trained tokenizer merge