Seq2Seq Agent

The Seq2Seq agent takes an input sequence and produces an output sequence.

The agent supports encoding and decoding via a variety of RNN flavors, including LSTMs and GRUs. The agent supports encoding and decoding via a variety of RNN flavors, including LSTMs and GRUs. It also supports numerous decoding strategies, like beam search and nucleus decoding.

The following papers outline more information regarding this model:

Seq2seqAgent Options

Seq2Seq Arguments

-hs, --hiddensize

Size of the hidden layers

Default: 128.

-esz, --embeddingsize

Size of the token embeddings

Default: 128.

-nl, --numlayers

Number of hidden layers

Default: 2.

-dr, --dropout

Dropout rate

Default: 0.1.

-bi, --bidirectional

Whether to encode the context with a bidirectional rnn

Default: False.

-att, --attention

Choices: none, concat, general, local. If set local, also set attention-length. (see arxiv.org/abs/1508.04025)

Choices: none, concat, general, dot, local.

Default: none.

-attl, --attention-length

Length of local attention.

Default: 48.

--attention-time

Whether to apply attention before or after decoding.

Choices: pre, post.

Default: post.

-rnn, --rnn-class

Choose between different types of RNNs.

Choices: rnn, gru, lstm.

Default: lstm.

-dec, --decoder

Choose between different decoder modules. Default “same” uses same class as encoder, while “shared” also uses the same weights. Note that shared disabled some encoder options–in particular, bidirectionality.

Choices: same, shared.

Default: same.

-lt, --lookuptable

The encoder, decoder, and output modules can share weights, or not. Unique has independent embeddings for each. Enc_dec shares the embedding for the encoder and decoder. Dec_out shares decoder embedding and output weights. All shares all three weights.

Choices: unique, enc_dec, dec_out, all.

Default: unique.

-soft, --numsoftmax

Default 1, if greater then uses mixture of softmax (see arxiv.org/abs/1711.03953).

Default: 1.

-idr, --input-dropout

Probability of replacing tokens with UNK in training.

Default: 0.0.

Torch Generator Agent

--beam-size

Beam size, if 1 then greedy search

Default: 1.

--beam-min-length

Minimum length of prediction to be generated by the beam search

Default: 1.

--beam-context-block-ngram

Size n-grams to block in beam search from the context. val <= 0 implies no blocking

Default: -1.

--beam-block-ngram

Size n-grams to block in beam search. val <= 0 implies no blocking

Default: -1.

--beam-length-penalty

Applies a length penalty. Set to 0 for no penalty.

Default: 0.65.

--inference

Generation algorithm

Choices: nucleus, topk, greedy, delayedbeam, beam.

Default: greedy.

--topk

K used in Top K sampling

Default: 10.

--topp

P used in nucleus sampling

Default: 0.9.

--beam-delay

Used in delayedbeam search

Default: 30.

--beam-block-list-filename

Load a text file of hard blocks for beam search to never say.

--temperature

Temperature to add during decoding

Default: 1.0.

--compute-tokenized-bleu

If true, compute tokenized bleu scores

Default: False.

TorchAgent Arguments

-i, --interactive-mode

Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts.

Default: False.

-emb, --embedding-type

Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training.

Choices: random, glove, glove-fixed, fasttext, fasttext-fixed, fasttext_cc, fasttext_cc-fixed.

Default: random.

-embp, --embedding-projection

If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice.

Default: random.

--fp16

Use fp16 computations.

Default: False.

--fp16-impl

Implementation of FP16 to use

Choices: apex, mem_efficient.

Default: apex.

-rc, --rank-candidates

Whether the model should parse candidates for ranking.

Default: False.

-tr, --truncate

Truncate input lengths to increase speed / use less memory.

Default: -1.

--text-truncate

Text input truncation length: if not specified, this will default to truncate

--label-truncate

Label truncation length: if not specified, this will default to truncate

-histsz, --history-size

Number of past dialog utterances to remember.

Default: -1.

-pt, --person-tokens

Add person tokens to history. adds __p1__ in front of input text and __p2__ in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization.

Default: False.

--split-lines

Split the dialogue history on newlines and save in separate vectors

Default: False.

--delimiter

Join history lines with this token, defaults to newline

Default: \n.

-gpu, --gpu

Which GPU to use

Default: -1.

--no-cuda

Disable GPUs even if available. otherwise, will use GPUs if available on the device.

Default: False.

Optimizer Arguments

-opt, --optimizer

Choose between pytorch optimizers. Any member of torch.optim should be valid.

Choices: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, rprop, rmsprop, optimizer, lbfgs, mem_eff_adam, adafactor.

Default: sgd.

-lr, --learningrate

Learning rate

Default: 1.

-clip, --gradient-clip

Gradient clipping using l2 norm

Default: 0.1.

--adafactor-eps

Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively

Default: 1e-30,1e-3. Recommended: 1e-30,1e-3.

-mom, --momentum

If applicable, momentum value for optimizer.

Default: 0.

--nesterov

If applicable, whether to use nesterov momentum.

Default: True.

-nu, --nus

If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0

Default: 0.7.

-beta, --betas

If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999

Default: 0.9,0.999.

-wdecay, --weight-decay

Weight decay on the weights.

BPEHelper Arguments

--bpe-vocab

Path to pre-trained tokenizer vocab

--bpe-merge

Path to pre-trained tokenizer merge

Learning Rate Scheduler

--lr-scheduler

Learning rate scheduler.

Choices: reduceonplateau, none, fixed, invsqrt, cosine, linear.

Default: reduceonplateau.

--lr-scheduler-patience

LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every <patience> validations.

Default: 3.

--lr-scheduler-decay

Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered.

Default: 0.5.

--max-lr-steps

Number of train steps the scheduler should take after warmup. Training is terminated after this many steps. This should only be set for –lr-scheduler cosine or linear

Default: -1.

--invsqrt-lr-decay-gamma

Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt

Default: -1.