HRED Agent

The HRED agent uses a traditional LSTM encoder decoder, but also utilizes a context LSTM that encodes the history.

The following papers outline more information regarding this model:

An important difference is that the model currently only supports LSTM RNN units, rather than the GRU units used in the papers. It also supports the decoding strategies in TorchGeneratorModel (such as beam search and greedy).

Example script to run on dailydialog: parlai train_model -t dailydialog -mf /tmp/dailydialog_hred -bs 4 -eps 5 –model hred

HredAgent Options

optional arguments




Set to use CUDA kernel for beam search ngram blocking

Default: False.


Return the topk logits in the act message, if verbose mode is set.

Default: -1.

HRED Arguments



--hiddensize, --hs

Size of the hidden layers

Default: 128.

--embeddingsize, --esz

Size of the token embeddings

Default: 128.

--numlayers, --nl

Number of hidden layers

Default: 2.

--dropout, --dr

Dropout rate

Default: 0.1.

--lookuptable, --lt

The encoder, decoder, and output modules can share weights, or not. Unique has independent embeddings for each. Enc_dec shares the embedding for the encoder and decoder. Dec_out shares decoder embedding and output weights. All shares all three weights.

Choices: unique, enc_dec, dec_out, all.

Default: unique.

--input-dropout, --idr

Probability of replacing tokens with UNK in training.

Default: 0.0.

Torch Generator Agent




Beam size, if 1 then greedy search

Default: 1.


Minimum length of prediction to be generated by the beam search

Default: 1.


Size n-grams to block in beam search from the context. val <= 0 implies no blocking

Default: -1.


Size n-grams to block in beam search. val <= 0 implies no blocking

Default: -1.


Block n-grams from the full history context. Specify False to block up to m tokens in the past, where m is truncation parameter for agent

Default: True.


Applies a length penalty. Set to 0 for no penalty.

Default: 0.65.


Generation algorithm

Choices: beam, nucleus, delayedbeam, greedy, delayednucleusbeam, topk, factual_nucleus.

Default: greedy.


K used in Top K sampling

Default: 10.


P used in nucleus sampling

Default: 0.9.


Used in delayedbeam search

Default: 30.


Decay factor in factual nucleus sampling

Default: 0.9.


Lower bound in factual nucleus sampling

Default: 0.3.


Whether to reset p value in factual nucleus at full stops

Default: True.


Load a text file of hard blocks for beam search to never say.


Temperature to add during decoding

Default: 1.0.


If true, compute tokenized bleu scores

Default: False.

TorchAgent Arguments



--interactive-mode, --i

Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts.

Default: False.

--embedding-type, --emb

Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training.

Choices: random, glove, glove-fixed, fasttext, fasttext-fixed, fasttext_cc, fasttext_cc-fixed.

Default: random.

--embedding-projection, --embp

If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice.

Default: random.


Use fp16 computations.

Default: False.


Implementation of FP16 to use

Choices: safe, mem_efficient.

Default: safe.

--rank-candidates, --rc

Whether the model should parse candidates for ranking.

Default: False.

--truncate, --tr

Truncate input lengths to increase speed / use less memory.

Default: -1.


Text input truncation length: if not specified, this will default to truncate


Label truncation length: if not specified, this will default to truncate


Reverse the history

Default: False.

--history-size, --histsz

Number of past dialog utterances to remember.

Default: -1.

--person-tokens, --pt

Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization.

Default: False.


Split the dialogue history on newlines and save in separate vectors

Default: False.


Join history lines with this token, defaults to newline

Default: \n.


Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence.

-gpu, --gpu

Which GPU to use

Default: -1.


Disable GPUs even if available. otherwise, will use GPUs if available on the device.

Default: False.

Optimizer Arguments



--optimizer, --opt

Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor.

Choices: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor.

Default: sgd.

--learningrate, --lr

Learning rate

Default: 1.

--gradient-clip, --clip

Gradient clipping using l2 norm

Default: 0.1.


Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively

Default: 1e-30,1e-3. Recommended: 1e-30,1e-3.

--momentum, --mom

If applicable, momentum value for optimizer.

Default: 0.


If applicable, whether to use nesterov momentum.

Default: True.

--nus, --nu

If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0

Default: 0.7.

--betas, --beta

If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999

Default: 0.9,0.999.

--weight-decay, --wdecay

Weight decay on the weights.

BPEHelper Arguments




Path to pre-trained tokenizer vocab


Path to pre-trained tokenizer merge


Use BPE dropout during training.

Learning Rate Scheduler




Learning rate scheduler.

Choices: reduceonplateau, none, fixed, invsqrt, cosine, linear.

Default: reduceonplateau.


LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every validations.

Default: 3.


Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered.

Default: 0.5.


Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt

Default: -1.