BERT Classifier

This directory contains an implementations of a classifier based on a pretrained language model BERT (Devlin et al. It relies on the pytorch implementation provided by Hugging Face (

Basic Examples

Train a classifier on the SNLI tas.

python examples/ -m bert_classifier -t snli --classes 'entailment' 'contradiction' 'neutral' -mf /tmp/BERT_snli -bs 20

BertClassifierAgent Options

TorchAgent Arguments

-i, --interactive-mode

Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts.

Default: False.

-emb, --embedding-type

Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training.

Choices: random, glove, glove-fixed, fasttext, fasttext-fixed, fasttext_cc, fasttext_cc-fixed.

Default: random.

-embp, --embedding-projection

If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice.

Default: random.


Use fp16 computations.

Default: False.


Implementation of FP16 to use

Choices: apex, mem_efficient.

Default: apex.

-rc, --rank-candidates

Whether the model should parse candidates for ranking.

Default: False.

-tr, --truncate

Truncate input lengths to increase speed / use less memory.

Default: -1.


Text input truncation length: if not specified, this will default to truncate


Label truncation length: if not specified, this will default to truncate

-histsz, --history-size

Number of past dialog utterances to remember.

Default: -1.

-pt, --person-tokens

Add person tokens to history. adds __p1__ in front of input text and __p2__ in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization.

Default: False.


Split the dialogue history on newlines and save in separate vectors

Default: False.


Join history lines with this token, defaults to newline

Default: \n.

-gpu, --gpu

Which GPU to use

Default: -1.


Disable GPUs even if available. otherwise, will use GPUs if available on the device.

Default: False.

Optimizer Arguments

-opt, --optimizer

Choose between pytorch optimizers. Any member of torch.optim should be valid.

Choices: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, rprop, rmsprop, optimizer, lbfgs, mem_eff_adam, adafactor.

Default: sgd.

-lr, --learningrate

Learning rate

Default: 1.

-clip, --gradient-clip

Gradient clipping using l2 norm

Default: 0.1.


Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively

Default: 1e-30,1e-3. Recommended: 1e-30,1e-3.

-mom, --momentum

If applicable, momentum value for optimizer.

Default: 0.


If applicable, whether to use nesterov momentum.

Default: True.

-nu, --nus

If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0

Default: 0.7.

-beta, --betas

If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999

Default: 0.9,0.999.

-wdecay, --weight-decay

Weight decay on the weights.

BPEHelper Arguments


Path to pre-trained tokenizer vocab


Path to pre-trained tokenizer merge

Learning Rate Scheduler


Learning rate scheduler.

Choices: reduceonplateau, none, fixed, invsqrt, cosine, linear.

Default: reduceonplateau.


LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every <patience> validations.

Default: 3.


Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered.

Default: 0.5.


Number of train steps the scheduler should take after warmup. Training is terminated after this many steps. This should only be set for –lr-scheduler cosine or linear

Default: -1.


Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt

Default: -1.

Torch Classifier Arguments


The name of the classes.


Weight of each of the classes for the softmax


During evaluation, threshold for choosing ref class; only applies to binary classification

Default: 0.5.


Print probability of chosen class during interactive mode

Default: False.


Uses nn.DataParallel for multi GPU

Default: False.


Give prec/recall metrics for all classes

Default: True.


Loads the list of classes from a file


Ignore labels provided to model

BERT Classifier Arguments


Which part of the encoders do we optimize (defaults to all layers)

Choices: additional_layers, top_layer, top4_layers, all_encoder_layers, all.

Default: all_encoder_layers.


Add [CLS] token to text vec

Default: True.


Separate the last utterance into a differentsegment with [SEP] token in between

Default: False.

BertDictionaryAgent Options

BPEHelper Arguments


Path to pre-trained tokenizer vocab


Path to pre-trained tokenizer merge