BERT Ranker¶
This directory contains several implementations of a ranker based on a pretrained language model BERT (Devlin et al. https://arxiv.org/abs/1810.04805). It relies on the pytorch implementation provided by Hugging Face (https://github.com/huggingface/pytorch-pretrained-BERT).
Content¶
This directory contains 3 Torch Ranker Agents (see parlai/core/torch_ranker_agent.py). All of them are rankers, which means that given a context, they try to guess what is the next utterance among a set of candidates.
BiEncoderRankerAgent associates a vector to the context and a vector to every possible utterance, and is trained to maximize the dot product between the correct utterance and the context.
CrossEncoderRankerAgent concatenate the text with a candidate utterance and gives a score. This scales much less that BiEncoderRankerAgent at inference time since you can not precompute a vector per candidate. However, it tends to give higher accuracy.
BothEncoderRankerAgent does both, it ranks the top N candidates using a BiEncoder and follows it by a CrossEncoder. Resulting in a scalable and precise system.
Preliminary¶
In order to use those agents you need to install pytorch-pretrained-bert (https://github.com/huggingface/pytorch-pretrained-BERT). If you have not installed, running the model will prompt you to run:
pip install pytorch-pretrained-bert
Basic Examples¶
Train a BiEncoder BERT model on ConvAI2:
parlai train_model -t convai2 -m bert_ranker/bi_encoder_ranker --batchsize 20 -vtim 30 --model-file /tmp/bert_biencoder_test --data-parallel True
Train a CrossEncoder BERT model on ConvAI2:
parlai train_model -t convai2 -m bert_ranker/cross_encoder_ranker --batchsize 2 -vtim 30 --model-file /tmp/bert_crossencoder_test --data-parallel True
BiEncoderRankerAgent Options¶
TorchAgent Arguments
Argument |
Description |
---|---|
|
Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts. |
|
Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training. |
|
If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice. |
|
Use fp16 computations. |
|
Implementation of FP16 to use |
|
Whether the model should parse candidates for ranking. |
|
Truncate input lengths to increase speed / use less memory. |
|
Text input truncation length: if not specified, this will default to |
|
Label truncation length: if not specified, this will default to |
|
Reverse the history |
|
Number of past dialog utterances to remember. |
|
Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization. |
|
Split the dialogue history on newlines and save in separate vectors |
|
Join history lines with this token, defaults to newline |
|
Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence. |
|
Which GPU to use |
|
Disable GPUs even if available. otherwise, will use GPUs if available on the device. |
Optimizer Arguments
Argument |
Description |
---|---|
|
Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor. |
|
Learning rate |
|
Gradient clipping using l2 norm |
|
Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively |
|
If applicable, momentum value for optimizer. |
|
If applicable, whether to use nesterov momentum. |
|
If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0 |
|
If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999 |
|
Weight decay on the weights. |
BPEHelper Arguments
Argument |
Description |
---|---|
|
Path to pre-trained tokenizer vocab |
|
Path to pre-trained tokenizer merge |
|
Use BPE dropout during training. |
Learning Rate Scheduler
Argument |
Description |
---|---|
|
Learning rate scheduler. |
|
LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every |
|
Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered. |
|
Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt |
TorchRankerAgent
Argument |
Description |
---|---|
|
The source of candidates during training (see TorchRankerAgent._build_candidates() for details). |
|
The source of candidates during evaluation (defaults to the samevalue as –candidates if no flag is given) |
|
The source of candidates during interactive mode. Since in interactive mode, batchsize == 1, we cannot use batch candidates. |
|
Block repeating previous utterances. Helpful for many models that score repeats highly, so switched on by default. |
|
A text file of fixed candidates to use for all examples, one candidate per line |
|
One of “reuse”, “replace”, or a path to a file with vectors corresponding to the candidates at –fixed-candidates-path. The default path is a /path/to/model-file.<cands_name>, where <cands_name> is the name of the file (not the full path) passed by the flag –fixed-candidates-path. By default, this file is created once and reused. To replace it, use the “replace” option. |
|
Cache and save the encoding of the candidate vecs. This might be used when interacting with the model in real time or evaluating on fixed candidate set when the encoding of the candidates is independent of the input. |
|
Initialize model with weights from this file. |
|
Get predictions and calculate mean rank during the train step. Turning this on may slow down training. |
|
Limit to the number of predictions in output.text_candidates |
|
Ignore examples for which the label is not present in the label candidates. Default behavior results in RuntimeError. |
|
Ranking returns the top k results of k > 0, otherwise sorts every single candidate according to the ranking. |
|
Final response output algorithm |
|
K used in Top K sampling inference, when selected |
|
Return sorted candidate scores from eval_step |
Bert Ranker Arguments
Argument |
Description |
---|---|
|
Also add a transformer layer on top of Bert |
|
Which layer of Bert do we use? Default=-1=last one. |
|
For biencoder, output dimension |
|
For the biencoder: select how many elements to return |
|
Use model in data parallel, requires multiple gpus. NOTE This is incompatible with distributed training |
|
How do we transform a list of output into one |
CrossEncoderRankerAgent Options¶
TorchAgent Arguments
Argument |
Description |
---|---|
|
Whether in full interactive mode or not, which means generating text or retrieving from a full set of candidates, which is necessary to actually do full dialogue. However, during training or quick validation (e.g. PPL for generation or ranking a few candidates for ranking models) you might want these set to off. Typically, scripts can set their preferred default behavior at the start, e.g. eval scripts. |
|
Choose between different strategies for initializing word embeddings. Default is random, but can also preinitialize from Glove or Fasttext. Preinitialized embeddings can also be fixed so they are not updated during training. |
|
If pretrained embeddings have a different dimensionality than your embedding size, strategy for projecting to the correct size. If the dimensions are the same, this is ignored unless you append “-force” to your choice. |
|
Use fp16 computations. |
|
Implementation of FP16 to use |
|
Whether the model should parse candidates for ranking. |
|
Truncate input lengths to increase speed / use less memory. |
|
Text input truncation length: if not specified, this will default to |
|
Label truncation length: if not specified, this will default to |
|
Reverse the history |
|
Number of past dialog utterances to remember. |
|
Add person tokens to history. adds p1 in front of input text and p2 in front of past labels when available or past utterances generated by the model. these are added to the dictionary during initialization. |
|
Split the dialogue history on newlines and save in separate vectors |
|
Join history lines with this token, defaults to newline |
|
Comma separated list of special tokens. In case of ambiguous parses from special tokens, the ordering provided in this arg sets precedence. |
|
Which GPU to use |
|
Disable GPUs even if available. otherwise, will use GPUs if available on the device. |
Optimizer Arguments
Argument |
Description |
---|---|
|
Optimizer choice. Possible values: adadelta, adagrad, adam, adamw, sparseadam, adamax, asgd, sgd, radam, rprop, rmsprop, optimizer, nadam, lbfgs, mem_eff_adam, adafactor. |
|
Learning rate |
|
Gradient clipping using l2 norm |
|
Epsilon values for adafactor optimizer: regularization constants for square gradient and parameter scale respectively |
|
If applicable, momentum value for optimizer. |
|
If applicable, whether to use nesterov momentum. |
|
If applicable, nu value(s) for optimizer. can use a single value like 0.7 or a comma-separated tuple like 0.7,1.0 |
|
If applicable, beta value(s) for optimizer. can use a single value like 0.9 or a comma-separated tuple like 0.9,0.999 |
|
Weight decay on the weights. |
BPEHelper Arguments
Argument |
Description |
---|---|
|
Path to pre-trained tokenizer vocab |
|
Path to pre-trained tokenizer merge |
|
Use BPE dropout during training. |
Learning Rate Scheduler
Argument |
Description |
---|---|
|
Learning rate scheduler. |
|
LR scheduler patience. In number of validation runs. If using fixed scheduler, LR is decayed every |
|
Decay factor for LR scheduler, or how much LR is multiplied by when it is lowered. |
|
Constant used only to find the lr multiplier for the invsqrt scheduler. Must be set for –lr-scheduler invsqrt |
TorchRankerAgent
Argument |
Description |
---|---|
|
The source of candidates during training (see TorchRankerAgent._build_candidates() for details). |
|
The source of candidates during evaluation (defaults to the samevalue as –candidates if no flag is given) |
|
The source of candidates during interactive mode. Since in interactive mode, batchsize == 1, we cannot use batch candidates. |
|
Block repeating previous utterances. Helpful for many models that score repeats highly, so switched on by default. |
|
A text file of fixed candidates to use for all examples, one candidate per line |
|
One of “reuse”, “replace”, or a path to a file with vectors corresponding to the candidates at –fixed-candidates-path. The default path is a /path/to/model-file.<cands_name>, where <cands_name> is the name of the file (not the full path) passed by the flag –fixed-candidates-path. By default, this file is created once and reused. To replace it, use the “replace” option. |
|
Cache and save the encoding of the candidate vecs. This might be used when interacting with the model in real time or evaluating on fixed candidate set when the encoding of the candidates is independent of the input. |
|
Initialize model with weights from this file. |
|
Get predictions and calculate mean rank during the train step. Turning this on may slow down training. |
|
Limit to the number of predictions in output.text_candidates |
|
Ignore examples for which the label is not present in the label candidates. Default behavior results in RuntimeError. |
|
Ranking returns the top k results of k > 0, otherwise sorts every single candidate according to the ranking. |
|
Final response output algorithm |
|
K used in Top K sampling inference, when selected |
|
Return sorted candidate scores from eval_step |
Bert Ranker Arguments
Argument |
Description |
---|---|
|
Also add a transformer layer on top of Bert |
|
Which layer of Bert do we use? Default=-1=last one. |
|
For biencoder, output dimension |
|
For the biencoder: select how many elements to return |
|
Use model in data parallel, requires multiple gpus. NOTE This is incompatible with distributed training |
|
How do we transform a list of output into one |