Speeding up training¶
Author: Stephen Roller
This tutorial walks you through a few ways to massively speed up your training runs in ParlAI. These tricks tend to work best with generative models, but some can also be used with others.
A summary of the speedups is in this table:
Method |
Train |
Eval |
Total |
Speedup |
---|---|---|---|---|
Baseline |
504s |
48s |
552s |
1.0x |
Skip generation |
504s |
16s |
520s |
1.1x |
Dynamic batching |
254s |
11s |
265s |
2.1x |
FP16 |
197s |
8s |
205s |
2.7x |
Larger batchsize (FP16) |
151s |
7s |
158s |
3.5x |
Using 4 GPUs |
47s |
3s |
50s |
11.0x |
Setting a baseline¶
We’ll start with an example training command, which trains a transformer/generator on ConvAI2 for one epoch, using a batchsize of 64, with a roughly 20M parameter model. We’ll train using ADAM optimizer, with a learning rate of 1e-3. We’ll build the dictionary ahead of time to ensure it’s kept the same.
mkdir fastmodels
parlai build_dict -t convai2 -df dictfile
parlai train -df dictfile -m transformer/generator -t convai2 -eps 1.0 -bs 64 \
--embedding-size 250 --ffn-size 1000 --n-layers 8 -opt adam -lr 1e-3
On my computer using a 16gb V100, this takes about 550s. 500s is spend in training, and another 50s is spent in evaluation.
We will continuously modify this training command in this tutorial, but you are free to mix and match options.
Skip generation¶
You may notice your model is taking a long time to evaluate, even though the
evaluation dataset is much smaller. This is because we are doing full
generation through the model, including beam search. You can get a massive
speedup by turning off this generation step, with --skip-generation true
.
parlai train -df dictfile -m transformer/generator -t convai2 -eps 1.0 -bs 64 \
--embedding-size 250 --ffn-size 1000 --n-layers 8 -opt adam -lr 1e-3 \
--skip-generation true
This brings evaluation time down to 16s, but doesn’t affect training time. Just
remember you will need to turn --skip-generation
back off if you want
statistics like BLEU or F1. Also, --skip-generation
is only an option in
generative models. Ranking models have similar options like -cands batch
.
Warning
Models trained with --skip-generation True
will remember this option when
loading back up. You will need to manually set it back to false whenever you
need evaluations with generations.
Dynamic batching¶
Dynamic batching groups conversations of the same length at the same time, to minimize the amount of unnecessary padding in the tensors. Furthermore, dynamic batching actually increases the batch size to use the maximum amount of memory available on your GPU.
Add --dynamic-batching full
(-dynb full
) to your training command. Note
that in order to use dynamic batching, we must also set a --truncate
option.
We’ll use 256, since that is longer than almost all the conversations in our
data.
parlai train -df dictfile -m transformer/generator -t convai2 -eps 1.0 -bs 64 \
--embedding-size 250 --ffn-size 1000 --n-layers 8 -opt adam -lr 1e-3 \
--skip-generation true \
--eval-batchsize 128 \
--dynamic-batching full --truncate 256
You should notice that your memory utilization is much higher in this mode. This actually has an advantage, since you can more easily find your maximum batchsize. To get the full benefits of dynamic batching, make sure to use the largest batchsize you can.
Overall, this results in a large increase in speed: about 2x, bringing training down to 250s and evaluation to 11s.
Warning
You may find perplexity is quite a bit worse than without dynamic batching. This is because we use larger batches, and take fewer steps. You can usually increase your learning rate pretty substantially when using dynamic batching, to compensate for the fewer steps.
FP16¶
If you have access to an NVIDIA GPU with FP16 CUDA Cores (V100, GTX 2080, etc),
then you can get large speedups by switching on the option --fp16 true
. The
default version of FP16 requires that you install
APEX, but you can use a simplified version
(which doesn’t depend on APEX) with --fp16 true --fp16-impl mem_efficient
.
Note that in order to get the full benefit of fp16, we need to make sure all our hidden dimensions are multiples of 8, otherwise the hardware won’t use CUDA cores. We will slightly adjust the size of the network to support this. We’ll slightly adjust the network parameters (–embedding-size and –ffn-size) to conform to this.
parlai train -df dictfile -m transformer/generator -t convai2 -eps 1.0 -bs 64 \
--embedding-size 256 --ffn-size 1024 --n-layers 8 -opt adam -lr 1e-3 \
--skip-generation true \
--eval-batchsize 128 \
--dynamic-batching full \
--fp16 true --fp16-impl mem_efficient
Further notice that FP16 often significantly lowers the memory size of your model and activations (almost by a factor of 2). This means you can usually get away with significantly increasing the batchsize (and eval batchsize).
parlai train -df dictfile -m transformer/generator -t convai2 -eps 1.0 \
--embedding-size 256 --ffn-size 1024 --n-layers 8 -opt adam -lr 1e-3 \
--skip-generation true \
--eval-batchsize 256 \
--dynamic-batching full \
--fp16 true -bs 128
In this example, we see about a 25% speedup. Generally you can expect a larger speedup with larger models, with models of >300M often getting a ~50% speedup. With the increased batch size, this can often be brought to 2.5x faster.
Warning
Without a GPU with FP16 CUDA cores, you may find that FP16 actually slows your program. You may still see a benefit from the reduced memory usage though.
Use multiple GPUs¶
If you have multiple GPUs, you can utilize them by switching from train
to
multiprocessing_train
. If you have 4 GPUs, you’ll find training should be
roughly 3.5x faster. The arguments for the training are left otherwise the same.
parlai multiprocessing_train \
-df dictfile -m transformer/generator -t convai2 -eps 1.0 -bs 64 \
--embedding-size 256 --ffn-size 1024 --n-layers 8 -opt adam -lr 1e-3 \
--skip-generation true \
--eval-batchsize 128 \
--dynamic-batching full \
--fp16 true
Note that we leave batchsize the same: we use the batchsize PER GPU. In my system, I have 4 GPUs, so things are a little under 4x faster.
Similarly, we also have the multiprocessing_eval
command, for using multiple
GPUs in evaluation.
Danger
This should never be mixed with options like --model-parallel true
or --data-parallel true
, as those options use different GPUs without
multiprocessing. The BlenderBot3B and BlenderBot9B models both use those
options, so this should be used with care.