The new Megatron-DeepSpeed release contains a variety of improvements / optimizations to enable pre-training Transformer based architectures with significantly longer sequences than was previously possible.
Enabled attention map memory optimization, where we first generated attention mask on CPU memory and then moved it into GPU memory to avoid out-of-memory errors when training with very large sequence lengths.
Position embedding partitioning, where we split weights of position encoding across all GPUs when enabling sequence parallel to further reduce the memory footprint.
cd ./genslm/examples/long-sequences# create a new virtual environmentmkdir-p"venvs/${MACHINE}/${CONDA_DATE}"python3-m venv "venvs/${MACHINE}/${CONDA_DATE}"--system-site-packagessource"venvs/${MACHINE}/${CONDA_DATE}/bin/activate"
Create a new folder (genslm/examples/long-sequences/deps/${MACHINE}) where weโll installing dependencies locally:
mkdir-p"deps/${MACHINE}"cd"deps/${MACHINE}"
Dependencies
We provide below the details needed to install each of the required dependencies.
These newly introduced optimizations, in combination with ZeRO-Offload allows us to go even further.
By employing ZeRO-Offloading, we are able to free up additional memory which can be used for even longer sequences.
Though work is still ongoing, this is a promising direction that will allow us to consider significantly larger genomes than previously possible.
We use Weights & Biases to track these experiments, and have aggregated our initial results in the W&B Report below.
We can evaluate the performance of our model by looking at two different metrics for throughput: samples_per_sec and TFLOPS.
Explicitly, we see that we are able to scale up to significantly longer sequences (420k / 128k ~ 3.3x) with only a minimal impact on throughput performance (81 / 105 ~ 77\%)4.
Table 2: Impact on TFLOPS as a function of increasing sequence length. Table from: throughput/TFLOPS
Name
Sequence Length (k)
(seq_len / min_seq_len)
TFLOPS
TFLOPS (% of peak)
GPT25B
420
3.28125
81.77225
77.867
GPT25B
400
3.125
90.62
86.297
GPT25B
360
2.8125
81.6325
77.7348
GPT25B
360
2.8125
82.6824
78.7346
GPT25B
192
1.5
115.8228
110.2927
GPT25B
128
1
106.672
101.5788
GPT25B
128
1
105.014
100.00
Figure 3: Weights & Biases Report
Footnotes
The described experiments were performed on 4 NVIDIA DGX A100-40GB nodes, all using TPSIZE=32[^tpsize], connected through 8 HDR InfiniBand (200Gb/s per HDR).โฉ๏ธ