Parallel Training Methods

Sam Foreman

Parallel Training Methods

@ Intro to AI-driven Science on Supercomputers

Sam Foreman

ALCF

2024-11-05

Why Distributed Training?

N workers each processing unique batch¹ of data:
- [micro_batch_size = 1] $\times$ [N GPUs] $\rightarrow$ [global_batch_size = N]
Improved gradient estimators
- Smooth loss landscape
- Less iterations needed for same number of epochs
  - common to scale learning rate lr *= sqrt(N)
See: Large Batch Training of Convolutional Networks

Going beyond Data Parallelism: DeepSpeed + `ZeRO`

Depending on the ZeRO stage (1, 2, 3), we can offload:
1. Stage 1: optimizer states $\left(P_{\mathrm{os}}\right)$
2. Stage 2: gradients + opt. states $\left(P_{\mathrm{os}+\mathrm{g}}\right)$
3. Stage 3: model params + grads + opt. states $\left(P_{\mathrm{os}+\mathrm{g}+\mathrm{p}}\right)$

Figure 14: DeepSpeed + ZeRO

Tensor (/ Model) Parallel Training: Example

Want to compute: $y = \sum_{i} x_{i} W_{i} = x_0 * W_0 + x_1 * W_1 + x_2 * W_2$
where each GPU only has only its portion of the full weights as shown below

Compute: $y_{0} = x_{0} * W_{0}\rightarrow$ GPU1
Compute: $y_{1} = y_{0} + x_{1} * W_{1}\rightarrow$ GPU2
Compute: $y = y_{1} + x_{2} * W_{2} = \sum_{i} x_{i} W_{i}$ ✅

Figure 19

Deciding on a Parallelism Strategy

Model fits onto a single GPU:
- Normal use
Model DOES NOT fit on a single GPU:
- ZeRO + Offload CPU (or, optionally, NVMe)
Largest layer DOES NOT fit on a single GPU:
- ZeRO + Enable Memory Centric Tiling (MCT)
  - MCT Allows running of arbitrarily large layers by automatically splitting them and executing them sequentially.

Model fits onto a single GPU
- DDP
- ZeRO

Model DOES NOT fit onto a single GPU

With sufficiently fast connectivity between nodes, these three strategies should be comparable.
- Otherwise, PP > ZeRO $\simeq$ TP.

When you have fast inter-node connectivity:
- ZeRO (virtually NO modifications)
- PP + ZeRO + TP + DP (less communication, at the cost of MAJOR modifications)
  - when you have slow inter-node connectivity and still low on GPU memory:
```
DP + PP + TP + ZeRO-1
```
- NOTE: TP is almost always used within a single node, e.g.
  TP <= GPUS_PER_NODE

Parallel Training Methods @ Intro to AI-driven Science on Supercomputers Sam Foreman ALCF 2024-11-05

Parallel Training Methods
👀 Overview
📑 Outline
🚀 Scaling: Overview
🐢 Training on a Single Device
Single GPU
Single GPU
🏎️ Training on Multiple GPUs: Data Parallelism
Data Parallel: Forward Pass
Data Parallel: Backward Pass
Data Parallel: Full Setup
Data Parallel: Training
🗣️ Communication
AllReduce
Reduce
Broadcast
AllGather
Scatter
Why Distributed Training?
Why Distributed Training? Speedup!
Dealing with Data
Broadcast Initial State
Best Practices
Going Beyond Data Parallelism
Going beyond Data Parallelism: DeepSpeed + ZeRO
Fully Sharded Data Parallel: 🔥 PyTorch + FSDP
🕸️ Additional Parallelism Strategies
Pipeline Parallelism (PP)
Tensor Parallel (TP)
Tensor Parallel (TP)
Tensor (/ Model) Parallel Training: Example
Tensor (Model) Parallelism1
Tensor Parallelism
3D Parallelism
Deciding on a Parallelism Strategy
🦙 Large Language Models
🔮 Emergent Abilities
🚂 Training LLMs
♻️ Life-Cycle of the LLM
🎀 Life-Cycle of the LLM
⏩ Forward Pass
💬 Generating Text
👋 Hands On
🧑‍💻 Hands On: Getting Started
📦 Install {ezpz, wordplay}
ezpz: Example [video]
Install wordplay 🎮💬
Prepare Data
Launch Training (DDP)
Training: Example Output
wordplay: Example [video]
❤️ Thank you!
📓 References

Year	Author	GPU	Batch Size	# GPU	TIME (s)	ACC
2016	He	P100	256	8	104,400	75.30%
2019	Yamazaki	V100	81,920	2048	72	75.08%