Deep Learning and Foundation Models at Scale

Sam Foreman

Deep Learning and Foundation Models at Scale

Sam Foreman

[email protected]

ALCF

2024-10-29

Tensor (/ Model) Parallel Training: Example

Want to compute: $y = \sum_{i} x_{i} W_{i} = x_0 * W_0 + x_1 * W_1 + x_2 * W_2$
where each GPU only has only its portion of the full weights as shown below

Compute: $y_{0} = x_{0} * W_{0}\rightarrow$ GPU1
Compute: $y_{1} = y_{0} + x_{1} * W_{1}\rightarrow$ GPU2
Compute: $y = y_{1} + x_{2} * W_{2} = \sum_{i} x_{i} W_{i}$ ✅

Figure 15

Deciding on a Parallelism Strategy

Model fits onto a single GPU:
- Normal use
Model DOES NOT fit on a single GPU:
- ZeRO + Offload CPU (or, optionally, NVMe)
Largest layer DOES NOT fit on a single GPU:
- ZeRO + Enable Memory Centric Tiling (MCT)
  - MCT Allows running of arbitrarily large layers by automatically splitting them and executing them sequentially.

Model fits onto a single GPU
- DDP
- ZeRO

Model DOES NOT fit onto a single GPU

With sufficiently fast connectivity between nodes, these three strategies should be comparable.
- Otherwise, PP $>$ ZeRO $\simeq$ TP.

When you have fast inter-node connectivity:
- ZeRO (virtually NO modifications)
- PP + ZeRO + TP + DP (less communication, at the cost of MAJOR modifications)
  - when you have slow inter-node connectivity and still low on GPU memory:
```
DP + PP + TP + ZeRO-1
```
- NOTE: TP is almost always used within a single node, e.g.
  TP <= GPUS_PER_NODE

Deep Learning and Foundation Models at Scale Sam Foreman [email protected] ALCF 2024-10-29

Deep Learning and Foundation Models at Scale
October 29 – 31,...
Overview
🚀 Scaling: Overview
Single GPU
Data Parallel Training
Data Parallel Training
Communication
AllReduce
Reduce
Broadcast
AllGather
Scatter
Why Distributed Training?
Why Distributed Training? Speedup!
Dealing with Data
Broadcast Initial State
Best Practices
Data Parallelism
Going beyond Data Parallelism: ZeRO
Fully Sharded Data Parallel (FSDP)
Pipeline Parallel (PP)
Tensor Parallel (TP)
Model Parallel Training
Tensor (/ Model) Parallel Training: Example
Tensor (Model) Parallelism1
Tensor Parallelism
3D Parallelism
Deciding on a Parallelism Strategy
Large Language Models
Emergent Abilities
Training LLMs
Life-Cycle of the LLM
Life-Cycle of the LLM
Forward Pass
Generating Text
Assistant Models
Hands On
🌱 Clone Repositories
🐍 Setup Python
Setup Job
📦 Install {ezpz, wordplay}
🚀 Launch ezpz.test_dist
PyInstrument Profile
🍋 ezpz: Example [video]
Install wordplay 🎮💬
Prepare Data
Launch Training (DDP)
wordplay: Example [video]
❤️ Thank you!
References

Year	Author	GPU	Batch Size	# GPU	TIME (s)	ACC
2016	He	P100	256	8	104,400	75.30%
2019	Yamazaki	V100	81,920	2048	72	75.08%