Training Foundation Models on Supercomputers

Sam Foreman

Training Foundation Models on Supercomputers

Sam Foreman

[email protected]

ALCF

2025-09-24

Going beyond Data Parallelism: DeepSpeed + `ZeRO`

Depending on the ZeRO stage (1, 2, 3), we can offload:
1. Stage 1: optimizer states $\left(P_{\mathrm{os}}\right)$
2. Stage 2: gradients + opt. states $\left(P_{\mathrm{os}+\mathrm{g}}\right)$
3. Stage 3: model params + grads + opt. states $\left(P_{\mathrm{os}+\mathrm{g}+\mathrm{p}}\right)$

Tensor (/ Model) Parallel Training: Example

Want to compute: $y = \sum_{i} x_{i} W_{i} = x_0 * W_0 + x_1 * W_1 + x_2 * W_2$
where each GPU only has only its portion of the full weights as shown below

Compute: $y_{0} = x_{0} * W_{0}\rightarrow$ GPU1
Compute: $y_{1} = y_{0} + x_{1} * W_{1}\rightarrow$ GPU2
Compute: $y = y_{1} + x_{2} * W_{2} = \sum_{i} x_{i} W_{i}$ ✅

Figure 11

Training Foundation Models on Supercomputers Sam Foreman [email protected] ALCF 2025-09-24

Slides

Tools

Close
Training Foundation Models on Supercomputers
👀 Scaling: Overview
🐢 Training on a Single Device
👬 Training on Multiple GPUS: Data Parallelism
➡️ Data Parallel: Forward Pass
⬅️ Data Parallel: Backward Pass
🔄 Data Parallel: Training
📡 Communication
🚧 Common Pitfalls
🎀 Best Practices
🤔 Plan of Attack
🚀 Going Beyond Data Parallelism
Going beyond Data Parallelism:  DeepSpeed + ZeRO
🕸️ Additional Parallelism Strategies
Pipeline Parallelism (PP)
Tensor Parallel (TP)
Tensor Parallel (TP)
Tensor (/ Model) Parallel Training: Example
🧬 MProt-DPO: Scaling Results
🌎 AERIS: Scaling Results
🍋 ezpz
🍋 ezpz @ ALCF
🐣 Getting Started
🏖️ Shell Environment
🔍 Environment Setup with ezpz_setup_env
⏱️ Working with Job Scheduler(s)
🐍 Python Environments
🧪 Simple Distributed Test
➕ How to Modify Existing Code
✨ Features
🧪 Experiment Tracking
🤏 Minimal Example
🏃‍♂️ Running the Minimal Example
📝 ezpz-test
🦜 Generate Text
🤗 Huggingface Trainer
🏎️ Megatron-DeepSpeed
🙌 Acknowledgements
📓 References
f Fullscreen
s Speaker View
o Slide Overview
e PDF Export Mode
r Scroll View Mode
b Toggle Chalkboard
c Toggle Notes Canvas
d Download Drawings
? Keyboard Help

Training Foundation Models on Supercomputers

👀 Scaling: Overview

🐢 Training on a Single Device

👬 Training on Multiple GPUS: Data Parallelism

➡️ Data Parallel: Forward Pass

⬅️ Data Parallel: Backward Pass

🔄 Data Parallel: Training

📡 Communication

🚧 Common Pitfalls

🎀 Best Practices

🤔 Plan of Attack

🚀 Going Beyond Data Parallelism

Going beyond Data Parallelism: DeepSpeed + `ZeRO`

🕸️ Additional Parallelism Strategies

Pipeline Parallelism (PP)

Tensor Parallel (TP)

Tensor Parallel (TP)

Tensor (/ Model) Parallel Training: Example

🧬 MProt-DPO: Scaling Results

🌎 AERIS: Scaling Results

🍋 `ezpz`

🍋 `ezpz` @ ALCF

🐣 Getting Started

🏖️ Shell Environment

🔍 Environment Setup with `ezpz_setup_env`

⏱️ Working with Job Scheduler(s)

🐍 Python Environments

🧪 Simple Distributed Test

➕ How to Modify Existing Code

✨ Features

🧪 Experiment Tracking

🤏 Minimal Example

🏃‍♂️ Running the Minimal Example

📝 `ezpz-test`

🦜 Generate Text

🤗 Huggingface Trainer

🏎️ Megatron-DeepSpeed

🙌 Acknowledgements

📓 References