Training Foundation Models on Supercomputers

Sam Foreman

Training Foundation Models on Supercomputers

Sam Foreman

Argonne National Laboratory

@ Georgia Institute of Technology

2025-10-15

🌐 Distributed Training

🚀 Scaling: Overview

✅ Goal:
- Minimize: Cost (i.e. amount of time spent training)
- Maximize: Performance
📑 Note

See 🤗 Performance and Scalability for more details

Going beyond Data Parallelism: DeepSpeed + `ZeRO`

Depending on the ZeRO stage (1, 2, 3), we can offload:
1. Stage 1: optimizer states $\left(P_{\mathrm{os}}\right)$
2. Stage 2: gradients + opt. states $\left(P_{\mathrm{os}+\mathrm{g}}\right)$
3. Stage 3: model params + grads + opt. states $\left(P_{\mathrm{os}+\mathrm{g}+\mathrm{p}}\right)$

Tensor (/ Model) Parallel Training: Example

Want to compute: $y = \sum_{i} x_{i} W_{i} = x_0 * W_0 + x_1 * W_1 + x_2 * W_2$
where each GPU has only its portion of the full weights as shown below

Compute: $y_{0} = x_{0} * W_{0}\rightarrow$ GPU1
Compute: $y_{1} = y_{0} + x_{1} * W_{1}\rightarrow$ GPU2
Compute: $y = y_{1} + x_{2} * W_{2} = \sum_{i} x_{i} W_{i}$ ✅

Figure 10

🔭 AI-for-Science
source (@tenderizzation)

ChatGPT: explain this image

🏋️ Challenges: In Practice

This is incredibly difficult in practice, due in part to:

Brand new {hardware, architecture, software}
Lack of native support in existing frameworks (though getting better!)
General system stability
+10k Nodes $\left(\times \frac{12\,\,\mathrm{XPU}}{1\,\,\mathrm{Node}}\right)\Rightarrow$ +100k XPUs
- network performance
- file system stability (impacted by other users !)
- many unexpected difficulties occur at increasingly large scales
Combinatorial explosion of possible configurations and experiments
- {hyperparameters, architectures, tokenizers, learning rates, …}

Training Foundation Models on Supercomputers Sam Foreman Argonne National Laboratory @ Georgia Institute of Technology 2025-10-15

Slides

Tools

Close
Training Foundation Models on Supercomputers
🌐 Distributed Training
🚀 Scaling: Overview
🐢 Training on a Single Device
🕸️ Parallelism Strategies
👬 Training on Multiple GPUs: Data Parallelism
▶️ Data Parallel: Forward Pass
◀️ Data Parallel: Backward Pass
🔄 Collective Communication
Reduce
🐣 Getting Started: In Practice
📝 Plan of Attack
🚀 Going Beyond Data Parallelism
Going beyond Data Parallelism:  DeepSpeed + ZeRO
🕸️ Additional Parallelism Strategies
Pipeline Parallelism (PP)
Tensor Parallel (TP)
Tensor Parallel (TP)
Tensor (/ Model) Parallel Training: Example
🔭 AI-for-Science...
🏗️ Aurora
🌌 AuroraGPT (2024–)
🧪 AuroraGPT: Open Science Foundation Model
🧰 AuroraGPT: Toolbox
🏋️ Challenges: In Practice
💾 AuroraGPT: Training
🍹 AuroraGPT: Blending Data, Efficiently
📉 Loss Curve: Training AuroraGPT-7B on 2T Tokens
✨ Features
✨ Features (even more!)
🧬 MProt-DPO
🧬 Scaling Results (2024)
🧬 MProt-DPO: Scaling Results
🚂 Loooooooooong Sequence Lengths
🌎 AERIS (2025)
👀 High-Level Overview of AERIS
➕ Contributions
⚠️ Issues with the Deterministic Approach
🎲 Transitioning to a Probabilistic Model
🌀 Sequence-Window-Pipeline Parallelism SWiPe
🚀 AERIS: Scaling Results
🌪️ Hurricane Laura
📓 References
❤️ Acknowledgements
f Fullscreen
s Speaker View
o Slide Overview
e PDF Export Mode
r Scroll View Mode
b Toggle Chalkboard
c Toggle Notes Canvas
d Download Drawings
? Keyboard Help

Property	Value
Racks	166
Nodes	10,624
XPUs²	127,488
CPUs	21,248
NICs	84,992
HBM	8 PB
DDR5c	10 PB

Property	Description
Domain	Global
Resolution	0.25° & 1.4°
Training Data	ERA5 (1979–2018)
Model Architecture	Swin Transformer
Speedup¹	O(10k–100k)