AuroraGPT: Training Foundation Models on Supercomputers

Sam Foreman

AuroraGPT: Training Foundation Models on Supercomputers

Sam Foreman

Argonne National Laboratory

@ Argonne National Laboratory

2025-12-16

🏋️ Challenges

This is incredibly difficult in practice, due in part to:

Brand new {hardware, architecture, software}
Lack of native support in existing frameworks (though getting better!)
General system stability
+10k Nodes $\left(\times \frac{12\,\,\mathrm{XPU}}{1\,\,\mathrm{Node}}\right)\Rightarrow$ +100k XPUs
- network performance
- file system stability (impacted by other users !)
- many unexpected difficulties occur at increasingly large scales
Combinatorial explosion of possible configurations and experiments
- {hyperparameters, architectures, tokenizers, learning rates, …}

AuroraGPT: Training Foundation Models on Supercomputers Sam Foreman Argonne National Laboratory @ Argonne National Laboratory 2025-12-16

Slides

Tools

Close
AuroraGPT: Training Foundation Models on Supercomputers
🧰 AuroraGPT: Toolbox
👥 Team Leads
🤝 Teams
🏋️ Challenges
💾 AuroraGPT: Training
🍹 AuroraGPT: Blending Data, Efficiently
📉 Training AuroraGPT-7B on 2T Tokens
📉 Training AuroraGPT-2B on 7T Tokens
✨ Features
✨ Features (even more!)
🧬 MProt-DPO
🧬 Scaling Results (2024)
🧬 MProt-DPO: Scaling Results
🚂 Loooooooooong Sequence Lengths
🌎 AERIS (2025)
👀 High-Level Overview of AERIS
➕ Contributions
⚠️ Issues with the Deterministic Approach
🎲 Transitioning to a Probabilistic Model
🌀 Sequence-Window-Pipeline Parallelism SWiPe
🚀 AERIS: Scaling Results
🌪️ Hurricane Laura
📓 References
❤️ Acknowledgements
Extras
f Fullscreen
s Speaker View
o Slide Overview
e PDF Export Mode
r Scroll View Mode
b Toggle Chalkboard
c Toggle Notes Canvas
d Download Drawings
? Keyboard Help

Property	Description
Domain	Global
Resolution	0.25° & 1.4°
Training Data	ERA5 (1979–2018)
Model Architecture	Swin Transformer
Speedup¹	O(10k–100k)