Training Foundation Models on Supercomputers

Sam Foreman

Training Foundation Models on Supercomputers

Sam Foreman

Argonne National Laboratory

@ University of Illinois at Urbana-Champaign

2025-10-24

Argonne Leadership Computing Facility (ALCF)

The ALCF enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.
–alcf.anl.gov

🤖 ALCF AI Testbed

ALCF AI Testbed Systems are in production and available for allocations to the research community
Significant improvement in time-to-solution and energy-efficiency for diverse AI for science applications.
NAIRR Pilot

Up to 25 $\times$ improvement for genomic foundation models with 6.5 $\times$ energy efficiency

Figure 2: **SambaNova SN-30**: 2nd Gen, 8 nodes with 64 AI Accelerators

Figure 3: **Graphcore Bow**: generation accelerators: Pod-64 configuration with 64 accelerators

Figure 4: **Cerebras**: 2x CS-2 WSE with Memory-X and Swarm-X technologies

Figure 5: **GroqRack**: 9 nodes, 8 GroqChip v1.5 Tensor streaming processors accelerators per node

🏋️ Challenges: In Practice

This is incredibly difficult in practice, due in part to:

Brand new {hardware, architecture, software}
Lack of native support in existing frameworks (though getting better!)
General system stability
+10k Nodes $\left(\times \frac{12\,\,\mathrm{XPU}}{1\,\,\mathrm{Node}}\right)\Rightarrow$ +100k XPUs
- network performance
- file system stability (impacted by other users !)
- many unexpected difficulties occur at increasingly large scales
Combinatorial explosion of possible configurations and experiments
- {hyperparameters, architectures, tokenizers, learning rates, …}

Training Foundation Models on Supercomputers Sam Foreman Argonne National Laboratory @ University of Illinois at Urbana-Champaign 2025-10-24

Slides

Tools

Close
Training Foundation Models on Supercomputers
🧑🏻‍💻 About Me
Timeline
Argonne Leadership Computing Facility (ALCF)
🏗️ Aurora
🤖 ALCF AI Testbed
🔭 AI-for-Science...
🌌 AuroraGPT (2024–)
🧪 AuroraGPT: Open Science Foundation Model
🧰 AuroraGPT: Toolbox
👥 Team Leads
🤝 Teams
🏋️ Challenges: In Practice
💾 AuroraGPT: Training
🍹 AuroraGPT: Blending Data, Efficiently
📉 Training AuroraGPT-7B on 2T Tokens
📉 Training AuroraGPT-2B on 7T Tokens
✨ Features
✨ Features (even more!)
🧬 MProt-DPO
🧬 Scaling Results (2024)
🧬 MProt-DPO: Scaling Results
🚂 Loooooooooong Sequence Lengths
🌎 AERIS (2025)
👀 High-Level Overview of AERIS
➕ Contributions
⚠️ Issues with the Deterministic Approach
🎲 Transitioning to a Probabilistic Model
🌀 Sequence-Window-Pipeline Parallelism SWiPe
🚀 AERIS: Scaling Results
🌪️ Hurricane Laura
📓 References
❤️ Acknowledgements
Extras
f Fullscreen
s Speaker View
o Slide Overview
e PDF Export Mode
r Scroll View Mode
b Toggle Chalkboard
c Toggle Notes Canvas
d Download Drawings
? Keyboard Help

Property	Value
Racks	166
Nodes	10,624
XPUs²	127,488
CPUs	21,248
NICs	84,992
HBM	8 PB
DDR5c	10 PB

Property	Description
Domain	Global
Resolution	0.25° & 1.4°
Training Data	ERA5 (1979–2018)
Model Architecture	Swin Transformer
Speedup¹	O(10k–100k)