Training Foundation Models on Supercomputers

Sam Foreman

Argonne National Laboratory

Location Logo

@ University of Illinois at Urbana-Champaign

2025-10-24

🧑🏻‍💻 About Me

  • 🏡 samforeman.me
  • UIUC (2015):
    • Engineering Physics + Applied Mathematics
  • University of Iowa (2015–2019):
    • PhD. Physics1
  • ANL (2019–2022): Postdoctoral Researcher
  • ANL (2022–Present): Assistant Computational Scientist
    • Member of the AI/ML Group at ALCF

Current Research:

  • AuroraGPT: Foundation Models for Science
  • AERIS: Argonne’s Earth System Model
    • Finalist for the 2025 ACM Gordon Bell Prize in Climate Modeling
  • MProt-DPO: Multimodal Protein Design
    • Finalist for the ACM Gordon Bell Prize 2024
  • GenSLMs: Genome Scale Language Models.
    • Winner of the ACM Gordon Bell Special Prize for HPC-Based COVID-19 Research
  1. A Machine Learning Approach to Lattice Gauge Theory

Argonne Leadership Computing Facility (ALCF)

The ALCF enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community.

Images from The Computer That Will Change Everything – Chicago Magazine

🏗️ Aurora

Table 1: Aurora1 Specs
Property Value
Racks 166
Nodes 10,624
XPUs2 127,488
CPUs 21,248
NICs 84,992
HBM 8 PB
DDR5c 10 PB
Figure 1: Aurora: Fact Sheet.
  1. 🏆 Aurora Supercomputer Ranks Fastest for AI

  2. Each node has 6 Intel Data Center GPU Max 1550 (code-named “Ponte Vecchio”) tiles, with 2 XPUs per tile.

🤖 ALCF AI Testbed

  • ALCF AI Testbed Systems are in production and available for allocations to the research community
  • Significant improvement in time-to-solution and energy-efficiency for diverse AI for science applications.
  • NAIRR Pilot

Up to 25×\times× improvement for genomic foundation models with 6.5×\times× energy efficiency

Figure 2: SambaNova SN-30: 2nd Gen, 8 nodes with 64 AI Accelerators
Figure 3: Graphcore Bow: generation accelerators: Pod-64 configuration with 64 accelerators
Figure 4: Cerebras: 2x CS-2 WSE with Memory-X and Swarm-X technologies
Figure 5: GroqRack: 9 nodes, 8 GroqChip v1.5 Tensor streaming processors accelerators per node

🔭 AI-for-Science
source (@tenderizzation)
 

ChatGPT: explain this image

🌌 AuroraGPT (2024–)

AuroraGPT: General purpose scientific LLM Broadly trained on a general corpora plus scientific {papers, texts, data}

  • Explore pathways towards a “Scientific Assistant” model
  • Build with international partners (RIKEN, BSC, others)
  • Multimodal: images, tables, equations, proofs, time series, graphs, fields, sequences, etc
Figure 6: Image from Hannibal046 / Awesome-LLM

🧪 AuroraGPT: Open Science Foundation Model

Figure 7: High-level overview of AuroraGPT project

🧰 AuroraGPT: Toolbox

  • Datasets and data pipelines (how do we deal with scientific data?)
  • Software infrastructure and workflows (scalable, robust, extensible)
  • Evaluation of state-of-the-art LLM Models (how do they perform on scientific tasks?)

🚂 Training

argonne-lcf/Megatron-DeepSpeed
Large Model Training: Any Scale, Any Accelerator

🏃‍♂️ Running

argonne-lcf/inference-endpoints
Inference endpoints for LLMs, hosted @ ALCF

👥 Team Leads

Planning

Rick Stevens

Rick Stevens1

Ian Foster

Ian Foster

Rinku Gupta

Rinku Gupta

Mike Papka

Mike Papka

Arvind Ramanathan

Arvind Ramanathan

Fangfang Xia

Fangfang Xia

Data

Ian Foster

Ian Foster

Robert Underwood

Robert Underwood

Training

Venkat Vishwanath

Venkat Vishwanath

Sam Foreman

Sam Foreman

Evaluation

Franck Cappello

Franck Cappello

Sandeep Madireddy

Sandeep Madireddy

Bo Li

Bo Li

Post

Eliu Huerta

Eliu Huerta

Azton Wells

Azton Wells

Inference

Rajeev Thakur

Rajeev Thakur

Comms

Charlie Catlett

Charlie Catlett

David Martin

David Martin

Distribution

Brad Ullrich

Brad Ullrich
  1. Lead

🤝 Teams

  • Planning
  • Data Prep
    • Accumulate 20+ T tokens of high-quality scientific text and structured data
  • Models / Training1
    • Train (entirely from scratch) a series of models on publicly available data
  • Evaluation
    • Skills, trustworthiness, safety, robustness, privacy, machine ethics
  • Post-Training
    • Fine-tuning, alignment
  • Inference
    • Model serving, API development / public-facing web services
  • Distribution
    • Licensing, generating and distributing artifacts for public consumption
  • Communication
  1. Co-led by: Venkat Vishwanath, Sam Foreman

🏋️ Challenges: In Practice

This is incredibly difficult in practice, due in part to:

  • Brand new {hardware, architecture, software}
  • Lack of native support in existing frameworks (though getting better!)
  • General system stability
    +10k Nodes (×12  XPU1  Node)⇒\left(\times \frac{12\,\,\mathrm{XPU}}{1\,\,\mathrm{Node}}\right)\Rightarrow(×1Node12XPU​)⇒ +100k XPUs
    • network performance
    • file system stability (impacted by other users !)
    • many unexpected difficulties occur at increasingly large scales
  • Combinatorial explosion of possible configurations and experiments
    • {hyperparameters, architectures, tokenizers, learning rates, …}

💾 AuroraGPT: Training

  • To train a fixed model on trillions of tokens requires:
    1. Aggregating data from multiple different corpora
      (e.g. ArXiv, Reddit, StackExchange, GitHub, Wikipedia, etc.)
    2. Sampling each training batch according to a fixed distribution across corpora
    3. Building indices that map batches of tokens into these files (indexing)

    The original implementation was slow:

    • Designed to run serially on a single device
    • Major bottleneck when debugging data pipeline at scale

🍹 AuroraGPT: Blending Data, Efficiently

  • 🐢 Original implementation:
    • Slow (serial, single device)
    • ~ 1 hr/2T tokens
  • 🐇 New implementation:
    • Fast! (distributed, asynchronous)
    • ~ 2 min/2T tokens
      (30x faster !!)
Figure 8: Time spent preparing 2T tokens

📉 Loss Curve: Training AuroraGPT-7B on 2T Tokens

Figure 9: Loss curve during training on 2T tokens.

AuroraGPT-2B Pre-Training

Figure 10: Loss curve for (new) AuroraGPT-2B model trained on 7T tokens.

✨ Features

  • 🕸️ Parallelism:
    • {data, tensor, pipeline, sequence, …}
  • ♻️ Checkpoint Converters:
    • Megatron ⇄ 🤗 HF ⇄ ZeRO ⇄ Universal
  • 🔀 DeepSpeed Integration:
    • ZeRO Offloading
    • Activation checkpointing
    • AutoTP (WIP)
    • ability to leverage features from DeepSpeed community

✨ Features (even more!)

  • 🧗 Optimizers1:
    • Support for many different optimizers:
      • Distributed Shampoo, Muon, Adopt, Sophia, Lamb, GaLORE, ScheduleFree, …
    • See full list
    • Large batch training
  • 📊 Experiment Tracking:
    • Automatic experiment and metric tracking with Weights & Biases
  1. Implemented by Marieme Ngom

🧬 MProt-DPO

  • Finalist: SC’24 ACM Gordon Bell Prize
    • MProt-DPO: Breaking the ExaFLOPS Barrier for Multimodal Protein Design Workflows with Direct Preference Optimization (Dharuman et al. (2024))
  • One of the first protein design toolkits that integrates:
    • Text, (protein/gene) sequence, structure/conformational sampling modalities to build aligned representations for protein sequence-function mapping

🧬 Scaling Results (2024)

Figure 11: Scaling results for 3.5B model across ~38,400 GPUs
  • ~ 4 EFLOPS @ Aurora

  • 38,400 XPUs
    = 3200 [node] x 12 [XPU / node]

  • 🎖️ Gordon Bell Finalist:

    • MProt-DPO: Breaking the ExaFLOPS Barrier for Multimodal Protein Design Workflows (Dharuman et al. (2024))

This novel work presents a scalable, multimodal workflow for protein design that trains an LLM to generate protein sequences, computationally evaluates the generated sequences, and then exploits them to fine-tune the model.

Direct Preference Optimization steers the LLM toward the generation of preferred sequences, and enhanced workflow technology enables its efficient execution. A 3.5B and a 7B model demonstrate scalability and exceptional mixed precision performance of the full workflow on ALPS, Aurora, Frontier, Leonardo and PDX.

🧬 MProt-DPO: Scaling Results

Figure 12: 3.5B model
Figure 13: 7B model

🚂 Loooooooooong Sequence Lengths

  • Working with Microsoft/DeepSpeed team to enable longer sequence lengths (context windows) for LLMs
    • See my blog post for additional details

25B

33B
Figure 14: Maximum (achievable) SEQ_LEN for both 25B and 33B models (See: Song et al. (2023))

scaling4science
Megatron-DS-Benchmarking

🌎 AERIS (2025)

Figure 15: arXiv:2509.13523
Figure 16: Pixel-level Swin diffusion transformer in sizes from [1–80]B

We demonstrate a significant advancement in AI weather and climate modeling with AERIS by efficient scaling of window-based transformer models. We have performed global medium-range forecasts with performance competitive with GenCast and surpassing the IFS ENS model, with longer, 90- day rollouts showing our ability to learn atmospheric dynamics on seasonal scales without collapsing, becoming the first diffusion-based model that can work across forecast scales from 6 hours all the way to 3 months with remarkably accurate out of distribution predictions of extreme events.

👀 High-Level Overview of AERIS

Figure 17: Rollout of AERIS model, specific humidity at 700m.
Table 2: Overview of AERIS model and training setup
Property Description
Domain Global
Resolution 0.25° & 1.4°
Training Data ERA5 (1979–2018)
Model Architecture Swin Transformer
Speedup1 O(10k–100k)
  1. Relative to PDE-based models, e.g.: GFS

➕ Contributions

☔ AERIS

First billion-parameter diffusion model for weather + climate

  • Operates at the pixel level (1 × 1 patch size), guided by physical priors
  • Medium-range forecast skill:
    • Surpasses IFS ENS, competitive with GenCast1
    • Uniquely stable on seasonal scales to 90 days

🌀 SWiPe

A novel 3D (sequence-window-pipeline) parallelism strategy for training transformers across high-resolution inputs

  • Enables scalable small-batch training on large supercomputers2
    • 10.21 ExaFLOPS
    • @ 121,000 Intel XPUs (Aurora)
  1. GenCast: A Generative Model for Medium-Range Global Weather Forecasting (Price et al. (2024))

  2. Demonstrated on up to 120,960 GPUs on Aurora and 8,064 GPUs on LUMI.

⚠️ Issues with the Deterministic Approach

  • Transformers:
    • Deterministic
    • Single input → single forecast
  • Diffusion:
    • Probabilistic
    • Single input → ensemble of forecasts
    • Captures uncertainty and variability in weather predictions
    • Enables ensemble forecasting for better risk assessment

🎲 Transitioning to a Probabilistic Model

Figure 18: Reverse diffusion with the input condition, individual sampling steps t0→t64t_{0} \rightarrow t_{64}t0​→t64​, the next time step estimate and the target output.

Reverse Diffusion Process (\mathcal{N}\rightarrow \pi)

Reverse Diffusion Process (N→π\mathcal{N}\rightarrow \piN→π)

Forward Diffusion Process (\pi\rightarrow \mathcal{N})

Forward Diffusion Process (π→N\pi\rightarrow \mathcal{N}π→N)

🌀 Sequence-Window-Pipeline Parallelism SWiPe

  • SWiPe is a novel parallelism strategy for Swin-based Transformers
  • Hybrid 3D Parallelism strategy, combining:
    • Sequence parallelism (SP)
    • Window parallelism (WP)
    • Pipeline parallelism (PP)
Figure 19
Figure 20: SWiPe Communication Patterns

🚀 AERIS: Scaling Results

Figure 21: AERIS: Scaling Results
  • 10 EFLOPs (sustained) @ 120,960 GPUs
  • See (Hatanpää et al. (2025)) for additional details
  • arXiv:2509.13523

🌪️ Hurricane Laura

Figure 22: Hurricane Laura tracks (top) and intensity (bottom). Initialized 7(a), 5(b) and 3(c) days prior to 2020-08-28T00z.

📓 References

Dharuman, Gautham, Kyle Hippe, Alexander Brace, Sam Foreman, Väinö Hatanpää, Varuni K. Sastry, Huihuo Zheng, et al. 2024. “MProt-DPO: Breaking the ExaFLOPS Barrier for Multimodal Protein Design Workflows with Direct Preference Optimization.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. SC ’24. Atlanta, GA, USA: IEEE Press. https://doi.org/10.1109/SC41406.2024.00013.
Hatanpää, Väinö, Eugene Ku, Jason Stock, Murali Emani, Sam Foreman, Chunyong Jung, Sandeep Madireddy, et al. 2025. “AERIS: Argonne Earth Systems Model for Reliable and Skillful Predictions.” https://arxiv.org/abs/2509.13523.
Price, Ilan, Alvaro Sanchez-Gonzalez, Ferran Alet, Tom R. Andersson, Andrew El-Kadi, Dominic Masters, Timo Ewalds, et al. 2024. “GenCast: Diffusion-Based Ensemble Forecasting for Medium-Range Weather.” https://arxiv.org/abs/2312.15796.
Song, Shuaiwen Leon, Bonnie Kruft, Minjia Zhang, Conglong Li, Shiyang Chen, Chengming Zhang, Masahiro Tanaka, et al. 2023. “DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery Through Sophisticated AI System Technologies.” https://arxiv.org/abs/2310.04610.

❤️ Acknowledgements

This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

Extras

Timeline

2015UIUCB.S. in EngineeringPhysicsApplied Mathematics2018Received SCGSRFellowshipAllowed me to finishmy PhD at ANL2019University of IowaPh.D. in Physics2022Argonne LeadershipComputing FacilityPostdoctoralResearcherPresentArgonne NationalLaboratoryComputationalScientist (HPC &Large-Scale AI)How I got Here

samforeman.me/talks/2025/10/24/slides

1
Training Foundation Models on Supercomputers Sam Foreman Argonne National Laboratory @ University of Illinois at Urbana-Champaign 2025-10-24

  1. Slides

  2. Tools

  3. Close
  • Training Foundation Models on Supercomputers
  • 🧑🏻‍💻 About Me
  • Argonne Leadership Computing Facility (ALCF)
  • 🏗️ Aurora
  • 🤖 ALCF AI Testbed
  • 🔭 AI-for-Science...
  • 🌌 AuroraGPT (2024–)
  • 🧪 AuroraGPT: Open Science Foundation Model
  • 🧰 AuroraGPT: Toolbox
  • 👥 Team Leads
  • 🤝 Teams
  • 🏋️ Challenges: In Practice
  • 💾 AuroraGPT: Training
  • 🍹 AuroraGPT: Blending Data, Efficiently
  • 📉 Loss Curve: Training AuroraGPT-7B on 2T Tokens
  • AuroraGPT-2B Pre-Training
  • ✨ Features
  • ✨ Features (even more!)
  • 🧬 MProt-DPO
  • 🧬 Scaling Results (2024)
  • 🧬 MProt-DPO: Scaling Results
  • 🚂 Loooooooooong Sequence Lengths
  • 🌎 AERIS (2025)
  • 👀 High-Level Overview of AERIS
  • ➕ Contributions
  • ⚠️ Issues with the Deterministic Approach
  • 🎲 Transitioning to a Probabilistic Model
  • 🌀 Sequence-Window-Pipeline Parallelism SWiPe
  • 🚀 AERIS: Scaling Results
  • 🌪️ Hurricane Laura
  • 📓 References
  • ❤️ Acknowledgements
  • Extras
  • Timeline
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • r Scroll View Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help