Training Foundation Models on Supercomputers

Sam Foreman

Argonne National Laboratory

Location Logo

@ Georgia Institute of Technology

2025-10-15

🌐 Distributed Training

🚀 Scaling: Overview

  • ✅ Goal:
    • Minimize: Cost (i.e. amount of time spent training)
    • Maximize: Performance

    📑 Note

    See 🤗 Performance and Scalability for more details

In this talk, we will explore the intricacies of training foundation models on supercomputers. We will discuss the architecture of these models, the computational requirements, and the strategies employed to optimize training processes. Attendees will gain insights into the latest advancements in hardware and software that facilitate efficient model training at scale.

🐢 Training on a Single Device

Data

GPU0

Network

Loss

x0

x1

x2

Data

GPU0

Network

Loss

x1

x2

x3

Data

GPU0

Network

Loss

x2

x3

x4

SLOW!: Model size limited by GPU memory

🕸️ Parallelism Strategies

  • Data Parallelism
    • Split data across workers
    • Easiest to implement
    • No changes to model
  • Model Parallelism
    • Split model across workers
  • Hybrid Parallelism
    • Combine data + model parallelism
    • More complex to implement
    • Requires changes to model

👬 Training on Multiple GPUs: Data Parallelism

GPU2

GPU1

GPU0

Data

NN

NN

NN

x2

x1

x0

Loss

Loss

Loss

Figure 1: Each GPU receives unique data at each step
  • See 🤗 Methods and tools for efficient training on a single GPU

▶️ Data Parallel: Forward Pass

GPU2

GPU1

GPU0

Data

NN

NN

NN

x0

x1

x2

Loss

Loss

Loss

Avg. Grads
(∑ₙgₙ)/N

Figure 2: Average gradients across all GPUs

◀️ Data Parallel: Backward Pass

Data

Send Updates

GPU2

NN

Loss

GPU1

NN

Loss

GPU0

NN

Loss

x0

x1

x1

Figure 3: Send global updates back to each GPU. See: PyTorch / Distributed Data Parallel

🔄 Collective Communication

  • Broadcast: Send data from one node to all other nodes
  • Reduce: Aggregate data from all nodes to one node
    • AllReduce: Aggregate data from all nodes to all nodes
  • Gather: Collect data from all nodes to one node
    • AllGather: Collect data from all nodes to all nodes
  • Scatter: Distribute data from one node to all other nodes

Reduce

  • Perform a reduction on data across ranks, send to individual

2

3

2

1

0

0

1

3

Reduce

z=reduce(x, 2, SUM)

x0

x1

x2

x3

z

Figure 4: Reduce operation: one rank receives the reduction of input values across ranks

🐣 Getting Started: In Practice

  • 📦 Distributed Training Frameworks:
    • 🍋 saforem2 / ezpz
    • 🤖 Megatron-LM
    • 🤗 Accelerate
    • 🔥 PyTorch
      • DDP / FSDP
  • 🚀 DeepSpeed
    • ZeRO Offloading
    • Megatron-DeepSpeed
  • 🧠 Memory Management:
    • FSDP vs. ZeRO
    • Activation Checkpointing
    • Mixed Precision Training
    • Gradient Accumulation
    • Offloading to CPU/NVMe

🔄 Keeping things in Sync

Computation stalls during communication !!

Keeping the communication to computation ratio small is important for effective scaling.

📝 Plan of Attack

no

yes

yes

no

Model Perfect?

Available Memory?

Done

Make Model Larger

Free Up Memory

Figure 5: General strategy for scaling model training

🚀 Going Beyond Data Parallelism

  • ✅ Useful when model fits on single GPU:
    • ultimately limited by GPU memory
    • model performance limited by size
  • ⚠️ When model does not fit on a single GPU:
    • Offloading (can only get you so far…):
      • DeepSpeed + ZeRO
      • 🔥 PyTorch + FSDP
    • Otherwise, resort to model parallelism strategies

Going beyond Data Parallelism: DeepSpeed + ZeRO

  • Depending on the ZeRO stage (1, 2, 3), we can offload:
    1. Stage 1: optimizer states (Pos)\left(P_{\mathrm{os}}\right)(Pos​)
    2. Stage 2: gradients + opt. states (Pos+g)\left(P_{\mathrm{os}+\mathrm{g}}\right)(Pos+g​)
    3. Stage 3: model params + grads + opt. states (Pos+g+p)\left(P_{\mathrm{os}+\mathrm{g}+\mathrm{p}}\right)(Pos+g+p​)
Figure 6: DeepSpeed + ZeRO

🕸️ Additional Parallelism Strategies

  • Tensor (/ Model) Parallelism (TP):
    • 🤗 Tensor Parallelism
    • 🔥 Large Scale Transformer model training with Tensor Parallel (TP)
  • Pipeline Parallelism (PP):
    • 🔥 PyTorch, DeepSpeed
  • Sequence Parallelism (SP):
    • DeepSpeed Ulysses
    • Megatron / Context Parallelism
    • Unified Sequence Parallel (USP)
      • feifeibear/long-context-attention
    • Supports 4D Parallelism (DP + TP + PP + SP)

Pipeline Parallelism (PP)

  • Model is split up vertically (layer-level) across multiple GPUs
  • Each GPU:
    • has a portion of the full model
    • processes in parallel different stages of the pipeline (on a small chunk of the batch)
  • See:
    • 🔥 PyTorch / Pipeline Parallelism
    • DeepSpeed / Pipeline Parallelism

GPU 1

GPU 0

Layer 0

Layer 1

Layer 2

Layer 3

Figure 7: Pipeline Parallelism

Tensor Parallel (TP)

  • Each tensor is split up into multiple chunks
  • Each shard of the tensor resides on its designated GPU
  • During processing each shard gets processed separately (and in parallel) on different GPUs
    • synced at the end of the step
  • See: 🤗 Model Parallelism for additional details

GPU1

GPU0

Layer 0

Layer 1

Layer 2

Layer 3

Layer 0

Layer 1

Layer 2

Layer 3

Figure 8: Tensor Parallel Training

Tensor Parallel (TP)

  • Suitable when the model is too large to fit onto a single device (CPU / GPU)
  • Typically more complicated to implement than data parallel training
    • This is what one may call horizontal parallelism
    • Communication whenever dataflow between two subsets
  • argonne-lcf/Megatron-DeepSpeed
  • 🤗 huggingface/nanotron

GPU1

GPU0

Layer 0

Layer 1

Layer 2

Layer 3

Layer 0

Layer 1

Layer 2

Layer 3

Figure 9: Tensor Parallel Training
  • Split up network over multiple workers
  • Each receives disjoint subset
  • All communication associated with subsets are distributed
  • Communication whenever dataflow between two subsets
  • Typically more complicated to implement than data parallel training
  • Suitable when the model is too large to fit onto a single device (CPU / GPU)

Tensor (/ Model) Parallel Training: Example

Want to compute: y=∑ixiWi=x0∗W0+x1∗W1+x2∗W2y = \sum_{i} x_{i} W_{i} = x_0 * W_0 + x_1 * W_1 + x_2 * W_2y=∑i​xi​Wi​=x0​∗W0​+x1​∗W1​+x2​∗W2​
where each GPU has only its portion of the full weights as shown below

  1. Compute: y0=x0∗W0→y_{0} = x_{0} * W_{0}\rightarrowy0​=x0​∗W0​→ GPU1
  2. Compute: y1=y0+x1∗W1→y_{1} = y_{0} + x_{1} * W_{1}\rightarrowy1​=y0​+x1​∗W1​→ GPU2
  3. Compute: y=y1+x2∗W2=∑ixiWiy = y_{1} + x_{2} * W_{2} = \sum_{i} x_{i} W_{i}y=y1​+x2​∗W2​=∑i​xi​Wi​ ✅

x₀ W₀

x₀ W₀
+ x₁ W₁

GPU2

W2

GPU1

W1

GPU0

W0

x₀

x₁

x₂

Figure 10

🔭 AI-for-Science
source (@tenderizzation)
 

ChatGPT: explain this image

🏗️ Aurora

Table 1: Aurora1 Specs
Property Value
Racks 166
Nodes 10,624
XPUs2 127,488
CPUs 21,248
NICs 84,992
HBM 8 PB
DDR5c 10 PB
Figure 11: Aurora: Fact Sheet.
  1. 🏆 Aurora Supercomputer Ranks Fastest for AI

  2. Each node has 6 Intel Data Center GPU Max 1550 (code-named “Ponte Vecchio”) tiles, with 2 XPUs per tile.

🌌 AuroraGPT (2024–)

AuroraGPT: General purpose scientific LLM Broadly trained on a general corpora plus scientific {papers, texts, data}

  • Explore pathways towards a “Scientific Assistant” model
  • Build with international partners (RIKEN, BSC, others)
  • Multilingual English, 日本語, French, German, Spanish
  • Multimodal: images, tables, equations, proofs, time series, graphs, fields, sequences, etc
Figure 12: Image from Hannibal046 / Awesome-LLM

🧪 AuroraGPT: Open Science Foundation Model

Figure 13: High-level overview of AuroraGPT project

🧰 AuroraGPT: Toolbox

  • Datasets and data pipelines (how do we deal with scientific data?)
  • Software infrastructure and workflows (scalable, robust, extensible)
  • Evaluation of state-of-the-art LLM Models (how do they perform on scientific tasks?)

🚂 Training

argonne-lcf/Megatron-DeepSpeed
Large Model Training: Any Scale, Any Accelerator

🏃‍♂️ Running

argonne-lcf/inference-endpoints
Inference endpoints for LLMs, hosted @ ALCF

🏋️ Challenges: In Practice

This is incredibly difficult in practice, due in part to:

  • Brand new {hardware, architecture, software}
  • Lack of native support in existing frameworks (though getting better!)
  • General system stability
    +10k Nodes (×12  XPU1  Node)⇒\left(\times \frac{12\,\,\mathrm{XPU}}{1\,\,\mathrm{Node}}\right)\Rightarrow(×1Node12XPU​)⇒ +100k XPUs
    • network performance
    • file system stability (impacted by other users !)
    • many unexpected difficulties occur at increasingly large scales
  • Combinatorial explosion of possible configurations and experiments
    • {hyperparameters, architectures, tokenizers, learning rates, …}

💾 AuroraGPT: Training

  • To train a fixed model on trillions of tokens requires:
    1. Aggregating data from multiple different corpora
      (e.g. ArXiv, Reddit, StackExchange, GitHub, Wikipedia, etc.)
    2. Sampling each training batch according to a fixed distribution across corpora
    3. Building indices that map batches of tokens into these files (indexing)

    The original implementation was slow:

    • Designed to run serially on a single device
    • Major bottleneck when debugging data pipeline at scale

🍹 AuroraGPT: Blending Data, Efficiently

  • 🐢 Original implementation:
    • Slow (serial, single device)
    • ~ 1 hr/2T tokens
  • 🐇 New implementation:
    • Fast! (distributed, asynchronous)
    • ~ 2 min/2T tokens
      (30x faster !!)
Figure 14: Time spent preparing 2T tokens

📉 Loss Curve: Training AuroraGPT-7B on 2T Tokens

Figure 15: Loss curve during training on 2T tokens.

✨ Features

  • 🕸️ Parallelism:
    • {data, tensor, pipeline, sequence, …}
  • ♻️ Checkpoint Converters:
    • Megatron ⇄ 🤗 HF ⇄ ZeRO ⇄ Universal
  • 🔀 DeepSpeed Integration:
    • ZeRO Offloading
    • Activation checkpointing
    • AutoTP (WIP)
    • ability to leverage features from DeepSpeed community

✨ Features (even more!)

  • 🧗 Optimizers1:
    • Support for many different optimizers:
      • Distributed Shampoo, Muon, Adopt, Sophia, Lamb, GaLORE, ScheduleFree, …
    • See full list
    • Large batch training
  • 📊 Experiment Tracking:
    • Automatic experiment and metric tracking with Weights & Biases
  1. Implemented by Marieme Ngom

🧬 MProt-DPO

  • Finalist: SC’24 ACM Gordon Bell Prize
    • MProt-DPO: Breaking the ExaFLOPS Barrier for Multimodal Protein Design Workflows with Direct Preference Optimization
  • One of the first protein design toolkits that integrates:
    • Text, (protein/gene) sequence, structure/conformational sampling modalities to build aligned representations for protein sequence-function mapping

One of the first multimodal protein design toolkits that:␍ integrates text, (protein/gene) sequence, structure/conformational sampling modalities to build aligned representations for protein sequence-function mapping␍ preference optimization strategies that have been scaled to include various design constraints imposed in diverse protein design tasks␍ Two application scenarios: ␍ Protein design: at least 5x gains in productive designs␍ Antibody optimization: designs result in greater complementarity and exploration of sequence space␍ High water marks for training/ fine-tuning multimodal models:␍ achieves ~4.11 EFLOPS sustained performance (peak 5.57 EFLOPS) on Aurora, with >1 EFLOPS on each HPC resource including the NVIDIA DGX cloud ␍ Novel integrated workflow that supports diverse backbone foundation models as well as custom models ␍

🧬 Scaling Results (2024)

Figure 16: Scaling results for 3.5B model across ~38,400 GPUs
  • ~ 4 EFLOPS @ Aurora

  • 38,400 XPUs
    = 3200 [node] x 12 [XPU / node]

  • 🎖️ Gordon Bell Finalist:

    • MProt-DPO: Breaking the ExaFLOPS Barrier for Multimodal Protein Design Workflows (1)

This novel work presents a scalable, multimodal workflow for protein design that trains an LLM to generate protein sequences, computationally evaluates the generated sequences, and then exploits them to fine-tune the model.

Direct Preference Optimization steers the LLM toward the generation of preferred sequences, and enhanced workflow technology enables its efficient execution. A 3.5B and a 7B model demonstrate scalability and exceptional mixed precision performance of the full workflow on ALPS, Aurora, Frontier, Leonardo and PDX.

🧬 MProt-DPO: Scaling Results

Figure 17: 3.5B model
Figure 18: 7B model

🚂 Loooooooooong Sequence Lengths

  • Working with Microsoft/DeepSpeed team to enable longer sequence lengths (context windows) for LLMs
    • See my blog post for additional details

25B

33B
Figure 19: Maximum (achievable) SEQ_LEN for both 25B and 33B models (See: Song et al. (2023))

scaling4science
Megatron-DS-Benchmarking

🌎 AERIS (2025)

Figure 20: arXiv:2509.13523
Figure 21: Pixel-level Swin diffusion transformer in sizes from [1–80]B

We demonstrate a significant advancement in AI weather and climate modeling with AERIS by efficient scaling of window-based transformer models. We have performed global medium-range forecasts with performance competitive with GenCast and surpassing the IFS ENS model, with longer, 90- day rollouts showing our ability to learn atmospheric dynamics on seasonal scales without collapsing, becoming the first diffusion-based model that can work across forecast scales from 6 hours all the way to 3 months with remarkably accurate out of distribution predictions of extreme events.

👀 High-Level Overview of AERIS

Figure 22: Rollout of AERIS model, specific humidity at 700m.
Table 2: Overview of AERIS model and training setup
Property Description
Domain Global
Resolution 0.25° & 1.4°
Training Data ERA5 (1979–2018)
Model Architecture Swin Transformer
Speedup1 O(10k–100k)
  1. Relative to PDE-based models, e.g.: GFS

➕ Contributions

☔ AERIS

First billion-parameter diffusion model for weather + climate

  • Operates at the pixel level (1 × 1 patch size), guided by physical priors
  • Medium-range forecast skill:
    • Surpasses IFS ENS, competitive with GenCast1
    • Uniquely stable on seasonal scales to 90 days

🌀 SWiPe

A novel 3D (sequence-window-pipeline) parallelism strategy for training transformers across high-resolution inputs

  • Enables scalable small-batch training on large supercomputers2
    • 10.21 ExaFLOPS
    • @ 121,000 Intel XPUs (Aurora)
  1. GenCast: A Generative Model for Medium-Range Global Weather Forecasting (Price et al. (2024))

  2. Demonstrated on up to 120,960 GPUs on Aurora and 8,064 GPUs on LUMI.

⚠️ Issues with the Deterministic Approach

  • Transformers:
    • Deterministic
    • Single input → single forecast
  • Diffusion:
    • Probabilistic
    • Single input → ensemble of forecasts
    • Captures uncertainty and variability in weather predictions
    • Enables ensemble forecasting for better risk assessment

🎲 Transitioning to a Probabilistic Model

Figure 23: Reverse diffusion with the input condition, individual sampling steps t0→t64t_{0} \rightarrow t_{64}t0​→t64​, the next time step estimate and the target output.

Reverse Diffusion Process (\mathcal{N}\rightarrow \pi)

Reverse Diffusion Process (N→π\mathcal{N}\rightarrow \piN→π)

Forward Diffusion Process (\pi\rightarrow \mathcal{N})

Forward Diffusion Process (π→N\pi\rightarrow \mathcal{N}π→N)

🌀 Sequence-Window-Pipeline Parallelism SWiPe

  • SWiPe is a novel parallelism strategy for Swin-based Transformers
  • Hybrid 3D Parallelism strategy, combining:
    • Sequence parallelism (SP)
    • Window parallelism (WP)
    • Pipeline parallelism (PP)
Figure 24
Figure 25: SWiPe Communication Patterns

🚀 AERIS: Scaling Results

Figure 26: AERIS: Scaling Results
  • 10 EFLOPs (sustained) @ 120,960 GPUs
  • See (Hatanpää et al. (2025)) for additional details
  • arXiv:2509.13523

🌪️ Hurricane Laura

Figure 27: Hurricane Laura tracks (top) and intensity (bottom). Initialized 7(a), 5(b) and 3(c) days prior to 2020-08-28T00z.

📓 References

Hatanpää, Väinö, Eugene Ku, Jason Stock, Murali Emani, Sam Foreman, Chunyong Jung, Sandeep Madireddy, et al. 2025. “AERIS: Argonne Earth Systems Model for Reliable and Skillful Predictions.” https://arxiv.org/abs/2509.13523.
Price, Ilan, Alvaro Sanchez-Gonzalez, Ferran Alet, Tom R. Andersson, Andrew El-Kadi, Dominic Masters, Timo Ewalds, et al. 2024. “GenCast: Diffusion-Based Ensemble Forecasting for Medium-Range Weather.” https://arxiv.org/abs/2312.15796.
Song, Shuaiwen Leon, Bonnie Kruft, Minjia Zhang, Conglong Li, Shiyang Chen, Chengming Zhang, Masahiro Tanaka, et al. 2023. “DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery Through Sophisticated AI System Technologies.” https://arxiv.org/abs/2310.04610.

❤️ Acknowledgements

This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

samforeman.me/talks/2025/10/15

1
Training Foundation Models on Supercomputers Sam Foreman Argonne National Laboratory @ Georgia Institute of Technology 2025-10-15

  1. Slides

  2. Tools

  3. Close
  • Training Foundation Models on Supercomputers
  • 🌐 Distributed Training
  • 🚀 Scaling: Overview
  • 🐢 Training on a Single Device
  • 🕸️ Parallelism Strategies
  • 👬 Training on Multiple GPUs: Data Parallelism
  • ▶️ Data Parallel: Forward Pass
  • ◀️ Data Parallel: Backward Pass
  • 🔄 Collective Communication
  • Reduce
  • 🐣 Getting Started: In Practice
  • 📝 Plan of Attack
  • 🚀 Going Beyond Data Parallelism
  • Going beyond Data Parallelism: DeepSpeed + ZeRO
  • 🕸️ Additional Parallelism Strategies
  • Pipeline Parallelism (PP)
  • Tensor Parallel (TP)
  • Tensor Parallel (TP)
  • Tensor (/ Model) Parallel Training: Example
  • 🔭 AI-for-Science...
  • 🏗️ Aurora
  • 🌌 AuroraGPT (2024–)
  • 🧪 AuroraGPT: Open Science Foundation Model
  • 🧰 AuroraGPT: Toolbox
  • 🏋️ Challenges: In Practice
  • 💾 AuroraGPT: Training
  • 🍹 AuroraGPT: Blending Data, Efficiently
  • 📉 Loss Curve: Training AuroraGPT-7B on 2T Tokens
  • ✨ Features
  • ✨ Features (even more!)
  • 🧬 MProt-DPO
  • 🧬 Scaling Results (2024)
  • 🧬 MProt-DPO: Scaling Results
  • 🚂 Loooooooooong Sequence Lengths
  • 🌎 AERIS (2025)
  • 👀 High-Level Overview of AERIS
  • ➕ Contributions
  • ⚠️ Issues with the Deterministic Approach
  • 🎲 Transitioning to a Probabilistic Model
  • 🌀 Sequence-Window-Pipeline Parallelism SWiPe
  • 🚀 AERIS: Scaling Results
  • 🌪️ Hurricane Laura
  • 📓 References
  • ❤️ Acknowledgements
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • r Scroll View Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help