Sam Foreman
@ Georgia Institute of Technology
2025-10-15
📑 Note
See 🤗 Performance and Scalability for more details
SLOW!: Model size limited by GPU memory
ezpz
🔄 Keeping things in Sync
Computation stalls during communication !!
Keeping the communication to computation ratio small is important for effective scaling.
ZeRO
ZeRO
stage (1, 2, 3), we can offload:
TP
):
PP
):
SP
):
DP
+ TP
+ PP
+ SP
)argonne-lcf/Megatron-DeepSpeed
huggingface/nanotron
Want to compute: y=∑ixiWi=x0∗W0+x1∗W1+x2∗W2
where each GPU has only its portion of the full weights as shown below
GPU1
GPU2
🔭 AI-for-Science
ChatGPT: explain this image
Property | Value |
---|---|
Racks | 166 |
Nodes | 10,624 |
XPUs2 | 127,488 |
CPUs | 21,248 |
NICs | 84,992 |
HBM | 8 PB |
DDR5c | 10 PB |
AuroraGPT: General purpose scientific LLM Broadly trained on a general corpora plus scientific {papers, texts, data}
Awesome-LLM
🚂 Training
argonne-lcf/Megatron-DeepSpeed
Large Model Training: Any Scale, Any Accelerator
🏃♂️ Running
argonne-lcf/inference-endpoints
Inference endpoints for LLMs, hosted @ ALCF
This is incredibly difficult in practice, due in part to:
The original implementation was slow:
~ 4 EFLOPS @ Aurora
38,400 XPUs
= 3200 [node] x 12 [XPU / node]
SEQ_LEN
for both 25B
and 33B
models (See: Song et al. (2023))
☔ AERIS
First billion-parameter diffusion model for weather + climate
🌀 SWiPe
A novel 3D (sequence-window-pipeline) parallelism strategy for training transformers across high-resolution inputs
SWiPe
SWiPe
is a novel parallelism strategy for Swin-based TransformersSP
)WP
)PP
)This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.