Sam Foreman
@ Argonne National Laboratory
2025-12-16
🍋 ezpz
saforem2/ezpz
Write once, run anywhere
🚂 Training
argonne-lcf/Megatron-DeepSpeed
For the largest of large language models
🏃♂️ Running
argonne-lcf/inference-endpoints
Inference endpoints for LLMs, hosted @ ALCF
This is incredibly difficult in practice, due in part to:
The original implementation was slow:
argonne-lcf/Megatron-DeepSpeed
~ 4 EFLOPS @ Aurora
38,400 XPUs
= 3200 [node] x 12 [XPU / node]
SEQ_LEN for both 25B and 33B models (See: Song et al. (2023))
☔ AERIS
First billion-parameter diffusion model for weather + climate
🌀 SWiPe
A novel 3D (sequence-window-pipeline) parallelism strategy for training transformers across high-resolution inputs
SWiPeSWiPe is a novel parallelism strategy for Swin-based TransformersSP)WP)PP)This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.