🌌 AuroraGPT
2025-05-21
ALCF INCITE GPU HACKATHON
May 20–22, 2025
LLMs on Aurora: 🌌 AuroraGPT
Sam Foreman
2025-05-21
AuroraGPT: General purpose scientific LLM
Broadly trained on a general corpora plus scientific
{papers, texts, data}
Awesome-LLM
Datasets and data pipelines for preparing science training data
Software infrastructure and workflows to train, evaluate and deploy LLMs at scale for scientific resarch purposes
Evaluation of state-of-the-art LLM Models:
| Racks | 166 |
| Nodes | 10,624 |
| CPUs | 21,248 |
| GPUs | 63,744 |
| NICs | 84,992 |
| HBM | 8 PB |
| DDR5c | 10 PB |
Up to ≈ 25× throughput improvement for genomic FMs with 6.5× energy efficiency
✅ Goal: Assemble a large corpus of documents (general and scientific) to train and fine-tune AuroraGPT models
The original implementation was slow:
✅ Goals
❌ Challenges
~ 4 EFLOPS @ Aurora
38,400 XPUs
= 3200 [node] x 12 [XPU / node]
🔔 Gordon Bell Finalist1:
Megatron-DeepSpeedezpz🙏 Acknowledgements
This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.
SEQ_LEN for both 25B and 33B models (See: Song et al. (2023))