---
title: "AuroraGPT"
subtitle: "Large Scale Training on Diverse Accelerators"
description: "Presented at the 2025 Scalable Deep Learning Session at the SIAM Annual Meeting"
date: 2025-05-21
email: "[email protected]"
number-sections: false
# location: "[SIAM Annual Meeting 2025](https://meetings.siam.org/sess/dsp_programsess.cfm?SESSIONCODE=84772)"
location: "[SIAM Annual Meeting 2025](https://www.siam.org/conferences-events/siam-conferences/an25/)"
sublocation: "[Scalable Deep Learning](https://meetings.siam.org/sess/dsp_programsess.cfm?SESSIONCODE=84772)"
location-url: "https://meetings.siam.org/sess/dsp_programsess.cfm?SESSIONCODE=84772"
image: ./assets/thumbnail.png
author: "[Sam Foreman](https://samforeman.me)"
affiliation: "Argonne National Laboratory"
editor:
render-on-save: true
freeze: auto
twitter-card:
image: "./assets/thumbnail.png"
creator: "saforem2"
site: "saforem2"
title: "AuroraGPT: Large Scale Training on Diverse Accelerators"
description: "Presented at the 2025 Scalable Deep Learning Session at the SIAM Annual Meeting"
open-graph:
title: "AuroraGPT: Large Scale Training on Diverse Accelerators"
description: "Presented at the 2025 Scalable Deep Learning Session at the SIAM Annual Meeting"
image: "./assets/thumbnail.png"
citation:
author: Sam Foreman
type: speech
url: https://samforeman.me/talks/AuroraGPT-SIAM25/slides.html
format:
revealjs:
callout-style: simple
# # width: 1024
# # height: 768
# max-scale: 4.0
# min-scale: 0.1
# # title: "AuroraGPT"
# pdf-separate-fragments: true
# shift-heading-level-by: -1
# center: false
margin: 0.1
center: true
# background-color: "#FFFFFF"
shift-heading-by: -1
footer: "[samforeman.me/talks/AuroraGPT-SIAM25/slides](https://samforeman.me/talks/AuroraGPT-SIAM25/slides)"
slide-url: samforeman.me/talks/AuroraGPT-SIAM25/slides
template-partials:
- title-slide.html
title-slide-attributes:
data-background-size: cover
data-background-iframe: https://emilhvitfeldt.github.io/quarto-iframe-examples/stars/index.html
html: default
gfm:
output-file: "README.md"
revealjs-plugins:
- revealjs-text-resizer
---
<!--
::: {.content-visible when-format="revealjs"}
## {background-iframe="https://emilhvitfeldt.github.io/quarto-iframe-examples/stars/index.html"}
::: {style="text-align:left; padding: 10pt; margin-left: auto; margin-right: auto; line-height: 1.25lh!important; background-color: oklch(from #364FC7 calc(l * 1.15) c h/0.1); border: 2px solid hsla(230, 57%, 50%, 1.0); box-shadow: RGBA(0, 0, 0, 0.35) 0px 5px 15px;"}
[AuroraGPT]{style="font-size: 2.0em; font-weight: 700; color: hsla(230, 57%, 50%, 1.0)!important; margin-block: 1rem;"}
[Large Scale Training on Diverse Accelerators]{style="font-size: 1.5em; font-weight: 600; color:hsla(230, 57%, 50%, 0.7); margin-block: 1rem;"}
::: {.flex-container style="flex-direction: row; gap: 0.5rem; justify-content: space-between; align-items: flex-end;"}
::: {style="margin-block-start: 0.5lh;"}
[[Sam Foreman]{style="color:hsla(0, 0%, 0%, 1.0); font-weight: 600;"}](https://samforeman.me)]
[[[email protected]]{style="color:hsla(0, 0%, 10%, 1.0); font-weight: 650;"}](mailto://[email protected])
[Argonne National Laboratory]{style="color:hsla(0, 0%, 20%, 1.0); font-weight: 500;"}
:::
::: {style="margin-block-start: 0.5lh; text-align: right;"}
[Scalable Deep Learning]{style="color: hsla(0, 0%, 0%, 1.0)!important;"}
[@ 2025 SIAM Annual Meeting]{style="color: hsla(0, 0%, 10%, 1.0)!important;"}
[*2025-07-31*]{style="color:hsla(0, 0%, 20%, 1.0);"}
:::
:::
:::
:::
-->
<!--
## Scalable Deep Learning {background-color="white"}
- [Scalable Deep Learning](https://meetings.siam.org/sess/dsp_programsess.cfm?SESSIONCODE=84772)
- [2025 ALCF INCITE GPU Hackathon (20-May 22, 2025)](https://www.alcf.anl.gov/events/alcf-incite-gpu-hackathon)
- LLMs on Aurora[^my-talks]:
- [🍋 Hands-On: ezpz](https://samforeman.me/talks/incite-hackathon-2025/ezpz/slides)
- [🌌 Overview: AuroraGPT](https://samforeman.me/talks/AuroraGPT-SIAM25)
[^my-talks]: _my_ talks can be found at:
[https://samforeman.me/talks/incite-hackathon-2025](https://samforeman.me/talks/incite-hackathon-2025)
-->
## 🎯 AuroraGPT: Goals {.smaller background-color="white"}
::: {.flex-container style="justify-content: space-around;"}
::: {.column style="width: 50%"}
::: {.blue-card}
[**AuroraGPT**](https://auroragpt.anl.gov): *General purpose scientific LLM*
Broadly trained on a general corpora plus scientific {papers, texts, data}
:::
- **Explore pathways** towards a "Scientific Assistant" model
- **Build with international partners** (RIKEN, BSC, others)
- **Multilingual** English, 日本語, French, German, Spanish
- **Multimodal**: images, tables, equations, proofs, time series, graphs, fields, sequences, etc
:::
::: {.column style="text-align: center;"}
::: {#fig-awesome-llm}

Image from {{< iconify fa github >}}
[Hannibal046 / `Awesome-LLM`](https://github.com/Hannibal046/Awesome-LLM)
:::
:::
:::
<!--
::: {#fig-timeline}
{width="75%" style="margin-left:auto;margin-right:auto;"}
Credit to the entire AuroraGPT team for slides.
:::
-->
::: {.notes}
- Here to talk about AuroraGPT, Argonne's internal effort to build a general
purpose scientific LLM, broadly trained on a general corpora of text +
scientific {papers, text, data}
- As part of this effort, we plan to...
- Explore pathways, build with international partners, multi-{lingual, modal}
- Rough timeline of the project and deliverables:
- 202{3,4}: text-only models, plan to release a series of {7B, 70B, 1T} models
- 202{4,5}: Basic multi-modal models
- 202{5,6}: Advanced scientific multimodal models
-
- AuroraGPT: Exascale Pre-Training of Large Language Models on Diverse Accelerators
> [argonne-lcf/Megatron-DeepSpeed](https://github.com/argonne-lcf/Megatron-DeepSpeed)
> Large Model Training: any scale, any accelerator
- Thoughts:
- yeah okay so I'll probably try and include then like:
- [x] {tensor, pipeline, sequence}-parallelism
- [x] DeepSpeed integration (ZeRO offloading, activation checkpointing, ...)
- [x] Robust mechanisms for automatic experiment {configuration, tracking, ...}
- [x] Support for modern (and experimental!) optimizers
- [x] Large batch training
- Goals
- Issues with existing models
- AuroraGPT
- Project Details
- Teams, Ongoing Efforts
- Scientific Evaluations
- Scaling Results
- MProt-DPO
- ~~aeris~~ (??)
:::
<!--
## 🦙 Issues with "Publicly Available" LLMs {background-color="white"}
- **Trust** and **Safety**:
- Skepticism about deployment in critical infrastructure
- Correctness and reliability of model outputs
- **Transparency**:
- Data governance, _what was used for pre-training_? fine-tuning?
- **generally unknown**
- What is _open source_?
- Model weights?
- Pre-training \{code, logs, metrics\} ?
::: {.notes}
- Why are we doing this?
- What is the issue with current LLMs?
- **Trust and safety**
- Hallucinations, false confidence
- Can this be reliably mitigated?
- Scaling up inference compute seems to help
- reasoning models, TTT, etc.
- **Transparency**
- Different frontier labs have different definitions of "open source"
- e.g. Llama no longer releases base models
- Libgen ??
- AllenAI institute, olmo models good example
:::
-->
## 🧪 AuroraGPT: Open Science Foundation Model {background-color="white"}
::: {#fig-aurora-gpt .r-stretch style="vertical-align:center;"}

High-level overview of AuroraGPT project
:::
::: {.notes}
- AuroraGPT will be a publicly distributed, open source foundation model for
open science
- Is being trained on:
- Scientific / engineering structured data
- General text, media, news, etc.
- Large amounts of low to medium quality data
- Much less high quality data (that is publicly available for use)
- This data is then cleaned, processed, de-duplicated and used for the initial
pre-training phase of the model
- The vast majority of the overall compute is spent during this initial
pre-training phase
- This is the group I help to lead and will be talking a bit about today
- The initial pre-training phase is currently underway
- Eventually, given a bit of time, effort and magic, the model will be
ready for fine-tuning and additional training for a variety of downstream
tasks
- The pretrained model will then be handed off for additional fine-tuning on a
variety of downstream tasks
- Scientific discovery
- Accelerate scientific tasks
- Digital twins
- Inverse design
- Code optimization
- Accelerated simulations
- Autonomous experiments
- Co-design
- Becoming increasingly clear that LLMs have the potential to drastically
accelerate computational science
- We've seen this already for {GenSLMs, Weather / Climate / Earth Systems
Modeling, Particle Physics, etc.}
:::
## 🧰 AuroraGPT: Toolbox {background-color="white"}
- **Datasets and data pipelines**
(how do we deal with scientific data?)
- **Software infrastructure and workflows**
(scalable, robust, extensible)
- **Evaluation of state-of-the-art LLM Models**
(how do they perform on scientific tasks?)
<!-- to train, evaluate and deploy LLMs at scale for scientific resarch purposes-->
::: {.flex-container style="gap: 5pt;"}
::: {.callout-note icon=false title="🚂 Training"}
{{< fa brands github >}} [argonne-lcf/Megatron-DeepSpeed](https://github.com/argonne-lcf/Megatron-DeepSpeed)
[Large Model Training: Any Scale, Any Acclerator]{.dim-text}
:::
::: {.callout-important icon=false title="🏃♂️ Running"}
{{< fa brands github >}} [argonne-lcf/inference-endpoints](https://github.com/argonne-lcf/inference-endpoints)
[Inference endpoints for LLMs, hosted @ ALCF]{.dim-text}
:::
:::
<!--
## 📚 What do we hope to get? {background-color="white"}
- **Assessment of the approach** of augmenting web training data with two forms
of data specific to science:
- Full text scientific papers
- Structured scientific datasets (suitably mapped to narrative form)
- **Research grade artifacts** (**models**) for scientific community for
adaptation for downstream uses[^mprot-dpo-1]
- **Promotion of responsible AI** best practices where we can figure them out
- **International Collaborations** around the long term goal of _AGI for science_
[^mprot-dpo-1]:|
🔔 Gordon Bell Finalist: [MProt-DPO](https://dl.acm.org/doi/10.1109/SC41406.2024.00013) [@mprot-dpo2024]
::: {.notes}
- Deliverables:
- datasets, pipelines
- software infrastructure, workflows to interface with science applications
- checkpoints, models, logs, workbook, insights, etc.
- Hope to understand:
- How different state-of-the-art models perform at different scientific tasks
- where deep data may have an impact
- feasibility of generically augmenting text with scientific structured data
- Huge undertaking that will require large international collaborations around
long term goal of AGI for science
- Extra points:
- Well known that LLMs are good for non-consequential tasks
- Known to "hallucinate" and create false information
- Can this be mitigated reliably ??
:::
-->
## 🌌 Aurora {background-color="white"}
::: {.flex-container style="align-items: center;"}
::: {.column style="width:5%;"}
::: {#tbl-aurora}
| <!-- --> | <!-- --> |
| -------: | :------- |
| Racks | 166 |
| Nodes | 10,624 |
| CPUs | 21,248 |
| GPUs | 63,744 |
| NICs | 84,992 |
| HBM | 8 PB |
| DDR5c | 10 PB |
Aurora Specs {.responsive .striped .hover}
:::
:::
::: {.column style="text-align:center"}
::: {#fig-aurora .r-stretch}

Aurora[^fastest]: [Fact Sheet](https://www.alcf.anl.gov/sites/default/files/2024-07/Aurora_FactSheet_2024.pdf).
:::
[^fastest]:|
🏆 [Aurora Supercomputer Ranks Fastest for AI](https://www.intel.com/content/www/us/en/newsroom/news/intel-powered-aurora-supercomputer-breaks-exascale-barrier.html)
:::
:::
## 🤝 Teams {auto-animate=true background-color="white"}
::: {.flex-container}
::: {.column}
- **Planning**
- **Data**
- Aggregate existing data and generate new (synthetic) data
- [**Models / Training**]{style="background: oklch(from #ff1a8f calc(l * 1.15) c h / 0.1); border: 1px solid #ff1a8f; border-radius: 0.25px;"}[^me]
- Pre-train a series of models on publicly available data
- **Post-Training**
- Fine-tuning, alignment, reinforcement learning
[^me]: **Sam Foreman** (co-lead), Varuni Sastry, Marieme Ngom, ...
:::
::: {.column}
- **Evaluation**
- Skills, trustworthiness, safety, robustness, privacy, machine ethics
- **Inference**
- Model serving, API development / public-facing web services
- **Distribution**
- Licensing, generating and distributing artifacts for public consumption
- **Communication**
:::
:::
::: {.notes}
generating
curating / aggregating
cleaning / understanding
new data for training
including:
MCQ's + scientific narratives
new scientific data modalities (gene sequences, geospatial data, ...)
:::
<!--
## 📚 Data {background-color="white"}
::: {.green-card}
✅ **Goal**: Assemble a large corpus of documents (general and scientific) to train and fine-tune AuroraGPT models
:::
- **Challenges**: Avoid / detect contamination with benchmarks
- Respect copyright (ACM Digital Library), privacy, and ethical
considerations
- **Performance Challenges**: _High throughput_ data processing
- Converting PDF $\rightarrow$ text (math formula, figures)
- Convert science information (data) into text (narratives)
- De-duplication (syntactic and semantic) of scientific documents (to avoid
memorization, bias)
- **Quantity**: Considering 20+ Trillion tokens $\rightarrow\approx$ 100M
papers
- **Domains**: All (long-term) scientific domains, starting with:
- Material science, Physics, Biology, Computer Science, Climate Science
-->
<!--
## {background-color="white"}
:::: {.flex-container style="align-items: center;"}
::: {.column style="width:40%"}

:::
::: {.column style="width:60%;"}
:::
::::
-->
::: {.content-visible when-format="html" unless-format="revealjs"}
## 🦜 Training Large Models
:::
::: {.content-visible when-format="revealjs"}
## {.smaller background-color="white"}
:::
:::: {.flex-container style="align-items: flex-start; gap: 10pt;"}
::: {.column}
### 🍎 Training LLMs
- Want to **minimize** *cost* of training
- ~~**Maximize** *throughput*~~ (?)
- Data parallelism takes us only so far [@mccandlish2018empiricalmodellargebatchtraining]...
- _Possible_ directions:
- Large batch training (?)
- new (second order?) optimizers
- Better tokenization schemes (no tokenizers ?)
- Better data (?)
- Alternative architecture(s) (?)
- Diffusion / flow-matching
- Sub-quadratic attention (state space models, ...)
{{< fa brands github >}} [argonne-lcf/Megatron-DeepSpeed](https://github.com/argonne-lcf/Megatron-DeepSpeed)
:::
::: {.column}

:::
:::
## 🎯 Goals
We *need* our implementation[^implementation] to be:
- 💯 **Correct**
- Consistent across systems
- Requires being able to run _the same code_ on multiple different machines
- Independent of hardware and communication library
(e.g. `CUDA`, `ROCm`, `XPU`, `CPU`, `MPS`, ...)
- 🚀 **Scalable**
- Performant across thousands of GPUs
- Highly configurable and extensible
- Parallelizable across (tensor, pipeline, sequence) dimension(s)
- _Robust against {hardware, network, filesystem, transient}
failures_[^robust]
[^implementation]:|
{{< fa brands github >}} [argonne-lcf/Megatron-DeepSpeed](https://github.com/argonne-lcf/Megatron-DeepSpeed)
[^robust]:|
*Very much a WIP*
## 🏋️ Challenges: In Practice
This is _incredibly_ difficult in practice, due in part to:
- Brand new {hardware, architecture, software}
- Lack of native support in existing frameworks (though getting better!)
- General system stability
\+10k Nodes $\left(\times \frac{12\,\,\mathrm{XPU}}{1\,\,\mathrm{Node}}\right)\Rightarrow$ \+**100k** XPUs
- network performance
- file system stability (impacted by _other users_ !)
- _many_ unexpected difficulties occur at increasingly large scales
- Combinatorial explosion of possible configurations and experiments
- {hyperparameters, architectures, tokenizers, learning rates, ...}
## 💾 Training: 2T Tokens {background-color="white"}
- To train a fixed model on trillions of tokens requires:
1. **Aggregating** data from multiple different _corpora_
(e.g. ArXiv, Reddit, StackExchange, GitHub, Wikipedia, etc.)
1. **Sampling** _each training batch_ according to a fixed distribution
across corpora
1. **Building** indices that map batches of tokens into these files
(indexing)
::: {.red-card}
The original implementation was _slow_:
- Designed to run _serially_ on a **single device**
- **Major bottleneck** when debugging data pipeline at scale
:::
## 🍹 Blending Data, Efficiently {background-color="white"}
::: {.flex-container style="padding: 10pt; justify-content: space-around; align-items: flex-start;"}
::: {.column style="width:25%;"}
- 🐢 Original implementation:
- **Slow** (serial, single device)
- [\~ 1 hr]{.dim-text}/2T tokens
- 🐇 New implementation:
- **Fast!** (distributed, asynchronous)
- [\~ **2 min**]{style="color:#2296F3;"}/2T tokens
(**30x** faster !!)
:::
::: {.column}
{#fig-data-processing .r-stretch}
:::
:::
## 📉 Loss Curve: Training AuroraGPT-7B on 2T Tokens {background-color="white"}
::: {.content-visible when-format="html" unless-format="revealjs"}
::: {#fig-loss-curve}
{.width="90%" style="margin-left:auto;margin-right:auto;"}
Loss curve during training on 2T tokens.
:::
:::
::: {.content-visible when-format="revealjs"}
::: {#fig-loss-curve}
{width="90%" style="margin-left:auto;margin-right:auto;"}
Loss curve during training on 2T tokens.
:::
:::
## ✨ Features {background-color="white"}
- 🕸️ **Parallelism**:
- {data, tensor, pipeline, sequence, ...}
- ♻️ **Checkpoint Converters**:
- Megatron ⇄ 🤗 HF ⇄ ZeRO ⇄ Universal
- 🔀 **DeepSpeed Integration**:
- ZeRO Offloading
- Activation checkpointing
- AutoTP (*WIP*)
- ability to leverage features from DeepSpeed community
## ✨ Features (even more!) {background-color="white"}
- 🧗 **Optimizers**[^marieme]:
- Support for *many* different optimizers:
- Distributed Shampoo, Muon, Adopt, Sophia, Lamb, GaLORE, ScheduleFree, ...
- See
[full list](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/e3b0398d2f2d3f8ec543e99373ca14bd18a1e4f8/megatron/arguments.py#L1477-L1502)
- Large batch training
- 📊 **Experiment Tracking**:
- Automatic experiment and metric tracking with Weights \& Biases
[^marieme]: Implemented by Marieme Ngom
::: {.content-visible when-format="revealjs"}
## {.smaller background-color="#040406"}
:::
::: {.flex-container style="align-items: center; gap: 5pt;"}
::: {.column style="width:55%; text-align: center;"}
<!-- ::: {.callout-tip icon=false title="🔭 LLMs for Science"} -->
[🔭 LLMs for Science]{style="font-weight: 600; font-size: 1.5em;"}
{{< iconify fa twitter >}} [source](https://x.com/tenderizzation/status/1944591320796090606)
([\@tenderizzation](https://twitter.com/tenderizzation))
ChatGPT: [explain this image](https://chatgpt.com/share/688ab77e-9ca0-800a-8ab0-a293e06b3cce)
<!-- ::: -->
:::
::: {.column}

:::
:::
## 🤔 Evaluating Models on Scientific Applications {background-color="white"}
- What to measure?
- **Knowledge Extraction, Retrieval, Distillation, Synthesis**: LLM is
provided a question or instruction and a truthful answer is expected
- **Text Grounded**: Answers are expected to be fully grounded on
peer-reviewed references to support responses
- **Reasoning**: LLMs are expected to solve deductive (prove a theory or
hypothesis from formal logic and observations), inductive (validate /
explain observations from theories) problems
- **Creativity**: A creative answer is expected from a question or
instruction
- thoughtful dialogue, coding, etc.
## ⚖️ Evaluating FM Skills for Science: Criteria {background-color="white"}
- Criteria for all of the above:
- **Correctness** of facts
- **Accuracy** of solutions and inferences
- **Reliability** consistently good in quality or performance
- **Speed** how fast to produce a response
- **\# shots** how many examples are needed for good quality
- Extent of _prompt engineering_
## 🧬 MProt-DPO: Scaling Results {.smaller background-color="white"}
::: {.columns}
::: {.column style="width:70%;"}
::: {.flex-container style="align-items: center; text-align: center; margin-left: auto; margin-right: auto;"}
::: {#fig-mprot-3p5B-scaling0}
{width=100% style="margin:0; padding-unset;"}
Scaling results for `3.5B` model across ~38,400 GPUs
:::
:::
:::
::: {.column style="width:30%;"}
- ~ [4 EFLOPS]{.highlight-blue} @ Aurora
- 38,400 XPUs
= 3200 \[node\] x 12 \[XPU / node\]
- 🔔 Gordon Bell Finalist[^mprot-dpo]:
- [MProt-DPO: Breaking the ExaFLOPS Barrier for Multimodal Protein Design Workflows](https://dl.acm.org/doi/10.1109/SC41406.2024.00013)
:::
:::
[^mprot-dpo]: [@mprot-dpo2024]
<!-- [^aggregate]: -->
<!-- [^gb-finalist]: 🔔 Gordon Bell Finalist: [MProt-DPO](https://dl.acm.org/doi/10.1109/SC41406.2024.00013) [@mprot-dpo2024] -->
<!-- Scaling results[^gb-finalist] for `3.5B` Model -->
## 🧬 MProt-DPO: Scaling Results {.smaller background-color="#1c1c1c"}
::: {.columns}
::: {.column #fig-mprot-3p5B-scaling}

`3.5B` model
:::
::: {.column #fig-mprot-7B-scaling}

`7B` model
:::
:::
## 🚂 Loooooooooong Sequence Lengths {.smaller background-color="#1c1c1c"}
::: {.flex-container style="align-items: center; justify-content: center;"}
{style="height:50pt; margin: unset; padding: 0"}
[{{< iconify ic baseline-plus >}}]{.dim-text style="font-size: 2.0em;"}
{style="height:50pt; margin: unset; padding: 0;"}
:::
- Working with
[{{< fa brands microsoft >}} Microsoft/DeepSpeed](https://github.com/microsoft/DeepSpeed)
team to enable longer sequence lengths (context windows) for LLMs
- See my [blog post](https://samforeman.me/posts/auroragpt/long-sequences/) for additional details
::: {#fig-long-seq}
::: {.flex-container}


:::
Maximum (achievable) `SEQ_LEN` for both `25B` and `33B` models (See: @song2023ds4sci)
:::
::: aside
[{{< fa brands github >}} `scaling4science`](https://github.com/saforem2/scaling4science)
[{{< fa brands github >}} `Megatron-DS-Benchmarking`](https://github.com/saforem2/Megatron-DS-Benchmarking)
:::
<!--
## 🌎 Aeris: Scaling Results
::: {#tbl-aeris}
| Model | #Nodes | DP | GBS | TFLOPS/tile | MFU | EF(S) | EF(P) |
|-----------|--------|-----|------|-------------|------|-------|-------|
| 0.6B | 2304 | 32 | 1152 | 19.0 | 9% | 0.53 | 0.59 |
| 4B | 4352 | 8 | 576 | 18.3 | 8% | 0.95 | 1.00 |
| 16B | 8704 | 16 | 1152 | **42.7** | **20%**|**4.46**|**5.09**|
| 37B | **9000**| 5 | **500**| 32.3 | 16% | 3.80 | 3.98 |
Sustained and peak training throughput for Aeris on Aurora, across different
model sizes.
Note: `EF(S)` -- sustained ExaFLOPS, `EF(P)` -- peak ExaFLOPS {.responsive .striped .hover}
:::
::: aside
The gap between peak and sustained ExaFLOPS is primarily due to the time spent on the
optimizer step and gradient reduction. These components occur outside the
pipelined forward-backward pass and thus contribute to the reduction in
sustained throughput relative to the peak.
:::
-->
## 📓 References {background-color="white"}
::: {.flex-container style="gap: 2pt;"}
::: {.column}
- {{< fa brands github >}} [argonne-lcf / `Megatron-DeepSpeed`](https://github.com/argonne-lcf/Megatron-DeepSpeed)
[For the largest of large language models.]{.dim-text}
- {{< fa brands github >}} [saforem2 / `ezpz`](https://github.com/saforem2/ezpz)
[Distributed training, ezpz. 🍋]{.dim-text}
- 📊 See my other slides at [samforeman.me/talks](https://samforeman.me/talks):
- [LLMs from Scratch](https://saforem2.github.io/llm-workshop-talk)
- [Creating Small(\~ish) LLMs](https://saforem2.github.io/LLM-tutorial)
- [Parallel Training Techniques](https://saforem2.github.io/parallel-training-slides)
- [LLMs on Polaris](https://samforeman.me/talks/llms-on-polaris/#/title-slide)
- [Training LLMs at Scale](https://samforeman.me/talks/llms-at-scale/)
:::
::: {.column}
- 👀 See also:
- [New international consortium for generative AI models for science](https://www.anl.gov/article/new-international-consortium-formed-to-create-trustworthy-and-reliable-generative-ai-models-for)
- [PyTorch Distributed Overview](https://pytorch.org/tutorials/beginner/dist_overview.html)
- [🤗 Efficient Training on Multiple GPUs](https://huggingface.co/docs/transformers/en/perf_train_gpu_many)
- [Getting Started - DeepSpeed](https://www.deepspeed.ai/getting-started/)
- 🕸️ [Quality Measures for Dynamic Graph Generative Models](https://openreview.net/forum?id=8bjspmAMBk)
[@hosseini2025quality]
:::
:::
## ❤️ Thank you! {background-color="white"}
- Organizers
- Feel free to reach out!
<split even>
[<i class="fas fa-home"></i>](https://samforeman.me)
[<i class="far fa-paper-plane"></i>](mailto:///[email protected])
[<i class="fab fa-twitter"></i>](https://www.twitter.com/saforem2)
</split>
::: {.callout-note icon=false title="🙏 Acknowledgements" collapse="false"}
This research used resources of the Argonne Leadership Computing Facility,
which is a DOE Office of Science User Facility supported under Contract
DE-AC02-06CH11357.
:::
## 📑 Bibliography {background-color="white"}
- Refs:
- @wei2022emergentabilitieslargelanguage
- Animations from [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
::: {#refs}
:::
<!--
## ♻️ Life Cycle of the LLM {background-color="white"}
::: {.panel-tabset style="text-align:center"}
### 📝 Pre-training {background-color="white"}
::: {#fig-pretraining style="width:90%; text-align: center; margin-left: auto; margin-right: auto;"}

**Pre-training**: Virtually all of the compute used during pretraining phase
:::
<!--
### 🎀 Fine-Tuning {background-color="white"}
::: {#fig-fine-tuning style="width:90%; text-align: center; margin-left: auto; margin-right: auto;"}

**Fine-tuning**: Fine-tuning actually updates the model's weights to make the model better at a certain task.
:::
:::
-->
<!--
## 🍎 Training LLMs {.smaller background-color="white"}
:::: {.flex-container style="align-items: flex-end; "}
::: {.column style="width:221pt;"}
::: {#fig-it-hungers}

It's hungry!
:::
:::
::: {.column style="width:60%;"}
::: {#fig-evolution}

Visualization from @yang2023harnessing
:::
:::
::::
-->
<!--
### 💾 Evaluating Checkpoints {background-color="white"}
```python
from typing import Optional
import os
from pathlib import Path
from transformers import LlamaForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7B-hf")
def load_model(ckpt_dir) -> LlamaForCausalLM:
return LlamaForCausalLM.from_pretrained(ckpt_dir)
def eval_model(model, max_length: int, prompt: str) -> str:
return (
tokenizer.batch_decode(
model.generate(
**tokenizer(prompt, return_tensors="pt"),
max_length=max_length,
),
clean_up_tokenization_spaces=True,
skip_special_tokens=True,
)[0]
)
def load_and_eval_model_from_checkpoint(
step: int,
max_length: int = 64,
prompt: Optional[str] = None,
ckpt_root: Optional[os.PathLike | Path | str] = None,
) -> str:
print(f"Loading model from checkpoint at global step: {step}")
prompt = "What is it like in there?" if prompt is None else prompt
ckpt_root = Path("checkpoints") if ckpt_root is None else Path(ckpt_root)
ckpt_dir = ckpt_root.joinpath(f"global_step{step}")
return (
eval_model(
model=load_model(ckpt_dir.as_posix())
max_length=max_length,
prompt=prompt,
)
)
```
### Model Evaluations {background-color="white"}
::: {.panel-tabset}
#### 7000
Tokens: 88B
```python
>>> print(load_checkpoint(7000))
Loading model from checkpoint at global step: 7000
"What is it like in there?"
"""
I'm not sure if it's a good idea to use a different name for the same thing,
but I'm sure it's a good idea to use a different name for the same thing.
I'm not sure if it's a good idea to use a different name for the same thing,
but I'm sure it's a good idea to use a different name for the same thing.
I'm not sure if it's a good idea to use a different name for the same thing,
but I'm sure it
"""
```
#### 12000
Tokens: 150B
```python
>>> print(load_checkpoint(12000))
Loading model from checkpoint at global step: 12000
"What is it like in there?"
"""
What is it like in there?
The people are very friendly and helpful.
What is it like in there?
The people are very friendly and helpful.
What is it like in there?
The people are very friendly and helpful.
What is it like in there?
The people are very friendly and helpful.
What is it like in there?
The people are very friendly and helpful.
What is it like in there?
"""
```
#### 17000
Tokens: 215B
```python
>>> print(load_checkpoint(17000))
Loading model from checkpoint at global step: 17000
"What is it like in there?"
"""
I’m not sure what to expect. I’m not sure what to expect from the people I’m
with. I’m not sure what to expect from the people I’m with. I’m not sure what
to expect from the people I’m with. I’m not sure what to expect from the people
I’m with.
I’m not sure what to expect from the people I’m with.
I’m not sure what to expect from the people I’m with.
I’m not sure what to expect from the people
"""
```
#### 22000
Tokens: 277B
```python
>>> print(load_checkpoint(22000))
Loading model from checkpoint at global step: 22000
"What is it like in there?"
"""
I’m a 20 year old guy from the UK. I’m a student at the University of
Manchester, studying Computer Science. I’m a big fan of the band, The Beatles,
and I’m a huge fan of the movie, The Wizard of Oz. I’m a huge fan of the band,
The Beatles, and I’m a huge fan of the movie, The Wizard of Oz.
I’m a big fan of the band, The Beatles, and I’m a huge fan of the movie
"""
```
#### 32000
Tokens: 400B
```python
>>> print(load_checkpoint(32000))
Loading model from checkpoint at global step: 32000
"What is it like in there?"
"""
I've been to the US and I've been to Canada.
In the US, it's a lot like the US.
In Canada, it's a lot like the US.
In the US, it's a lot like the US.
In Canada, it's a lot like the US.
In the US, it's a lot like the US.
In Canada, it's a lot like the US.
In the US, it's
"""
```
#### 40000
Tokens: 503B
```python
>>> print(load_checkpoint(40000))
Loading model from checkpoint at global step: 40000
"What is it like in there?"
"""
The first thing you notice when you enter the room is the size. It’s huge. It’s
like a football field. It’s a lot of space.
The second thing you notice is the light. It’s bright. It’s bright.
The third thing you notice is the sound. It’s loud. It’s loud.
The fourth thing you notice is the smell. It’s a lot of smells. It’s a lot of smells.
The fifth thing you notice is the temperature. It’s hot.
"""
```
:::
-->
<!-- ::: -->
<!--
- Being trained on:
:::: {.flex-container style="flex-direction:row; justify-content: space-around;"}
::: {.flex-container style="flex-direction:column;"}
🇺🇸English
🇯🇵日本語
🇫🇷French
🇩🇪Deutsch
🇪🇸Español[^bsc]
🇮🇹Italian
:::
::: {.flex-container style="flex-direction:column;"}
🧪 scientific text
🖼️ images
📊 tables
➕ equations
📖 proofs
:::
::: {.flex-container style="flex-direction:column;"}
📆 structured data
⛓️ sequences
⏰ time-series
🕸️ graphs
🌀 fields
:::
::::
[^riken]:|
[Argonne and RIKEN sign a MOU in support of AI for science](https://www.anl.gov/article/argonne-and-riken-sign-a-memorandum-of-understanding-in-support-of-ai-for-science)
[^bsc]:|
Collaborations with Barcelona Supercomputing Center
-->