AuroraGPT

Sam Foreman

AuroraGPT

Large Scale Training on Diverse Accelerators

Presented at the 2025 Scalable Deep Learning Session at the SIAM Annual Meeting

Author

Sam Foreman

Published

July 31, 2025

Modified

September 7, 2025

🎯 AuroraGPT: Goals

AuroraGPT: General purpose scientific LLM Broadly trained on a general corpora plus scientific {papers, texts, data}

Explore pathways towards a “Scientific Assistant” model
Build with international partners (RIKEN, BSC, others)
Multilingual English, 日本語, French, German, Spanish
Multimodal: images, tables, equations, proofs, time series, graphs, fields, sequences, etc

Figure 1: Image from Hannibal046 / `Awesome-LLM`

Here to talk about AuroraGPT, Argonne’s internal effort to build a general purpose scientific LLM, broadly trained on a general corpora of text + scientific {papers, text, data}
As part of this effort, we plan to…
- Explore pathways, build with international partners, multi-{lingual, modal}
Rough timeline of the project and deliverables:
- 202{3,4}: text-only models, plan to release a series of {7B, 70B, 1T} models
- 202{4,5}: Basic multi-modal models
- 202{5,6}: Advanced scientific multimodal models
AuroraGPT: Exascale Pre-Training of Large Language Models on Diverse Accelerators > argonne-lcf/Megatron-DeepSpeed > Large Model Training: any scale, any accelerator
Thoughts:
- yeah okay so I’ll probably try and include then like:
  - {tensor, pipeline, sequence}-parallelism
  - DeepSpeed integration (ZeRO offloading, activation checkpointing, …)
  - Robust mechanisms for automatic experiment {configuration, tracking, …}
  - Support for modern (and experimental!) optimizers
  - Large batch training
Goals
Issues with existing models
AuroraGPT
- Project Details
- Teams, Ongoing Efforts
- Scientific Evaluations
Scaling Results
- MProt-DPO
- ~~aeris~~ (??)

🧪 AuroraGPT: Open Science Foundation Model

Figure 2: High-level overview of AuroraGPT project

AuroraGPT will be a publicly distributed, open source foundation model for open science
Is being trained on:
- Scientific / engineering structured data
- General text, media, news, etc.
- Large amounts of low to medium quality data
- Much less high quality data (that is publicly available for use)
This data is then cleaned, processed, de-duplicated and used for the initial pre-training phase of the model
The vast majority of the overall compute is spent during this initial pre-training phase
- This is the group I help to lead and will be talking a bit about today
The initial pre-training phase is currently underway
- Eventually, given a bit of time, effort and magic, the model will be ready for fine-tuning and additional training for a variety of downstream tasks
The pretrained model will then be handed off for additional fine-tuning on a variety of downstream tasks
- Scientific discovery
- Accelerate scientific tasks
- Digital twins
- Inverse design
- Code optimization
- Accelerated simulations
- Autonomous experiments
- Co-design
Becoming increasingly clear that LLMs have the potential to drastically accelerate computational science
- We’ve seen this already for {GenSLMs, Weather / Climate / Earth Systems Modeling, Particle Physics, etc.}

🧰 AuroraGPT: Toolbox

Datasets and data pipelines (how do we deal with scientific data?)
Software infrastructure and workflows (scalable, robust, extensible)
Evaluation of state-of-the-art LLM Models (how do they perform on scientific tasks?)

🚂 Training

argonne-lcf/Megatron-DeepSpeed
Large Model Training: Any Scale, Any Acclerator

🏃‍♂️ Running

argonne-lcf/inference-endpoints
Inference endpoints for LLMs, hosted @ ALCF

🌌 Aurora

Table 1: Aurora Specs


Racks	166
Nodes	10,624
CPUs	21,248
GPUs	63,744
NICs	84,992
HBM	8 PB
DDR5c	10 PB

🤝 Teams

Planning
Data
- Aggregate existing data and generate new (synthetic) data
Models / Training²
- Pre-train a series of models on publicly available data
Post-Training
- Fine-tuning, alignment, reinforcement learning

Evaluation
- Skills, trustworthiness, safety, robustness, privacy, machine ethics
Inference
- Model serving, API development / public-facing web services
Distribution
- Licensing, generating and distributing artifacts for public consumption
Communication

generating curating / aggregating cleaning / understanding new data for training including: MCQ’s + scientific narratives new scientific data modalities (gene sequences, geospatial data, …)

🦜 Training Large Models

🍎 Training LLMs

Want to minimize cost of training
- ~~Maximize throughput~~ (?)
  - Data parallelism takes us only so far (McCandlish et al. 2018)…
Possible directions:
- Large batch training (?)
  - new (second order?) optimizers
- Better tokenization schemes (no tokenizers ?)
  - Better data (?)
- Alternative architecture(s) (?)
  - Diffusion / flow-matching
  - Sub-quadratic attention (state space models, …)

argonne-lcf/Megatron-DeepSpeed

🎯 Goals

We need our implementation³ to be:

💯 Correct
- Consistent across systems
- Requires being able to run the same code on multiple different machines
- Independent of hardware and communication library (e.g. CUDA, ROCm, XPU, CPU, MPS, …)
🚀 Scalable
- Performant across thousands of GPUs
- Highly configurable and extensible
- Parallelizable across (tensor, pipeline, sequence) dimension(s)
- Robust against {hardware, network, filesystem, transient} failures⁴

🏋️ Challenges: In Practice

This is incredibly difficult in practice, due in part to:

Brand new {hardware, architecture, software}
Lack of native support in existing frameworks (though getting better!)
General system stability
+10k Nodes \left(\times \frac{12\,\,\mathrm{XPU}}{1\,\,\mathrm{Node}}\right)\Rightarrow +100k XPUs
- network performance
- file system stability (impacted by other users !)
- many unexpected difficulties occur at increasingly large scales
Combinatorial explosion of possible configurations and experiments
- {hyperparameters, architectures, tokenizers, learning rates, …}

💾 Training: 2T Tokens

To train a fixed model on trillions of tokens requires:
1. Aggregating data from multiple different corpora
  (e.g. ArXiv, Reddit, StackExchange, GitHub, Wikipedia, etc.)
2. Sampling each training batch according to a fixed distribution across corpora
3. Building indices that map batches of tokens into these files (indexing)
The original implementation was slow:
- Designed to run serially on a single device
- Major bottleneck when debugging data pipeline at scale

🍹 Blending Data, Efficiently

🐢 Original implementation:
- Slow (serial, single device)
- ~ 1 hr/2T tokens
🐇 New implementation:
- Fast! (distributed, asynchronous)
- ~ 2 min/2T tokens
  (30x faster !!)

Figure 4: Time spent preparing 2T tokens

📉 Loss Curve: Training AuroraGPT-7B on 2T Tokens

Figure 5: Loss curve during training on 2T tokens.

✨ Features

🕸️ Parallelism:
- {data, tensor, pipeline, sequence, …}
♻️ Checkpoint Converters:
- Megatron ⇄ 🤗 HF ⇄ ZeRO ⇄ Universal
🔀 DeepSpeed Integration:
- ZeRO Offloading
- Activation checkpointing
- AutoTP (WIP)
- ability to leverage features from DeepSpeed community

✨ Features (even more!)

🧗 Optimizers⁵:
- Support for many different optimizers:
  - Distributed Shampoo, Muon, Adopt, Sophia, Lamb, GaLORE, ScheduleFree, …
- See full list
- Large batch training
📊 Experiment Tracking:
- Automatic experiment and metric tracking with Weights & Biases

🔭 LLMs for Science
source (@tenderizzation)
ChatGPT: explain this image

🤔 Evaluating Models on Scientific Applications

What to measure?
- Knowledge Extraction, Retrieval, Distillation, Synthesis: LLM is provided a question or instruction and a truthful answer is expected
- Text Grounded: Answers are expected to be fully grounded on peer-reviewed references to support responses
- Reasoning: LLMs are expected to solve deductive (prove a theory or hypothesis from formal logic and observations), inductive (validate / explain observations from theories) problems
- Creativity: A creative answer is expected from a question or instruction
  - thoughtful dialogue, coding, etc.

⚖️ Evaluating FM Skills for Science: Criteria

Criteria for all of the above:
- Correctness of facts
- Accuracy of solutions and inferences
- Reliability consistently good in quality or performance
- Speed how fast to produce a response
- # shots how many examples are needed for good quality
  - Extent of prompt engineering

🧬 MProt-DPO: Scaling Results

Figure 6: Scaling results for `3.5B` model across ~38,400 GPUs

~ 4 EFLOPS @ Aurora
38,400 XPUs
= 3200 [node] x 12 [XPU / node]
🔔 Gordon Bell Finalist⁶:
- MProt-DPO: Breaking the ExaFLOPS Barrier for Multimodal Protein Design Workflows

🧬 MProt-DPO: Scaling Results

🚂 Loooooooooong Sequence Lengths

Working with Microsoft/DeepSpeed team to enable longer sequence lengths (context windows) for LLMs
- See my blog post for additional details

scaling4science
Megatron-DS-Benchmarking

📓 References

argonne-lcf / Megatron-DeepSpeed
For the largest of large language models.
saforem2 / ezpz
Distributed training, ezpz. 🍋
📊 See my other slides at samforeman.me/talks:

👀 See also:

❤️ Thank you!

Organizers
Feel free to reach out!

🙏 Acknowledgements

This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

📑 Bibliography

Refs:
- Wei et al. (2022)
- Animations from The Illustrated Transformer

Dharuman, Gautham, Kyle Hippe, Alexander Brace, Sam Foreman, Väinö Hatanpää, Varuni K. Sastry, Huihuo Zheng, et al. 2024. “MProt-DPO: Breaking the ExaFLOPS Barrier for Multimodal Protein Design Workflows with Direct Preference Optimization.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. SC ’24. Atlanta, GA, USA: IEEE Press. https://doi.org/10.1109/SC41406.2024.00013.

Hosseini, Ryien, Filippo Simini, Venkatram Vishwanath, Rebecca Willett, and Henry Hoffmann. 2025. “Quality Measures for Dynamic Graph Generative Models.” In The Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=8bjspmAMBk.

McCandlish, Sam, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. 2018. “An Empirical Model of Large-Batch Training.” https://arxiv.org/abs/1812.06162.

Song, Shuaiwen Leon, Bonnie Kruft, Minjia Zhang, Conglong Li, Shiyang Chen, Chengming Zhang, Masahiro Tanaka, et al. 2023. “DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery Through Sophisticated AI System Technologies.” https://arxiv.org/abs/2310.04610.

Wei, Jason, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, et al. 2022. “Emergent Abilities of Large Language Models.” https://arxiv.org/abs/2206.07682.

Footnotes

🏆 Aurora Supercomputer Ranks Fastest for AI ↩︎
Sam Foreman (co-lead), Varuni Sastry, Marieme Ngom, …↩︎
argonne-lcf/Megatron-DeepSpeed ↩︎
Very much a WIP↩︎
Implemented by Marieme Ngom↩︎
(Dharuman et al. 2024)↩︎

Citation

BibTeX citation:

@unpublished{foreman2025,
  author = {Foreman, Sam},
  title = {AuroraGPT},
  date = {2025-07-31},
  url = {https://samforeman.me/talks/AuroraGPT-SIAM25/slides.html},
  langid = {en}
}

For attribution, please cite this work as:

Foreman, Sam. 2025. “AuroraGPT.” July 31. https://samforeman.me/talks/AuroraGPT-SIAM25/slides.html.

--- title: "AuroraGPT" subtitle: "Large Scale Training on Diverse Accelerators" description: "Presented at the 2025 Scalable Deep Learning Session at the SIAM Annual Meeting" date: 2025-07-31 email: "[email protected]" number-sections: false # location: "[SIAM Annual Meeting 2025](https://meetings.siam.org/sess/dsp_programsess.cfm?SESSIONCODE=84772)" location: "[SIAM Annual Meeting 2025](https://www.siam.org/conferences-events/siam-conferences/an25/)" sublocation: "[Scalable Deep Learning](https://meetings.siam.org/sess/dsp_programsess.cfm?SESSIONCODE=84772)" location-url: "https://meetings.siam.org/sess/dsp_programsess.cfm?SESSIONCODE=84772" image: ./assets/thumbnail.png author: "[Sam Foreman](https://samforeman.me)" affiliation: "Argonne National Laboratory" editor: render-on-save: true freeze: auto twitter-card: image: "./assets/thumbnail.png" creator: "saforem2" site: "saforem2" title: "AuroraGPT: Large Scale Training on Diverse Accelerators" description: "Presented at the 2025 Scalable Deep Learning Session at the SIAM Annual Meeting" open-graph: title: "AuroraGPT: Large Scale Training on Diverse Accelerators" description: "Presented at the 2025 Scalable Deep Learning Session at the SIAM Annual Meeting" image: "./assets/thumbnail.png" citation: author: Sam Foreman type: speech url: https://samforeman.me/talks/AuroraGPT-SIAM25/slides.html format: revealjs: callout-style: simple # # width: 1024 # # height: 768 # max-scale: 4.0 # min-scale: 0.1 # # title: "AuroraGPT" # pdf-separate-fragments: true # shift-heading-level-by: -1 # center: false margin: 0.1 center: true # background-color: "#FFFFFF" shift-heading-by: -1 footer: "[samforeman.me/talks/AuroraGPT-SIAM25/slides](https://samforeman.me/talks/AuroraGPT-SIAM25/slides)" slide-url: samforeman.me/talks/AuroraGPT-SIAM25/slides template-partials: - title-slide.html title-slide-attributes: data-background-size: cover data-background-iframe: https://emilhvitfeldt.github.io/quarto-iframe-examples/stars/index.html html: default gfm: output-file: "README.md" revealjs-plugins: - revealjs-text-resizer ---   ## 🎯 AuroraGPT: Goals {.smaller background-color="white"} ::: {.flex-container style="justify-content: space-around;"} ::: {.column style="width: 50%"} ::: {.blue-card} [**AuroraGPT**](https://auroragpt.anl.gov): *General purpose scientific LLM* Broadly trained on a general corpora plus scientific {papers, texts, data} ::: - **Explore pathways** towards a "Scientific Assistant" model - **Build with international partners** (RIKEN, BSC, others) - **Multilingual** English, 日本語, French, German, Spanish - **Multimodal**: images, tables, equations, proofs, time series, graphs, fields, sequences, etc ::: ::: {.column style="text-align: center;"} ::: {#fig-awesome-llm} ![](./assets/llms.gif) Image from {{< iconify fa github >}} [Hannibal046 / `Awesome-LLM`](https://github.com/Hannibal046/Awesome-LLM) ::: ::: :::  ::: {.notes} - Here to talk about AuroraGPT, Argonne's internal effort to build a general purpose scientific LLM, broadly trained on a general corpora of text + scientific {papers, text, data} - As part of this effort, we plan to... - Explore pathways, build with international partners, multi-{lingual, modal} - Rough timeline of the project and deliverables: - 202{3,4}: text-only models, plan to release a series of {7B, 70B, 1T} models - 202{4,5}: Basic multi-modal models - 202{5,6}: Advanced scientific multimodal models - - AuroraGPT: Exascale Pre-Training of Large Language Models on Diverse Accelerators > [argonne-lcf/Megatron-DeepSpeed](https://github.com/argonne-lcf/Megatron-DeepSpeed) > Large Model Training: any scale, any accelerator - Thoughts: - yeah okay so I'll probably try and include then like: - [x] {tensor, pipeline, sequence}-parallelism - [x] DeepSpeed integration (ZeRO offloading, activation checkpointing, ...) - [x] Robust mechanisms for automatic experiment {configuration, tracking, ...} - [x] Support for modern (and experimental!) optimizers - [x] Large batch training - Goals - Issues with existing models - AuroraGPT - Project Details - Teams, Ongoing Efforts - Scientific Evaluations - Scaling Results - MProt-DPO - ~~aeris~~ (??) :::  ## 🧪 AuroraGPT: Open Science Foundation Model {background-color="white"} ::: {#fig-aurora-gpt .r-stretch style="vertical-align:center;"} ![](./assets/AuroraGPT.svg) High-level overview of AuroraGPT project ::: ::: {.notes} - AuroraGPT will be a publicly distributed, open source foundation model for open science - Is being trained on: - Scientific / engineering structured data - General text, media, news, etc. - Large amounts of low to medium quality data - Much less high quality data (that is publicly available for use) - This data is then cleaned, processed, de-duplicated and used for the initial pre-training phase of the model - The vast majority of the overall compute is spent during this initial pre-training phase - This is the group I help to lead and will be talking a bit about today - The initial pre-training phase is currently underway - Eventually, given a bit of time, effort and magic, the model will be ready for fine-tuning and additional training for a variety of downstream tasks - The pretrained model will then be handed off for additional fine-tuning on a variety of downstream tasks - Scientific discovery - Accelerate scientific tasks - Digital twins - Inverse design - Code optimization - Accelerated simulations - Autonomous experiments - Co-design - Becoming increasingly clear that LLMs have the potential to drastically accelerate computational science - We've seen this already for {GenSLMs, Weather / Climate / Earth Systems Modeling, Particle Physics, etc.} ::: ## 🧰 AuroraGPT: Toolbox {background-color="white"} - **Datasets and data pipelines** (how do we deal with scientific data?) - **Software infrastructure and workflows** (scalable, robust, extensible) - **Evaluation of state-of-the-art LLM Models** (how do they perform on scientific tasks?)  ::: {.flex-container style="gap: 5pt;"} ::: {.callout-note icon=false title="🚂 Training"} {{< fa brands github >}} [argonne-lcf/Megatron-DeepSpeed](https://github.com/argonne-lcf/Megatron-DeepSpeed) [Large Model Training: Any Scale, Any Acclerator]{.dim-text} ::: ::: {.callout-important icon=false title="🏃‍♂️ Running"} {{< fa brands github >}} [argonne-lcf/inference-endpoints](https://github.com/argonne-lcf/inference-endpoints) [Inference endpoints for LLMs, hosted @ ALCF]{.dim-text} ::: :::  ## 🌌 Aurora {background-color="white"} ::: {.flex-container style="align-items: center;"} ::: {.column style="width:5%;"} ::: {#tbl-aurora} |  |  | | -------: | :------- | | Racks | 166 | | Nodes | 10,624 | | CPUs | 21,248 | | GPUs | 63,744 | | NICs | 84,992 | | HBM | 8 PB | | DDR5c | 10 PB | Aurora Specs {.responsive .striped .hover} ::: ::: ::: {.column style="text-align:center"} ::: {#fig-aurora .r-stretch} ![](./assets/aurora.png) Aurora[^fastest]: [Fact Sheet](https://www.alcf.anl.gov/sites/default/files/2024-07/Aurora_FactSheet_2024.pdf). ::: [^fastest]:| 🏆 [Aurora Supercomputer Ranks Fastest for AI](https://www.intel.com/content/www/us/en/newsroom/news/intel-powered-aurora-supercomputer-breaks-exascale-barrier.html) ::: ::: ## 🤝 Teams {auto-animate=true background-color="white"} ::: {.flex-container} ::: {.column} - **Planning** - **Data** - Aggregate existing data and generate new (synthetic) data - [**Models / Training**]{style="background: oklch(from #ff1a8f calc(l * 1.15) c h / 0.1); border: 1px solid #ff1a8f; border-radius: 0.25px;"}[^me] - Pre-train a series of models on publicly available data - **Post-Training** - Fine-tuning, alignment, reinforcement learning [^me]: **Sam Foreman** (co-lead), Varuni Sastry, Marieme Ngom, ... ::: ::: {.column} - **Evaluation** - Skills, trustworthiness, safety, robustness, privacy, machine ethics - **Inference** - Model serving, API development / public-facing web services - **Distribution** - Licensing, generating and distributing artifacts for public consumption - **Communication** ::: ::: ::: {.notes} generating curating / aggregating cleaning / understanding new data for training including: MCQ's + scientific narratives new scientific data modalities (gene sequences, geospatial data, ...) :::   ::: {.content-visible when-format="html" unless-format="revealjs"} ## 🦜 Training Large Models ::: ::: {.content-visible when-format="revealjs"} ## {.smaller background-color="white"} ::: :::: {.flex-container style="align-items: flex-start; gap: 10pt;"} ::: {.column} ### 🍎 Training LLMs - Want to **minimize** *cost* of training - ~~**Maximize** *throughput*~~ (?) - Data parallelism takes us only so far [@mccandlish2018empiricalmodellargebatchtraining]... - _Possible_ directions: - Large batch training (?) - new (second order?) optimizers - Better tokenization schemes (no tokenizers ?) - Better data (?) - Alternative architecture(s) (?) - Diffusion / flow-matching - Sub-quadratic attention (state space models, ...) {{< fa brands github >}} [argonne-lcf/Megatron-DeepSpeed](https://github.com/argonne-lcf/Megatron-DeepSpeed) ::: ::: {.column} ![](https://github.com/saforem2/llm-lunch-talk/blob/main/docs/assets/it_hungers.jpeg?raw=true) ::: ::: ## 🎯 Goals We *need* our implementation[^implementation] to be: - 💯 **Correct** - Consistent across systems - Requires being able to run _the same code_ on multiple different machines - Independent of hardware and communication library (e.g. `CUDA`, `ROCm`, `XPU`, `CPU`, `MPS`, ...) - 🚀 **Scalable** - Performant across thousands of GPUs - Highly configurable and extensible - Parallelizable across (tensor, pipeline, sequence) dimension(s) - _Robust against {hardware, network, filesystem, transient} failures_[^robust] [^implementation]:| {{< fa brands github >}} [argonne-lcf/Megatron-DeepSpeed](https://github.com/argonne-lcf/Megatron-DeepSpeed) [^robust]:| *Very much a WIP* ## 🏋️ Challenges: In Practice This is _incredibly_ difficult in practice, due in part to: - Brand new {hardware, architecture, software} - Lack of native support in existing frameworks (though getting better!) - General system stability \+10k Nodes $\left(\times \frac{12\,\,\mathrm{XPU}}{1\,\,\mathrm{Node}}\right)\Rightarrow$ \+**100k** XPUs - network performance - file system stability (impacted by _other users_ !) - _many_ unexpected difficulties occur at increasingly large scales - Combinatorial explosion of possible configurations and experiments - {hyperparameters, architectures, tokenizers, learning rates, ...} ## 💾 Training: 2T Tokens {background-color="white"} - To train a fixed model on trillions of tokens requires: 1. **Aggregating** data from multiple different _corpora_ (e.g. ArXiv, Reddit, StackExchange, GitHub, Wikipedia, etc.) 1. **Sampling** _each training batch_ according to a fixed distribution across corpora 1. **Building** indices that map batches of tokens into these files (indexing) ::: {.red-card} The original implementation was _slow_: - Designed to run _serially_ on a **single device** - **Major bottleneck** when debugging data pipeline at scale ::: ## 🍹 Blending Data, Efficiently {background-color="white"} ::: {.flex-container style="padding: 10pt; justify-content: space-around; align-items: flex-start;"} ::: {.column style="width:25%;"} - 🐢 Original implementation: - **Slow** (serial, single device) - [\~ 1 hr]{.dim-text}/2T tokens - 🐇 New implementation: - **Fast!** (distributed, asynchronous) - [\~ **2 min**]{style="color:#2296F3;"}/2T tokens (**30x** faster !!) ::: ::: {.column} ![Time spent preparing 2T tokens](./assets/data-processing.svg){#fig-data-processing .r-stretch} ::: ::: ## 📉 Loss Curve: Training AuroraGPT-7B on 2T Tokens {background-color="white"} ::: {.content-visible when-format="html" unless-format="revealjs"} ::: {#fig-loss-curve} ![](./assets/train-val-loss-vs-tokens-wide.svg){.width="90%" style="margin-left:auto;margin-right:auto;"} Loss curve during training on 2T tokens. ::: ::: ::: {.content-visible when-format="revealjs"} ::: {#fig-loss-curve} ![](./assets/train-val-loss-vs-tokens-medium.svg){width="90%" style="margin-left:auto;margin-right:auto;"} Loss curve during training on 2T tokens. ::: ::: ## ✨ Features {background-color="white"} - 🕸️ **Parallelism**: - {data, tensor, pipeline, sequence, ...} - ♻️ **Checkpoint Converters**: - Megatron ⇄ 🤗 HF ⇄ ZeRO ⇄ Universal - 🔀 **DeepSpeed Integration**: - ZeRO Offloading - Activation checkpointing - AutoTP (*WIP*) - ability to leverage features from DeepSpeed community ## ✨ Features (even more!) {background-color="white"} - 🧗 **Optimizers**[^marieme]: - Support for *many* different optimizers: - Distributed Shampoo, Muon, Adopt, Sophia, Lamb, GaLORE, ScheduleFree, ... - See [full list](https://github.com/argonne-lcf/Megatron-DeepSpeed/blob/e3b0398d2f2d3f8ec543e99373ca14bd18a1e4f8/megatron/arguments.py#L1477-L1502) - Large batch training - 📊 **Experiment Tracking**: - Automatic experiment and metric tracking with Weights \& Biases [^marieme]: Implemented by Marieme Ngom ::: {.content-visible when-format="revealjs"} ## {.smaller background-color="#040406"} ::: ::: {.flex-container style="align-items: center; gap: 5pt;"} ::: {.column style="width:55%; text-align: center;"}  [🔭 LLMs for Science]{style="font-weight: 600; font-size: 1.5em;"} {{< iconify fa twitter >}} [source](https://x.com/tenderizzation/status/1944591320796090606) ([\@tenderizzation](https://twitter.com/tenderizzation)) ChatGPT: [explain this image](https://chatgpt.com/share/688ab77e-9ca0-800a-8ab0-a293e06b3cce)  ::: ::: {.column} ![](./assets/modeling-planets.jpg) ::: ::: ## 🤔 Evaluating Models on Scientific Applications {background-color="white"} - What to measure? - **Knowledge Extraction, Retrieval, Distillation, Synthesis**: LLM is provided a question or instruction and a truthful answer is expected - **Text Grounded**: Answers are expected to be fully grounded on peer-reviewed references to support responses - **Reasoning**: LLMs are expected to solve deductive (prove a theory or hypothesis from formal logic and observations), inductive (validate / explain observations from theories) problems - **Creativity**: A creative answer is expected from a question or instruction - thoughtful dialogue, coding, etc. ## ⚖️ Evaluating FM Skills for Science: Criteria {background-color="white"} - Criteria for all of the above: - **Correctness** of facts - **Accuracy** of solutions and inferences - **Reliability** consistently good in quality or performance - **Speed** how fast to produce a response - **\# shots** how many examples are needed for good quality - Extent of _prompt engineering_ ## 🧬 MProt-DPO: Scaling Results {.smaller background-color="white"} ::: {.columns} ::: {.column style="width:70%;"} ::: {.flex-container style="align-items: center; text-align: center; margin-left: auto; margin-right: auto;"} ::: {#fig-mprot-3p5B-scaling0} ![](./assets/mprot-3p5B-scaling-2.svg){width=100% style="margin:0; padding-unset;"} Scaling results for `3.5B` model across ~38,400 GPUs ::: ::: ::: ::: {.column style="width:30%;"} - ~ [4 EFLOPS]{.highlight-blue} @ Aurora - 38,400 XPUs = 3200 \[node\] x 12 \[XPU / node\] - 🔔 Gordon Bell Finalist[^mprot-dpo]: - [MProt-DPO: Breaking the ExaFLOPS Barrier for Multimodal Protein Design Workflows](https://dl.acm.org/doi/10.1109/SC41406.2024.00013) ::: ::: [^mprot-dpo]: [@mprot-dpo2024]    ## 🧬 MProt-DPO: Scaling Results {.smaller background-color="#1c1c1c"} ::: {.columns} ::: {.column #fig-mprot-3p5B-scaling} ![](./assets/mprot-3p5B-scaling-2.svg) `3.5B` model ::: ::: {.column #fig-mprot-7B-scaling} ![](./assets/mprot-7B-scaling-2.svg) `7B` model ::: ::: ## 🚂 Loooooooooong Sequence Lengths {.smaller background-color="#1c1c1c"} ::: {.flex-container style="align-items: center; justify-content: center;"} ![](/assets/anl.svg){style="height:50pt; margin: unset; padding: 0"} [{{< iconify ic baseline-plus >}}]{.dim-text style="font-size: 2.0em;"} ![](/assets/deepspeed-logo-transparent.svg){style="height:50pt; margin: unset; padding: 0;"} ::: - Working with [{{< fa brands microsoft >}} Microsoft/DeepSpeed](https://github.com/microsoft/DeepSpeed) team to enable longer sequence lengths (context windows) for LLMs - See my [blog post](https://samforeman.me/posts/auroragpt/long-sequences/) for additional details ::: {#fig-long-seq} ::: {.flex-container} ![25B](https://raw.githubusercontent.com/saforem2/scaling4science/main/assets/25B.svg) ![33B](https://raw.githubusercontent.com/saforem2/scaling4science/main/assets/33B.svg) ::: Maximum (achievable) `SEQ_LEN` for both `25B` and `33B` models (See: @song2023ds4sci) ::: ::: aside [{{< fa brands github >}} `scaling4science`](https://github.com/saforem2/scaling4science) [{{< fa brands github >}} `Megatron-DS-Benchmarking`](https://github.com/saforem2/Megatron-DS-Benchmarking) :::  ## 📓 References {background-color="white"} ::: {.flex-container style="gap: 2pt;"} ::: {.column} - {{< fa brands github >}} [argonne-lcf / `Megatron-DeepSpeed`](https://github.com/argonne-lcf/Megatron-DeepSpeed) [For the largest of large language models.]{.dim-text} - {{< fa brands github >}} [saforem2 / `ezpz`](https://github.com/saforem2/ezpz) [Distributed training, ezpz. 🍋]{.dim-text} - 📊 See my other slides at [samforeman.me/talks](https://samforeman.me/talks): - [LLMs from Scratch](https://saforem2.github.io/llm-workshop-talk) - [Creating Small(\~ish) LLMs](https://saforem2.github.io/LLM-tutorial) - [Parallel Training Techniques](https://saforem2.github.io/parallel-training-slides) - [LLMs on Polaris](https://samforeman.me/talks/llms-on-polaris/#/title-slide) - [Training LLMs at Scale](https://samforeman.me/talks/llms-at-scale/) ::: ::: {.column} - 👀 See also: - [New international consortium for generative AI models for science](https://www.anl.gov/article/new-international-consortium-formed-to-create-trustworthy-and-reliable-generative-ai-models-for) - [PyTorch Distributed Overview](https://pytorch.org/tutorials/beginner/dist_overview.html) - [🤗 Efficient Training on Multiple GPUs](https://huggingface.co/docs/transformers/en/perf_train_gpu_many) - [Getting Started - DeepSpeed](https://www.deepspeed.ai/getting-started/) - 🕸️ [Quality Measures for Dynamic Graph Generative Models](https://openreview.net/forum?id=8bjspmAMBk) [@hosseini2025quality] ::: ::: ## ❤️ Thank you! {background-color="white"} - Organizers - Feel free to reach out! <split even> [](https://samforeman.me) [](mailto:///[email protected]) [](https://www.twitter.com/saforem2) </split> ::: {.callout-note icon=false title="🙏 Acknowledgements" collapse="false"} This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. ::: ## 📑 Bibliography {background-color="white"} - Refs: - @wei2022emergentabilitieslargelanguage - Animations from [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/) ::: {#refs} :::