# AuroraGPT Sam Foreman 2024-10-30 - [🎯 AuroraGPT Goals](#dart-auroragpt-goals) - [🧪 AuroraGPT: Open Science Foundation Models](#test_tube-auroragpt-open-science-foundation-models) - [📊 AuroraGPT Outcomes](#bar_chart-auroragpt-outcomes) - [🌌 Aurora](#milky_way-aurora) - [🤖 ALCF AI Testbed](#robot-alcf-ai-testbed) - [👥 Team Leads](#busts_in_silhouette-team-leads) - [🤝 Teams](#handshake-teams) - [🦜 Model Training](#parrot-model-training) - [🚀 Accelerating Dataset Processing at Scale for Training](#rocket-accelerating-dataset-processing-at-scale-for-training) - [🚀 Accelerating Dataset Processing: Results](#rocket-accelerating-dataset-processing-results) - [📓 References](#notebook-references) - [❤️ Thank you!](#heart-thank-you) - [📑 Bibliography](#bookmark_tabs-bibliography) - [🎁 Extras](#gift-extras) - [🚂 Loooooooooong Sequence Lengths](#steam_locomotive-loooooooooong-sequence-lengths) - [♻️ Life Cycle of the LLM](#recycle-life-cycle-of-the-llm) - [🍎 Training LLMs](#apple-training-llms) ## 🎯 AuroraGPT Goals

**AuroraGPT**: *General purpose scientific LLM* Broadly trained on a general corpora plus scientific {papers, texts, data}

- **Explore pathways** towards a “Scientific Assistant” model - **Build with international partners** (RIKEN, BSC, others) - **Multilingual** English, 日本語, French, German, Spanish - **Multimodal**: images, tables, equations, proofs, time series, graphs, fields, sequences, etc

![](./assets/llms.gif) Figure 1: Image from [`Hannibal046/Awesome-LLM`](https://github.com/Hannibal046/Awesome-LLM)

Figure 2: Credit to the entire AuroraGPT team for slides.

- Here to talk about AuroraGPT, Argonne’s internal effort to build a general purpose scientific LLM, broadly trained on a general corpora of text + scientific {papers, text, data} - As part of this effort, we plan to… - Explore pathways, build with international partners, multi-{lingual, modal} - Rough timeline of the project and deliverables: - 202{3,4}: text-only models, plan to release a series of {7B, 70B, 1T} models - 202{4,5}: Basic multi-modal models - 202{5,6}: Advanced scientific multimodal models

## 🧪 AuroraGPT: Open Science Foundation Models

![](./assets/AuroraGPT.svg) Figure 3: [`argonne-lcf/Megatron-DeepSpeed`](https://github.com/argonne-lcf/Megatron-DeepSpeed)

- AuroraGPT will be a publicly distributed, open source foundation model for open science - Is being trained on: - Scientific / engineering structured data - General text, media, news, etc. - Large amounts of low to medium quality data - Much less high quality data (that is publicly available for use) - This data is then cleaned, processed, de-duplicated and used for the initial pre-training phase of the model - The vast majority of the overall compute is spent during this initial pre-training phase - This is the group I help to lead and will be talking a bit about today - The initial pre-training phase is currently underway - Eventually, given a bit of time, effort and magic, the model will be ready for fine-tuning and additional training for a variety of downstream tasks - The pretrained model will then be handed off for additional fine-tuning on a variety of downstream tasks - Scientific discovery - Accelerate scientific tasks - Digital twins - Inverse design - Code optimization - Accelerated simulations - Autonomous experiments - Co-design - Becoming increasingly clear that LLMs have the potential to drastically accelerate computational science - We’ve seen this already for {GenSLMs, Weather / Climate / Earth Systems Modeling, Particle Physics, etc.}

## 📊 AuroraGPT Outcomes - **Datasets and data pipelines** for preparing science training data - **Software infrastructure and workflows** to train, evaluate and deploy LLMs at scale for scientific resarch purposes - **Evaluation of state-of-the-art LLM Models** to determine where they fall short in deep scientific tasks and where deep data may have an impact - **Assessment of the approach** of augmenting web training data with two forms of data specific to science - Full text scientific papers - Structured scientific datasets (suitably mapped to narrative form) - **Research grade artifacts** (**models**) for scientific community for adaptation for downstream uses - **Promotion of responsible AI** best practices where we can figure them out - **International Collaborations** around the long term goal of *AGI for science*

- Deliverables: - datasets, pipelines - software infrastructure, workflows to interface with science applications - checkpoints, models, logs, workbook, insights, etc. - Hope to understand: - How different state-of-the-art models perform at different scientific tasks - where deep data may have an impact - feasibility of generically augmenting text with scientific structured data - Huge undertaking that will require large international collaborations around long term goal of AGI for science - Extra points: - Well known that LLMs are good for non-consequential tasks - Known to “hallucinate” and create false information - Can this be mitigated reliably ??

## 🌌 Aurora

Table 1: Aurora Specs | | | |------:|:-------| | Racks | 166 | | Nodes | 10,624 | | CPUs | 21,248 | | GPUs | 63,744 | | NICs | 84,992 | | HBM | 8 PB | | DDR5c | 10 PB |

![](./assets/aurora.png) Figure 4: [Aurora Fact Sheet](https://www.alcf.anl.gov/sites/default/files/2024-07/Aurora_FactSheet_2024.pdf)

## 🤖 ALCF AI Testbed - ALCF AI Testbed Systems are in production and [available for allocations](https://accounts.alcf.anl.gov/#/allocationRequests) to the research community - Significant improvement in time-to-solution and energy-efficiency for diverse AI for science applications. - [NAIRR Pilot](https://nairrpilot.org/)

Up to **25**$\times$ improvement for genomic foundation models with **6.5**$\times$ energy efficiency

![](./assets/sambanova.jpeg) Figure 5: **SambaNova SN-30**: 2nd Gen, 8 nodes with 64 AI Accelerators

![](./assets/graphcore.png) Figure 6: **Graphcore Bow**: generation accelerators: Pod-64 configuration with 64 accelerators

![](./assets/cerebras.jpeg) Figure 7: **Cerebras**: 2x CS-2 WSE with Memory-X and Swarm-X technologies

![](./assets/groq.jpeg) Figure 8: **GroqRack**: 9 nodes, 8 GroqChip v1.5 Tensor streaming processors accelerators per node

## 👥 Team Leads

**Planning** Rick Stevens

**Data**

**Training** Venkat Vishwanath

**Evaluation** Franck Cappello

**Post**

**Inference** Rajeev Thakur

**Comms**

**Distribution** Brad Ullrich

## 🤝 Teams

- **Planning** - **Data Prep** - Accumulate 20+ T tokens of high-quality scientific text and structured data - **Models / Training**[^1] - Train (entirely from scratch) a series of models on publicly available data - **Evaluation** - Skills, trustworthiness, safety, robustness, privacy, machine ethics

- **Post-Training** - Fine-tuning, alignment - **Inference** - Model serving, API development / public-facing web services - **Distribution** - Licensing, generating and distributing artifacts for public consumption - **Communication**

## 🦜 Model Training

✅ **Goals** - Want training runs at scale to be: - efficient - stable - reproducible - This requires: - robust data pipelines / file IO - effectively overlapping compute with communication - stability across {network, filesystem, machine} - 3D / Multi-dimensional Parallelism strategies - Large batch training - Second order optimizers - Sub-quadratic attention - State space models - *Highly optimized GPU kernels*

❌ **Challenges** - *Looong time* to train, can be: - weeks (even months) of continuous training - order of magnitude longer than typical NN training jobs - Stability issues: - failures are expensive (but inevitable) - stragglers common at scale - Individual jobs are: - **fragile** - only as good as the worst rank - one hang or bad worker can crash job - network / filesystem / other-user(s) dependent - Cost / benefits of different collective communication algorithms - depend on optimized / efficient implementations - Network performance - *Highly optimized GPU kernels*

[argonne-lcf / `Megatron-DeepSpeed`](https://github.com/argonne-lcf/Megatron-DeepSpeed)

## 🚀 Accelerating Dataset Processing at Scale for Training - To train a fixed model on trillions of tokens requires: - Aggregating data from multiple different *corpora* (e.g. Reddit, StackExchange, GitHub, etc.) - Sampling *each training batch* according to a fixed distribution across corpora - Building indices that map batches of tokens into these files (indexing) - The original implementation was slow, and designed to run on a single device - Major bottleneck when debugging data pipeline at scale - [x] Completely re-wrote an asynchronous, distributed implementation that *significantly* improves performance ## 🚀 Accelerating Dataset Processing: Results - [x] Completely re-wrote an asynchronous, distributed implementation that *significantly* improves performance (~~60 min~~ $\rightarrow$ 4 min)

Figure 9: Time spent building `BlendableDataset`

Figure 10: Time spent building `GPTDataset`

## 📓 References

- See my other slides at [samforeman.me/talks](https://samforeman.me/talks) - [LLMs from Scratch](https://saforem2.github.io/llm-workshop-talk) - [Creating Small(~ish) LLMs](https://saforem2.github.io/LLM-tutorial) - [Parallel Training Techniques](https://saforem2.github.io/parallel-training-slides) - [LLMs on Polaris](https://samforeman.me/talks/llms-on-polaris/#/title-slide) - [Training LLMs at Scale](https://samforeman.me/talks/llms-at-scale/) - [argonne-lcf / `Megatron-DeepSpeed`](https://github.com/argonne-lcf/Megatron-DeepSpeed) For the largest of large language models. - [saforem2 / `ezpz`](https://github.com/saforem2/ezpz) Distributed training, ezpz. 🍋

- 👀 See also: - [New international consortium for generative AI models for science](https://www.anl.gov/article/new-international-consortium-formed-to-create-trustworthy-and-reliable-generative-ai-models-for) - [PyTorch Distributed Overview](https://pytorch.org/tutorials/beginner/dist_overview.html) - [Distributed Data Parallel — PyTorch master documentation](https://pytorch.org/docs/master/notes/ddp.html) - [🤗 Efficient Training on Multiple GPUs](https://huggingface.co/docs/transformers/en/perf_train_gpu_many) - [Getting Started - DeepSpeed](https://www.deepspeed.ai/getting-started/)

## ❤️ Thank you! - Organizers - Feel free to reach out! [](https://samforeman.me) [](mailto:///foremans@anl.gov) [](https://www.twitter.com/saforem2) > [!NOTE] > > ### 🙏 Acknowledgements > > This research used resources of the Argonne Leadership Computing > Facility, which is a DOE Office of Science User Facility supported > under Contract DE-AC02-06CH11357. ## 📑 Bibliography - Refs: - Wei et al. (2022) - Animations from [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)

Song, Shuaiwen Leon, Bonnie Kruft, Minjia Zhang, Conglong Li, Shiyang Chen, Chengming Zhang, Masahiro Tanaka, et al. 2023. “DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery Through Sophisticated AI System Technologies.” .

Wei, Jason, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, et al. 2022. “Emergent Abilities of Large Language Models.” .

Yang, Jingfeng, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, and Xia Hu. 2023. “Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond.” .

## 🎁 Extras ### 🚂 Loooooooooong Sequence Lengths

- Working with [ Microsoft/DeepSpeed](https://github.com/microsoft/DeepSpeed) team to enable longer sequence lengths (context windows) for LLMs - See my [blog post](https://samforeman.me/posts/auroragpt/long-sequences/) for additional details

![25B](https://raw.githubusercontent.com/saforem2/scaling4science/main/assets/25B.svg) ![33B](https://raw.githubusercontent.com/saforem2/scaling4science/main/assets/33B.svg)

Figure 11: Maximum (achievable) `SEQ_LEN` for both `25B` and `33B` models (See: Song et al. (2023))

[ `scaling4science`](https://github.com/saforem2/scaling4science) [ `Megatron-DS-Benchmarking`](https://github.com/saforem2/Megatron-DS-Benchmarking)

### ♻️ Life Cycle of the LLM

### 📝 Pre-training

![](https://jalammar.github.io/images/gpt3/03-gpt3-training-step-back-prop.gif) Figure 12: **Pre-training**: Virtually all of the compute used during pretraining phase

### 🎀 Fine-Tuning

![](https://jalammar.github.io/images/gpt3/10-gpt3-fine-tuning.gif) Figure 13: **Fine-tuning**: Fine-tuning actually updates the model’s weights to make the model better at a certain task.

### 🍎 Training LLMs

![](https://github.com/saforem2/llm-lunch-talk/blob/main/docs/assets/it_hungers.jpeg?raw=true) Figure 14: It’s hungry!

![](https://github.com/Mooler0410/LLMsPracticalGuide/raw/main/imgs/survey-gif-test.gif) Figure 15: Visualization from Yang et al. (2023)

[^1]: Co-led by: Venkat Vishwanath, Sam Foreman