AuroraGPT
๐ฏ AuroraGPT Goals
AuroraGPT: General purpose scientific LLM
Broadly trained on a general corpora plus scientific {papers, texts, data}
- Explore pathways towards a โScientific Assistantโ model
- Build with international partners (RIKEN, BSC, others)
- Multilingual English, ๆฅๆฌ่ช, French, German, Spanish
- Multimodal: images, tables, equations, proofs, time series, graphs, fields, sequences , etc
Image from Hannibal046/Awesome-LLM
Here to talk about AuroraGPT, Argonneโs internal effort to build a general purpose scientific LLM, broadly trained on a general corpora of text + scientific {papers, text, data}
As part of this effort, we plan toโฆ
- Explore pathways, build with international partners, multi-{lingual, modal}
Rough timeline of the project and deliverables:
- 202{3,4}: text-only models, plan to release a series of {7B, 70B, 1T} models
- 202{4,5}: Basic multi-modal models
- 202{5,6}: Advanced scientific multimodal models
๐งช AuroraGPT: Open Science Foundation Models
- AuroraGPT will be a publicly distributed, open source foundation model for open science
- Is being trained on:
- Scientific / engineering structured data
- General text, media, news, etc.
- Large amounts of low to medium quality data
- Much less high quality data (that is publicly available for use)
- This data is then cleaned, processed, de-duplicated and used for the initial pre-training phase of the model
- The vast majority of the overall compute is spent during this initial pre-training phase
- This is the group I help to lead and will be talking a bit about today
- The initial pre-training phase is currently underway
- Eventually, given a bit of time, effort and magic, the model will be ready for fine-tuning and additional training for a variety of downstream tasks
- The pretrained model will then be handed off for additional fine-tuning on a variety of downstream tasks
- Scientific discovery
- Accelerate scientific tasks
- Digital twins
- Inverse design
- Code optimization
- Accelerated simulations
- Autonomous experiments
- Co-design
- Becoming increasingly clear that LLMs have the potential to drastically accelerate computational science
- Weโve seen this already for {GenSLMs, Weather / Climate / Earth Systems Modeling, Particle Physics, etc.}
๐ AuroraGPT Outcomes
- Datasets and data pipelines for preparing science training data
- Software infrastructure and workflows to train, evaluate and deploy LLMs at scale for scientific resarch purposes
- Evaluation of state-of-the-art LLM Models to determine where they fall short in deep scientific tasks and where deep data may have an impact
- Assessment of the approach of augmenting web training data with two forms of data specific to science
- Full text scientific papers
- Structured scientific datasets (suitably mapped to narrative form)
- Research grade artifacts (models) for scientific community for adaptation for downstream uses
- Promotion of responsible AI best practices where we can figure them out
- International Collaborations around the long term goal of AGI for science
Deliverables:
- datasets, pipelines
- software infrastructure, workflows to interface with science applications
- checkpoints, models, logs, workbook, insights, etc.
Hope to understand:
- How different state-of-the-art models perform at different scientific tasks
- where deep data may have an impact
- feasibility of generically augmenting text with scientific structured data
Huge undertaking that will require large international collaborations around long term goal of AGI for science
Extra points:
- Well known that LLMs are good for non-consequential tasks
- Known to โhallucinateโ and create false information
- Can this be mitigated reliably ??
๐ Aurora
Racks | 166 |
Nodes | 10,624 |
CPUs | 21,248 |
GPUs | 63,744 |
NICs | 84,992 |
HBM | 8 PB |
DDR5c | 10 PB |
๐ค ALCF AI Testbed
- ALCF AI Testbed Systems are in production and available for allocations to the research community
- Significant improvement in time-to-solution and energy-efficiency for diverse AI for science applications.
- NAIRR Pilot
Up to 25X improvement for genomic foundation models with 6.5X energy efficiency
๐ฅ Team Leads
Planning
Data
Models / Training
Evaluation
Post
Inference
Comms
Distribution
๐ค Teams
- Planning
- Data Prep
- Accumulate 20+ T tokens of high-quality scientific text and structured data
- Models / Training2
- Train (entirely from scratch) a series of models on publicly available data
- Evaluation
- Skills, trustworthiness, safety, robustness, privacy, machine ethics
- Post-Training
- Fine-tuning, alignment
- Inference
- Model serving, API development / public-facing web services
- Distribution
- Licensing, generating and distributing artifacts for public consumption
- Communication
๐ฆ Model Training
โ Goals
- Want training runs at scale to be:
- efficient
- stable
- reproducible
- This requires:
- robust data pipelines / file IO
- effectively overlapping compute with communication
- stability across {network, filesystem, machine}
- 3D / Multi-dimensional Parallelism strategies
- Large batch training
- Second order optimizers
- Sub-quadratic attention
- State space models
- Highly optimized GPU kernels
โ Challenges
- Looong time to train, can be:
- weeks (even months) of continuous training
- order of magnitude longer than typical NN training jobs
- Stability issues:
- failures are expensive (but inevitable)
- stragglers common at scale
- Individual jobs are:
- fragile
- only as good as the worst rank
- one hang or bad worker can crash job
- network / filesystem / other-user(s) dependent
- Cost / benefits of different collective communication algorithms
- depend on optimized / efficient implementations
- Network performance
- Highly optimized GPU kernels
๐ Accelerating Dataset Processing at Scale for Training
- To train a fixed model on trillions of tokens requires:
- Aggregating data from multiple different corpora (e.g. Reddit, StackExchange, GitHub, etc.)
- Sampling each training batch according to a fixed distribution across corpora
- Building indices that map batches of tokens into these files (indexing)
- The original implementation was slow, and designed to run on a single device
- Major bottleneck when debugging data pipeline at scale
๐ Accelerating Dataset Processing at Scale for Training
๐ References
- ๐ก samforeman.me:
- ๐ฆ Talks:
- See my other slides on:
- ๐๏ธ
argonne-lcf/Megatron-DeepSpeed
For the largest of large language models. - ๐
saforem2/ezpz
Distributed training, ezpz. - ๐ See also:
โค๏ธ Thank you!
๐ Bibliography
- Refs:
- Wei et al. (2022)
- Animations from The Illustrated Transformer
Song, Shuaiwen Leon, Bonnie Kruft, Minjia Zhang, Conglong Li, Shiyang Chen, Chengming Zhang, Masahiro Tanaka, et al. 2023. โDeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery Through Sophisticated AI System Technologies.โ https://arxiv.org/abs/2310.04610.
Wei, Jason, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, et al. 2022. โEmergent Abilities of Large Language Models.โ https://arxiv.org/abs/2206.07682.
Yang, Jingfeng, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, and Xia Hu. 2023. โHarnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond.โ https://arxiv.org/abs/2304.13712.
๐ Extras
๐ Loooooooooong Sequence Lengths
- Working with Microsoft/DeepSpeed team to enable longer sequence lengths (context windows) for LLMs
- See my blog post for additional details
โป๏ธ Life Cycle of the LLM
๐ Training LLMs
Footnotes
Citation
BibTeX citation:
@unpublished{foreman2024,
author = {Foreman, Sam},
title = {AuroraGPT},
date = {2024-09-04},
url = {https://samforeman.me/talks/hpc-user-forum/slides},
langid = {en}
}
For attribution, please cite this work as:
Foreman, Sam. 2024. โAuroraGPT.โ September 4. https://samforeman.me/talks/hpc-user-forum/slides.