LLMs on Aurora: Hands-On

Sam Foreman

ALCF INCITE GPU HACKATHON
May 20–22, 2025

LLMs on Aurora: 🍋 ezpz

Sam Foreman

2025-05-07

🎥 recording

📍 Currently

Figure 1: Current state of LLM Pretraining. [Source]

💬 LLMs on Aurora

🍋 `ezpz`

Write once, run anywhere

🐣 Getting Started

Submit interactive job:

qsub -I -l select=2 -l walltime=01:00:00 \
    -l filesystems=home:flare \
    -A gpu_hack \
    -q gpu_hack_prio

Source¹ the ezpz/bin/utils.sh script (using curl to download it²):
```
source <(curl -L https://bit.ly/ezpz-utils)
```

🏖️ Shell Environment

Setup environment:
```
ezpz_setup_env
```

🔍 Environment Setup with `ezpz_setup_env`

Wrapper around ezpz_setup_job && ezpz_setup_python

ezpz_setup_job: Determine the specifics of our active (PBS, SLURM) job¹
ezpz_setup_python:
- if @ ALCF:
  - Load the appropriate modules and activate base conda env
- else:
  - Look for an active conda environment
    - If found, use it to build a new virtual environment
- Activate the newly created venvs/$(basename ${CONDA_PREFIX}) environment

⏱️ Working with Job Scheduler(s)

ezpz integrates directly with the ALCF job scheduler¹
- has mechanisms for getting information about our currently running jobs
🪄 Automagically:
- Determine the specifics of our active (PBS, SLURM) job
  (e.g. ${NHOSTS}, ${NGPU_PER_HOST}, ${NGPUS}, …)
- Load the appropriate modules²
- Create (or activate) a virtual environment on top of a base conda environment

🔄 Use Custom Node Lists

Experiment¹ with custom hostfile(s), e.g.:

source <(curl -L https://bit.ly/ezpz-utils)
# 1. If no `hostfile` specified, find and use `$PBS_NODEFILE` 
ezpz_setup_job
# 2. Grab a subset of nodes:
head -n 2 $PBS_NODEFILE > nodefile-0-1
# 3. Pass custom `nodefile-0-1`:
ezpz_setup_job nodefile-0-1  # will use `nodefile-0-1`

🐍 Python Environments

ALWAYS work inside a virtual environment
- best practice is to maintain separate virtual environments for:
  - each project you work on
  - different versions of a specific package you’re working with
    e.g you would want different envs for torch==2.X vs torch==2.Y
- Mangled python environments are one of the most common issues faced by users

📦 Install `ezpz`

Install¹:

python3 -m pip install "git+https://github.com/saforem2/ezpz"

Run distributed test:
```
ezpz-test
```

Launch any python from python

Launch a module:
```
ezpz-launch -m ezpz.test_dist
```

Launch a python string:

ezpz-launch -c "'import ezpz; ezpz.setup_torch()'"

➕ How to Modify Existing Code

+ import ezpz
+ _ = ezpz.setup_torch()

- model.to('cuda')
+ model.to(ezpz.get_torch_device_type())

✨ Features

Initializing PyTorch across multiple processes

import ezpz
_ = ezpz.setup_torch()
rank = ezpz.get_rank()
world_size = ezpz.get_world_size()
local_rank = ezpz.get_local_rank()

Automatic device detection (xpu, cuda, mps, cpu, …)

x = torch.rand((10, 10)).to(ezpz.get_torch_device_type())

Automatic (single-process) logging
```
logger = ezpz.get_logger(__name__)
```

Distributed debugger:

try:
    buggy_code()
except Exception:
    ezpz.breakpoint(0)

🧪 Experiment Tracking

import ezpz
rank = ezpz.setup_torch()
logger = ezpz.get_logger(__name__)
if rank == 0:                   # -- [1.] --
    try:
        _ = ezpz.setup_wandb(
            "ezpz.examples.minimal"
        )
    except Exception:
        logger.exception(
            "Failed to initialize wandb, continuing without it"
        )

# ...build {model, optimizer}, etc...

for i in range(train_iters):
    metrics = train_step(...)
    logger.info(                 # -- [2.] --
        history.update(metrics)  # -- [3.] --
    )

if rank == 0:
    history.finalize()

Initialize W&B (if WANDB_DISABLED is not set)
Log summary of metrics to stdout
Update history.history with metrics¹

🤏 Minimal Example

See ezpz/examples/minimal.py

import os
import time
import ezpz
import torch

logger = ezpz.get_logger(__name__)


class Network(torch.nn.Module):
    def __init__(
        self,
        input_dim: int,
        output_dim: int,
        sizes: list[int] | None,
    ):
        super(Network, self).__init__()
        nh = output_dim if sizes is None else sizes[0]
        layers = [torch.nn.Linear(input_dim, nh), torch.nn.ReLU()]
        if sizes is not None and len(sizes) > 1:
            for idx, size in enumerate(sizes[1:]):
                layers.extend(
                    [torch.nn.Linear(sizes[idx], size), torch.nn.ReLU()]
                )
            layers.append(torch.nn.Linear(sizes[-1], output_dim))
        self.layers = torch.nn.Sequential(*layers)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.layers(x)


@ezpz.timeitlogit(rank=ezpz.get_rank())
def train(
    model: torch.nn.Module, optimizer: torch.optim.Optimizer
) -> ezpz.History:
    unwrapped_model = (
        model.module
        if isinstance(model, torch.nn.parallel.DistributedDataParallel)
        else model
    )
    history = ezpz.History()
    device_type = ezpz.get_torch_device_type()
    dtype = unwrapped_model.layers[0].weight.dtype
    bsize = int(os.environ.get("BATCH_SIZE", 64))
    isize = unwrapped_model.layers[0].in_features
    warmup = int(os.environ.get("WARMUP_ITERS", 10))
    log_freq = int(os.environ.get("LOG_FREQ", 1))
    model.train()
    for step in range(int(os.environ.get("TRAIN_ITERS", 500))):
        with torch.autocast(
            device_type=device_type,
            dtype=dtype,
        ):
            t0 = time.perf_counter()
            x = torch.rand((bsize, isize), dtype=dtype).to(device_type)
            y = model(x)
            loss = ((y - x) ** 2).sum()
            dtf = (t1 := time.perf_counter()) - t0
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            dtb = time.perf_counter() - t1
            if step % log_freq == 0 and step > warmup:
                logger.info(
                    history.update(
                        {
                            "iter": step,
                            "loss": loss.item(),
                            "dt": dtf + dtb,
                            "dtf": dtf,
                            "dtb": dtb,
                        }
                    )
                )
    return history


@ezpz.timeitlogit(rank=ezpz.get_rank())
def setup():
    rank = ezpz.setup_torch()
    if os.environ.get("WANDB_DISABLED", False):
        logger.info("WANDB_DISABLED is set, not initializing wandb")
    elif rank == 0:
        try:
            _ = ezpz.setup_wandb(
                project_name=os.environ.get(
                    "PROJECT_NAME", "ezpz.examples.minimal"
                )
            )
        except Exception:
            logger.exception(
                "Failed to initialize wandb, continuing without it"
            )
    device_type = ezpz.get_torch_device_type()
    model = Network(
        input_dim=int((os.environ.get("INPUT_SIZE", 128))),
        output_dim=int(os.environ.get("OUTPUT_SIZE", 128)),
        sizes=[
            int(x)
            for x in os.environ.get("LAYER_SIZES", "1024,512,256,128").split(
                ","
            )
        ],
    )
    model.to(device_type)
    model.to((os.environ.get("DTYPE", torch.bfloat16)))
    logger.info(f"{model=}")
    optimizer = torch.optim.Adam(model.parameters())
    if ezpz.get_world_size() > 1:
        from torch.nn.parallel import DistributedDataParallel as DDP

        model = DDP(model, device_ids=[ezpz.get_local_rank()])

    return model, optimizer


def main():
    model, optimizer = setup()
    history = train(model, optimizer)
    if ezpz.get_rank() == 0:
        dataset = history.finalize()
        logger.info(f"{dataset=}")


if __name__ == "__main__":
    main()

🏃‍♂️ Running the Minimal Example

To run the previous example we:

Source the ezpz utils script:

source <(curl -L https://bit.ly/ezpz-utils)

Setup our environment:
```
ezpz_setup_env
```
Run the example:
```
ezpz-launch -m ezpz.examples.minimal
```

📝 `ezpz-test`

ezpz-test is a simple test script that trains a small model using DDP across all available GPUs
- It will automatically detect the number of GPUs and launch an appropriate mpiexec command to run the training script across all GPUs
See: ezpz/test.py

Command:

#[🐍 aurora_nre_models_frameworks-2025.0.0](👻 aurora_nre_models_frameworks-2025.0.0)
#[05/05/25 @ 07:41:35][x4520c1s0b0n0][/f/d/f/p/s/ezpz][🌱 update-utils][📦🤷✓] [⏱️ 54s]
; ezpz-test

🦜 Generate Text

See: ezpz/generate.py

Command:

python3 -m ezpz.generate --model_name meta-llama/Llama-3.1-8B

🤗 Huggingface Trainer

See ezpz/hf_trainer.py

Command:

ezpz-launch -m ezpz.hf_trainer \
    --dataset_name=eliplutchok/fineweb-small-sample \
    --streaming \
    --model_name_or_path=meta-llama/Llama-3.2-1B \
    --bf16=true \
    --do_train=true \
    --do_eval=true \
    --report-to=wandb \
    --logging-steps=1 \
    --include-tokens-per-second=true \
    --block-size=128 \
    --max-steps=10 \
    --include-num-input-tokens-seen=true \
    --auto_find_batch_size=true \
    --gradient_checkpointing=true \
    --optim=adamw_torch \
    --overwrite-output-dir=true \
    --logging-first-step \
    --include-for-metrics='inputs,loss' \
    --max-eval-samples=50 \
    --ddp-backend=ccl

🏎️ Megatron-DeepSpeed

git clone https://github.com/argonne-lcf/Megatron-DeepSpeed
cd Megatron-DeepSpeed
source <(curl -L https://bit.ly/ezpz-utils)
python3 -m pip install -e \
    deepspeed \
    "git+https://github.com/saforem2/ezpz"
bash train_alcf.sh

🙌 Acknowledgements

This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.