LLMs on Aurora: Hands-On

πŸ‹ ezpz

Sam Foreman
[email protected]

ALCF

2025-05-07

ALCF INCITE GPU HACKATHON
May 20–22, 2025

LLMs on Aurora: πŸ‹ ezpz

Sam Foreman

2025-05-07

πŸŽ₯ recording

πŸ“ Currently

Figure 1: Current state of LLM Pretraining. [Source]

πŸ’¬ LLMs on Aurora

  • πŸ‹ ezpz
  • πŸ€— transformers
  • 🏎️ Megatron-DeepSpeed

πŸ‹ ezpz

Write once, run anywhere

🐣 Getting Started

  1. Submit interactive job:

    qsub -I -l select=2 -l walltime=01:00:00 \
        -l filesystems=home:flare \
        -A gpu_hack \
        -q gpu_hack_prio
  2. Source1 the ezpz/bin/utils.sh script (using curl to download it2):

    source <(curl -L https://bit.ly/ezpz-utils)
  1. In general, you should be wary of running random scripts from the internet.

  2. https://bit.ly/ezpz-utils, since https://raw.githubusercontent.com/saforem2/ezpz/main/bin/utils.sh is a bit of a pain

πŸ–οΈ Shell Environment

  1. Setup environment:

    ezpz_setup_env

 

πŸ” Environment Setup with ezpz_setup_env

  • Wrapper around ezpz_setup_job && ezpz_setup_python
  1. ezpz_setup_job: Determine the specifics of our active (PBS, SLURM) job1

  2. ezpz_setup_python:

    • if @ ALCF:
      • Load the appropriate modules and activate base conda env
    • else:
      • Look for an active conda environment
        • If found, use it to build a new virtual environment
    • Activate the newly created venvs/$(basename ${CONDA_PREFIX}) environment
  1. e.g. ${NHOSTS}, ${NGPU_PER_HOST}, ${NGPUS}, …

⏱️ Working with Job Scheduler(s)

  • ezpz integrates directly with the ALCF job scheduler1
    • has mechanisms for getting information about our currently running jobs
  • πŸͺ„ Automagically:
    • Determine the specifics of our active (PBS, SLURM) job
      (e.g. ${NHOSTS}, ${NGPU_PER_HOST}, ${NGPUS}, …)
    • Load the appropriate modules2
    • Create (or activate) a virtual environment on top of a base conda environment
  1. Should also work with SLURM (needs further testing)

  2. On any of the ALCF systems, including: Aurora, Polaris, …, etc.

πŸ”„ Use Custom Node Lists

  • Experiment1 with custom hostfile(s), e.g.:

    source <(curl -L https://bit.ly/ezpz-utils)
    # 1. If no `hostfile` specified, find and use `$PBS_NODEFILE` 
    ezpz_setup_job
    # 2. Grab a subset of nodes:
    head -n 2 $PBS_NODEFILE > nodefile-0-1
    # 3. Pass custom `nodefile-0-1`:
    ezpz_setup_job nodefile-0-1  # will use `nodefile-0-1`
  1. Or, for example, if you would like to exclude a node you suspect is having issues

🐍 Python Environments

  • ALWAYS work inside a virtual environment
    • best practice is to maintain separate virtual environments for:
      • each project you work on
      • different versions of a specific package you’re working with
        e.g you would want different envs for torch==2.X vs torch==2.Y
    • Mangled python environments are one of the most common issues faced by users

πŸ“¦ Install ezpz

  1. Install1:

    python3 -m pip install "git+https://github.com/saforem2/ezpz"
  2. Run distributed test:

    ezpz-test
  3. Launch any python from python

    • Launch a module:

      ezpz-launch -m ezpz.test_dist
    • Launch a python string:

      ezpz-launch -c "'import ezpz; ezpz.setup_torch()'"
  1. You should always be working in a virtual environment. See: πŸ–οΈ Shell Environment

βž• How to Modify Existing Code

+ import ezpz
+ _ = ezpz.setup_torch()

- model.to('cuda')
+ model.to(ezpz.get_torch_device_type())

✨ Features

  • Initializing PyTorch across multiple processes

    import ezpz
    _ = ezpz.setup_torch()
    rank = ezpz.get_rank()
    world_size = ezpz.get_world_size()
    local_rank = ezpz.get_local_rank()
  • Automatic device detection (xpu, cuda, mps, cpu, …)

    x = torch.rand((10, 10)).to(ezpz.get_torch_device_type())
  • Automatic (single-process) logging

    logger = ezpz.get_logger(__name__)
  • Distributed debugger:

    try:
        buggy_code()
    except Exception:
        ezpz.breakpoint(0)

πŸ§ͺ Experiment Tracking

import ezpz
rank = ezpz.setup_torch()
logger = ezpz.get_logger(__name__)
if rank == 0:                   # -- [1.] --
    try:
        _ = ezpz.setup_wandb(
            "ezpz.examples.minimal"
        )
    except Exception:
        logger.exception(
            "Failed to initialize wandb, continuing without it"
        )

# ...build {model, optimizer}, etc...

for i in range(train_iters):
    metrics = train_step(...)
    logger.info(                 # -- [2.] --
        history.update(metrics)  # -- [3.] --
    )

if rank == 0:
    history.finalize()
  1. Initialize W&B (if WANDB_DISABLED is not set)
  2. Log summary of metrics to stdout
  3. Update history.history with metrics1
  1. Will automatically be reported to W&B if a run is detected

🀏 Minimal Example

  • See ezpz/examples/minimal.py
import os
import time
import ezpz
import torch

logger = ezpz.get_logger(__name__)


class Network(torch.nn.Module):
    def __init__(
        self,
        input_dim: int,
        output_dim: int,
        sizes: list[int] | None,
    ):
        super(Network, self).__init__()
        nh = output_dim if sizes is None else sizes[0]
        layers = [torch.nn.Linear(input_dim, nh), torch.nn.ReLU()]
        if sizes is not None and len(sizes) > 1:
            for idx, size in enumerate(sizes[1:]):
                layers.extend(
                    [torch.nn.Linear(sizes[idx], size), torch.nn.ReLU()]
                )
            layers.append(torch.nn.Linear(sizes[-1], output_dim))
        self.layers = torch.nn.Sequential(*layers)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.layers(x)


@ezpz.timeitlogit(rank=ezpz.get_rank())
def train(
    model: torch.nn.Module, optimizer: torch.optim.Optimizer
) -> ezpz.History:
    unwrapped_model = (
        model.module
        if isinstance(model, torch.nn.parallel.DistributedDataParallel)
        else model
    )
    history = ezpz.History()
    device_type = ezpz.get_torch_device_type()
    dtype = unwrapped_model.layers[0].weight.dtype
    bsize = int(os.environ.get("BATCH_SIZE", 64))
    isize = unwrapped_model.layers[0].in_features
    warmup = int(os.environ.get("WARMUP_ITERS", 10))
    log_freq = int(os.environ.get("LOG_FREQ", 1))
    model.train()
    for step in range(int(os.environ.get("TRAIN_ITERS", 500))):
        with torch.autocast(
            device_type=device_type,
            dtype=dtype,
        ):
            t0 = time.perf_counter()
            x = torch.rand((bsize, isize), dtype=dtype).to(device_type)
            y = model(x)
            loss = ((y - x) ** 2).sum()
            dtf = (t1 := time.perf_counter()) - t0
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            dtb = time.perf_counter() - t1
            if step % log_freq == 0 and step > warmup:
                logger.info(
                    history.update(
                        {
                            "iter": step,
                            "loss": loss.item(),
                            "dt": dtf + dtb,
                            "dtf": dtf,
                            "dtb": dtb,
                        }
                    )
                )
    return history


@ezpz.timeitlogit(rank=ezpz.get_rank())
def setup():
    rank = ezpz.setup_torch()
    if os.environ.get("WANDB_DISABLED", False):
        logger.info("WANDB_DISABLED is set, not initializing wandb")
    elif rank == 0:
        try:
            _ = ezpz.setup_wandb(
                project_name=os.environ.get(
                    "PROJECT_NAME", "ezpz.examples.minimal"
                )
            )
        except Exception:
            logger.exception(
                "Failed to initialize wandb, continuing without it"
            )
    device_type = ezpz.get_torch_device_type()
    model = Network(
        input_dim=int((os.environ.get("INPUT_SIZE", 128))),
        output_dim=int(os.environ.get("OUTPUT_SIZE", 128)),
        sizes=[
            int(x)
            for x in os.environ.get("LAYER_SIZES", "1024,512,256,128").split(
                ","
            )
        ],
    )
    model.to(device_type)
    model.to((os.environ.get("DTYPE", torch.bfloat16)))
    logger.info(f"{model=}")
    optimizer = torch.optim.Adam(model.parameters())
    if ezpz.get_world_size() > 1:
        from torch.nn.parallel import DistributedDataParallel as DDP

        model = DDP(model, device_ids=[ezpz.get_local_rank()])

    return model, optimizer


def main():
    model, optimizer = setup()
    history = train(model, optimizer)
    if ezpz.get_rank() == 0:
        dataset = history.finalize()
        logger.info(f"{dataset=}")


if __name__ == "__main__":
    main()

πŸƒβ€β™‚οΈ Running the Minimal Example

To run the previous example we:

  1. Source the ezpz utils script:

    source <(curl -L https://bit.ly/ezpz-utils)
  2. Setup our environment:

    ezpz_setup_env
  3. Run the example:

    ezpz-launch -m ezpz.examples.minimal

πŸ“ ezpz-test

  • ezpz-test is a simple test script that trains a small model using DDP across all available GPUs

    • It will automatically detect the number of GPUs and launch an appropriate mpiexec command to run the training script across all GPUs
  • See: ezpz/test.py

  • Command:

    #[🐍 aurora_nre_models_frameworks-2025.0.0](πŸ‘» aurora_nre_models_frameworks-2025.0.0)
    #[05/05/25 @ 07:41:35][x4520c1s0b0n0][/f/d/f/p/s/ezpz][🌱 update-utils][πŸ“¦πŸ€·βœ“] [⏱️ 54s]
    ; ezpz-test

🦜 Generate Text

  • See: ezpz/generate.py

  • Command:

    python3 -m ezpz.generate --model_name meta-llama/Llama-3.1-8B

πŸ€— Huggingface Trainer

  • See ezpz/hf_trainer.py

  • Command:

    ezpz-launch -m ezpz.hf_trainer \
        --dataset_name=eliplutchok/fineweb-small-sample \
        --streaming \
        --model_name_or_path=meta-llama/Llama-3.2-1B \
        --bf16=true \
        --do_train=true \
        --do_eval=true \
        --report-to=wandb \
        --logging-steps=1 \
        --include-tokens-per-second=true \
        --block-size=128 \
        --max-steps=10 \
        --include-num-input-tokens-seen=true \
        --auto_find_batch_size=true \
        --gradient_checkpointing=true \
        --optim=adamw_torch \
        --overwrite-output-dir=true \
        --logging-first-step \
        --include-for-metrics='inputs,loss' \
        --max-eval-samples=50 \
        --ddp-backend=ccl

🏎️ Megatron-DeepSpeed

git clone https://github.com/argonne-lcf/Megatron-DeepSpeed
cd Megatron-DeepSpeed
source <(curl -L https://bit.ly/ezpz-utils)
python3 -m pip install -e \
    deepspeed \
    "git+https://github.com/saforem2/ezpz"
bash train_alcf.sh

πŸ™Œ Acknowledgements

This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

samforeman.me/talks/incite-hackathon-2025/ezpz/slides

1
LLMs on Aurora: Hands-On πŸ‹ ezpz Sam Foreman [email protected] ALCF 2025-05-07

  1. Slides

  2. Tools

  3. Close
  • LLMs on Aurora: Hands-On
  • ALCF INCITE GPU HACKATHON...
  • πŸ“ Currently
  • πŸ’¬ LLMs on Aurora
  • πŸ‹ ezpz
  • 🐣 Getting Started
  • πŸ–οΈ Shell Environment
  • πŸ” Environment Setup with ezpz_setup_env
  • ⏱️ Working with Job Scheduler(s)
  • πŸ”„ Use Custom Node Lists
  • 🐍 Python Environments
  • πŸ“¦ Install ezpz
  • βž• How to Modify Existing Code
  • ✨ Features
  • πŸ§ͺ Experiment Tracking
  • 🀏 Minimal Example
  • πŸƒβ€β™‚οΈ Running the Minimal Example
  • πŸ“ ezpz-test
  • 🦜 Generate Text
  • πŸ€— Huggingface Trainer
  • 🏎️ Megatron-DeepSpeed
  • πŸ™Œ Acknowledgements
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • r Scroll View Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help