โฐ Starting Up Distributed Training on Aurora
Response
In Measuring / Calculating Startup Time,I provide a summary of how the startup time is identified and calculated.
Iโm not sure exactly I understand
Will the measurement methodology be the same for distributed training? For examples, we can measure the start-up time for the rank0?
The startup time is being measured for distributed training (logs only created on
RANK = 0
)I discuss in Minimal Working Example a minimal example that can be used to measure the startup times.
- This is using a library Iโve been working on,
ezpz
that is designed to help simplify the process of setting up / initializing distributed training across many GPUs.
- This is using a library Iโve been working on,
Measuring / Calculating Startup Time
The startup timing was identified by parsing the logfiles from existing runs and calculating the difference \delta t = t_{1} - t_{0},
t_{0} is the time stamp at the very beginning of the shell script (defined here) which then launches
t_{0} appears in the logfile as:
t_{1} is identified as the timestamp associated with the completion of the first training step
t_{1} appears in the logfile as:
Below is an example of the bash script use to parse the logfiles and identify these timestamps:
$ for f in $(tail -5 logfiles) ; do echo $f; cat $f | grep -E "Job started|step=0\," | uniq ; echo "\n" ; done /lus/grand/projects/datascience/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_actCkpt_GPT1T_4L_z1_seqlen2048_mp8_pp2_sp1_nl4_hs25600_gb16_mb1/logs/foremans-x3004c0s13b0n0-nhosts4-ngpu16-2023-11-02-183323.log Job started at: 2023-11-02-183323 on x3004c0s13b0n0 [2023-11-02 18:34:13,122] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)] /lus/grand/projects/datascience/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_SP_actCkpt_GPT125M_z0_seqlen2048_mp16_pp1_sp1_nl12_hs768_gb1_mb1/logs/foremans-x3015c0s37b0n0-nhosts4-ngpu16-2023-11-02-184240.log Job started at: 2023-11-02-184240 on x3015c0s37b0n0 /lus/grand/projects/datascience/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_SP_actCkpt_GPT125M_z0_seqlen2048_mp16_pp1_sp1_nl12_hs768_gb1_mb1/logs/foremans-x3015c0s37b0n0-nhosts4-ngpu16-2023-11-02-184259.log Job started at: 2023-11-02-184259 on x3015c0s37b0n0 [2023-11-02 18:43:23,385] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)] /lus/grand/projects/datascience/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_SP_actCkpt_GPT125M_z0_seqlen2048_mp16_pp1_sp1_nl12_hs768_gb1_mb1/logs/foremans-x3004c0s13b0n0-nhosts4-ngpu16-2023-11-02-184407.log Job started at: 2023-11-02-184407 on x3004c0s13b0n0 [2023-11-02 18:44:32,804] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)] /lus/grand/projects/datascience/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_actCkpt_GPT1T_4L_z1_seqlen2048_mp8_pp2_sp1_nl4_hs25600_gb16_mb2/logs/foremans-x3108c0s25b1n0-nhosts2-ngpu8-2023-11-02-192739.log Job started at: 2023-11-02-192739 on x3108c0s25b1n0
Minimal Working Example
As for 3:
If we need to report the startup time for the DL applications, do we need to collect measurements using the actual Aurora NRE workloads or some small benchmarking test cases? For example, we can try to recreate the typical start-up scenarios, like library imports, and measure those separately as shown below.
Iโve been working on a library to help simplify this:
ezpz
Minimal library that handles the initialization of distributed training
-
Setup / Install:
# launch job $ qsub -q EarlyAppAccess -A Aurora_Deployment -l walltime=2:00:00 -l select=4 -I # load frameworks $ module use -a /soft/modulefiles ; module --ignore_cache load frameworks $ module load frameworks/.2023.12.15.001 # install `ezpz` $ git clone https://github.com/saforem2/ezpz $ cd ezpz $ mkdir -p venvs/aurora/2023.12.15.001 $ python3 -m venv venvs/aurora/2023.12.15.001 --system-site-packages $ source venvs/aurora/2023.12.15.001/bin/activate $ python3 -m pip install -e . # print job info and define `launch` alias $ source ezpz/src/ezpz/bin/savejobenv โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ [Hosts]: โ โข x4415c6s5b0n0.hostmgmt2415.cm.aurora.alcf.anl.gov x4415c6s6b0n0.hostmgmt2415.cm.aurora.alcf.anl.gov x4415c6s7b0n0.hostmgmt2415.cm.aurora.alcf.anl.gov x4415c7s0b0n0.hostmgmt2415.cm.aurora.alcf.anl.gov โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ [DIST INFO]: โ โข Loading job env from: /home/foremans/.pbsenv โ โข HOSTFILE: /var/spool/pbs/aux/297306.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov โ โข NHOSTS: 4 โ โข NGPU_PER_HOST: 12 โ โข NGPUS (NHOSTS x NGPU_PER_HOST): 48 โ โข DIST_LAUNCH: mpiexec --verbose --envall -n 48 -ppn 12 --hostfile /var/spool/pbs/aux/297306.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov โ โข Defining alias: launch: aliased to mpiexec --verbose --envall -n 48 -ppn 12 --hostfile /var/spool/pbs/aux/297306.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Launch with
framework=pytorch
,backend=DDP
:# ---------------------------------------------------------- # launch + startup on all workers with # โข `framework` โ {`pytorch`, `tensorflow`} # โข `backend` โ {`horovod`, `deepspeed`, `DDP`} # where `deepspeed` and `DDP` only available for `pytorch` # ---------------------------------------------------------- $ launch python3 -m ezpz framework=pytorch backend=DDP [2023-12-19 13:33:24][INFO][dist.py:292] - Using device='xpu' [2023-12-19 13:33:26][INFO][dist.py:243] - Using DDP for distributed training [2023-12-19 13:33:26][WARNING][dist.py:104] - Using backend='ccl' [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 1 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 2 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 3 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 4 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 0 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 5 / 47 [2023-12-19 13:33:35][INFO][__main__.py:49] - { "_target_": "ezpz.configs.TrainConfig", "framework": "pytorch", "backend": "DDP", "ds_config_path": null, "port": null, "seed": null, "use_wandb": true, "wandb_project_name": null, "precision": null, "ngpus": null } [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 9 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 10 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 11 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 7 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 8 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 6 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 12 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 13 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 14 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 15 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 18 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 19 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 20 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 21 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 22 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 23 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 24 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 25 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 26 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 27 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 30 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 16 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 17 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 28 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 32 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 33 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 36 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 37 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 38 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 39 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 43 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 46 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 29 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 47 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 31 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 34 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 35 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 42 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 41 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 44 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 45 / 47 [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 40 / 47 [2023-12-19 13:33:47][INFO][dist.py:415] - Setting up wandb from rank: 0 [2023-12-19 13:33:47][INFO][dist.py:416] - Using: WB PROJECT: ezpz [2023-12-19 13:33:58][INFO][dist.py:448] - W&B RUN: [flowing-wood-8](https://wandb.ai/l2hmc-qcd/ezpz/runs/uya29gm5) [2023-12-19 13:33:58][INFO][dist.py:490] - Running on x4415c6s5b0n0.hostmgmt2415.cm.aurora.alcf.anl.gov [2023-12-19 13:33:58][INFO][dist.py:506] - Reading hosts from /var/spool/pbs/aux/297306.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov [2023-12-19 13:33:58][INFO][__main__.py:57] - Output dir: /lus/gecko/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/src/ezpz/outputs/runs/pytorch/DDP/2023-12-19/13-33-17 [2023-12-19 13:33:58][CRITICAL][dist.py:519] - ๐ flowing-wood-8 [2023-12-19 13:33:58][CRITICAL][dist.py:520] - ๐ https://wandb.ai/l2hmc-qcd/ezpz/runs/uya29gm5 [2023-12-19 13:33:58][CRITICAL][dist.py:521] - ๐/: /lus/gecko/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/src/ezpz/outputs/runs/pytorch/DDP/2023-12-19/13-33-17/wandb/run-20231219_133354-uya29gm5/files [2023-12-19 13:33:58][INFO][dist.py:563] - Adding /lus/gecko/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/src/ezpz/ezpz-pt-DDP-xpu.log to W&B artifact... [2023-12-19 13:33:58][INFO][dist.py:563] - Adding /lus/gecko/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/src/ezpz/outputs/runs/pytorch/DDP/2023-12-19/13-33-17/__main__.log to W&B artifact... [2023-12-19 13:33:58][INFO][dist.py:563] - Adding /lus/gecko/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/src/ezpz/outputs/runs/pytorch/DDP/2023-12-19/13-33-17/main_debug.log to W&B artifact... [2023-12-19 13:33:58][INFO][dist.py:563] - Adding /lus/gecko/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/src/ezpz/outputs/runs/pytorch/DDP/2023-12-19/13-33-16/__main__.log to W&B artifact...
Citation
@online{foreman2024,
author = {Foreman, Sam},
title = {โฐ {Starting} {Up} {Distributed} {Training} on {Aurora}},
date = {2024-03-21},
url = {https://samforeman.me/posts/AuroraGPT/startup-times/},
langid = {en}
}