๐ ezpz
@ ALCF
Work smarter, not harder.
๐ฃ Getting Started
There are two main, distinct components of ezpz
:
- ๐ Python Library
- ๐ Shell Utilities
๐ Shell Utilities
The Shell Utilities can be roughly broken up further into two main components
We provide a variety of helper functions designed to make your life easier when working with job schedulers (e.g. PBS Pro
@ ALCF or slurm
elsewhere).
All of these functions are:
To use these, we can source
the file directly via:
export PBS_O_WORKDIR=$(pwd) # if on ALCF
source /dev/stdin <<< $(curl 'https://raw.githubusercontent.com/saforem2/ezpz/refs/heads/main/src/ezpz/bin/utils.sh')
โ๏ธ Setup Environment
We would like to write our application in such a way that it is able to take full advantage of the resources allocated by the job scheduler.
That is to say, we want to have a single script with the ability to dynamically launch
python applications across any number of accelerators on any of the systems under consideration.
In order to do this, there is some basic setup and information gathering that needs to occur.
In particular, we need mechanisms for:
- Setting up a python environment
- Determining what system / machine weโre on
- + what job scheduler weโre using (e.g.
PBS Pro
@ ALCF orslurm
elsewhere)
- + what job scheduler weโre using (e.g.
- Determining how many nodes have been allocated in the current job (
NHOSTS
)- + Determining how many accelerators exist on each of these nodes (
NGPU_PER_HOST
)
- + Determining how many accelerators exist on each of these nodes (
This allows us to calculate the total number of accelerators (GPUs) as: N_{\mathrm{GPU}} = N_{\mathrm{HOST}} \times n_{\mathrm{GPU}}
where n_{\mathrm{GPU}} = N_{\mathrm{GPU}} / N_{\mathrm{HOST}} is the number of GPUs per host.
With this we have everything we need to build the appropriate {mpi
{run
, exec
}, slurm
} command for launching our python application across them.
Now, there are a few functions in particular worth elaborating on.
Function | Description | |
---|---|---|
Setup Environment | ezpz_setup_env |
Wrapper around ezpz_setup_python && ezpz_setup_job |
Setup Job | ezpz_setup_job |
Determine {NGPUS , NGPU_PER_HOST , NHOSTS }, build launch command alias |
Setup Python | ezpz_setup_python |
Load python modules and activate virtual environment |
Setup Conda | ezpz_setup_conda |
Identify appropriate conda module to load |
Setup Virtual Environment | ezpz_setup_venv_from_conda |
From ${CONDA_NAME} , build or activate the virtual env located in venvs/${CONDA_NAME}/ |
Some of the ezpz_*
functions (e.g. ezpz_setup_python
), will try to create / look for certain directories.
In an effort to be explicit, these directories will be defined relative to a WORKING_DIR
(e.g. "${WORKING_DIR}/venvs/"
)
This WORKING_DIR
will be assigned to the first non-zero match found below:
PBS_O_WORKDIR
: If found in environment, paths will be relative to thisSLURM_SUBMIT_DIR
: Next in line. If not @ ALCF, maybe usingslurm
โฆ$(pwd)
: Otherwise, no worries. Use your actual working directory.
๐ Example
Clone repo:
Setup environment:
Output
$ export PBS_O_WORKDIR=$(pwd) && source src/ezpz/bin/utils.sh && ezpz_setup_env Using WORKING_DIR: /gila/Aurora_deployment/foremans/projects/saforem2/ezpz No conda_prefix OR virtual_env found in environment... Setting up conda... Due to MODULEPATH changes, the following have been reloaded: 1) mpich/icc-all-pmix-gpu/20240717 The following have been reloaded with a version change: 1) oneapi/eng-compiler/2024.07.30.002 => oneapi/release/2024.2.1 Found conda at: /opt/aurora/24.180.1/frameworks/aurora_nre_models_frameworks-2024.2.1_u1 No VIRTUAL_ENV found in environment! - Trying to setup from /opt/aurora/24.180.1/frameworks/aurora_nre_models_frameworks-2024.2.1_u1 - Using VENV_DIR=/gila/Aurora_deployment/foremans/projects/saforem2/ezpz/venvs/aurora_nre_models_frameworks-2024.2.1_u1 - Creating a new virtual env on top of aurora_nre_models_frameworks-2024.2.1_u1 in /gila/Aurora_deployment/foremans/projects/saforem2/ezpz/venvs/aurora_nre_models_frameworks-2024.2.1_u1 [python] Using /gila/Aurora_deployment/foremans/projects/saforem2/ezpz/venvs/aurora_nre_models_frameworks-2024.2.1_u1/bin/python3 [๐ ezpz/bin/utils.sh] โข USER=foremans โข MACHINE=sunspot โข HOST=x1922c5s0b0n0 โข TSTAMP=2024-11-28-133756 [ezpz_setup_host_pbs] โข Using hostfile: /var/spool/pbs/aux/10283088.amn-0001 โข Found in environment: โข HOSTFILE: /var/spool/pbs/aux/10283088.amn-0001 โข Writing PBS vars to: /home/foremans/.pbsenv [ezpz_save_pbs_env] โข Setting: โข HOSTFILE: /var/spool/pbs/aux/10283088.amn-0001 โข JOBENV_FILE: /home/foremans/.pbsenv [HOSTS] โข [host:0] - x1922c5s0b0n0.hostmgmt2001.cm.americas.sgi.com โข [host:1] - x1922c5s2b0n0.hostmgmt2001.cm.americas.sgi.com โข [host:2] - x1922c5s4b0n0.hostmgmt2001.cm.americas.sgi.com โข [host:3] - x1922c6s1b0n0.hostmgmt2001.cm.americas.sgi.com [DIST INFO] โข NGPUS=48 โข NHOSTS=4 โข NGPU_PER_HOST=12 โข HOSTFILE=/var/spool/pbs/aux/10283088.amn-0001 โข DIST_LAUNCH=mpiexec --verbose --envall -n 48 -ppn 12 --hostfile /var/spool/pbs/aux/10283088.amn-0001 --cpu-bind depth -d 8 [LAUNCH]: โข To launch across all available GPUs, use: launch launch = mpiexec --verbose --envall -n 48 -ppn 12 --hostfile /var/spool/pbs/aux/10283088.amn-0001 --cpu-bind depth -d 8 took: 0h:00m:12s
Install
ezpz
:Check
launch
:Run
ezpz.test_dist
:Output
#[๐ aurora_nre_models_frameworks-2023.2.1_u1](๐ป aurora_nre_models_frameworks-2024.2.1_u1) #[๐ป][01:42:37 PM][foremans@x1922c5s0b0n0][โฆ/ezpz][๐ฑ saforem2-patch-1]via โจ v1.4.552 $ launch python3 -m ezpz.test_dist Disabling local launch: multi-node application Connected to tcp://x1922c5s0b0n0.hostmgmt2001.cm.americas.sgi.com:7919 Found executable /gila/Aurora_deployment/foremans/projects/saforem2/ezpz/venvs/aurora_nre_models_frameworks-2024.2.1_u1/bin/python3 Launching application ecd9868b-2b3b-4a0e-b2e6-79769ecc7eff [2024-11-28 13:43:41.385960][INFO][dist.py:348] - [device='xpu'][rank=3/47][local_rank=3/11][node=3/3] [2024-11-28 13:43:41.386603][INFO][dist.py:348] - [device='xpu'][rank=1/47][local_rank=1/11][node=1/3] [2024-11-28 13:43:41.386605][INFO][dist.py:348] - [device='xpu'][rank=2/47][local_rank=2/11][node=2/3] [2024-11-28 13:43:41.386834][INFO][dist.py:348] - [device='xpu'][rank=6/47][local_rank=6/11][node=2/3] [2024-11-28 13:43:41.387707][INFO][dist.py:348] - [device='xpu'][rank=8/47][local_rank=8/11][node=0/3] [2024-11-28 13:43:41.387290][INFO][dist.py:348] - [device='xpu'][rank=4/47][local_rank=4/11][node=0/3] [2024-11-28 13:43:41.387235][INFO][dist.py:348] - [device='xpu'][rank=10/47][local_rank=10/11][node=2/3] [2024-11-28 13:43:41.387362][INFO][dist.py:348] - [device='xpu'][rank=11/47][local_rank=11/11][node=3/3] [2024-11-28 13:43:41.387761][INFO][dist.py:348] - [device='xpu'][rank=5/47][local_rank=5/11][node=1/3] [2024-11-28 13:43:41.387958][INFO][dist.py:348] - [device='xpu'][rank=9/47][local_rank=9/11][node=1/3] [2024-11-28 13:43:46.384505][INFO][dist.py:348] - [device='xpu'][rank=7/47][local_rank=7/11][node=3/3] [2024-11-28 13:44:32.816252][INFO][dist.py:348] - [device='xpu'][rank=15/47][local_rank=3/11][node=3/3] [2024-11-28 13:44:32.821175][INFO][dist.py:348] - [device='xpu'][rank=17/47][local_rank=5/11][node=1/3] [2024-11-28 13:44:32.824021][INFO][dist.py:348] - [device='xpu'][rank=22/47][local_rank=10/11][node=2/3] [2024-11-28 13:44:32.825905][INFO][dist.py:348] - [device='xpu'][rank=41/47][local_rank=5/11][node=1/3] [2024-11-28 13:44:32.826590][INFO][dist.py:348] - [device='xpu'][rank=38/47][local_rank=2/11][node=2/3] [2024-11-28 13:44:32.838048][INFO][dist.py:348] - [device='xpu'][rank=23/47][local_rank=11/11][node=3/3] [2024-11-28 13:44:32.838526][INFO][dist.py:348] - [device='xpu'][rank=21/47][local_rank=9/11][node=1/3] [2024-11-28 13:44:32.838825][INFO][dist.py:348] - [device='xpu'][rank=19/47][local_rank=7/11][node=3/3] [2024-11-28 13:44:32.838817][INFO][dist.py:348] - [device='xpu'][rank=36/47][local_rank=0/11][node=0/3] [2024-11-28 13:44:32.838665][INFO][dist.py:348] - [device='xpu'][rank=35/47][local_rank=11/11][node=3/3] [2024-11-28 13:44:32.839033][INFO][dist.py:348] - [device='xpu'][rank=25/47][local_rank=1/11][node=1/3] [2024-11-28 13:44:32.838855][INFO][dist.py:348] - [device='xpu'][rank=26/47][local_rank=2/11][node=2/3] [2024-11-28 13:44:32.839144][INFO][dist.py:348] - [device='xpu'][rank=33/47][local_rank=9/11][node=1/3] [2024-11-28 13:44:32.840785][INFO][dist.py:348] - [device='xpu'][rank=37/47][local_rank=1/11][node=1/3] [2024-11-28 13:44:32.840740][INFO][dist.py:348] - [device='xpu'][rank=47/47][local_rank=11/11][node=3/3] [2024-11-28 13:44:32.844721][INFO][dist.py:348] - [device='xpu'][rank=46/47][local_rank=10/11][node=2/3] [2024-11-28 13:44:32.845202][INFO][dist.py:348] - [device='xpu'][rank=30/47][local_rank=6/11][node=2/3] [2024-11-28 13:44:32.845888][INFO][dist.py:348] - [device='xpu'][rank=18/47][local_rank=6/11][node=2/3] [2024-11-28 13:44:32.849905][INFO][dist.py:348] - [device='xpu'][rank=20/47][local_rank=8/11][node=0/3] [2024-11-28 13:44:32.849947][INFO][dist.py:348] - [device='xpu'][rank=32/47][local_rank=8/11][node=0/3] [2024-11-28 13:44:32.850232][INFO][dist.py:348] - [device='xpu'][rank=39/47][local_rank=3/11][node=3/3] [2024-11-28 13:44:32.850301][INFO][dist.py:348] - [device='xpu'][rank=44/47][local_rank=8/11][node=0/3] [2024-11-28 13:44:32.850795][INFO][dist.py:348] - [device='xpu'][rank=14/47][local_rank=2/11][node=2/3] [2024-11-28 13:44:32.851872][INFO][dist.py:348] - [device='xpu'][rank=28/47][local_rank=4/11][node=0/3] [2024-11-28 13:44:32.852078][INFO][dist.py:348] - [device='xpu'][rank=12/47][local_rank=0/11][node=0/3] [2024-11-28 13:44:32.853836][INFO][dist.py:348] - [device='xpu'][rank=16/47][local_rank=4/11][node=0/3] [2024-11-28 13:44:32.853997][INFO][dist.py:348] - [device='xpu'][rank=24/47][local_rank=0/11][node=0/3] [2024-11-28 13:44:32.855526][INFO][dist.py:348] - [device='xpu'][rank=13/47][local_rank=1/11][node=1/3] [2024-11-28 13:44:32.856609][INFO][dist.py:348] - [device='xpu'][rank=40/47][local_rank=4/11][node=0/3] [2024-11-28 13:44:32.857991][INFO][dist.py:348] - [device='xpu'][rank=42/47][local_rank=6/11][node=2/3] [2024-11-28 13:44:32.863239][INFO][dist.py:348] - [device='xpu'][rank=29/47][local_rank=5/11][node=1/3] [2024-11-28 13:44:32.864956][INFO][dist.py:348] - [device='xpu'][rank=45/47][local_rank=9/11][node=1/3] [2024-11-28 13:44:32.867388][INFO][dist.py:348] - [device='xpu'][rank=27/47][local_rank=3/11][node=3/3] [2024-11-28 13:44:32.867764][INFO][dist.py:348] - [device='xpu'][rank=43/47][local_rank=7/11][node=3/3] [2024-11-28 13:44:32.873106][INFO][dist.py:348] - [device='xpu'][rank=34/47][local_rank=10/11][node=2/3] [2024-11-28 13:44:32.877043][INFO][dist.py:348] - [device='xpu'][rank=31/47][local_rank=7/11][node=3/3] [2024-11-28 13:44:32.887609][INFO][dist.py:92] - [dist_info]: โข DEVICE=xpu โข DEVICE_ID=xpu:0 โข DISTRIBUTED_BACKEND=ccl โข GPUS_PER_NODE=12 โข HOSTS=['x1922c5s0b0n0.hostmgmt2001.cm.americas.sgi.com', 'x1922c5s2b0n0.hostmgmt2001.cm.americas.sgi.com', 'x1922c5s4b0n0.hostmgmt2001.cm.americas.sgi.com', 'x1922c6s1b0n0.hostmgmt2001.cm.americas.sgi.com'] โข HOSTFILE=/var/spool/pbs/aux/10283088.amn-0001 โข HOSTNAME=x1922c5s0b0n0.hostmgmt2001.cm.americas.sgi.com โข LOCAL_RANK=0 โข MACHINE=SunSpot โข NUM_NODES=4 โข NGPUS=48 โข NGPUS_AVAILABLE=48 โข NODE_ID=0 โข RANK=0 โข SCHEDULER=PBS โข WORLD_SIZE_TOTAL=48 โข WORLD_SIZE_IN_USE=48 โข LAUNCH_CMD=mpiexec --verbose --envall -n 48 -ppn 12 --hostfile /var/spool/pbs/aux/10283088.amn-0001 --cpu-bind depth -d 16 [2024-11-28 13:44:32.933415][INFO][dist.py:725] - Using oneccl_bindings from: /opt/aurora/24.180.1/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages/oneccl_bindings_for_pytorch/__init__.py [2024-11-28 13:44:32.933861][INFO][dist.py:727] - Using ipex from: /opt/aurora/24.180.1/frameworks/aurora_nre_models_frameworks-2024.2.1_u1/lib/python3.10/site-packages/intel_extension_for_pytorch/__init__.py [2024-11-28 13:44:32.934238][INFO][dist.py:728] - [0/48] Using device='xpu' with backend='DDP' + 'ccl' for distributed training. [2024-11-28 13:44:32.940188][INFO][dist.py:348] - [device='xpu'][rank=0/47][local_rank=0/11][node=0/3] [2024-11-28 13:44:32.940746][WARNING][_logger.py:68] - Using [48 / 48] available "xpu" devices !! [2024-11-28 13:44:34.193788][INFO][dist.py:882] - Setting up wandb from rank: 0 [2024-11-28 13:44:34.194312][INFO][dist.py:883] - Using: WB PROJECT: ezpz.test_dist wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. wandb: Currently logged in as: foremans (aurora_gpt). Use `wandb login --relogin` to force relogin wandb: Tracking run with wandb version 0.18.7 wandb: Run data is saved locally in /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/wandb/run-20241128_134434-4mwyy84l wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run driven-wind-683 wandb: โญ๏ธ View project at https://wandb.ai/aurora_gpt/ezpz.test_dist wandb: ๐ View run at https://wandb.ai/aurora_gpt/ezpz.test_dist/runs/4mwyy84l [2024-11-28 13:44:35.481364][INFO][dist.py:908] - W&B RUN: [driven-wind-683](https://wandb.ai/aurora_gpt/ezpz.test_dist/runs/4mwyy84l) [2024-11-28 13:44:35.495959][INFO][dist.py:304] - Updating wandb.run: driven-wind-683 config with "DIST_INFO" [2024-11-28 13:44:35.500222][INFO][dist.py:936] - Running on machine='SunSpot' [2024-11-28 13:44:35.501304][INFO][dist.py:92] - [CONFIG]: โข warmup=0 โข log_freq=1 โข batch_size=64 โข input_size=128 โข output_size=128 โข dtype=torch.float32 โข device=xpu โข world_size=48 โข train_iters=100 [2024-11-28 13:44:35.575161][INFO][test_dist.py:147] - model=Network( (layers): Sequential( (0): Linear(in_features=128, out_features=1024, bias=True) (1): Linear(in_features=1024, out_features=512, bias=True) (2): Linear(in_features=512, out_features=256, bias=True) (3): Linear(in_features=256, out_features=128, bias=True) (4): Linear(in_features=128, out_features=128, bias=True) ) ) [2024-11-28 13:44:48.340596][INFO][test_dist.py:228] - iter=1 dt=0.019713 dtf=0.002959 dtb=0.016754 loss=1821.131958 sps=3246.551823 [2024-11-28 13:44:48.346527][INFO][test_dist.py:228] - iter=2 dt=0.004083 dtf=0.000822 dtb=0.003261 loss=1348.032227 sps=15673.877930 [2024-11-28 13:44:48.351387][INFO][test_dist.py:228] - iter=3 dt=0.003260 dtf=0.000735 dtb=0.002525 loss=1072.068359 sps=19634.479814 [2024-11-28 13:44:48.356043][INFO][test_dist.py:228] - iter=4 dt=0.003117 dtf=0.000716 dtb=0.002401 loss=928.493774 sps=20534.414826 [2024-11-28 13:44:48.360830][INFO][test_dist.py:228] - iter=5 dt=0.003181 dtf=0.000798 dtb=0.002384 loss=867.221558 sps=20116.568892 [2024-11-28 13:44:48.365608][INFO][test_dist.py:228] - iter=6 dt=0.003146 dtf=0.000738 dtb=0.002408 loss=794.671204 sps=20341.929281 [2024-11-28 13:44:48.370473][INFO][test_dist.py:228] - iter=7 dt=0.003149 dtf=0.000751 dtb=0.002398 loss=748.687500 sps=20324.551047 [2024-11-28 13:44:48.375126][INFO][test_dist.py:228] - iter=8 dt=0.003105 dtf=0.000719 dtb=0.002385 loss=720.116943 sps=20614.492256 [2024-11-28 13:44:48.379744][INFO][test_dist.py:228] - iter=9 dt=0.003057 dtf=0.000698 dtb=0.002359 loss=714.217957 sps=20937.174740 [2024-11-28 13:44:48.384429][INFO][test_dist.py:228] - iter=10 dt=0.003083 dtf=0.000703 dtb=0.002380 loss=718.780884 sps=20760.967373 [2024-11-28 13:44:48.390098][INFO][test_dist.py:228] - iter=11 dt=0.003931 dtf=0.000802 dtb=0.003129 loss=699.931152 sps=16281.262548 [2024-11-28 13:44:48.395638][INFO][test_dist.py:228] - iter=12 dt=0.003698 dtf=0.000865 dtb=0.002832 loss=687.315857 sps=17308.773026 [2024-11-28 13:44:48.400852][INFO][test_dist.py:228] - iter=13 dt=0.003502 dtf=0.000792 dtb=0.002709 loss=685.231995 sps=18276.152786 [2024-11-28 13:44:48.406145][INFO][test_dist.py:228] - iter=14 dt=0.003501 dtf=0.000788 dtb=0.002714 loss=678.643860 sps=18278.977230 [2024-11-28 13:44:48.411186][INFO][test_dist.py:228] - iter=15 dt=0.003366 dtf=0.000760 dtb=0.002606 loss=675.734375 sps=19012.740109 [2024-11-28 13:44:48.416092][INFO][test_dist.py:228] - iter=16 dt=0.003308 dtf=0.000764 dtb=0.002545 loss=656.803894 sps=19345.774997 [2024-11-28 13:44:48.421088][INFO][test_dist.py:228] - iter=17 dt=0.003326 dtf=0.000800 dtb=0.002526 loss=663.846558 sps=19240.551489 [2024-11-28 13:44:48.426305][INFO][test_dist.py:228] - iter=18 dt=0.003439 dtf=0.000805 dtb=0.002634 loss=646.870605 sps=18607.582468 [2024-11-28 13:44:48.431107][INFO][test_dist.py:228] - iter=19 dt=0.003217 dtf=0.000730 dtb=0.002488 loss=631.541504 sps=19893.098888 [2024-11-28 13:44:48.436171][INFO][test_dist.py:228] - iter=20 dt=0.003447 dtf=0.000707 dtb=0.002741 loss=645.819580 sps=18564.526433 [2024-11-28 13:44:48.441333][INFO][test_dist.py:228] - iter=21 dt=0.003465 dtf=0.000921 dtb=0.002543 loss=642.620361 sps=18470.919500 [2024-11-28 13:44:48.446923][INFO][test_dist.py:228] - iter=22 dt=0.003732 dtf=0.000822 dtb=0.002910 loss=639.229980 sps=17149.966074 [2024-11-28 13:44:48.451965][INFO][test_dist.py:228] - iter=23 dt=0.003386 dtf=0.000778 dtb=0.002608 loss=627.894897 sps=18902.966756 [2024-11-28 13:44:48.457006][INFO][test_dist.py:228] - iter=24 dt=0.003367 dtf=0.000807 dtb=0.002560 loss=604.797424 sps=19007.381386 [2024-11-28 13:44:48.461850][INFO][test_dist.py:228] - iter=25 dt=0.003164 dtf=0.000721 dtb=0.002443 loss=614.561523 sps=20229.778931 [2024-11-28 13:44:48.466983][INFO][test_dist.py:228] - iter=26 dt=0.003473 dtf=0.000759 dtb=0.002714 loss=617.550781 sps=18428.700435 [2024-11-28 13:44:48.471925][INFO][test_dist.py:228] - iter=27 dt=0.003289 dtf=0.000758 dtb=0.002531 loss=618.434082 sps=19457.441792 [2024-11-28 13:44:48.477516][INFO][test_dist.py:228] - iter=28 dt=0.003872 dtf=0.000930 dtb=0.002942 loss=607.261475 sps=16528.336617 [2024-11-28 13:44:48.482581][INFO][test_dist.py:228] - iter=29 dt=0.003365 dtf=0.000773 dtb=0.002591 loss=601.454590 sps=19021.673645 [2024-11-28 13:44:48.487622][INFO][test_dist.py:228] - iter=30 dt=0.003288 dtf=0.000825 dtb=0.002463 loss=594.649170 sps=19463.454229 [2024-11-28 13:44:48.492493][INFO][test_dist.py:228] - iter=31 dt=0.003211 dtf=0.000734 dtb=0.002477 loss=582.087036 sps=19933.062380 [2024-11-28 13:44:48.497510][INFO][test_dist.py:228] - iter=32 dt=0.003360 dtf=0.000851 dtb=0.002509 loss=582.850586 sps=19050.324061 [2024-11-28 13:44:48.502534][INFO][test_dist.py:228] - iter=33 dt=0.003299 dtf=0.000751 dtb=0.002547 loss=574.619019 sps=19402.524122 [2024-11-28 13:44:48.507516][INFO][test_dist.py:228] - iter=34 dt=0.003318 dtf=0.000843 dtb=0.002475 loss=571.530273 sps=19291.396984 [2024-11-28 13:44:48.512324][INFO][test_dist.py:228] - iter=35 dt=0.003162 dtf=0.000718 dtb=0.002444 loss=569.056335 sps=20239.766403 [2024-11-28 13:44:48.517314][INFO][test_dist.py:228] - iter=36 dt=0.003338 dtf=0.000803 dtb=0.002535 loss=570.773315 sps=19171.898987 [2024-11-28 13:44:48.522111][INFO][test_dist.py:228] - iter=37 dt=0.003111 dtf=0.000712 dtb=0.002399 loss=564.691101 sps=20571.738756 [2024-11-28 13:44:48.527333][INFO][test_dist.py:228] - iter=38 dt=0.003581 dtf=0.000698 dtb=0.002883 loss=555.157349 sps=17870.995302 [2024-11-28 13:44:48.532827][INFO][test_dist.py:228] - iter=39 dt=0.003596 dtf=0.000841 dtb=0.002755 loss=542.374756 sps=17796.934508 [2024-11-28 13:44:48.538391][INFO][test_dist.py:228] - iter=40 dt=0.003673 dtf=0.000872 dtb=0.002801 loss=538.616821 sps=17424.614226 [2024-11-28 13:44:48.543835][INFO][test_dist.py:228] - iter=41 dt=0.003606 dtf=0.000837 dtb=0.002769 loss=546.055054 sps=17746.977167 [2024-11-28 13:44:48.548728][INFO][test_dist.py:228] - iter=42 dt=0.003171 dtf=0.000786 dtb=0.002385 loss=541.741455 sps=20183.008810 [2024-11-28 13:44:48.553392][INFO][test_dist.py:228] - iter=43 dt=0.003091 dtf=0.000675 dtb=0.002415 loss=541.895630 sps=20707.731509 [2024-11-28 13:44:48.558674][INFO][test_dist.py:228] - iter=44 dt=0.003598 dtf=0.000797 dtb=0.002801 loss=532.636841 sps=17789.399613 [2024-11-28 13:44:48.563691][INFO][test_dist.py:228] - iter=45 dt=0.003305 dtf=0.000726 dtb=0.002579 loss=527.679077 sps=19363.041186 [2024-11-28 13:44:48.568927][INFO][test_dist.py:228] - iter=46 dt=0.003472 dtf=0.000776 dtb=0.002696 loss=519.220581 sps=18435.547756 [2024-11-28 13:44:48.573801][INFO][test_dist.py:228] - iter=47 dt=0.003203 dtf=0.000723 dtb=0.002480 loss=527.749268 sps=19982.521481 [2024-11-28 13:44:48.578761][INFO][test_dist.py:228] - iter=48 dt=0.003253 dtf=0.000782 dtb=0.002471 loss=524.344238 sps=19672.846752 [2024-11-28 13:44:48.583369][INFO][test_dist.py:228] - iter=49 dt=0.003042 dtf=0.000652 dtb=0.002390 loss=514.100464 sps=21038.853899 [2024-11-28 13:44:48.588292][INFO][test_dist.py:228] - iter=50 dt=0.003210 dtf=0.000763 dtb=0.002447 loss=513.998962 sps=19936.401969 [2024-11-28 13:44:48.593155][INFO][test_dist.py:228] - iter=51 dt=0.003210 dtf=0.000763 dtb=0.002447 loss=506.444519 sps=19937.409848 [2024-11-28 13:44:48.598482][INFO][test_dist.py:228] - iter=52 dt=0.003648 dtf=0.000781 dtb=0.002868 loss=505.063721 sps=17542.989327 [2024-11-28 13:44:48.603395][INFO][test_dist.py:228] - iter=53 dt=0.003258 dtf=0.000742 dtb=0.002516 loss=500.011047 sps=19642.561466 [2024-11-28 13:44:48.608385][INFO][test_dist.py:228] - iter=54 dt=0.003292 dtf=0.000797 dtb=0.002495 loss=508.445740 sps=19439.362133 [2024-11-28 13:44:48.613115][INFO][test_dist.py:228] - iter=55 dt=0.003151 dtf=0.000699 dtb=0.002452 loss=492.626648 sps=20310.433009 [2024-11-28 13:44:48.618036][INFO][test_dist.py:228] - iter=56 dt=0.003251 dtf=0.000727 dtb=0.002524 loss=487.402435 sps=19684.735824 [2024-11-28 13:44:48.623122][INFO][test_dist.py:228] - iter=57 dt=0.003449 dtf=0.000979 dtb=0.002469 loss=474.962097 sps=18556.851343 [2024-11-28 13:44:48.628167][INFO][test_dist.py:228] - iter=58 dt=0.003343 dtf=0.000811 dtb=0.002532 loss=479.064941 sps=19143.536594 [2024-11-28 13:44:48.633070][INFO][test_dist.py:228] - iter=59 dt=0.003256 dtf=0.000787 dtb=0.002468 loss=471.197083 sps=19658.706785 [2024-11-28 13:44:48.638373][INFO][test_dist.py:228] - iter=60 dt=0.003548 dtf=0.000853 dtb=0.002696 loss=469.964081 sps=18037.965649 [2024-11-28 13:44:48.643270][INFO][test_dist.py:228] - iter=61 dt=0.003225 dtf=0.000736 dtb=0.002489 loss=476.972076 sps=19844.166872 [2024-11-28 13:44:48.648256][INFO][test_dist.py:228] - iter=62 dt=0.003214 dtf=0.000745 dtb=0.002469 loss=463.572174 sps=19912.478524 [2024-11-28 13:44:48.652939][INFO][test_dist.py:228] - iter=63 dt=0.003095 dtf=0.000683 dtb=0.002411 loss=462.910156 sps=20679.849741 [2024-11-28 13:44:48.657892][INFO][test_dist.py:228] - iter=64 dt=0.003239 dtf=0.000746 dtb=0.002492 loss=457.325439 sps=19762.175344 [2024-11-28 13:44:48.662903][INFO][test_dist.py:228] - iter=65 dt=0.003293 dtf=0.000729 dtb=0.002563 loss=453.347168 sps=19438.021843 [2024-11-28 13:44:48.667970][INFO][test_dist.py:228] - iter=66 dt=0.003276 dtf=0.000788 dtb=0.002488 loss=450.351135 sps=19534.356115 [2024-11-28 13:44:48.672824][INFO][test_dist.py:228] - iter=67 dt=0.003192 dtf=0.000735 dtb=0.002457 loss=450.714233 sps=20047.789409 [2024-11-28 13:44:48.677882][INFO][test_dist.py:228] - iter=68 dt=0.003330 dtf=0.000799 dtb=0.002530 loss=440.284546 sps=19220.840096 [2024-11-28 13:44:48.682532][INFO][test_dist.py:228] - iter=69 dt=0.003081 dtf=0.000663 dtb=0.002418 loss=444.536011 sps=20770.230742 [2024-11-28 13:44:48.687492][INFO][test_dist.py:228] - iter=70 dt=0.003307 dtf=0.000766 dtb=0.002541 loss=446.201233 sps=19354.965715 [2024-11-28 13:44:48.692350][INFO][test_dist.py:228] - iter=71 dt=0.003225 dtf=0.000737 dtb=0.002488 loss=427.167328 sps=19845.047963 [2024-11-28 13:44:48.697422][INFO][test_dist.py:228] - iter=72 dt=0.003415 dtf=0.000881 dtb=0.002534 loss=434.620087 sps=18740.536560 [2024-11-28 13:44:48.702393][INFO][test_dist.py:228] - iter=73 dt=0.003299 dtf=0.000740 dtb=0.002559 loss=425.558777 sps=19402.652860 [2024-11-28 13:44:48.707240][INFO][test_dist.py:228] - iter=74 dt=0.003180 dtf=0.000789 dtb=0.002391 loss=428.168945 sps=20126.919392 [2024-11-28 13:44:48.712103][INFO][test_dist.py:228] - iter=75 dt=0.003322 dtf=0.000680 dtb=0.002642 loss=417.153503 sps=19263.558998 [2024-11-28 13:44:48.717120][INFO][test_dist.py:228] - iter=76 dt=0.003405 dtf=0.000787 dtb=0.002618 loss=412.435059 sps=18794.204421 [2024-11-28 13:44:48.722012][INFO][test_dist.py:228] - iter=77 dt=0.003208 dtf=0.000722 dtb=0.002487 loss=426.690186 sps=19947.873515 [2024-11-28 13:44:48.727040][INFO][test_dist.py:228] - iter=78 dt=0.003330 dtf=0.000775 dtb=0.002554 loss=404.138733 sps=19220.615648 [2024-11-28 13:44:48.731946][INFO][test_dist.py:228] - iter=79 dt=0.003218 dtf=0.000757 dtb=0.002461 loss=396.998413 sps=19886.330391 [2024-11-28 13:44:48.736956][INFO][test_dist.py:228] - iter=80 dt=0.003362 dtf=0.000818 dtb=0.002544 loss=405.073059 sps=19036.951133 [2024-11-28 13:44:48.742154][INFO][test_dist.py:228] - iter=81 dt=0.003309 dtf=0.000840 dtb=0.002468 loss=408.205780 sps=19341.447610 [2024-11-28 13:44:48.747100][INFO][test_dist.py:228] - iter=82 dt=0.003301 dtf=0.000801 dtb=0.002500 loss=391.203918 sps=19389.697222 [2024-11-28 13:44:48.752235][INFO][test_dist.py:228] - iter=83 dt=0.003488 dtf=0.000732 dtb=0.002756 loss=401.911407 sps=18347.650097 [2024-11-28 13:44:48.757468][INFO][test_dist.py:228] - iter=84 dt=0.003567 dtf=0.000886 dtb=0.002681 loss=390.872192 sps=17942.862497 [2024-11-28 13:44:48.762431][INFO][test_dist.py:228] - iter=85 dt=0.003272 dtf=0.000769 dtb=0.002502 loss=390.741089 sps=19562.065370 [2024-11-28 13:44:48.767527][INFO][test_dist.py:228] - iter=86 dt=0.003390 dtf=0.000795 dtb=0.002596 loss=393.982513 sps=18877.408330 [2024-11-28 13:44:48.772363][INFO][test_dist.py:228] - iter=87 dt=0.003198 dtf=0.000727 dtb=0.002471 loss=378.682495 sps=20014.348794 [2024-11-28 13:44:48.777779][INFO][test_dist.py:228] - iter=88 dt=0.003666 dtf=0.000914 dtb=0.002752 loss=378.739502 sps=17459.234394 [2024-11-28 13:44:48.783160][INFO][test_dist.py:228] - iter=89 dt=0.003529 dtf=0.000832 dtb=0.002697 loss=382.028931 sps=18136.158201 [2024-11-28 13:44:48.788498][INFO][test_dist.py:228] - iter=90 dt=0.003442 dtf=0.000809 dtb=0.002633 loss=371.271118 sps=18594.284077 [2024-11-28 13:44:48.793524][INFO][test_dist.py:228] - iter=91 dt=0.003358 dtf=0.000754 dtb=0.002603 loss=371.135925 sps=19060.348912 [2024-11-28 13:44:48.798593][INFO][test_dist.py:228] - iter=92 dt=0.003336 dtf=0.000791 dtb=0.002545 loss=368.860168 sps=19183.369504 [2024-11-28 13:44:48.803491][INFO][test_dist.py:228] - iter=93 dt=0.003227 dtf=0.000725 dtb=0.002502 loss=371.283356 sps=19831.370522 [2024-11-28 13:44:48.808397][INFO][test_dist.py:228] - iter=94 dt=0.003250 dtf=0.000769 dtb=0.002481 loss=362.983154 sps=19693.397864 [2024-11-28 13:44:48.813075][INFO][test_dist.py:228] - iter=95 dt=0.003098 dtf=0.000703 dtb=0.002396 loss=365.535828 sps=20656.015873 [2024-11-28 13:44:48.818465][INFO][test_dist.py:228] - iter=96 dt=0.003581 dtf=0.000872 dtb=0.002709 loss=344.663239 sps=17873.310048 [2024-11-28 13:44:48.823831][INFO][test_dist.py:228] - iter=97 dt=0.003509 dtf=0.000834 dtb=0.002675 loss=361.620361 sps=18238.825661 [2024-11-28 13:44:48.828930][INFO][test_dist.py:228] - iter=98 dt=0.003326 dtf=0.000804 dtb=0.002522 loss=347.258301 sps=19245.190902 [2024-11-28 13:44:48.834103][INFO][test_dist.py:228] - iter=99 dt=0.003474 dtf=0.000740 dtb=0.002733 loss=346.517212 sps=18424.412945 Failed to download font: IBM Plex Sans, skipping! Failed to download font: IBM Plex Sans Condensed, skipping! Failed to download font: IBM Plex Serif, skipping! [2024-11-28 13:44:50.885911][INFO][history.py:696] - Saving train_iter plot to: /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/test-dist-plots/mplot [2024-11-28 13:44:51.498182][INFO][history.py:696] - Saving train_dt plot to: /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/test-dist-plots/mplot [2024-11-28 13:44:51.758706][INFO][history.py:696] - Saving train_dtf plot to: /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/test-dist-plots/mplot [2024-11-28 13:44:52.022463][INFO][history.py:696] - Saving train_dtb plot to: /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/test-dist-plots/mplot [2024-11-28 13:44:52.309521][INFO][history.py:696] - Saving train_loss plot to: /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/test-dist-plots/mplot [2024-11-28 13:44:52.562425][INFO][history.py:696] - Saving train_sps plot to: /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/test-dist-plots/mplot train_iter [2024-11-28-134452] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 99.0โค โโโโโ โ โโโโ โ โ โโโโ โ 82.7โค โโโโ โ โ โโโโ โ โ โโโโโ โ 66.3โค โโโโ โ โ โโโโโ โ 50.0โค โโโโโ โ โ โโโโ โ โ โโโโโ โ 33.7โค โโโโ โ โ โโโโ โ โ โโโโโ โ 17.3โค โโโโ โ โ โโโโ โ โ โโโโ โ 1.0โคโโโโ โ โโโฌโโฌโโโโฌโโโฌโโโฌโโโโโฌโโโฌโโโฌโโโโโฌโโโโฌโโโฌโโโฌโโโฌโโโโฌโโโฌโโโโฌโโโฌโโโฌโโโฌโโโโโฌโโ 2 5 11 16 20 27 31 36 43 48 53 57 61 68 71 78 81 86 91 98 train_iter train/iter [2024-11-28 13:44:52.868990][INFO][plot.py:220] - Appending plot to: /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/test-dist-plots/tplot/train_iter.txt text saved in /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/test-dist-plots/tplot/train_iter.txt train_dt [2024-11-28-134452] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 0.0197โคโ โ โโ โ โโ โ 0.0169โคโ โ โโ โ โโ โ 0.0142โคโ โ โโ โ 0.0114โคโ โ โโ โ โโ โ 0.0086โคโ โ โโ โ โโ โ 0.0058โคโ โ โโ โ โโ โ โ โ 0.0030โค โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโฌโโฌโโโโฌโโโฌโโโฌโโโโโฌโโโฌโโโฌโโโโฌโโโโฌโโโฌโโโโโฌโโโโโฌโโโโฌโโโโโฌโโโฌโโโโฌโโโโฌโโ 2 5 11 16 20 27 32 36 43 48 53 61 68 74 81 86 91 98 train_dt train/iter [2024-11-28 13:44:52.888254][INFO][plot.py:220] - Appending plot to: /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/test-dist-plots/tplot/train_dt.txt text saved in /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/test-dist-plots/tplot/train_dt.txt train_dtf [2024-11-28-134452] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 0.00296โคโ โ โโ โ โโ โ 0.00257โคโ โ โโ โ โโ โ 0.00219โคโ โ โโ โ 0.00181โคโ โ โโ โ โโ โ 0.00142โคโ โ โโ โ โโ โ 0.00104โคโ โ โโ โ โ โ โ โ โ โโโโ โโโโโโโโโโโโโโโโโโ โโโโ โโ โโโ โโโ โ โโโ โโโโโโโโโโโโโโโโโโ 0.00065โค โโโโโโ โ โ โ โโโ โโโโโโ โโโโ โโโโโโโโโโโโ โโ โ โโโ โโ โโโฌโโฌโโโโฌโโโฌโโโฌโโโโฌโโโฌโโโฌโโโโโฌโโโฌโโโฌโโโฌโโโฌโโโโฌโโโโฌโโโโโฌโโโฌโโโโฌโโโโฌโโ 2 5 11 16 20 27 31 36 43 48 53 57 61 68 74 81 86 91 98 train_dtf train/iter [2024-11-28 13:44:52.904574][INFO][plot.py:220] - Appending plot to: /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/test-dist-plots/tplot/train_dtf.txt text saved in /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/test-dist-plots/tplot/train_dtf.txt train_dtb [2024-11-28-134452] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 0.0168โคโ โ โโ โ โโ โ 0.0144โคโ โ โโ โ โโ โ 0.0120โคโ โ โโ โ 0.0096โคโ โ โโ โ โโ โ 0.0072โคโ โ โโ โ โโ โ 0.0048โคโ โ โโ โ โโ โ โ 0.0024โค โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโฌโโฌโโโโฌโโโฌโโโฌโโโโโฌโโโฌโโโฌโโโโฌโโโโฌโโโฌโโโโโฌโโโโโฌโโโโฌโโโโโฌโโโฌโโโโฌโโโโฌโโ 2 5 11 16 20 27 32 36 43 48 53 61 68 74 81 86 91 98 train_dtb train/iter [2024-11-28 13:44:52.920733][INFO][plot.py:220] - Appending plot to: /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/test-dist-plots/tplot/train_dtb.txt text saved in /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/test-dist-plots/tplot/train_dtb.txt train_loss [2024-11-28-134452] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 1821.1โคโ โ โโ โ โโ โ 1575.1โคโ โ โโ โ โโ โ 1329.0โคโ โ โโโ โ 1082.9โค โ โ โ โ โ โ โโ โ 836.8โค โ โ โ โโ โ โ โโโโโโ โ โ 590.7โค โโโโโโโโโโโโ โ โ โโโโโโโโโโโโโโโโ โ โ โโโโโโโโโโโโโโโ โ โ 344.7โค โโโโโโโโโโโโโโโโ โโโฌโโฌโโโโฌโโโฌโโโฌโโโโโฌโโโฌโโโฌโโโโฌโโโโฌโโโฌโโโโโฌโโโโโฌโโโโฌโโโโโฌโโโฌโโโโฌโโโโฌโโ 2 5 11 16 20 27 32 36 43 48 53 61 68 74 81 86 91 98 train_loss train/iter [2024-11-28 13:44:52.939625][INFO][plot.py:220] - Appending plot to: /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/test-dist-plots/tplot/train_loss.txt text saved in /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/test-dist-plots/tplot/train_loss.txt train_sps [2024-11-28-134452] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 21038.9โค โโโโโ โ โ โโ โ โ โ โ โ โ โโโ โ โโโ โโโ โโโโโโ โโโโโ โ โโโ โโโโโโโโโโโโโโโโ โโ โโโ โ โ โ โ โโโโโโโโโโโโโโ โโ โโโโโ โโ โโโ โ โ โโ โโโโโ โโโ โโโโ 18073.5โค โ โโ โโ โโ โโโ โ โโ โ โ โโโ โโ โ โโ โโ โ โโ โ โโ โ 15108.1โคโ โ โโ โ 12142.7โคโ โ โโ โ โโ โ 9177.3โคโ โ โโ โ โโ โ 6211.9โคโ โ โโ โ โโ โ 3246.6โคโ โ โโโฌโโฌโโโโฌโโโฌโโโฌโโโโฌโโโฌโโโฌโโโโโฌโโโฌโโโฌโโโฌโโโฌโโโโฌโโโโฌโโโโโฌโโโฌโโโโฌโโโโฌโโ 2 5 11 16 20 27 31 36 43 48 53 57 61 68 74 81 86 91 98 train_sps train/iter [2024-11-28 13:44:52.955997][INFO][plot.py:220] - Appending plot to: /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/test-dist-plots/tplot/train_sps.txt text saved in /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/test-dist-plots/tplot/train_sps.txt [2024-11-28 13:44:53.002178][INFO][test_dist.py:246] - dataset=<xarray.Dataset> Size: 5kB Dimensions: (draw: 99) Coordinates: * draw (draw) int64 792B 0 1 2 3 4 5 6 7 8 ... 91 92 93 94 95 96 97 98 Data variables: train_iter (draw) int64 792B 1 2 3 4 5 6 7 8 9 ... 92 93 94 95 96 97 98 99 train_dt (draw) float64 792B 0.01971 0.004083 ... 0.003326 0.003474 train_dtf (draw) float64 792B 0.002959 0.0008221 ... 0.0008038 0.0007404 train_dtb (draw) float64 792B 0.01675 0.003261 ... 0.002522 0.002733 train_loss (draw) float32 396B 1.821e+03 1.348e+03 ... 347.3 346.5 train_sps (draw) float64 792B 3.247e+03 1.567e+04 ... 1.925e+04 1.842e+04 _ ._ __/__ _ _ _ _ _/_ Recorded: 13:44:35 Samples: 2127 /_//_/// /_\ / //_// / //_'/ // Duration: 17.440 CPU time: 29.088 / _/ v5.0.0 Profile at /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/src/ezpz/profile.py:101 17.440 <module> ezpz/test_dist.py:1 โโ 17.440 main ezpz/test_dist.py:177 โโ 10.558 build_model_and_optimizer ezpz/test_dist.py:136 โ โโ 10.547 DistributedDataParallel.__init__ torch/nn/parallel/distributed.py:622 โ โโ 10.502 _verify_param_shape_across_processes torch/distributed/utils.py:266 โ โโ 10.502 PyCapsule._verify_params_across_processes <built-in> โโ 3.256 History.plot_all ezpz/history.py:636 โ โโ 1.809 savefig matplotlib/pyplot.py:974 โ โ [38 frames hidden] matplotlib, PIL, <built-in>, numpy โ โโ 0.895 <module> seaborn/__init__.py:1 โ โ [8 frames hidden] seaborn, scipy โ โโ 0.227 History.get_dataset ezpz/history.py:752 โ โ โโ 0.183 History.to_DataArray ezpz/history.py:712 โ โ โโ 0.183 DataArray.__init__ xarray/core/dataarray.py:437 โ โ [5 frames hidden] xarray โ โโ 0.179 make_ridgeplots ezpz/plot.py:900 โโ 2.016 _forward_step ezpz/test_dist.py:193 โ โโ 1.963 DistributedDataParallel._wrapped_call_impl torch/nn/modules/module.py:1528 โ [4 frames hidden] torch โ 1.945 Network._call_impl torch/nn/modules/module.py:1534 โ โโ 1.943 Network.forward ezpz/test_dist.py:128 โ โโ 1.943 Sequential._wrapped_call_impl torch/nn/modules/module.py:1528 โ [6 frames hidden] torch, <built-in> โโ 0.716 <module> ambivalent/__init__.py:1 โ [20 frames hidden] ambivalent, requests, urllib3, http, ... โโ 0.495 _backward_step ezpz/test_dist.py:198 โโ 0.457 Tensor.backward torch/_tensor.py:466 [3 frames hidden] torch, <built-in> [2024-11-28 13:44:54.356012][INFO][profile.py:115] - Saving pyinstrument profile output to: /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/ezpz_pyinstrument_profiles [2024-11-28 13:44:54.356642][INFO][profile.py:123] - PyInstrument profile saved (as html) to: /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/ezpz_pyinstrument_profiles/pyinstrument-profile-2024-11-28-134454.html [2024-11-28 13:44:54.357100][INFO][profile.py:131] - PyInstrument profile saved (as text) to: /lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/ezpz_pyinstrument_profiles/pyinstrument-profile-2024-11-28-134454.txt [2024-11-28 13:44:54.716119][INFO][profile.py:143] - Finished with pyinstrument profiler. Took: 17.44008s [2024-11-28 13:44:54.717891][INFO][test_dist.py:269] - [0] runtime=73.419312s wandb: ๐ View run driven-wind-683 at: https://wandb.ai/aurora_gpt/ezpz.test_dist/runs/4mwyy84l wandb: Find logs at: ../../../../../../lus/gila/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/wandb/run-20241128_134434-4mwyy84l/logs Application ecd9868b resources: utime=2078s stime=374s maxrss=2509188KB inblock=786752 oublock=12104 minflt=13378945 majflt=181980 nvcsw=1704995 nivcsw=216931 took: 0h:01m:32s
๐ ๏ธ Setup Python
This will:
Automatically load and activate
conda
using theezpz_setup_conda
function.How this is done, in practice, varies from machine to machine:
ALCF2: Automatically load the most recent
conda
module and activate the base environment.Frontier: Load the appropriate AMD modules (e.g.
rocm
,RCCL
, etc.), and activate baseconda
Perlmutter: Load the appropriate
pytorch
module and activate environmentUnknown: In this case, we will look for a
conda
,mamba
, ormicromamba
executable, and if found, use that to activate the base environment.
Using your ownconda
If you are already in a conda environment when calling
ezpz_setup_python
then it will try and use this instead.For example, if you have a custom
conda
env at~/conda/envs/custom
, then this would bootstrap thecustom
conda environment and create the virtual env invenvs/custom/
Build (or activate, if found) a virtual environment on top of (the active) base
conda
environment.By default, it will try looking in:
$PBS_O_WORKDIR
, otherwise${SLURM_SUBMIT_DIR}
, otherwise$(pwd)
for a nested folder named
"venvs/${CONDA_NAME}"
.If this doesnโt exist, it will attempt to create a new virtual environment at this location using:
(where weโve pulled in the
--system-site-packages
from conda).
๐งฐ Setup Job
Now that we are in a suitable python environment, we need to construct the command that we will use to run python on each of our acceleratorss.
To do this, we need a few things:
- What machine weโre on (and what scheduler is it using i.e. {PBS, SLURM})
- How many nodes are available in our active job
- How many GPUs are on each of those nodes
- What type of GPUs are they
With this information, we can then use mpi{exec,run}
or srun
to launch python across all of our accelerators.
Again, how this is done will vary from machine to machine and will depend on the job scheduler in use.
To identify where we are, we look at our $(hostname)
and see if weโre running on one of the known machines:
- ALCF3: Using PBS Pro via
qsub
andmpiexec
/mpirun
.x4*
: Aurora- Aurora:
x4*
(oraurora*
on login nodes) - Sunspot:
x1*
(oruan*
) - Sophia:
sophia-*
- Polaris / Sirius:
x3*
- to determine between the two, we look at
"${PBS_O_HOST}"
- to determine between the two, we look at
OLCF: Using Slurm via
sbatch
/srun
.frontier*
: Frontier, using Slurmnid*
: Perlmutter, using Slurm
Unknown machine: If
$(hostname)
does not match one of these patterns we assume that we are running on an unknown machine and will try to usempirun
as our generic launch commandOnce we have this, we can:
Get
PBS_NODEFILE
from$(hostname)
:ezpz_qsme_running
: For each (running) job owned by${USER}
, print out both the jobid as well as a list of hosts the job is running on, e.g.:ezpz_get_pbs_nodefile_from_hostname
: Look for$(hostname)
in the output from the above command to determine our${PBS_JOBID}
.Once weโve identified our
${PBS_JOBID}
we then know the location of our${PBS_NODEFILE}
since they are named according to:
Identify number of available accelerators:
๐ Python Library
To install4:
- ๐
ezpz
/src
/ezpz/
- ๐
bin/
:utils.sh
: Shell utilities forezpz
- ๐
conf/
:- โ๏ธ
config.yaml
: DefaultTrainConfig
object - โ๏ธ
ds_config.json
: DeepSpeed configuration
- โ๏ธ
- ๐
log/
: Logging configuration. - ๐
__about__.py
: Version information - ๐
__init__.py
: Main module - ๐
__main__.py
: Entry point - ๐
configs.py
: Configuration module - ๐
cria.py
: Baby Llama - ๐
dist.py
: Distributed training module - ๐
history.py
: History module - ๐
jobs.py
: Jobs module - ๐
model.py
: Model module - ๐
plot.py
: Plot modul - ๐
profile.py
: Profile module - ๐
runtime.py
: Runtime module - ๐
test.py
: Test module - ๐
test_dist.py
: Distributed training test module - ๐
train.py
: train module - ๐
trainer.py
: trainer module - ๐
utils.py
: utility module
- ๐
๐ /ezpz/src/ezpz/
โฃโโ ๐ bin/
โ โฃโโ ๐ affinity.sh
โ โฃโโ ๐ getjobenv
โ โฃโโ ๐ savejobenv
โ โฃโโ ๐ saveslurmenv
โ โฃโโ ๐ setup.sh
โ โฃโโ ๐ train.sh
โ โโโ ๐ utils.sh
โฃโโ ๐ conf/
โ โฃโโ ๐ hydra/
โ โ โโโ ๐ job_logging/
โ โ โฃโโ โ๏ธ colorlog1.yaml
โ โ โฃโโ โ๏ธ custom.yaml
โ โ โโโ โ๏ธ enrich.yaml
โ โฃโโ ๐ logdir/
โ โ โโโ โ๏ธ default.yaml
โ โฃโโ โ๏ธ config.yaml
โ โฃโโ ๐ ds_config.json
โ โโโ โ๏ธ ds_config.yaml
โฃโโ ๐ log/
โ โฃโโ ๐ conf/
โ โ โโโ ๐ hydra/
โ โ โโโ ๐ job_logging/
โ โ โโโ โ๏ธ enrich.yaml
โ โฃโโ ๐ __init__.py
โ โฃโโ ๐ __main__.py
โ โฃโโ ๐ config.py
โ โฃโโ ๐ console.py
โ โฃโโ ๐ handler.py
โ โฃโโ ๐ style.py
โ โฃโโ ๐ test.py
โ โโโ ๐ test_log.py
โฃโโ ๐ __about__.py
โฃโโ ๐ __init__.py
โฃโโ ๐ __main__.py
โฃโโ ๐ configs.py
โฃโโ ๐ cria.py
โฃโโ ๐ dist.py
โฃโโ ๐ history.py
โฃโโ ๐ jobs.py
โฃโโ ๐ loadjobenv.py
โฃโโ ๐ model.py
โฃโโ ๐ plot.py
โฃโโ ๐ profile.py
โฃโโ ๐ runtime.py
โฃโโ ๐ savejobenv.py
โฃโโ ๐ test.py
โฃโโ ๐ test_dist.py
โฃโโ ๐ train.py
โฃโโ ๐ trainer.py
โโโ ๐ utils.py
Footnotes
Plus this is useful for tab-completions in your shell, e.g.:
โฉ๏ธAny of {Aurora, Polaris, Sophia, Sunspot, Sirius}โฉ๏ธ
At ALCF, if our
$(hostname)
starts withx*
, weโre on a compute node.โฉ๏ธNote the
--require-virtualenv
isnโt strictly required, but I highly recommend to always try and work within a virtual environment, when possible.โฉ๏ธ
Citation
@article{foreman2024,
author = {Foreman, Sam},
title = {๐ Ezpz at {ALCF}},
date = {2024-08-23},
url = {https://samforeman.me/posts/ezpz-at-alcf},
langid = {en}
}