๐ ezpz
@ ALCF
Work smarter, not harder.
๐ฃ Getting Started
There are two main, distinct components of ezpz
:
- ๐ Shell Utilities
- ๐ Python Library
๐ Shell
[!IMPORTANT]
We provide a variety of helper functions designed to make your life easier.Theyโre entirely self-contained in
utils.sh
and are all prefixed withezpz_
To take advantage of these helper functions we need to source
this utils.sh
.
This can be done by:
PBS_O_WORKDIR=$(pwd) source /dev/stdin <<< $(curl 'https://raw.githubusercontent.com/saforem2/ezpz/refs/heads/main/src/ezpz/bin/utils.sh')
Now, there are a few functions in particular worth elaborating on.
ezpz_setup_python
: This is done in two parts.ezpz_setup_conda
: Will first attempt to load and activate a suitableconda
environment.Depending on where weโre running from, this will behave slightly differently:
ALCF1: Automatically load the most recent
conda
module and activate the base environment.Frontier: Load the appropriate AMD modules (e.g.
rocm
,RCCL
, etc.), and activate baseconda
Perlmutter: Load the appropriate
pytorch
module and activate environmentUnknown: In this case, we will look for a
conda
,mamba
, ormicromamba
executable, and if found, use that to activate the base environment.
ezpz_setup_venv
: Ideally, we are now inside a suitableconda
environment.Next, we will look for a virtual environment located in:
If this exists, activate it and continue. Otherwise, we will create a new
venv
via: virtualwhere weโve pulled in the
--system-site-packages
from conda.
ezpz_setup_job
ezpz_setup_env
[!WARNING]
Some of the
ezpz_*
functions (e.g.ezpz_setup_python
), will try to create / look for certain directories.In an effort to be explicit, these directories will be defined relative to a
WORKING_DIR
(e.g."${WORKING_DIR}/venvs/"
)This
WORKING_DIR
will be assigned to the first non-zero match found below:
PBS_O_WORKDIR
: If found in environment, paths will be relative to thisSLURM_SUBMIT_DIR
: Next in line. If not @ ALCF, maybe usingslurm
โฆ$(pwd)
: Otherwise, no worries. Use your actual working directory.
๐ Python
To install2:
๐ /ezpz/src/ezpz/
โฃโโ ๐ bin/
โ โฃโโ ๐ affinity.sh
โ โฃโโ ๐ getjobenv
โ โฃโโ ๐ savejobenv
โ โฃโโ ๐ saveslurmenv
โ โฃโโ ๐ setup.sh
โ โฃโโ ๐ train.sh
โ โโโ ๐ utils.sh
โฃโโ ๐ conf/
โ โฃโโ ๐ hydra/
โ โ โโโ ๐ job_logging/
โ โ โฃโโ โ๏ธ colorlog1.yaml
โ โ โฃโโ โ๏ธ custom.yaml
โ โ โโโ โ๏ธ enrich.yaml
โ โฃโโ ๐ logdir/
โ โ โโโ โ๏ธ default.yaml
โ โฃโโ โ๏ธ config.yaml
โ โฃโโ ๐ ds_config.json
โ โโโ โ๏ธ ds_config.yaml
โฃโโ ๐ log/
โ โฃโโ ๐ conf/
โ โ โโโ ๐ hydra/
โ โ โโโ ๐ job_logging/
โ โ โโโ โ๏ธ enrich.yaml
โ โฃโโ ๐ __init__.py
โ โฃโโ ๐ __main__.py
โ โฃโโ ๐ config.py
โ โฃโโ ๐ console.py
โ โฃโโ ๐ handler.py
โ โฃโโ ๐ style.py
โ โฃโโ ๐ test.py
โ โโโ ๐ test_log.py
โฃโโ ๐ __about__.py
โฃโโ ๐ __init__.py
โฃโโ ๐ __main__.py
โฃโโ ๐ configs.py
โฃโโ ๐ cria.py
โฃโโ ๐ dist.py
โฃโโ ๐ history.py
โฃโโ ๐ jobs.py
โฃโโ ๐ loadjobenv.py
โฃโโ ๐ model.py
โฃโโ ๐ plot.py
โฃโโ ๐ profile.py
โฃโโ ๐ runtime.py
โฃโโ ๐ savejobenv.py
โฃโโ ๐ test.py
โฃโโ ๐ test_dist.py
โฃโโ ๐ train.py
โฃโโ ๐ trainer.py
โโโ ๐ utils.py
Old README
s
README
(old)
ezpz
๐
Sam Foreman
2024-05-14
๐ Overview
ezpz
๐Launch, train and communicate across all your accelerators,
ezpz
.Full support for your favorite framework + backend combo โค๏ธ.
ezpz
simplifies the process of:
Setting up + launching distributed training:
import ezpz as ez
RANK =
ez.setup_torch(backend=backend)
forbackend
\in {DDP
,deepspeed
,horovod
}
RANK =
ez.get_rank()
LOCAL_RANK =
ez.get_local_rank()
WORLD_SIZE =
ez.get_world_size()
(see
ezpz/dist.py
for more details).Writing device agnostic code:
Full support for any {
device
+framework
+backend
}:
- device: {
GPU
,XPU
,MPS
,CPU
}- framework: {
torch
,deepspeed
,horovod
,tensorflow
}- backend: {
DDP
,deepspeed
,horovod
}
test_dist.py
import os
import logging
import time
from typing import Optional
import torch
import ezpz as ez
# backend can be any of DDP, deespepeed, horovod
RANK = ez.setup_torch(
backend=(
backend := os.environ.get('BACKEND', 'DDP')
),
port=(
port := os.environ.get("MASTER_PORT", "29500")
)
)
# RANK = DIST_INIT['rank']
# WORLD_SIZE = DIST_INIT['world_size']
# LOCAL_RANK = DIST_INIT['local_rank']
# if DEVICE == "cuda" and torch.cuda.is_available():
# torch.cuda.set_device(LOCAL_RANK)
DEVICE = ez.get_torch_device()
WORLD_SIZE = ez.get_world_size()
LOCAL_RANK = ez.get_local_rank()
DEVICE_ID = f"{DEVICE}:{LOCAL_RANK}"
# log only from RANK == 0
logger = logging.getLogger(__name__)
logger.setLevel("INFO") if RANK == 0 else logger.setLevel("CRITICAL")
BATCH_SIZE = int(os.environ.get("BATCH_SIZE", 64)) # 64
INPUT_SIZE = int(os.environ.get("INPUT_SIZE", 128)) # 128
OUTPUT_SIZE = int(os.environ.get("OUTPUT_SIZE", 128)) # 128
DTYPE = os.environ.get("DTYPE", torch.get_default_dtype())
TRAIN_ITERS = int(os.environ.get("TRAIN_ITERS", 50))
# logger.info(f"{DIST_INIT=}")
class Network(torch.nn.Module):
def __init__(
self,
input_dim: int = 128,
output_dim: int = 128,
sizes: Optional[list[int]] = None,
):
super(Network, self).__init__()
if sizes is None:
self.layers = torch.nn.Linear(input_dim, output_dim)
elif len(sizes) > 0:
layers = [torch.nn.Linear(input_dim, sizes[0])]
for idx, size in enumerate(sizes[1:]):
layers.append(
torch.nn.Linear(sizes[idx], size)
)
layers.append(torch.nn.Linear(sizes[-1], output_dim))
self.layers = torch.nn.Sequential(*layers)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.layers(x)
def calc_loss(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
return (y - x).pow(2).sum()
def plot_losses(losses: dict) -> None:
import plotext as pltx
# y = list(losses.values())
pltx.theme('clear')
pltx.scatter(list(losses.values()))
pltx.show()
pltx.save_fig("test_dist_losses.txt")
pltx.ylabel("loss")
pltx.xlabel("iteration")
def main():
model = Network(
input_dim=INPUT_SIZE,
output_dim=OUTPUT_SIZE,
sizes=[1024, 512, 256, 128]
)
model.to(DEVICE)
model.to(DEVICE_ID)
logger.info(f'{model=}')
optimizer = torch.optim.Adam(model.parameters())
if backend.lower() == 'ddp':
if WORLD_SIZE > 1:
from torch.nn.parallel import DistributedDataParallel as DDP
model = DDP(
model,
device_ids=[]
)
elif backend.lower() in ('ds', 'deepspeed'):
import deepspeed
# config = ez.load_ds_config().update(
# {"train_micro_batch_size_per_gpu": BATCH_SIZE}
# )
import argparse
parser = argparse.ArgumentParser(
description='My training script.'
)
parser.add_argument(
'--local_rank',
required=False,
type=int,
default=-1,
# default=ez.get_local_rank()),
help='local rank passed from distributed launcher',
)
# Include DeepSpeed configuration arguments
parser = deepspeed.add_config_arguments(parser)
cmd_args = parser.parse_args()
logger.info(f'{cmd_args=}')
model, optimizer, *_ = deepspeed.initialize(
args=cmd_args,
model=model,
optimizer=optimizer,
)
losses = {}
for iter in range(TRAIN_ITERS):
t0 = time.perf_counter()
x = torch.rand((BATCH_SIZE, INPUT_SIZE), dtype=DTYPE).to(DEVICE)
y = model(x)
loss = calc_loss(x, y)
losses[iter] = loss
dtf = ((t1 := time.perf_counter()) - t0)
if backend == 'deepspeed':
model.backward(loss)
model.step(loss)
else:
loss.backward()
optimizer.step()
optimizer.zero_grad()
dtb = time.perf_counter() - t1
logger.info(
', '.join([
f'{iter=}',
f'loss={loss.item():.5f}',
f'dt={dtf+dtb:.3f}',
f'{dtf=:.3f}',
f'{dtb=:.3f}'
])
)
if RANK == 0:
plot_losses(losses)
if __name__ == '__main__':
main()
README
(grandparent)
ezpz
Simplifies the process of setting up distributed training for:
pytorch
+{DDP, deepspeed, horovod}
tensorflow
+horovod
Setup
Install:
Determine available resources:
[ "$(hostname)==theta*" ] && HOSTFILE="${COBALT_NODEFILE}" # ThetaGPU @ ALCF [ "$(hostname)==x3*" ] && HOSTFILE="${PBS_NODEFILE}" # Polaris @ ALCF [ "$(hostname)==nid*" ] && HOSTFILE="${SLURM_NODELIST}" # Perlmutter @ NERSC NHOSTS=$(wc -l < "${HOSTFILE}") NGPU_PER_HOST=$(nvidia-smi -L | wc -l) NGPUS="$((${NHOSTS}*${NGPU_PER_HOST}))"; echo $NHOSTS $NGPU_PER_HOST $NGPUS 2 4 8
Example
python
script:""" ezpz/test.py """ from ezpz import setup_torch, setup_tensorflow def test( framework: str = 'pytorch', backend: str = 'deepspeed', port: int | str = '5432' ): if framework == 'pytorch': _ = setup_torch( backend=backend, port=port, ) elif framework == 'tensorflow': _ = setup_tensorflow() else: raise ValueError if __name__ == '__main__': import sys try: framework = sys.argv[1] except IndexError: framework = 'pytorch' try: backend = sys.argv[2] except IndexError: backend = 'deepspeed' try: port = sys.argv[3] except IndexError: port = '5432' test(framework=framework, backend=backend, port=port)
Tests
You can test a {framework, backend}
combination by:
for framework
\in {pytorch, tensorflow}
and backend
\in {horovod, deepspeed, DDP}
3
โ PyTorch + DDP:
โ PyTorch + DeepSpeed:
โ PyTorch + Horovod:
โ TensorFlow + Horovod:
Helper Utilities
src/ezpz/bin/savejobenv
: Shell script to save relevant job related environment variables to a file which can be sourced from new login instances.src/ezpz/bin/getjobenv
: Shell script that, when sourced, will populate the current environment with the necessary job-related variables.
Launch a job, clone (or navigate into)
ezpz
, and runsrc/ezpz/bin/savejobenv
:(thetalogin5) $ qsub-gpu -A datascience -n 4 -q full-node --attrs="filesystems=home,grand,eagle,theta-fs0:ssds=required" -t 12:00 -I (thetagpu13) $ git clone https://github.com/saforem2/ezpz (thetagpu13) $ cd ezpz/src/ezpz (thetagpu13) $ ./bin/savejobenv โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ [DIST INFO]: โ โข Writing Job info to /home/foremans/.cobaltenv โ โข NHOSTS: 4 โ โข NGPU_PER_HOST: 8 โ โข NGPUS = (NHOSTS * NGPU_PER_HOST) = 32 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ Saving COBALT env to /home/foremans/.cobaltenv from thetagpu13 โ Writing COBALT vars to /home/foremans/.cobaltenv โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ Copying COBALT_NODEFILE to clipboard... โ COBALT_NODEFILE: /var/tmp/cobalt.10154591 โ [Hosts]: โ thetagpu13 thetagpu12 thetagpu19 thetagpu18 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ Run 'source getjobenv' in a NEW SHELL to automatically set env vars โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
now, in a NEW SHELL
(localhost) $ ssh foremans@theta (thetalogin5) $ ssh thetagpu18 (thetagpu18) $ module load conda/2023-01-11; cond activate base (thetagpu18) $ cd ezpz (thetagpu18) $ mkdir -p venvs/thetaGPU/2023-01-11 (thetagpu18) $ python3 -m venv venvs/thetaGPU/2023-01-11 --system-site-packages (thetagpu18) $ source venvs/thetaGPU/2023-01-11/bin/activate (thetagpu18) $ python3 -m pip install -e . (thetagpu18) $ cd ezpz/src/ezpz (thetagpu18) $ source bin/getjobenv RUNNING_JOB_FILE: /var/tmp/cobalt-running-job JOBID: 10154591 Loading job env from: /home/foremans/.cobaltenv Defining alias mpilaunch: mpilaunch: aliased to mpirun -n 32 -N 8 --hostfile /var/tmp/cobalt.10154591 -x PATH -x LD_LIBRARY_PATH HOSTFILE: /var/tmp/cobalt.10154591 NHOSTS: 4 NGPU_PER_HOST: 8 NGPUS (NHOSTS x NGPU_PER_HOST): 32 HOSTS: thetagpu13 thetagpu12 thetagpu19 thetagpu18 (thetagpu18) $ mpilaunch python3 -m ezpz pytorch DDP Using DDP for distributed training RANK: 0 / 31 RANK: 25 / 31 RANK: 24 / 31 RANK: 15 / 31 RANK: 26 / 31 RANK: 31 / 31 RANK: 2 / 31 RANK: 12 / 31 RANK: 1 / 31 RANK: 28 / 31 RANK: 3 / 31 RANK: 14 / 31 RANK: 4 / 31 RANK: 10 / 31 RANK: 27 / 31 RANK: 5 / 31 RANK: 30 / 31 RANK: 29 / 31 RANK: 9 / 31 RANK: 7 / 31 RANK: 6 / 31 RANK: 13 / 31 RANK: 8 / 31 RANK: 11 / 31 RANK: 18 / 31 RANK: 16 / 31 RANK: 21 / 31 RANK: 20 / 31 RANK: 22 / 31 RANK: 19 / 31 RANK: 17 / 31 RANK: 23 / 31
while this example looked at ThetaGPU, the exact same process will work on any of
{ThetaGPU, Polaris, Perlmutter}
.2ez
Deprecated:
Complete Example
Gist
๐ฆ Clone Repo(s)
argonne-lcf/Megatron-DeepSpeed
:$ git clone https://github.com/argonne-lcf/Megatron-DeepSpeed Cloning into 'Megatron-DeepSpeed'... remote: Enumerating objects: 15538, done. remote: Counting objects: 100% (21/21), done. remote: Compressing objects: 100% (11/11), done. remote: Total 15538 (delta 10), reused 18 (delta 10), pack-reused 15517 Receiving objects: 100% (15538/15538), 6.25 MiB | 32.32 MiB/s, done. Resolving deltas: 100% (11482/11482), done. Updating files: 100% (596/596), done.
-
$ cd Megatron-DeepSpeed $ git clone https://github.com/saforem2/ezpz deps/ezpz Cloning into 'deps/ezpz'... remote: Enumerating objects: 2161, done. remote: Counting objects: 100% (390/390), done. remote: Compressing objects: 100% (181/181), done. remote: Total 2161 (delta 214), reused 285 (delta 151), pack-reused 1771 Receiving objects: 100% (2161/2161), 4.28 MiB | 25.35 MiB/s, done. Resolving deltas: 100% (1134/1134), done.
๐ Setup Job
Source
ezpz/bin/utils.sh
:ezpz_setup_alcf
:$ ezpz_setup_alcf [ezpz/bin/utils.sh] [2024-07-23-221417] โข USER=foremans โข MACHINE=polaris โข HOST=x3006c0s25b1n0 [ezpz_get_pbs_env]: Caught 0 arguments โข hostfile: /var/spool/pbs/aux/2036165.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov โข jobenv_file: /home/foremans/.pbsenv [ezpz_setup_host] โข Using hostfile: /var/spool/pbs/aux/2036165.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov โข Found in environment: โข HOSTFILE: /var/spool/pbs/aux/2036165.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov โข Writing PBS vars to: /home/foremans/.pbsenv [ezpz_save_pbs_env] โข Setting: โข HOSTFILE: /var/spool/pbs/aux/2036165.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov โข JOBENV_FILE: /home/foremans/.pbsenv [HOSTS] โข [host:0] - x3006c0s25b1n0.hsn.cm.polaris.alcf.anl.gov [DIST INFO] โข HOSTFILE=/var/spool/pbs/aux/2036165.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov โข NHOSTS=1 โข NGPU_PER_HOST=4 โข NGPUS=4 โข DIST_LAUNCH=mpiexec --verbose --envall -n 4 -ppn 4 --hostfile /var/spool/pbs/aux/2036165.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov --cpu-bind depth -d 16 [LAUNCH]: โข To launch across all available GPUs, use: launch launch = mpiexec --verbose --envall -n 4 -ppn 4 --hostfile /var/spool/pbs/aux/2036165.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov --cpu-bind depth -d 16
๐ Setup Python
ezpz_setup_python
:$ ezpz_setup_python No conda_prefix OR virtual_env found in environment... Setting up conda... Lmod is automatically replacing "nvhpc/23.9" with "gcc-native/12.3". Lmod is automatically replacing "PrgEnv-nvhpc/8.5.0" with "PrgEnv-gnu/8.5.0". Due to MODULEPATH changes, the following have been reloaded: 1) cray-mpich/8.1.28 Found conda at: /soft/applications/conda/2024-04-29/mconda3 No VIRTUAL_ENV found in environment! - Trying to setup from /soft/applications/conda/2024-04-29/mconda3 - Using VENV_DIR=/eagle/argonne_tpc/foremans/projects/argonne-lcf/tmp/2024-07-23-221253/Megatron-DeepSpeed/venvs/2024-04-29 - Creating a new virtual env on top of 2024-04-29 in /eagle/argonne_tpc/foremans/projects/argonne-lcf/tmp/2024-07-23-221253/Megatron-DeepSpeed/venvs/2024-04-29 [python] Using /eagle/argonne_tpc/foremans/projects/argonne-lcf/tmp/2024-07-23-221253/Megatron-DeepSpeed/venvs/2024-04-29/bin/python3 $ which python3 /eagle/argonne_tpc/foremans/projects/argonne-lcf/tmp/2024-07-23-221253/Megatron-DeepSpeed/venvs/2024-04-29/bin/python3 $ python3 -m pip install -e deps/ezpz --require-virtualenv Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com Obtaining file:///lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/tmp/2024-07-23-221253/Megatron-DeepSpeed/deps/ezpz Installing build dependencies ... done Checking if build backend supports build_editable ... done Getting requirements to build editable ... done Installing backend dependencies ... done Preparing editable metadata (pyproject.toml) ... done
๐ Test Setup
-
Output
[2024-07-23 22:21:37.972869][INFO][__init__:156] - Setting logging level to 'INFO' on 'RANK == 0' [2024-07-23 22:21:37.975224][INFO][__init__:157] - Setting logging level to 'CRITICAL' on all others 'RANK != 0' [2024-07-23 22:21:37.975718][INFO][__init__:160] - To disable this behavior, and log from ALL ranks (not recommended), set: 'export LOG_FROM_ALL_RANKS=1' in your environment, and re-run. [2024-07-23 22:21:39.790899][INFO][dist:358] - [device='cuda'][rank=1/3][local_rank=1/3][node=0/0] [2024-07-23 22:21:39.790850][INFO][dist:358] - [device='cuda'][rank=2/3][local_rank=2/3][node=0/0] [2024-07-23 22:21:39.791749][INFO][dist:358] - [device='cuda'][rank=3/3][local_rank=3/3][node=0/0] [2024-07-23 22:21:39.797666][INFO][dist:95] - [dist_info]: โข DEVICE=cuda โข DEVICE_ID=cuda:0 โข DISTRIBUTED_BACKEND=nccl โข GPUS_PER_NODE=4 โข HOSTS=['x3006c0s25b1n0.hsn.cm.polaris.alcf.anl.gov'] โข HOSTFILE=/var/spool/pbs/aux/2036165.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov โข HOSTNAME=x3006c0s25b1n0.hsn.cm.polaris.alcf.anl.gov โข LOCAL_RANK=0 โข MACHINE=Polaris โข NUM_NODES=1 โข NGPUS=4 โข NGPUS_AVAILABLE=4 โข NODE_ID=0 โข RANK=0 โข SCHEDULER=PBS โข WORLD_SIZE_TOTAL=4 โข WORLD_SIZE_IN_USE=4 โข LAUNCH_CMD=mpiexec --verbose --envall -n 4 -ppn 4 --hostfile /var/spool/pbs/aux/2036165.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov --cpu-bind depth -d 16 [2024-07-23 22:21:39.800519][INFO][dist:725] - [0/4] Using device='cuda' with backend='DDP' + 'nccl' for distributed training. [2024-07-23 22:21:39.805001][INFO][dist:358] - [device='cuda'][rank=0/3][local_rank=0/3][node=0/0] [2024-07-23 22:21:39.805513][WARNING][dist:364] - Using [4 / 4] available "cuda" devices !! [2024-07-23 22:21:39.806121][INFO][dist:95] - [timers_import]: โข os=1.062639057636261e-06 โข logging=4.0046870708465576e-07 โข typing=2.7157366275787354e-06 โข pathlib=1.2516975402832031e-06 โข ezpz=6.30505383014679e-07 โข torch=2.555549144744873e-06 โข torch_ddp=2.4745240807533264e-06 โข wandb=6.44102692604065e-05 โข total=7.550138980150223e-05 [2024-07-23 22:21:39.807221][INFO][dist:95] - [CONFIG]: โข warmup=0 โข log_freq=1 โข batch_size=64 โข input_size=128 โข output_size=128 โข dtype=torch.float32 โข device=cuda โข world_size=4 โข train_iters=100 [2024-07-23 22:21:41.373173][INFO][test_dist:183] - model=Network( (layers): Sequential( (0): Linear(in_features=128, out_features=1024, bias=True) (1): Linear(in_features=1024, out_features=512, bias=True) (2): Linear(in_features=512, out_features=256, bias=True) (3): Linear(in_features=256, out_features=128, bias=True) (4): Linear(in_features=128, out_features=128, bias=True) ) ) [2024-07-23 22:21:43.625040][INFO][test_dist:274] - iter=1, loss=2039.91, sps=2.252e+04, dt=0.00284196, dtf=0.0009801, dtb=0.001862 [2024-07-23 22:21:43.628643][INFO][test_dist:274] - iter=2, loss=1424.54, sps=3.272e+04, dt=0.00195628, dtf=0.0005183, dtb=0.001438 [2024-07-23 22:21:43.631833][INFO][test_dist:274] - iter=3, loss=1159.38, sps=3.331e+04, dt=0.00192147, dtf=0.0006139, dtb=0.001308 [2024-07-23 22:21:43.634991][INFO][test_dist:274] - iter=4, loss=935.58, sps=3.343e+04, dt=0.00191451, dtf=0.0006113, dtb=0.001303 [2024-07-23 22:21:43.638092][INFO][test_dist:274] - iter=5, loss=851.468, sps=3.483e+04, dt=0.00183759, dtf=0.0005938, dtb=0.001244 [2024-07-23 22:21:43.641232][INFO][test_dist:274] - iter=6, loss=785.109, sps=3.409e+04, dt=0.00187757, dtf=0.0005972, dtb=0.00128 [2024-07-23 22:21:43.644367][INFO][test_dist:274] - iter=7, loss=772.966, sps=3.417e+04, dt=0.00187292, dtf=0.0005868, dtb=0.001286 [2024-07-23 22:21:43.647507][INFO][test_dist:274] - iter=8, loss=727.854, sps=3.411e+04, dt=0.00187638, dtf=0.0005814, dtb=0.001295 [2024-07-23 22:21:43.650561][INFO][test_dist:274] - iter=9, loss=725.773, sps=3.546e+04, dt=0.00180485, dtf=0.0005975, dtb=0.001207 [2024-07-23 22:21:43.653520][INFO][test_dist:274] - iter=10, loss=720.374, sps=3.785e+04, dt=0.00169078, dtf=0.0006108, dtb=0.00108 [2024-07-23 22:21:43.656564][INFO][test_dist:274] - iter=11, loss=717.926, sps=3.602e+04, dt=0.00177678, dtf=0.0005694, dtb=0.001207 [2024-07-23 22:21:43.659555][INFO][test_dist:274] - iter=12, loss=692.535, sps=3.682e+04, dt=0.00173814, dtf=0.0005557, dtb=0.001182 [2024-07-23 22:21:43.662596][INFO][test_dist:274] - iter=13, loss=679.509, sps=3.67e+04, dt=0.00174377, dtf=0.0005531, dtb=0.001191 [2024-07-23 22:21:43.665598][INFO][test_dist:274] - iter=14, loss=674.778, sps=3.727e+04, dt=0.00171741, dtf=0.0005496, dtb=0.001168 [2024-07-23 22:21:43.668593][INFO][test_dist:274] - iter=15, loss=673.873, sps=3.666e+04, dt=0.00174556, dtf=0.0005708, dtb=0.001175 [2024-07-23 22:21:43.671589][INFO][test_dist:274] - iter=16, loss=667.283, sps=3.694e+04, dt=0.00173238, dtf=0.0005453, dtb=0.001187 [2024-07-23 22:21:43.674599][INFO][test_dist:274] - iter=17, loss=660.292, sps=3.646e+04, dt=0.00175558, dtf=0.0005538, dtb=0.001202 [2024-07-23 22:21:43.677592][INFO][test_dist:274] - iter=18, loss=660.664, sps=3.696e+04, dt=0.00173169, dtf=0.0005441, dtb=0.001188 [2024-07-23 22:21:43.680559][INFO][test_dist:274] - iter=19, loss=676.161, sps=3.709e+04, dt=0.00172556, dtf=0.0005668, dtb=0.001159 [2024-07-23 22:21:43.683539][INFO][test_dist:274] - iter=20, loss=665.099, sps=3.702e+04, dt=0.0017287, dtf=0.0005281, dtb=0.001201 [2024-07-23 22:21:43.686527][INFO][test_dist:274] - iter=21, loss=626.671, sps=3.7e+04, dt=0.00172989, dtf=0.0005279, dtb=0.001202 [2024-07-23 22:21:43.689518][INFO][test_dist:274] - iter=22, loss=632.127, sps=3.702e+04, dt=0.00172883, dtf=0.0005085, dtb=0.00122 [2024-07-23 22:21:43.692469][INFO][test_dist:274] - iter=23, loss=657.324, sps=3.755e+04, dt=0.00170436, dtf=0.0005164, dtb=0.001188 [2024-07-23 22:21:43.695563][INFO][test_dist:274] - iter=24, loss=617.646, sps=3.558e+04, dt=0.00179856, dtf=0.0005767, dtb=0.001222 [2024-07-23 22:21:43.698537][INFO][test_dist:274] - iter=25, loss=618.284, sps=3.705e+04, dt=0.00172744, dtf=0.0005522, dtb=0.001175 [2024-07-23 22:21:43.701410][INFO][test_dist:274] - iter=26, loss=615.418, sps=3.961e+04, dt=0.00161577, dtf=0.0005298, dtb=0.001086 [2024-07-23 22:21:43.704427][INFO][test_dist:274] - iter=27, loss=599.058, sps=3.648e+04, dt=0.00175461, dtf=0.0005156, dtb=0.001239 [2024-07-23 22:21:43.707374][INFO][test_dist:274] - iter=28, loss=621.717, sps=3.778e+04, dt=0.00169387, dtf=0.0004899, dtb=0.001204 [2024-07-23 22:21:43.710390][INFO][test_dist:274] - iter=29, loss=597.588, sps=3.623e+04, dt=0.00176654, dtf=0.0005663, dtb=0.0012 [2024-07-23 22:21:43.713386][INFO][test_dist:274] - iter=30, loss=598.102, sps=3.71e+04, dt=0.00172484, dtf=0.0005497, dtb=0.001175 [2024-07-23 22:21:43.716530][INFO][test_dist:274] - iter=31, loss=586.188, sps=3.357e+04, dt=0.00190664, dtf=0.0005618, dtb=0.001345 [2024-07-23 22:21:43.719525][INFO][test_dist:274] - iter=32, loss=591.646, sps=3.672e+04, dt=0.00174293, dtf=0.000561, dtb=0.001182 [2024-07-23 22:21:43.722513][INFO][test_dist:274] - iter=33, loss=574.161, sps=3.668e+04, dt=0.00174487, dtf=0.0005502, dtb=0.001195 [2024-07-23 22:21:43.725524][INFO][test_dist:274] - iter=34, loss=586.41, sps=3.707e+04, dt=0.00172628, dtf=0.0005552, dtb=0.001171 [2024-07-23 22:21:43.728594][INFO][test_dist:274] - iter=35, loss=574.43, sps=3.605e+04, dt=0.00177526, dtf=0.000576, dtb=0.001199 [2024-07-23 22:21:43.731615][INFO][test_dist:274] - iter=36, loss=552.77, sps=3.642e+04, dt=0.00175741, dtf=0.0005588, dtb=0.001199 [2024-07-23 22:21:43.734574][INFO][test_dist:274] - iter=37, loss=567.612, sps=3.748e+04, dt=0.00170768, dtf=0.0005318, dtb=0.001176 [2024-07-23 22:21:43.737564][INFO][test_dist:274] - iter=38, loss=561.004, sps=3.706e+04, dt=0.00172686, dtf=0.0005489, dtb=0.001178 [2024-07-23 22:21:43.740578][INFO][test_dist:274] - iter=39, loss=555.718, sps=3.645e+04, dt=0.00175567, dtf=0.0005662, dtb=0.001189 [2024-07-23 22:21:43.743565][INFO][test_dist:274] - iter=40, loss=543.661, sps=3.708e+04, dt=0.00172613, dtf=0.0005363, dtb=0.00119 [2024-07-23 22:21:43.746561][INFO][test_dist:274] - iter=41, loss=537.186, sps=3.691e+04, dt=0.00173373, dtf=0.0005346, dtb=0.001199 [2024-07-23 22:21:43.749446][INFO][test_dist:274] - iter=42, loss=545.877, sps=3.998e+04, dt=0.00160083, dtf=0.000533, dtb=0.001068 [2024-07-23 22:21:43.752446][INFO][test_dist:274] - iter=43, loss=546.533, sps=3.681e+04, dt=0.00173875, dtf=0.0005124, dtb=0.001226 [2024-07-23 22:21:43.755384][INFO][test_dist:274] - iter=44, loss=545.989, sps=3.796e+04, dt=0.001686, dtf=0.0005054, dtb=0.001181 [2024-07-23 22:21:43.758389][INFO][test_dist:274] - iter=45, loss=531.344, sps=3.667e+04, dt=0.00174516, dtf=0.0005569, dtb=0.001188 [2024-07-23 22:21:43.761470][INFO][test_dist:274] - iter=46, loss=515.415, sps=3.69e+04, dt=0.00173432, dtf=0.000551, dtb=0.001183 [2024-07-23 22:21:43.764494][INFO][test_dist:274] - iter=47, loss=523.498, sps=3.634e+04, dt=0.00176121, dtf=0.0005524, dtb=0.001209 [2024-07-23 22:21:43.767522][INFO][test_dist:274] - iter=48, loss=515.942, sps=3.625e+04, dt=0.00176562, dtf=0.0005655, dtb=0.0012 [2024-07-23 22:21:43.770555][INFO][test_dist:274] - iter=49, loss=527.433, sps=3.62e+04, dt=0.00176783, dtf=0.0005579, dtb=0.00121 [2024-07-23 22:21:43.773467][INFO][test_dist:274] - iter=50, loss=520.038, sps=3.938e+04, dt=0.00162521, dtf=0.0005579, dtb=0.001067 [2024-07-23 22:21:43.776470][INFO][test_dist:274] - iter=51, loss=507.743, sps=3.68e+04, dt=0.00173934, dtf=0.0005378, dtb=0.001202 [2024-07-23 22:21:43.779466][INFO][test_dist:274] - iter=52, loss=505.372, sps=3.694e+04, dt=0.00173268, dtf=0.0005321, dtb=0.001201 [2024-07-23 22:21:43.782434][INFO][test_dist:274] - iter=53, loss=505.824, sps=3.736e+04, dt=0.00171324, dtf=0.0005403, dtb=0.001173 [2024-07-23 22:21:43.785426][INFO][test_dist:274] - iter=54, loss=498.697, sps=3.751e+04, dt=0.00170619, dtf=0.0005259, dtb=0.00118 [2024-07-23 22:21:43.788396][INFO][test_dist:274] - iter=55, loss=492.434, sps=3.719e+04, dt=0.00172085, dtf=0.0005036, dtb=0.001217 [2024-07-23 22:21:43.791354][INFO][test_dist:274] - iter=56, loss=486.032, sps=3.754e+04, dt=0.00170497, dtf=0.0005077, dtb=0.001197 [2024-07-23 22:21:43.794333][INFO][test_dist:274] - iter=57, loss=487.687, sps=3.803e+04, dt=0.00168299, dtf=0.0005009, dtb=0.001182 [2024-07-23 22:21:43.797238][INFO][test_dist:274] - iter=58, loss=481.011, sps=3.929e+04, dt=0.00162898, dtf=0.0005554, dtb=0.001074 [2024-07-23 22:21:43.800237][INFO][test_dist:274] - iter=59, loss=478.058, sps=3.692e+04, dt=0.00173365, dtf=0.0005374, dtb=0.001196 [2024-07-23 22:21:43.803250][INFO][test_dist:274] - iter=60, loss=476.983, sps=3.666e+04, dt=0.00174587, dtf=0.0005318, dtb=0.001214 [2024-07-23 22:21:43.806222][INFO][test_dist:274] - iter=61, loss=468.415, sps=3.716e+04, dt=0.00172234, dtf=0.0005256, dtb=0.001197 [2024-07-23 22:21:43.809230][INFO][test_dist:274] - iter=62, loss=461.661, sps=3.727e+04, dt=0.00171737, dtf=0.0005219, dtb=0.001195 [2024-07-23 22:21:43.812204][INFO][test_dist:274] - iter=63, loss=465.746, sps=3.688e+04, dt=0.00173519, dtf=0.0005067, dtb=0.001228 [2024-07-23 22:21:43.815192][INFO][test_dist:274] - iter=64, loss=470.95, sps=3.724e+04, dt=0.00171855, dtf=0.0004994, dtb=0.001219 [2024-07-23 22:21:43.818155][INFO][test_dist:274] - iter=65, loss=463.301, sps=3.774e+04, dt=0.00169586, dtf=0.0005053, dtb=0.001191 [2024-07-23 22:21:43.821161][INFO][test_dist:274] - iter=66, loss=450.195, sps=3.68e+04, dt=0.00173904, dtf=0.0005626, dtb=0.001176 [2024-07-23 22:21:43.824143][INFO][test_dist:274] - iter=67, loss=449.097, sps=3.662e+04, dt=0.00174746, dtf=0.0005578, dtb=0.00119 [2024-07-23 22:21:43.827103][INFO][test_dist:274] - iter=68, loss=447.465, sps=3.778e+04, dt=0.00169412, dtf=0.0005488, dtb=0.001145 [2024-07-23 22:21:43.830071][INFO][test_dist:274] - iter=69, loss=444.676, sps=3.835e+04, dt=0.00166873, dtf=0.0005467, dtb=0.001122 [2024-07-23 22:21:43.833030][INFO][test_dist:274] - iter=70, loss=429.532, sps=3.83e+04, dt=0.00167122, dtf=0.0005362, dtb=0.001135 [2024-07-23 22:21:43.836024][INFO][test_dist:274] - iter=71, loss=437.085, sps=3.711e+04, dt=0.00172438, dtf=0.0005086, dtb=0.001216 [2024-07-23 22:21:43.839009][INFO][test_dist:274] - iter=72, loss=436.272, sps=3.71e+04, dt=0.00172525, dtf=0.0005177, dtb=0.001208 [2024-07-23 22:21:43.841920][INFO][test_dist:274] - iter=73, loss=430.464, sps=3.893e+04, dt=0.00164403, dtf=0.0004874, dtb=0.001157 [2024-07-23 22:21:43.844806][INFO][test_dist:274] - iter=74, loss=426.483, sps=3.904e+04, dt=0.0016393, dtf=0.000449, dtb=0.00119 [2024-07-23 22:21:43.847771][INFO][test_dist:274] - iter=75, loss=413.371, sps=3.75e+04, dt=0.0017066, dtf=0.0005185, dtb=0.001188 [2024-07-23 22:21:43.850712][INFO][test_dist:274] - iter=76, loss=421.381, sps=3.77e+04, dt=0.00169769, dtf=0.000506, dtb=0.001192 [2024-07-23 22:21:43.853587][INFO][test_dist:274] - iter=77, loss=415.112, sps=3.988e+04, dt=0.0016047, dtf=0.000537, dtb=0.001068 [2024-07-23 22:21:43.856557][INFO][test_dist:274] - iter=78, loss=413.084, sps=3.729e+04, dt=0.0017161, dtf=0.0005459, dtb=0.00117 [2024-07-23 22:21:43.859518][INFO][test_dist:274] - iter=79, loss=412.671, sps=3.761e+04, dt=0.00170149, dtf=0.0005066, dtb=0.001195 [2024-07-23 22:21:43.862469][INFO][test_dist:274] - iter=80, loss=408.688, sps=3.776e+04, dt=0.00169481, dtf=0.0005446, dtb=0.00115 [2024-07-23 22:21:43.865521][INFO][test_dist:274] - iter=81, loss=400.914, sps=3.674e+04, dt=0.00174196, dtf=0.0005528, dtb=0.001189 [2024-07-23 22:21:43.868536][INFO][test_dist:274] - iter=82, loss=389.823, sps=3.655e+04, dt=0.00175112, dtf=0.000574, dtb=0.001177 [2024-07-23 22:21:43.871531][INFO][test_dist:274] - iter=83, loss=399.073, sps=3.686e+04, dt=0.00173618, dtf=0.0005504, dtb=0.001186 [2024-07-23 22:21:43.874511][INFO][test_dist:274] - iter=84, loss=385.773, sps=3.725e+04, dt=0.00171814, dtf=0.0005499, dtb=0.001168 [2024-07-23 22:21:43.877492][INFO][test_dist:274] - iter=85, loss=400.61, sps=3.739e+04, dt=0.00171182, dtf=0.000546, dtb=0.001166 [2024-07-23 22:21:43.880505][INFO][test_dist:274] - iter=86, loss=389.813, sps=3.673e+04, dt=0.00174226, dtf=0.0005734, dtb=0.001169 [2024-07-23 22:21:43.883515][INFO][test_dist:274] - iter=87, loss=385.995, sps=3.694e+04, dt=0.00173256, dtf=0.0005296, dtb=0.001203 [2024-07-23 22:21:43.886470][INFO][test_dist:274] - iter=88, loss=379.115, sps=3.774e+04, dt=0.00169591, dtf=0.0005467, dtb=0.001149 [2024-07-23 22:21:43.889422][INFO][test_dist:274] - iter=89, loss=378.738, sps=3.798e+04, dt=0.00168494, dtf=0.0005276, dtb=0.001157 [2024-07-23 22:21:43.892414][INFO][test_dist:274] - iter=90, loss=365.054, sps=3.675e+04, dt=0.00174164, dtf=0.000513, dtb=0.001229 [2024-07-23 22:21:43.895367][INFO][test_dist:274] - iter=91, loss=380.372, sps=3.772e+04, dt=0.00169654, dtf=0.000495, dtb=0.001201 [2024-07-23 22:21:43.898322][INFO][test_dist:274] - iter=92, loss=377.233, sps=3.852e+04, dt=0.00166155, dtf=0.000539, dtb=0.001123 [2024-07-23 22:21:43.901288][INFO][test_dist:274] - iter=93, loss=366.226, sps=3.788e+04, dt=0.00168959, dtf=0.0005446, dtb=0.001145 [2024-07-23 22:21:43.904284][INFO][test_dist:274] - iter=94, loss=366.221, sps=3.69e+04, dt=0.00173462, dtf=0.0005535, dtb=0.001181 [2024-07-23 22:21:43.907289][INFO][test_dist:274] - iter=95, loss=366.673, sps=3.662e+04, dt=0.00174759, dtf=0.0005328, dtb=0.001215 [2024-07-23 22:21:43.910260][INFO][test_dist:274] - iter=96, loss=362.985, sps=3.716e+04, dt=0.00172234, dtf=0.0005436, dtb=0.001179 [2024-07-23 22:21:43.913277][INFO][test_dist:274] - iter=97, loss=349.768, sps=3.668e+04, dt=0.00174469, dtf=0.000529, dtb=0.001216 [2024-07-23 22:21:43.916293][INFO][test_dist:274] - iter=98, loss=363.521, sps=3.675e+04, dt=0.0017416, dtf=0.0005412, dtb=0.0012 [2024-07-23 22:21:43.919280][INFO][test_dist:274] - iter=99, loss=345.533, sps=3.717e+04, dt=0.00172205, dtf=0.0005134, dtb=0.001209 train/dt [2024-07-23-222143] โ 0.00284โคโ โ โ 0.00264โค โ โ 0.00243โค โ 0.00222โค โ โ 0.00201โค โโ โ โโโโโ โ 0.00181โค โ โ โ โ โโ โโโโโโ โโ โ โโโโโโโโโโโโโ โ โโ โ โ โ โโโ โโ โ โ โโ โ โ โ โ โ โ โ โ โโโโ โโโโ โโโ โโโโโ โโ โโโโโ โ โ 0.00160โค โ โ โ โ โ โ โโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโ 1.0 25.5 50.0 74.5 99.0 train/dt iter [2024-07-23 22:21:43.943086][INFO][plot:156] - Appending plot to: /lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/tmp/2024-07-23-221253/Megatron-DeepSpeed/test-dist-plots/train/dt.txt text saved in /lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/tmp/2024-07-23-221253/Megatron-DeepSpeed/test-dist-plots/train/dt.txt train/dtf [2024-07-23-222143] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 0.00098โคโ โ โ โ โ โ 0.00089โค โ โ โ โ โ 0.00080โค โ โ โ 0.00071โค โ โ โ โ โ 0.00063โค โ โ โโโ โ โ โ โโ โ โ โ โ โ โ โ โ โ โ โ 0.00054โค โโ โโ โ โโโ โโ โโ โโโโโโโ โ โโโ โโ โ โโ โ โโโโ โโ โโ โโโโ โ โ โโ โ โโ โโโ โ โโ โโ โ โโโ โโโ โ โ โ โ โ โ โ 0.00045โค โ โ โโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโ 1.0 25.5 50.0 74.5 99.0 train/dtf iter [2024-07-23 22:21:43.952631][INFO][plot:156] - Appending plot to: /lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/tmp/2024-07-23-221253/Megatron-DeepSpeed/test-dist-plots/train/dtf.txt text saved in /lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/tmp/2024-07-23-221253/Megatron-DeepSpeed/test-dist-plots/train/dtf.txt train/dtb [2024-07-23-222143] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 0.00186โคโ โ โ โ โ โ 0.00173โค โ โ โ โ โ 0.00160โค โ โ โ 0.00146โค โ โโ โ โ โ 0.00133โค โ โ โ โโ โโ โ โ โ โ โ 0.00120โค โโ โ โโโโ โโ โโโ โโ โโ โ โ โโโโโ โโ โ โ โ โ โโโ โ โโโโโโ โโ โโโ โโโ โโ โโ โ โโ โโโโ โโโโโโโ โโ โ โ โโ โโ โ 0.00107โค โ โ โ โ โ โ โ โโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโ 1.0 25.5 50.0 74.5 99.0 train/dtb iter [2024-07-23 22:21:43.962230][INFO][plot:156] - Appending plot to: /lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/tmp/2024-07-23-221253/Megatron-DeepSpeed/test-dist-plots/train/dtb.txt text saved in /lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/tmp/2024-07-23-221253/Megatron-DeepSpeed/test-dist-plots/train/dtb.txt train/loss [2024-07-23-222143] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 2039.9โคโ โ โ โ โ โ 1757.5โค โ โ โ โ โ 1475.1โคโ โ โ โ 1192.7โค โ โ โ โ โ โ 910.3โค โ โ โ โ โ โ โโโโโ โ 627.9โค โโโโโโโโโโโโ โ โ โโโโโโโโโโโโโโโโโโ โ โ โโโโโโโโโโโโโโโโโโโ โ 345.5โค โโโโโโโโโโโโโโโโโโโ โโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโ 1.0 25.5 50.0 74.5 99.0 train/loss iter [2024-07-23 22:21:44.011096][INFO][plot:156] - Appending plot to: /lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/tmp/2024-07-23-221253/Megatron-DeepSpeed/test-dist-plots/train/loss.txt text saved in /lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/tmp/2024-07-23-221253/Megatron-DeepSpeed/test-dist-plots/train/loss.txt train/iter [2024-07-23-222144] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 99.0โค โโโโโ โ โโโโ โ โ โโโโโ โ 82.7โค โโโโ โ โ โโโโ โ โ โโโโโ โ 66.3โค โโโโ โ โ โโโโโ โ 50.0โค โโโโ โ โ โโโโ โ โ โโโโโ โ 33.7โค โโโโ โ โ โโโโโ โ โ โโโโ โ 17.3โค โโโโ โ โ โโโโโ โ โ โโโโ โ 1.0โคโโโโ โ โโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโ 1.0 25.5 50.0 74.5 99.0 train/iter iter [2024-07-23 22:21:44.021040][INFO][plot:156] - Appending plot to: /lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/tmp/2024-07-23-221253/Megatron-DeepSpeed/test-dist-plots/train/iter.txt text saved in /lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/tmp/2024-07-23-221253/Megatron-DeepSpeed/test-dist-plots/train/iter.txt train/sps [2024-07-23-222144] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 39979.2โค โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โ โโโโ โโโ โ โโโโโ โโ โโโ โ โ 37069.3โค โ โโโโโโ โโ โโโ โโโโโโโโโโ โ โ โโโโ โ โโ โโโ โโ โ โโโโโ โ โโ โ โ โ โ โ โ โ โ 34159.4โค โโโโโ โ โ โโ โ 31249.4โค โ โ โ โ โ 28339.5โค โ โ โ โ โ 25429.6โค โ โ โ โ โ 22519.7โคโ โ โโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโ 1.0 25.5 50.0 74.5 99.0 train/sps iter [2024-07-23 22:21:44.030585][INFO][plot:156] - Appending plot to: /lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/tmp/2024-07-23-221253/Megatron-DeepSpeed/test-dist-plots/train/sps.txt text saved in /lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/tmp/2024-07-23-221253/Megatron-DeepSpeed/test-dist-plots/train/sps.txt
PyInstrument Profile:
Recorded: 22:21:41 Samples: 2223 Duration: 2.668 CPU time: 2.406 v4.6.2 Program: /lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/tmp/2024-07-23-221253/Megatron-DeepSpeed/deps/ezpz/src/ezpz/test_dist.py 2.668 <module> ezpz/test_dist.py:1 โโ 2.667 main ezpz/test_dist.py:217 โโ 2.106 build_model_and_optimizer ezpz/test_dist.py:171 โ โโ 2.092 Adam.__init__ torch/optim/adam.py:15 โ [147 frames hidden] torch, transformers, jax, huggingface... โโ 0.183 _forward_step ezpz/test_dist.py:231 โ โโ 0.137 DistributedDataParallel._wrapped_call_impl torch/nn/modules/module.py:1528 โ โ [6 frames hidden] torch โ โ 0.123 Network._call_impl torch/nn/modules/module.py:1534 โ โ โโ 0.123 Network.forward ezpz/test_dist.py:164 โ โ โโ 0.123 Sequential._wrapped_call_impl torch/nn/modules/module.py:1528 โ โ [7 frames hidden] torch, <built-in> โ โโ 0.046 calc_loss ezpz/test_dist.py:168 โโ 0.164 _backward_step ezpz/test_dist.py:236 โ โโ 0.103 wrapper torch/optim/optimizer.py:374 โ โ [5 frames hidden] torch โ โโ 0.060 Tensor.backward torch/_tensor.py:466 โ [4 frames hidden] torch, <built-in> โโ 0.113 tplot_dict ezpz/plot.py:136 โ โโ 0.082 show plotext/_core.py:292 โ [5 frames hidden] plotext โโ 0.099 Logger.info logging/__init__.py:1479 [6 frames hidden] logging, rich 0.099 RichHandler.emit rich/logging.py:126 โโ 0.099 Console.print ezpz/log/console.py:79 โโ 0.099 Console.print rich/console.py:1624 [5 frames hidden] rich [2024-07-23 22:21:44.231519][INFO][profile:115] - Saving pyinstrument profile output to: /lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/tmp/2024-07-23-221253/Megatron-DeepSpeed/ezpz_pyinstrument_profiles [2024-07-23 22:21:44.232054][INFO][profile:123] - PyInstrument profile saved (as html) to: /lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/tmp/2024-07-23-221253/Megatron-DeepSpeed/ezpz_pyinstrument_profiles/pyinstrument-profile-2024-07-23-222144.html [2024-07-23 22:21:44.232619][INFO][profile:131] - PyInstrument profile saved (as text) to: /lus/eagle/projects/argonne_tpc/foremans/projects/argonne-lcf/tmp/2024-07-23-221253/Megatron-DeepSpeed/ezpz_pyinstrument_profiles/pyinstrument-profile-2024-07-23-222144.txt [2024-07-23 22:21:44.761876][INFO][profile:143] - Finished with pyinstrument profiler. Took: 2.66778s [2024-07-23 22:21:44.762534][INFO][test_dist:318] - [0] runtime=6.785542s ezpz-test-dist.log lines 216-359/359 (END)
Deprecated:
Download and source
ezpz/bin/utils.sh
:Use
ezpz_setup_python
to:#[๐][12:45:49 PM][foremans@x3006c0s13b0n0][~/tmp/foremans/2024-07-15-124441] $ PBS_O_WORKDIR=$(pwd) ezpz_utils && ezpz_setup_python && ezpz_setup_alcf Using WORKING_DIR: /home/foremans/tmp/foremans/2024-07-15-124441 No conda_prefix OR virtual_env found in environment... Setting up conda... Lmod is automatically replacing "nvhpc/23.9" with "gcc-native/12.3". Lmod is automatically replacing "PrgEnv-nvhpc/8.5.0" with "PrgEnv-gnu/8.5.0". Due to MODULEPATH changes, the following have been reloaded: 1) cray-mpich/8.1.28 Found conda at: /soft/applications/conda/2024-04-29/mconda3 No VIRTUAL_ENV found in environment! - Trying to setup from /soft/applications/conda/2024-04-29/mconda3 - Using VENV_DIR=/home/foremans/tmp/foremans/2024-07-15-124441/venvs/2024-04-29 - Creating a new virtual env on top of 2024-04-29 in /home/foremans/tmp/foremans/2024-07-15-124441/venvs/2024-04-29 [python] Using /home/foremans/tmp/foremans/2024-07-15-124441/venvs/2024-04-29/bin/python3 [ezpz/bin/utils.sh] [2024-07-15-124600] โข USER=foremans โข MACHINE=polaris โข HOST=x3006c0s13b0n0 [ezpz_setup_host] โข Using hostfile: /var/spool/pbs/aux/2021158.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov โข Found in environment: โข HOSTFILE: /var/spool/pbs/aux/2021158.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov โข Writing PBS vars to: /home/foremans/.pbsenv [ezpz_save_pbs_env] โข Setting: โข HOSTFILE: /var/spool/pbs/aux/2021158.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov โข JOBENV_FILE: /home/foremans/.pbsenv [HOSTS] โข [host:0] - x3006c0s13b0n0.hsn.cm.polaris.alcf.anl.gov [DIST INFO] โข HOSTFILE=/var/spool/pbs/aux/2021158.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov โข NHOSTS=1 โข NGPU_PER_HOST=4 โข NGPUS=4 โข DIST_LAUNCH=mpiexec --verbose --envall -n 4 -ppn 4 --hostfile /var/spool/pbs/aux/2021158.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov --cpu-bind depth -d 16 [LAUNCH]: โข To launch across all available GPUs, use: launch launch = mpiexec --verbose --envall -n 4 -ppn 4 --hostfile /var/spool/pbs/aux/2021158.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov --cpu-bind depth -d 16
Install
ezpz
into the virtual environment from 2.output
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com Obtaining ezpz from git+https://github.com/saforem2/ezpz#egg=ezpz Cloning https://github.com/saforem2/ezpz to ./venvs/2024-04-29/src/ezpz Running command git clone --filter=blob:none --quiet https://github.com/saforem2/ezpz /home/foremans/tmp/foremans/2024-07-15-124441/venvs/2024-04-29/src/ezpz Resolved https://github.com/saforem2/ezpz to commit d8fabca03038db55a1dc490f801581e980f93a25 Installing build dependencies ... done Checking if build backend supports build_editable ... done Getting requirements to build editable ... done Installing backend dependencies ... done Preparing editable metadata (pyproject.toml) ... done Requirement already satisfied: ambivalent in /home/foremans/.local/polaris/conda/2024-04-29/lib/python3.11/site-packages (from ezpz) (0.0.1) Requirement already satisfied: hydra-colorlog in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (1.2.0) Requirement already satisfied: hydra-core in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (1.3.2) Requirement already satisfied: ipython in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (8.24.0) Requirement already satisfied: jax in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (0.4.26) Requirement already satisfied: jaxlib in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (0.4.26+cuda12.cudnn89) Requirement already satisfied: jaxtyping in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (0.2.28) Requirement already satisfied: joblib in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (1.4.0) Requirement already satisfied: ml-dtypes in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (0.3.2) Requirement already satisfied: mpi4py in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (3.1.6) Requirement already satisfied: omegaconf in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (2.3.0) Requirement already satisfied: plotext in /home/foremans/.local/polaris/conda/2024-04-29/lib/python3.11/site-packages (from ezpz) (5.2.8) Collecting pyinstrument (from ezpz) Downloading pyinstrument-4.6.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (21 kB) Requirement already satisfied: rich in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (13.7.1) Requirement already satisfied: seaborn in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (0.13.2) Requirement already satisfied: sentencepiece in /home/foremans/.local/polaris/conda/2024-04-29/lib/python3.11/site-packages (from ezpz) (0.2.0) Requirement already satisfied: sh in /home/foremans/.local/polaris/conda/2024-04-29/lib/python3.11/site-packages (from ezpz) (2.0.6) Requirement already satisfied: tensorboard in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (2.16.2) Requirement already satisfied: torch in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (2.3.0) Requirement already satisfied: tqdm in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (4.65.0) Requirement already satisfied: wandb in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (0.16.6) Requirement already satisfied: xarray in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (2024.3.0) Requirement already satisfied: colormaps in /home/foremans/.local/polaris/conda/2024-04-29/lib/python3.11/site-packages (from ambivalent->ezpz) (0.4.1) Requirement already satisfied: matplotlib in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ambivalent->ezpz) (3.8.4) Requirement already satisfied: requests in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ambivalent->ezpz) (2.31.0) Requirement already satisfied: colorlog in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from hydra-colorlog->ezpz) (6.8.2) Requirement already satisfied: antlr4-python3-runtime==4.9.* in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from hydra-core->ezpz) (4.9.3) Requirement already satisfied: packaging in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from hydra-core->ezpz) (24.0) Requirement already satisfied: PyYAML>=5.1.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from omegaconf->ezpz) (6.0.1) Requirement already satisfied: decorator in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ipython->ezpz) (5.1.1) Requirement already satisfied: jedi>=0.16 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ipython->ezpz) (0.19.1) Requirement already satisfied: matplotlib-inline in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ipython->ezpz) (0.1.7) Requirement already satisfied: prompt-toolkit<3.1.0,>=3.0.41 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ipython->ezpz) (3.0.43) Requirement already satisfied: pygments>=2.4.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ipython->ezpz) (2.17.2) Requirement already satisfied: stack-data in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ipython->ezpz) (0.6.3) Requirement already satisfied: traitlets>=5.13.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ipython->ezpz) (5.14.3) Requirement already satisfied: typing-extensions>=4.6 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ipython->ezpz) (4.11.0) Requirement already satisfied: pexpect>4.3 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ipython->ezpz) (4.9.0) Requirement already satisfied: numpy>=1.22 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from jax->ezpz) (1.26.4) Requirement already satisfied: opt-einsum in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from jax->ezpz) (3.3.0) Requirement already satisfied: scipy>=1.9 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from jax->ezpz) (1.13.0) Requirement already satisfied: typeguard==2.13.3 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from jaxtyping->ezpz) (2.13.3) Requirement already satisfied: markdown-it-py>=2.2.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from rich->ezpz) (3.0.0) Requirement already satisfied: pandas>=1.2 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from seaborn->ezpz) (2.2.2) Requirement already satisfied: absl-py>=0.4 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from tensorboard->ezpz) (2.1.0) Requirement already satisfied: grpcio>=1.48.2 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from tensorboard->ezpz) (1.62.2) Requirement already satisfied: markdown>=2.6.8 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from tensorboard->ezpz) (3.6) Requirement already satisfied: protobuf!=4.24.0,>=3.19.6 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from tensorboard->ezpz) (3.20.3) Requirement already satisfied: setuptools>=41.0.0 in ./venvs/2024-04-29/lib/python3.11/site-packages (from tensorboard->ezpz) (65.5.0) Requirement already satisfied: six>1.9 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from tensorboard->ezpz) (1.16.0) Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from tensorboard->ezpz) (0.7.2) Requirement already satisfied: werkzeug>=1.0.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from tensorboard->ezpz) (3.0.2) Requirement already satisfied: filelock in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from torch->ezpz) (3.13.1) Requirement already satisfied: sympy in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from torch->ezpz) (1.12) Requirement already satisfied: networkx in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from torch->ezpz) (3.3) Requirement already satisfied: jinja2 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from torch->ezpz) (3.0.3) Requirement already satisfied: fsspec in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from torch->ezpz) (2024.3.1) Requirement already satisfied: Click!=8.0.0,>=7.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from wandb->ezpz) (8.1.7) Requirement already satisfied: GitPython!=3.1.29,>=1.0.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from wandb->ezpz) (3.1.43) Requirement already satisfied: psutil>=5.0.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from wandb->ezpz) (5.9.8) Requirement already satisfied: sentry-sdk>=1.0.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from wandb->ezpz) (2.0.1) Requirement already satisfied: docker-pycreds>=0.4.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from wandb->ezpz) (0.4.0) Requirement already satisfied: setproctitle in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from wandb->ezpz) (1.3.3) Requirement already satisfied: appdirs>=1.4.3 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from wandb->ezpz) (1.4.4) Requirement already satisfied: gitdb<5,>=4.0.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from GitPython!=3.1.29,>=1.0.0->wandb->ezpz) (4.0.11) Requirement already satisfied: parso<0.9.0,>=0.8.3 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from jedi>=0.16->ipython->ezpz) (0.8.4) Requirement already satisfied: mdurl~=0.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from markdown-it-py>=2.2.0->rich->ezpz) (0.1.2) Requirement already satisfied: contourpy>=1.0.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from matplotlib->ambivalent->ezpz) (1.2.1) Requirement already satisfied: cycler>=0.10 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from matplotlib->ambivalent->ezpz) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from matplotlib->ambivalent->ezpz) (4.51.0) Requirement already satisfied: kiwisolver>=1.3.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from matplotlib->ambivalent->ezpz) (1.4.5) Requirement already satisfied: pillow>=8 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from matplotlib->ambivalent->ezpz) (10.3.0) Requirement already satisfied: pyparsing>=2.3.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from matplotlib->ambivalent->ezpz) (3.1.2) Requirement already satisfied: python-dateutil>=2.7 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from matplotlib->ambivalent->ezpz) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from pandas>=1.2->seaborn->ezpz) (2024.1) Requirement already satisfied: tzdata>=2022.7 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from pandas>=1.2->seaborn->ezpz) (2024.1) Requirement already satisfied: ptyprocess>=0.5 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from pexpect>4.3->ipython->ezpz) (0.7.0) Requirement already satisfied: wcwidth in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from prompt-toolkit<3.1.0,>=3.0.41->ipython->ezpz) (0.2.13) Requirement already satisfied: charset-normalizer<4,>=2 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from requests->ambivalent->ezpz) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from requests->ambivalent->ezpz) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from requests->ambivalent->ezpz) (2.1.0) Requirement already satisfied: certifi>=2017.4.17 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from requests->ambivalent->ezpz) (2024.2.2) Requirement already satisfied: MarkupSafe>=2.1.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from werkzeug>=1.0.1->tensorboard->ezpz) (2.1.3) Requirement already satisfied: executing>=1.2.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from stack-data->ipython->ezpz) (2.0.1) Requirement already satisfied: asttokens>=2.1.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from stack-data->ipython->ezpz) (2.4.1) Requirement already satisfied: pure-eval in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from stack-data->ipython->ezpz) (0.2.2) Requirement already satisfied: mpmath>=0.19 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from sympy->torch->ezpz) (1.3.0) Requirement already satisfied: smmap<6,>=3.0.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from gitdb<5,>=4.0.1->GitPython!=3.1.29,>=1.0.0->wandb->ezpz) (5.0.1) Downloading pyinstrument-4.6.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (104 kB) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 104.9/104.9 kB 3.1 MB/s eta 0:00:00 Building wheels for collected packages: ezpz Building editable for ezpz (pyproject.toml) ... done Created wheel for ezpz: filename=ezpz-0.1-py3-none-any.whl size=10104 sha256=f73fbc552c6192f2d1575c08528267c1c70bd4ed2eebab011c692b4cf66fd9cb Stored in directory: /tmp/pip-ephem-wheel-cache-6xb0tqk8/wheels/b3/57/90/f3324177d75cbc607a034b5b8e66d5b3d35dcf087967430718 Successfully built ezpz Installing collected packages: pyinstrument, ezpz Attempting uninstall: ezpz Found existing installation: ezpz 0.1 Not uninstalling ezpz at /home/foremans/.local/polaris/conda/2024-04-29/lib/python3.11/site-packages, outside environment /home/foremans/tmp/foremans/2024-07-15-124441/venvs/2024-04-29 Cant uninstall 'ezpz'. No files were found to uninstall. Successfully installed ezpz pyinstrument-4.6.2 [notice] A new release of pip is available: 24.0 -> 24.1.2 [notice] To update, run: pip install --upgrade pip 9.63s user 1.44s system 60% cpu 18.301s total
#[๐][12:44:34 PM][foremans@x3006c0s13b0n0][~/tmp]
$ dname=$USER/$(tstamp) ; mkdir -p $dname && cd $dname
#[๐][12:44:45 PM][foremans@x3006c0s13b0n0][~/tmp/foremans/2024-07-15-124441]
; ezpz_utils() { fp=$(mktemp) && curl -Ls https://raw.githubusercontent.com/saforem2/ezpz/main/src/ezpz/bin/utils.sh > $fp && source $fp || exit }
#[๐][12:44:47 PM][foremans@x3006c0s13b0n0][~/tmp/foremans/2024-07-15-124441]
$ PBS_O_WORKDIR=$(pwd) ezpz_utils
Using WORKING_DIR: /home/foremans/tmp/foremans/2024-07-15-124441
#[๐][12:45:49 PM][foremans@x3006c0s13b0n0][~/tmp/foremans/2024-07-15-124441]
$ PBS_O_WORKDIR=$(pwd) ezpz_utils && ezpz_setup_python && ezpz_setup_alcf
Using WORKING_DIR: /home/foremans/tmp/foremans/2024-07-15-124441
No conda_prefix OR virtual_env found in environment...
Setting up conda...
Lmod is automatically replacing "nvhpc/23.9" with "gcc-native/12.3".
Lmod is automatically replacing "PrgEnv-nvhpc/8.5.0" with "PrgEnv-gnu/8.5.0".
Due to MODULEPATH changes, the following have been reloaded:
1) cray-mpich/8.1.28
Found conda at: /soft/applications/conda/2024-04-29/mconda3
No VIRTUAL_ENV found in environment!
- Trying to setup from /soft/applications/conda/2024-04-29/mconda3
- Using VENV_DIR=/home/foremans/tmp/foremans/2024-07-15-124441/venvs/2024-04-29
- Creating a new virtual env on top of 2024-04-29 in /home/foremans/tmp/foremans/2024-07-15-124441/venvs/2024-04-29
[python] Using /home/foremans/tmp/foremans/2024-07-15-124441/venvs/2024-04-29/bin/python3
[ezpz/bin/utils.sh]
[2024-07-15-124600]
โข USER=foremans
โข MACHINE=polaris
โข HOST=x3006c0s13b0n0
[ezpz_setup_host]
โข Using hostfile: /var/spool/pbs/aux/2021158.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
โข Found in environment:
โข HOSTFILE: /var/spool/pbs/aux/2021158.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
โข Writing PBS vars to: /home/foremans/.pbsenv
[ezpz_save_pbs_env]
โข Setting:
โข HOSTFILE: /var/spool/pbs/aux/2021158.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
โข JOBENV_FILE: /home/foremans/.pbsenv
alias LAUNCH='mpiexec --verbose --envall -n 4 -ppn 4 --hostfile /var/spool/pbs/aux/2021158.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov --cpu-bind depth -d 16'
[HOSTS]
โข [host:0] - x3006c0s13b0n0.hsn.cm.polaris.alcf.anl.gov
[DIST INFO]
โข HOSTFILE=/var/spool/pbs/aux/2021158.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
โข NHOSTS=1
โข NGPU_PER_HOST=4
โข NGPUS=4
โข DIST_LAUNCH=mpiexec --verbose --envall -n 4 -ppn 4 --hostfile /var/spool/pbs/aux/2021158.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov --cpu-bind depth -d 16
[LAUNCH]:
โข To launch across all available GPUs, use: launch
launch = mpiexec --verbose --envall -n 4 -ppn 4 --hostfile /var/spool/pbs/aux/2021158.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov --cpu-bind depth -d 16
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Obtaining ezpz from git+https://github.com/saforem2/ezpz#egg=ezpz
Cloning https://github.com/saforem2/ezpz to ./venvs/2024-04-29/src/ezpz
Running command git clone --filter=blob:none --quiet https://github.com/saforem2/ezpz /home/foremans/tmp/foremans/2024-07-15-124441/venvs/2024-04-29/src/ezpz
Resolved https://github.com/saforem2/ezpz to commit d8fabca03038db55a1dc490f801581e980f93a25
Installing build dependencies ... done
Checking if build backend supports build_editable ... done
Getting requirements to build editable ... done
Installing backend dependencies ... done
Preparing editable metadata (pyproject.toml) ... done
Requirement already satisfied: ambivalent in /home/foremans/.local/polaris/conda/2024-04-29/lib/python3.11/site-packages (from ezpz) (0.0.1)
Requirement already satisfied: hydra-colorlog in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (1.2.0)
Requirement already satisfied: hydra-core in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (1.3.2)
Requirement already satisfied: ipython in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (8.24.0)
Requirement already satisfied: jax in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (0.4.26)
Requirement already satisfied: jaxlib in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (0.4.26+cuda12.cudnn89)
Requirement already satisfied: jaxtyping in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (0.2.28)
Requirement already satisfied: joblib in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (1.4.0)
Requirement already satisfied: ml-dtypes in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (0.3.2)
Requirement already satisfied: mpi4py in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (3.1.6)
Requirement already satisfied: omegaconf in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (2.3.0)
Requirement already satisfied: plotext in /home/foremans/.local/polaris/conda/2024-04-29/lib/python3.11/site-packages (from ezpz) (5.2.8)
Collecting pyinstrument (from ezpz)
Downloading pyinstrument-4.6.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (21 kB)
Requirement already satisfied: rich in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (13.7.1)
Requirement already satisfied: seaborn in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (0.13.2)
Requirement already satisfied: sentencepiece in /home/foremans/.local/polaris/conda/2024-04-29/lib/python3.11/site-packages (from ezpz) (0.2.0)
Requirement already satisfied: sh in /home/foremans/.local/polaris/conda/2024-04-29/lib/python3.11/site-packages (from ezpz) (2.0.6)
Requirement already satisfied: tensorboard in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (2.16.2)
Requirement already satisfied: torch in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (2.3.0)
Requirement already satisfied: tqdm in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (4.65.0)
Requirement already satisfied: wandb in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (0.16.6)
Requirement already satisfied: xarray in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ezpz) (2024.3.0)
Requirement already satisfied: colormaps in /home/foremans/.local/polaris/conda/2024-04-29/lib/python3.11/site-packages (from ambivalent->ezpz) (0.4.1)
Requirement already satisfied: matplotlib in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ambivalent->ezpz) (3.8.4)
Requirement already satisfied: requests in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ambivalent->ezpz) (2.31.0)
Requirement already satisfied: colorlog in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from hydra-colorlog->ezpz) (6.8.2)
Requirement already satisfied: antlr4-python3-runtime==4.9.* in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from hydra-core->ezpz) (4.9.3)
Requirement already satisfied: packaging in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from hydra-core->ezpz) (24.0)
Requirement already satisfied: PyYAML>=5.1.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from omegaconf->ezpz) (6.0.1)
Requirement already satisfied: decorator in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ipython->ezpz) (5.1.1)
Requirement already satisfied: jedi>=0.16 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ipython->ezpz) (0.19.1)
Requirement already satisfied: matplotlib-inline in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ipython->ezpz) (0.1.7)
Requirement already satisfied: prompt-toolkit<3.1.0,>=3.0.41 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ipython->ezpz) (3.0.43)
Requirement already satisfied: pygments>=2.4.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ipython->ezpz) (2.17.2)
Requirement already satisfied: stack-data in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ipython->ezpz) (0.6.3)
Requirement already satisfied: traitlets>=5.13.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ipython->ezpz) (5.14.3)
Requirement already satisfied: typing-extensions>=4.6 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ipython->ezpz) (4.11.0)
Requirement already satisfied: pexpect>4.3 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from ipython->ezpz) (4.9.0)
Requirement already satisfied: numpy>=1.22 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from jax->ezpz) (1.26.4)
Requirement already satisfied: opt-einsum in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from jax->ezpz) (3.3.0)
Requirement already satisfied: scipy>=1.9 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from jax->ezpz) (1.13.0)
Requirement already satisfied: typeguard==2.13.3 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from jaxtyping->ezpz) (2.13.3)
Requirement already satisfied: markdown-it-py>=2.2.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from rich->ezpz) (3.0.0)
Requirement already satisfied: pandas>=1.2 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from seaborn->ezpz) (2.2.2)
Requirement already satisfied: absl-py>=0.4 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from tensorboard->ezpz) (2.1.0)
Requirement already satisfied: grpcio>=1.48.2 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from tensorboard->ezpz) (1.62.2)
Requirement already satisfied: markdown>=2.6.8 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from tensorboard->ezpz) (3.6)
Requirement already satisfied: protobuf!=4.24.0,>=3.19.6 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from tensorboard->ezpz) (3.20.3)
Requirement already satisfied: setuptools>=41.0.0 in ./venvs/2024-04-29/lib/python3.11/site-packages (from tensorboard->ezpz) (65.5.0)
Requirement already satisfied: six>1.9 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from tensorboard->ezpz) (1.16.0)
Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from tensorboard->ezpz) (0.7.2)
Requirement already satisfied: werkzeug>=1.0.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from tensorboard->ezpz) (3.0.2)
Requirement already satisfied: filelock in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from torch->ezpz) (3.13.1)
Requirement already satisfied: sympy in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from torch->ezpz) (1.12)
Requirement already satisfied: networkx in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from torch->ezpz) (3.3)
Requirement already satisfied: jinja2 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from torch->ezpz) (3.0.3)
Requirement already satisfied: fsspec in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from torch->ezpz) (2024.3.1)
Requirement already satisfied: Click!=8.0.0,>=7.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from wandb->ezpz) (8.1.7)
Requirement already satisfied: GitPython!=3.1.29,>=1.0.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from wandb->ezpz) (3.1.43)
Requirement already satisfied: psutil>=5.0.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from wandb->ezpz) (5.9.8)
Requirement already satisfied: sentry-sdk>=1.0.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from wandb->ezpz) (2.0.1)
Requirement already satisfied: docker-pycreds>=0.4.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from wandb->ezpz) (0.4.0)
Requirement already satisfied: setproctitle in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from wandb->ezpz) (1.3.3)
Requirement already satisfied: appdirs>=1.4.3 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from wandb->ezpz) (1.4.4)
Requirement already satisfied: gitdb<5,>=4.0.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from GitPython!=3.1.29,>=1.0.0->wandb->ezpz) (4.0.11)
Requirement already satisfied: parso<0.9.0,>=0.8.3 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from jedi>=0.16->ipython->ezpz) (0.8.4)
Requirement already satisfied: mdurl~=0.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from markdown-it-py>=2.2.0->rich->ezpz) (0.1.2)
Requirement already satisfied: contourpy>=1.0.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from matplotlib->ambivalent->ezpz) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from matplotlib->ambivalent->ezpz) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from matplotlib->ambivalent->ezpz) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from matplotlib->ambivalent->ezpz) (1.4.5)
Requirement already satisfied: pillow>=8 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from matplotlib->ambivalent->ezpz) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from matplotlib->ambivalent->ezpz) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from matplotlib->ambivalent->ezpz) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from pandas>=1.2->seaborn->ezpz) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from pandas>=1.2->seaborn->ezpz) (2024.1)
Requirement already satisfied: ptyprocess>=0.5 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from pexpect>4.3->ipython->ezpz) (0.7.0)
Requirement already satisfied: wcwidth in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from prompt-toolkit<3.1.0,>=3.0.41->ipython->ezpz) (0.2.13)
Requirement already satisfied: charset-normalizer<4,>=2 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from requests->ambivalent->ezpz) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from requests->ambivalent->ezpz) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from requests->ambivalent->ezpz) (2.1.0)
Requirement already satisfied: certifi>=2017.4.17 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from requests->ambivalent->ezpz) (2024.2.2)
Requirement already satisfied: MarkupSafe>=2.1.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from werkzeug>=1.0.1->tensorboard->ezpz) (2.1.3)
Requirement already satisfied: executing>=1.2.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from stack-data->ipython->ezpz) (2.0.1)
Requirement already satisfied: asttokens>=2.1.0 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from stack-data->ipython->ezpz) (2.4.1)
Requirement already satisfied: pure-eval in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from stack-data->ipython->ezpz) (0.2.2)
Requirement already satisfied: mpmath>=0.19 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from sympy->torch->ezpz) (1.3.0)
Requirement already satisfied: smmap<6,>=3.0.1 in /soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages (from gitdb<5,>=4.0.1->GitPython!=3.1.29,>=1.0.0->wandb->ezpz) (5.0.1)
Downloading pyinstrument-4.6.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (104 kB)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 104.9/104.9 kB 3.1 MB/s eta 0:00:00
Building wheels for collected packages: ezpz
Building editable for ezpz (pyproject.toml) ... done
Created wheel for ezpz: filename=ezpz-0.1-py3-none-any.whl size=10104 sha256=f73fbc552c6192f2d1575c08528267c1c70bd4ed2eebab011c692b4cf66fd9cb
Stored in directory: /tmp/pip-ephem-wheel-cache-6xb0tqk8/wheels/b3/57/90/f3324177d75cbc607a034b5b8e66d5b3d35dcf087967430718
Successfully built ezpz
Installing collected packages: pyinstrument, ezpz
Attempting uninstall: ezpz
Found existing installation: ezpz 0.1
Not uninstalling ezpz at /home/foremans/.local/polaris/conda/2024-04-29/lib/python3.11/site-packages, outside environment /home/foremans/tmp/foremans/2024-07-15-124441/venvs/2024-04-29
Cant uninstall 'ezpz'. No files were found to uninstall.
Successfully installed ezpz pyinstrument-4.6.2
[notice] A new release of pip is available: 24.0 -> 24.1.2
[notice] To update, run: pip install --upgrade pip
9.63s user 1.44s system 60% cpu 18.301s total
Connected to tcp://x3006c0s13b0n0.hsn.cm.polaris.alcf.anl.gov:7919
Found executable /home/foremans/tmp/foremans/2024-07-15-124441/venvs/2024-04-29/bin/python3
Launching application 5ccc89be-4289-49f5-8b4e-64104021b3c5
[2024-07-15 12:46:24.601903][INFO][__init__:156] - Setting logging level to 'INFO' on 'RANK == 0'
[2024-07-15 12:46:24.604160][INFO][__init__:157] - Setting logging level to 'CRITICAL' on all others 'RANK != 0'
[2024-07-15 12:46:24.604624][INFO][__init__:160] - To disable this behavior, and log from ALL ranks (not recommended), set: 'export LOG_FROM_ALL_RANKS=1' in your environment, and re-run.
wandb: WARNING require() unsupported requirement: core
wandb: ERROR Supported wandb.require() features can be found at: https://wandb.me/library-require
wandb: WARNING require() unsupported requirement: core
wandb: ERROR Supported wandb.require() features can be found at: https://wandb.me/library-require
wandb: WARNING require() unsupported requirement: core
wandb: ERROR Supported wandb.require() features can be found at: https://wandb.me/library-require
wandb: WARNING require() unsupported requirement: core
wandb: ERROR Supported wandb.require() features can be found at: https://wandb.me/library-require
[2024-07-15 12:46:26.437667][INFO][dist:358] - [device='cuda'][rank=1/3][local_rank=1/3][node=0/0]
[2024-07-15 12:46:26.437716][INFO][dist:358] - [device='cuda'][rank=3/3][local_rank=3/3][node=0/0]
[2024-07-15 12:46:26.438619][INFO][dist:358] - [device='cuda'][rank=2/3][local_rank=2/3][node=0/0]
[2024-07-15 12:46:26.444402][INFO][dist:95] -
[dist_info]:
โข DEVICE=cuda
โข DEVICE_ID=cuda:0
โข DISTRIBUTED_BACKEND=nccl
โข GPUS_PER_NODE=4
โข HOSTS=['x3006c0s13b0n0.hsn.cm.polaris.alcf.anl.gov']
โข HOSTFILE=/var/spool/pbs/aux/2021158.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
โข HOSTNAME=x3006c0s13b0n0.hsn.cm.polaris.alcf.anl.gov
โข LOCAL_RANK=0
โข MACHINE=Polaris
โข NUM_NODES=1
โข NGPUS=4
โข NGPUS_AVAILABLE=4
โข NODE_ID=0
โข RANK=0
โข SCHEDULER=PBS
โข WORLD_SIZE_TOTAL=4
โข WORLD_SIZE_IN_USE=4
โข LAUNCH_CMD=mpiexec --verbose --envall -n 4 -ppn 4 --hostfile /var/spool/pbs/aux/2021158.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov --cpu-bind depth -d 16
[2024-07-15 12:46:26.447254][INFO][dist:725] - [0/4] Using device='cuda' with backend='DDP' + 'nccl' for distributed training.
[2024-07-15 12:46:26.451604][INFO][dist:358] - [device='cuda'][rank=0/3][local_rank=0/3][node=0/0]
[2024-07-15 12:46:26.452120][WARNING][dist:364] - Using [4 / 4] available "cuda" devices !!
[2024-07-15 12:46:26.452676][INFO][dist:95] -
[timers_import]:
โข os=1.1026859283447266e-06
โข logging=4.507601261138916e-07
โข typing=2.9457733035087585e-06
โข pathlib=1.2619420886039734e-06
โข ezpz=6.109476089477539e-07
โข torch=3.5976991057395935e-06
โข torch_ddp=2.3636966943740845e-06
โข wandb=6.36400654911995e-05
โข total=7.597357034683228e-05
[2024-07-15 12:46:26.453718][INFO][dist:95] -
[CONFIG]:
โข warmup=0
โข log_freq=1
โข batch_size=64
โข input_size=128
โข output_size=128
โข dtype=torch.float32
โข device=cuda
โข world_size=4
โข train_iters=100
[2024-07-15 12:46:28.048558][INFO][test_dist:183] - model=Network(
(layers): Sequential(
(0): Linear(in_features=128, out_features=1024, bias=True)
(1): Linear(in_features=1024, out_features=512, bias=True)
(2): Linear(in_features=512, out_features=256, bias=True)
(3): Linear(in_features=256, out_features=128, bias=True)
(4): Linear(in_features=128, out_features=128, bias=True)
)
)
[2024-07-15 12:46:30.303706][INFO][test_dist:274] - iter=1, loss=1977.36, sps=1.829e+04, dt=0.00349948, dtf=0.00097, dtb=0.00253
[2024-07-15 12:46:30.307445][INFO][test_dist:274] - iter=2, loss=1521.61, sps=3.179e+04, dt=0.00201315, dtf=0.0005266, dtb=0.001487
[2024-07-15 12:46:30.310480][INFO][test_dist:274] - iter=3, loss=1101.45, sps=3.636e+04, dt=0.00176022, dtf=0.0005615, dtb=0.001199
[2024-07-15 12:46:30.313460][INFO][test_dist:274] - iter=4, loss=907.963, sps=3.725e+04, dt=0.00171805, dtf=0.0005335, dtb=0.001185
[2024-07-15 12:46:30.316456][INFO][test_dist:274] - iter=5, loss=842.827, sps=3.698e+04, dt=0.0017307, dtf=0.0005462, dtb=0.001184
[2024-07-15 12:46:30.319577][INFO][test_dist:274] - iter=6, loss=781.918, sps=3.443e+04, dt=0.00185873, dtf=0.0006062, dtb=0.001253
[2024-07-15 12:46:30.322847][INFO][test_dist:274] - iter=7, loss=778.317, sps=3.253e+04, dt=0.00196762, dtf=0.0005706, dtb=0.001397
[2024-07-15 12:46:30.325813][INFO][test_dist:274] - iter=8, loss=746.629, sps=3.746e+04, dt=0.00170848, dtf=0.0005085, dtb=0.0012
[2024-07-15 12:46:30.328695][INFO][test_dist:274] - iter=9, loss=736.385, sps=3.913e+04, dt=0.00163572, dtf=0.0004885, dtb=0.001147
[2024-07-15 12:46:30.331672][INFO][test_dist:274] - iter=10, loss=733.067, sps=3.713e+04, dt=0.00172381, dtf=0.0004742, dtb=0.00125
[2024-07-15 12:46:30.334841][INFO][test_dist:274] - iter=11, loss=729.301, sps=3.342e+04, dt=0.00191528, dtf=0.0003869, dtb=0.001528
[2024-07-15 12:46:30.338105][INFO][test_dist:274] - iter=12, loss=701.243, sps=3.198e+04, dt=0.00200144, dtf=0.0004697, dtb=0.001532
[2024-07-15 12:46:30.340953][INFO][test_dist:274] - iter=13, loss=681.563, sps=4.088e+04, dt=0.00156543, dtf=0.0004341, dtb=0.001131
[2024-07-15 12:46:30.343843][INFO][test_dist:274] - iter=14, loss=685.259, sps=4.053e+04, dt=0.00157891, dtf=0.0004456, dtb=0.001133
[2024-07-15 12:46:30.346744][INFO][test_dist:274] - iter=15, loss=678.99, sps=3.935e+04, dt=0.00162651, dtf=0.0004026, dtb=0.001224
[2024-07-15 12:46:30.349674][INFO][test_dist:274] - iter=16, loss=672.654, sps=3.801e+04, dt=0.00168398, dtf=0.0004565, dtb=0.001227
[2024-07-15 12:46:30.352400][INFO][test_dist:274] - iter=17, loss=660.552, sps=4.319e+04, dt=0.00148184, dtf=0.0003963, dtb=0.001085
[2024-07-15 12:46:30.355521][INFO][test_dist:274] - iter=18, loss=640.175, sps=3.408e+04, dt=0.00187814, dtf=0.0005312, dtb=0.001347
[2024-07-15 12:46:30.358530][INFO][test_dist:274] - iter=19, loss=645.41, sps=3.65e+04, dt=0.00175325, dtf=0.0004639, dtb=0.001289
[2024-07-15 12:46:30.361515][INFO][test_dist:274] - iter=20, loss=649.203, sps=3.675e+04, dt=0.00174155, dtf=0.0004799, dtb=0.001262
[2024-07-15 12:46:30.364256][INFO][test_dist:274] - iter=21, loss=633.891, sps=4.277e+04, dt=0.00149629, dtf=0.000416, dtb=0.00108
[2024-07-15 12:46:30.367135][INFO][test_dist:274] - iter=22, loss=623.202, sps=3.892e+04, dt=0.00164444, dtf=0.0004509, dtb=0.001194
[2024-07-15 12:46:30.370090][INFO][test_dist:274] - iter=23, loss=623.178, sps=3.771e+04, dt=0.00169695, dtf=0.0004252, dtb=0.001272
[2024-07-15 12:46:30.373119][INFO][test_dist:274] - iter=24, loss=626.489, sps=3.69e+04, dt=0.00173444, dtf=0.0004551, dtb=0.001279
[2024-07-15 12:46:30.375951][INFO][test_dist:274] - iter=25, loss=636.674, sps=4.089e+04, dt=0.0015651, dtf=0.0004223, dtb=0.001143
[2024-07-15 12:46:30.378913][INFO][test_dist:274] - iter=26, loss=639.64, sps=3.758e+04, dt=0.00170305, dtf=0.0004532, dtb=0.00125
[2024-07-15 12:46:30.381808][INFO][test_dist:274] - iter=27, loss=605.015, sps=3.874e+04, dt=0.00165192, dtf=0.0004257, dtb=0.001226
[2024-07-15 12:46:30.384573][INFO][test_dist:274] - iter=28, loss=603.894, sps=4.244e+04, dt=0.00150807, dtf=0.0004296, dtb=0.001078
[2024-07-15 12:46:30.387388][INFO][test_dist:274] - iter=29, loss=619.885, sps=4.03e+04, dt=0.00158808, dtf=0.0004196, dtb=0.001168
[2024-07-15 12:46:30.390148][INFO][test_dist:274] - iter=30, loss=589.771, sps=4.21e+04, dt=0.0015203, dtf=0.0004438, dtb=0.001076
[2024-07-15 12:46:30.392838][INFO][test_dist:274] - iter=31, loss=595.523, sps=4.381e+04, dt=0.001461, dtf=0.0004153, dtb=0.001046
[2024-07-15 12:46:30.395622][INFO][test_dist:274] - iter=32, loss=605.537, sps=4.104e+04, dt=0.00155956, dtf=0.0004367, dtb=0.001123
[2024-07-15 12:46:30.398489][INFO][test_dist:274] - iter=33, loss=586.025, sps=3.913e+04, dt=0.00163565, dtf=0.0003743, dtb=0.001261
[2024-07-15 12:46:30.401355][INFO][test_dist:274] - iter=34, loss=577.14, sps=3.877e+04, dt=0.0016506, dtf=0.0004581, dtb=0.001192
[2024-07-15 12:46:30.404092][INFO][test_dist:274] - iter=35, loss=568.886, sps=4.383e+04, dt=0.00146019, dtf=0.0003966, dtb=0.001064
[2024-07-15 12:46:30.406843][INFO][test_dist:274] - iter=36, loss=567.26, sps=4.228e+04, dt=0.00151377, dtf=0.0004537, dtb=0.00106
[2024-07-15 12:46:30.409591][INFO][test_dist:274] - iter=37, loss=574.633, sps=4.226e+04, dt=0.00151457, dtf=0.0003845, dtb=0.00113
[2024-07-15 12:46:30.412347][INFO][test_dist:274] - iter=38, loss=557.928, sps=4.212e+04, dt=0.00151945, dtf=0.000457, dtb=0.001062
[2024-07-15 12:46:30.415130][INFO][test_dist:274] - iter=39, loss=558.186, sps=4.135e+04, dt=0.00154767, dtf=0.0003975, dtb=0.00115
[2024-07-15 12:46:30.417880][INFO][test_dist:274] - iter=40, loss=553.377, sps=4.233e+04, dt=0.00151185, dtf=0.0004556, dtb=0.001056
[2024-07-15 12:46:30.420583][INFO][test_dist:274] - iter=41, loss=542.918, sps=4.394e+04, dt=0.00145645, dtf=0.0003928, dtb=0.001064
[2024-07-15 12:46:30.423489][INFO][test_dist:274] - iter=42, loss=547.64, sps=3.827e+04, dt=0.00167242, dtf=0.0004762, dtb=0.001196
[2024-07-15 12:46:30.426243][INFO][test_dist:274] - iter=43, loss=546.106, sps=4.22e+04, dt=0.00151648, dtf=0.0004665, dtb=0.00105
[2024-07-15 12:46:30.428998][INFO][test_dist:274] - iter=44, loss=535.946, sps=4.209e+04, dt=0.0015204, dtf=0.0004679, dtb=0.001053
[2024-07-15 12:46:30.431749][INFO][test_dist:274] - iter=45, loss=534.731, sps=4.324e+04, dt=0.00148002, dtf=0.0003821, dtb=0.001098
[2024-07-15 12:46:30.434709][INFO][test_dist:274] - iter=46, loss=520.207, sps=3.74e+04, dt=0.00171109, dtf=0.0004486, dtb=0.001263
[2024-07-15 12:46:30.437524][INFO][test_dist:274] - iter=47, loss=527.301, sps=4.191e+04, dt=0.00152713, dtf=0.0004943, dtb=0.001033
[2024-07-15 12:46:30.440277][INFO][test_dist:274] - iter=48, loss=516.108, sps=4.279e+04, dt=0.00149561, dtf=0.0004418, dtb=0.001054
[2024-07-15 12:46:30.443029][INFO][test_dist:274] - iter=49, loss=516.086, sps=4.231e+04, dt=0.00151262, dtf=0.0003899, dtb=0.001123
[2024-07-15 12:46:30.445895][INFO][test_dist:274] - iter=50, loss=508.945, sps=3.922e+04, dt=0.00163198, dtf=0.0004333, dtb=0.001199
[2024-07-15 12:46:30.448667][INFO][test_dist:274] - iter=51, loss=513.235, sps=4.163e+04, dt=0.00153721, dtf=0.0004863, dtb=0.001051
[2024-07-15 12:46:30.451581][INFO][test_dist:274] - iter=52, loss=511.549, sps=3.822e+04, dt=0.00167444, dtf=0.0004698, dtb=0.001205
[2024-07-15 12:46:30.454412][INFO][test_dist:274] - iter=53, loss=510.211, sps=4.042e+04, dt=0.00158328, dtf=0.0004193, dtb=0.001164
[2024-07-15 12:46:30.457220][INFO][test_dist:274] - iter=54, loss=492.431, sps=4.151e+04, dt=0.00154189, dtf=0.0004324, dtb=0.001109
[2024-07-15 12:46:30.460039][INFO][test_dist:274] - iter=55, loss=499.074, sps=4.121e+04, dt=0.00155284, dtf=0.0005013, dtb=0.001052
[2024-07-15 12:46:30.462836][INFO][test_dist:274] - iter=56, loss=490.718, sps=4.108e+04, dt=0.0015581, dtf=0.0004388, dtb=0.001119
[2024-07-15 12:46:30.465621][INFO][test_dist:274] - iter=57, loss=493.223, sps=4.157e+04, dt=0.00153946, dtf=0.0003867, dtb=0.001153
[2024-07-15 12:46:30.468390][INFO][test_dist:274] - iter=58, loss=486.601, sps=4.256e+04, dt=0.00150368, dtf=0.0004462, dtb=0.001057
[2024-07-15 12:46:30.471148][INFO][test_dist:274] - iter=59, loss=473.609, sps=4.326e+04, dt=0.00147947, dtf=0.0004221, dtb=0.001057
[2024-07-15 12:46:30.473847][INFO][test_dist:274] - iter=60, loss=478.577, sps=4.378e+04, dt=0.00146188, dtf=0.0004284, dtb=0.001033
[2024-07-15 12:46:30.476547][INFO][test_dist:274] - iter=61, loss=475.991, sps=4.387e+04, dt=0.00145878, dtf=0.0003923, dtb=0.001066
[2024-07-15 12:46:30.479447][INFO][test_dist:274] - iter=62, loss=471.526, sps=3.859e+04, dt=0.00165854, dtf=0.0004523, dtb=0.001206
[2024-07-15 12:46:30.482180][INFO][test_dist:274] - iter=63, loss=460.986, sps=4.277e+04, dt=0.00149626, dtf=0.0003914, dtb=0.001105
[2024-07-15 12:46:30.484857][INFO][test_dist:274] - iter=64, loss=464.409, sps=4.404e+04, dt=0.00145311, dtf=0.0004247, dtb=0.001028
[2024-07-15 12:46:30.487652][INFO][test_dist:274] - iter=65, loss=458.023, sps=4.104e+04, dt=0.00155961, dtf=0.0003934, dtb=0.001166
[2024-07-15 12:46:30.490474][INFO][test_dist:274] - iter=66, loss=456.718, sps=4.013e+04, dt=0.00159499, dtf=0.000416, dtb=0.001179
[2024-07-15 12:46:30.493187][INFO][test_dist:274] - iter=67, loss=451.993, sps=4.296e+04, dt=0.00148977, dtf=0.000395, dtb=0.001095
[2024-07-15 12:46:30.495894][INFO][test_dist:274] - iter=68, loss=454.66, sps=4.4e+04, dt=0.0014545, dtf=0.000425, dtb=0.001029
[2024-07-15 12:46:30.498664][INFO][test_dist:274] - iter=69, loss=451.727, sps=4.17e+04, dt=0.00153459, dtf=0.0003837, dtb=0.001151
[2024-07-15 12:46:30.501502][INFO][test_dist:274] - iter=70, loss=440.922, sps=4.015e+04, dt=0.00159391, dtf=0.0004272, dtb=0.001167
[2024-07-15 12:46:30.504271][INFO][test_dist:274] - iter=71, loss=442.788, sps=4.322e+04, dt=0.00148083, dtf=0.0004178, dtb=0.001063
[2024-07-15 12:46:30.507000][INFO][test_dist:274] - iter=72, loss=439.069, sps=4.307e+04, dt=0.00148594, dtf=0.0004285, dtb=0.001057
[2024-07-15 12:46:30.509755][INFO][test_dist:274] - iter=73, loss=430.236, sps=4.211e+04, dt=0.00151976, dtf=0.0003829, dtb=0.001137
[2024-07-15 12:46:30.512494][INFO][test_dist:274] - iter=74, loss=428.951, sps=4.318e+04, dt=0.00148219, dtf=0.0004357, dtb=0.001046
[2024-07-15 12:46:30.515244][INFO][test_dist:274] - iter=75, loss=430.417, sps=4.253e+04, dt=0.00150487, dtf=0.0004357, dtb=0.001069
[2024-07-15 12:46:30.517921][INFO][test_dist:274] - iter=76, loss=416.647, sps=4.401e+04, dt=0.00145412, dtf=0.0004374, dtb=0.001017
[2024-07-15 12:46:30.520610][INFO][test_dist:274] - iter=77, loss=422.518, sps=4.468e+04, dt=0.0014323, dtf=0.0003941, dtb=0.001038
[2024-07-15 12:46:30.523343][INFO][test_dist:274] - iter=78, loss=412.028, sps=4.272e+04, dt=0.00149821, dtf=0.0004521, dtb=0.001046
[2024-07-15 12:46:30.526016][INFO][test_dist:274] - iter=79, loss=406.225, sps=4.473e+04, dt=0.00143094, dtf=0.0003893, dtb=0.001042
[2024-07-15 12:46:30.528730][INFO][test_dist:274] - iter=80, loss=402.887, sps=4.382e+04, dt=0.00146036, dtf=0.0004402, dtb=0.00102
[2024-07-15 12:46:30.531536][INFO][test_dist:274] - iter=81, loss=397.311, sps=4.069e+04, dt=0.0015729, dtf=0.000404, dtb=0.001169
[2024-07-15 12:46:30.534242][INFO][test_dist:274] - iter=82, loss=411.916, sps=4.39e+04, dt=0.00145795, dtf=0.0004351, dtb=0.001023
[2024-07-15 12:46:30.537005][INFO][test_dist:274] - iter=83, loss=402.795, sps=4.366e+04, dt=0.00146599, dtf=0.0004252, dtb=0.001041
[2024-07-15 12:46:30.539805][INFO][test_dist:274] - iter=84, loss=391.05, sps=4.1e+04, dt=0.00156095, dtf=0.0004323, dtb=0.001129
[2024-07-15 12:46:30.542590][INFO][test_dist:274] - iter=85, loss=383.782, sps=4.118e+04, dt=0.00155416, dtf=0.000388, dtb=0.001166
[2024-07-15 12:46:30.545290][INFO][test_dist:274] - iter=86, loss=399.543, sps=4.396e+04, dt=0.00145595, dtf=0.0004339, dtb=0.001022
[2024-07-15 12:46:30.547992][INFO][test_dist:274] - iter=87, loss=379.003, sps=4.456e+04, dt=0.00143613, dtf=0.0004131, dtb=0.001023
[2024-07-15 12:46:30.550837][INFO][test_dist:274] - iter=88, loss=372.048, sps=3.998e+04, dt=0.00160092, dtf=0.0004375, dtb=0.001163
[2024-07-15 12:46:30.553641][INFO][test_dist:274] - iter=89, loss=376.187, sps=4.137e+04, dt=0.0015471, dtf=0.0004142, dtb=0.001133
[2024-07-15 12:46:30.556354][INFO][test_dist:274] - iter=90, loss=372.281, sps=4.408e+04, dt=0.00145186, dtf=0.0004239, dtb=0.001028
[2024-07-15 12:46:30.559101][INFO][test_dist:274] - iter=91, loss=370.701, sps=4.252e+04, dt=0.00150523, dtf=0.0004545, dtb=0.001051
[2024-07-15 12:46:30.561884][INFO][test_dist:274] - iter=92, loss=356.074, sps=4.191e+04, dt=0.00152712, dtf=0.0004291, dtb=0.001098
[2024-07-15 12:46:30.564556][INFO][test_dist:274] - iter=93, loss=360.663, sps=4.468e+04, dt=0.00143241, dtf=0.0003938, dtb=0.001039
[2024-07-15 12:46:30.567287][INFO][test_dist:274] - iter=94, loss=374.599, sps=4.296e+04, dt=0.0014897, dtf=0.0004539, dtb=0.001036
[2024-07-15 12:46:30.570088][INFO][test_dist:274] - iter=95, loss=364.476, sps=4.253e+04, dt=0.00150469, dtf=0.0004189, dtb=0.001086
[2024-07-15 12:46:30.572767][INFO][test_dist:274] - iter=96, loss=358.992, sps=4.415e+04, dt=0.00144967, dtf=0.0004295, dtb=0.00102
[2024-07-15 12:46:30.575567][INFO][test_dist:274] - iter=97, loss=354.959, sps=4.092e+04, dt=0.0015641, dtf=0.0004054, dtb=0.001159
[2024-07-15 12:46:30.578287][INFO][test_dist:274] - iter=98, loss=345.644, sps=4.374e+04, dt=0.00146311, dtf=0.0004197, dtb=0.001043
[2024-07-15 12:46:30.580996][INFO][test_dist:274] - iter=99, loss=348.209, sps=4.391e+04, dt=0.00145758, dtf=0.0004263, dtb=0.001031
train/dt [2024-07-15-124630]
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
0.00350โคโ โ
โ โ
โ โ
0.00315โค โ
โ โ
โ โ
0.00281โค โ
โ โ
0.00247โค โ
โ โ
โ โ
0.00212โค โ
โโ โ โ
โ โ โ โ โ
0.00178โค โ โ โ
โ โโ โโ โ โ โโ โ โ โ โ โ โ โ
โ โ โโโ โ โโโโ โ โ โ โโโโโโ โโ โ โ โ โโ โโ โ โ โ
0.00143โค โ โ โ โ โโโ โโโโ โ โโโโโ โโ โโโโโโโโโโ โโ โ โโโ โโ
โโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโ
1.0 25.5 50.0 74.5 99.0
train/dt iter
[2024-07-15 12:46:30.627053][INFO][plot:156] - Appending plot to: /home/foremans/tmp/foremans/2024-07-15-124441/test-dist-plots/train/dt.txt
text saved in /home/foremans/tmp/foremans/2024-07-15-124441/test-dist-plots/train/dt.txt
train/dtf [2024-07-15-124630]
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
0.00097โคโ โ
โ โ
โ โ
0.00087โค โ
โ โ
โ โ
0.00077โค โ
โ โ
0.00067โค โ
โ โ
โ โ โ
0.00057โค โ โ
โ โ โ โ
โโ โ โ โ โ
0.00047โค โ โ โ โ โ โ โ โ โ
โ โโ โ โ โโ โ โโโ โโ โ โ โ โ โ โ โ โโโโ โโ โโ โ โ โ
โ โ โ โโโโโโ โ โโ โโ โ โโโโโ โ โ โโโโ โโโโโ
0.00037โค โ โ โโ โโ โ โ โ โ โโ โโ โ โ โ โ โ โ โ
โโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโ
1.0 25.5 50.0 74.5 99.0
train/dtf iter
[2024-07-15 12:46:30.638406][INFO][plot:156] - Appending plot to: /home/foremans/tmp/foremans/2024-07-15-124441/test-dist-plots/train/dtf.txt
text saved in /home/foremans/tmp/foremans/2024-07-15-124441/test-dist-plots/train/dtf.txt
train/dtb [2024-07-15-124630]
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
0.00253โคโ โ
โ โ
โ โ
0.00228โค โ
โ โ
โ โ
0.00203โค โ
โ โ
0.00177โค โ
โ โ
โ โ
0.00152โค โโ โ
โโ โ
โ โ โ โ
0.00127โค โ โ โ โ โ
โ โโโ โโ โโ โ โ โ โ โ โโ โ โ โ โ
โ โ โโ โ โ โ โโ โ โ โโโโ โ โโ โ โ โโ โโ โ โ โ โ
0.00102โค โ โ โ โโโ โโโโ โโ โ โ โโโ โ โ โโโโโโโโโโ โโ โ โโโ โโ
โโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโ
1.0 25.5 50.0 74.5 99.0
train/dtb iter
[2024-07-15 12:46:30.648950][INFO][plot:156] - Appending plot to: /home/foremans/tmp/foremans/2024-07-15-124441/test-dist-plots/train/dtb.txt
text saved in /home/foremans/tmp/foremans/2024-07-15-124441/test-dist-plots/train/dtb.txt
train/loss [2024-07-15-124630]
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1977.4โคโ โ
โ โ
โ โ
1705.4โค โ
โ โ
โโ โ
1433.5โค โ
โ โ
โ โ
1161.5โค โ โ
โ โ
889.5โค โ โ
โ โ โ
โ โโโโโโ โ
617.6โค โโโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโโโ โ
345.6โค โโโโโโโโโโโโโโโโโ
โโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโ
1.0 25.5 50.0 74.5 99.0
train/loss iter
[2024-07-15 12:46:30.702717][INFO][plot:156] - Appending plot to: /home/foremans/tmp/foremans/2024-07-15-124441/test-dist-plots/train/loss.txt
text saved in /home/foremans/tmp/foremans/2024-07-15-124441/test-dist-plots/train/loss.txt
train/iter [2024-07-15-124630]
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
99.0โค โโโโโ
โ โโโโ โ
โ โโโโโ โ
82.7โค โโโโ โ
โ โโโโ โ
โ โโโโโ โ
66.3โค โโโโ โ
โ โโโโโ โ
50.0โค โโโโ โ
โ โโโโ โ
โ โโโโโ โ
33.7โค โโโโ โ
โ โโโโโ โ
โ โโโโ โ
17.3โค โโโโ โ
โ โโโโโ โ
โ โโโโ โ
1.0โคโโโโ โ
โโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโ
1.0 25.5 50.0 74.5 99.0
train/iter iter
[2024-07-15 12:46:30.714042][INFO][plot:156] - Appending plot to: /home/foremans/tmp/foremans/2024-07-15-124441/test-dist-plots/train/iter.txt
text saved in /home/foremans/tmp/foremans/2024-07-15-124441/test-dist-plots/train/iter.txt
train/sps [2024-07-15-124630]
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
44725.9โค โ โ โ โโ โ โ โ โโโโ โโ โ โ โ โโ
โ โ โ โ โ โโ โ โโ โ โ โ โ โโโโ โ โ โ โ
โ โ โ โ โ โ โ โโโ โ โ โ โโ โ โ โ โ
40319.7โค โ โโ โ โ โ โ โ โ โ โ
โ โ โโ โ โ โ โ โ โ
โ โโโ โโ โ โ โ โ
35913.4โค โ
โ โ โ โ โ
31507.2โคโ โ โ โ
โ โ
โ โ
27100.9โค โ
โ โ
โ โ
22694.7โค โ
โ โ
โ โ
18288.4โคโ โ
โโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโ
1.0 25.5 50.0 74.5 99.0
train/sps iter
[2024-07-15 12:46:30.724919][INFO][plot:156] - Appending plot to: /home/foremans/tmp/foremans/2024-07-15-124441/test-dist-plots/train/sps.txt
text saved in /home/foremans/tmp/foremans/2024-07-15-124441/test-dist-plots/train/sps.txt
_ ._ __/__ _ _ _ _ _/_ Recorded: 12:46:28 Samples: 2182
/_//_/// /_\ / //_// / //_'/ // Duration: 2.688 CPU time: 2.356
/ _/ v4.6.2
Program: /home/foremans/tmp/foremans/2024-07-15-124441/venvs/2024-04-29/src/ezpz/src/ezpz/test_dist.py
2.688 <module> ezpz/test_dist.py:1
โโ 2.687 main ezpz/test_dist.py:217
โโ 2.104 build_model_and_optimizer ezpz/test_dist.py:171
โ โโ 2.089 Adam.__init__ torch/optim/adam.py:15
โ [142 frames hidden] torch, transformers, jax, huggingface...
โโ 0.199 _backward_step ezpz/test_dist.py:236
โ โโ 0.104 Tensor.backward torch/_tensor.py:466
โ โ [4 frames hidden] torch, <built-in>
โ โโ 0.094 wrapper torch/optim/optimizer.py:374
โ [6 frames hidden] torch, <built-in>
โโ 0.145 tplot_dict ezpz/plot.py:136
โ โโ 0.085 show plotext/_core.py:292
โ โ [5 frames hidden] plotext
โ โโ 0.031 <module> plotext/__init__.py:1
โโ 0.136 _forward_step ezpz/test_dist.py:231
โ โโ 0.084 DistributedDataParallel._wrapped_call_impl torch/nn/modules/module.py:1528
โ โ [6 frames hidden] torch
โ โ 0.071 Network._call_impl torch/nn/modules/module.py:1534
โ โ โโ 0.071 Network.forward ezpz/test_dist.py:164
โ โ โโ 0.071 Sequential._wrapped_call_impl torch/nn/modules/module.py:1528
โ โ [7 frames hidden] torch, <built-in>
โ โโ 0.052 calc_loss ezpz/test_dist.py:168
โโ 0.100 Logger.info logging/__init__.py:1479
[6 frames hidden] logging, rich
0.100 RichHandler.emit rich/logging.py:126
โโ 0.098 Console.print ezpz/log/console.py:79
โโ 0.098 Console.print rich/console.py:1624
[4 frames hidden] rich
[2024-07-15 12:46:30.921061][INFO][profile:115] - Saving pyinstrument profile output to: /home/foremans/tmp/foremans/2024-07-15-124441/ezpz_pyinstrument_profiles
[2024-07-15 12:46:30.921555][INFO][profile:123] - PyInstrument profile saved (as html) to: /home/foremans/tmp/foremans/2024-07-15-124441/ezpz_pyinstrument_profiles/pyinstrument-profile-2024-07-15-124630.html
[2024-07-15 12:46:30.922034][INFO][profile:131] - PyInstrument profile saved (as text) to: /home/foremans/tmp/foremans/2024-07-15-124441/ezpz_pyinstrument_profiles/pyinstrument-profile-2024-07-15-124630.txt
[2024-07-15 12:46:31.432464][INFO][profile:143] - Finished with pyinstrument profiler. Took: 2.68764s
[2024-07-15 12:46:31.433122][INFO][test_dist:318] - [0] runtime=6.820802s
Application 5ccc89be resources: utime=21s stime=21s maxrss=1383056KB inblock=8080 oublock=3456 minflt=659443 majflt=896 nvcsw=192077 nivcsw=672014
Footnotes
Any of {Aurora, Polaris, Sophia, Sunspot, Sirius}โฉ๏ธ
Note the
--require-virtualenv
isnโt strictly required, but I highly recommend to always try and work within a virtual environment, when possible.โฉ๏ธdeepspeed
,DDP
only supportpytorch
โฉ๏ธIf necessary, otherwise activate if already existsโฉ๏ธ
Note that the virtual environment will be created at
./venvs/${CONDA_NAME}
, where${CONDA_NAME}
will match the prefix of the activeconda
environmentโฉ๏ธ
Citation
@article{foreman2024,
author = {Foreman, Sam},
title = {๐ Ezpz at {ALCF}},
date = {2024-08-23},
url = {https://samforeman.me/posts/ezpz-at-alcf},
langid = {en}
}