Want to compute: y = \sum_{i} x_{i} W_{i} = x_0 * W_0 + x_1 * W_1 + x_2 * W_2
where each GPU only has only its portion of the full weights as shown below
flowchart LR
subgraph X0["`GPU0`"]
direction LR
a("`Wโ`")
end
subgraph X1["`GPU1`"]
direction LR
b("`Wโ`")
end
subgraph X2["`GPU2`"]
direction LR
c("`Wโ`")
end
t0("`xโ`")-->X0
X0 -->|"`xโ Wโ`"|X1
X1 -->|"`xโ Wโ <br>+ xโ Wโ`"|X2
t1("`xโ`") --> X1
t2("`xโ`") --> X2
Figure 15
Tensor (Model) Parallelism1
In Tensor Paralleism each GPU processes only a slice of a tensor and only aggregates the full tensor for operations that require the whole thing.
The main building block of any transformer is a fully connected nn.Linear followed by a nonlinear activation GeLU.
Y = GeLU(XA), where X and Y are the input and output vectors, and A is the weight matrix.
If we look at the computation in matrix form, itโs easy to see how the matrix multiplication can be split between multiple GPUs:
Tensor Parallelism
Figure 16: Tensor Parallel GEMM. This information is based on (the much more in-depth) TP Overview by @anton-l
$ export PBS_O_WORKDIR=$(pwd)&&source deps/ezpz/src/ezpz/bin/utils.shUsing WORKING_DIR: /eagle/argonne_tpc/foremans/tmp/2024-10-26-094746$ ezpz_setup_pythonNo conda_prefix OR virtual_env found in environment...Setting up conda...Lmod is automatically replacing "nvhpc/23.9" with "gcc-native/12.3".Lmod is automatically replacing "PrgEnv-nvhpc/8.5.0" with "PrgEnv-gnu/8.5.0".Due to MODULEPATH changes, the following have been reloaded:1)cray-mpich/8.1.28Found conda at: /soft/applications/conda/2024-04-29/mconda3No VIRTUAL_ENV found in environment!- Trying to setup from /soft/applications/conda/2024-04-29/mconda3- Using VENV_DIR=/eagle/argonne_tpc/foremans/tmp/2024-10-26-094746/venvs/2024-04-29- Creating a new virtual env on top of 2024-04-29 in /eagle/argonne_tpc/foremans/tmp/2024-10-26-094746/venvs/2024-04-29[python] Using /eagle/argonne_tpc/foremans/tmp/2024-10-26-094746/venvs/2024-04-29/bin/python3
Setup Job
$ ezpz_setup_job[๐ ezpz/bin/utils.sh]โข USER=foremansโข MACHINE=polarisโข HOST=x3205c0s25b0n0โข TSTAMP=2024-10-26-094841[ezpz_get_pbs_env]: Caught 0 argumentsโข hostfile: /var/spool/pbs/aux/3061463.polaris-pbs-01.hsn.cm.polaris.alcf.anl.govโข jobenv_file: /home/foremans/.pbsenv[ezpz_setup_host_pbs]โข Using hostfile: /var/spool/pbs/aux/3061463.polaris-pbs-01.hsn.cm.polaris.alcf.anl.govโข Found in environment:โข HOSTFILE: /var/spool/pbs/aux/3061463.polaris-pbs-01.hsn.cm.polaris.alcf.anl.govโข Writing PBS vars to: /home/foremans/.pbsenv[ezpz_save_pbs_env]โข Setting:โข HOSTFILE: /var/spool/pbs/aux/3061463.polaris-pbs-01.hsn.cm.polaris.alcf.anl.govโข JOBENV_FILE: /home/foremans/.pbsenv[HOSTS]โข[host:0]- x3205c0s25b0n0.hsn.cm.polaris.alcf.anl.govโข[host:1]- x3205c0s25b1n0.hsn.cm.polaris.alcf.anl.gov[DIST INFO]โข NGPUS=8โข NHOSTS=2โข NGPU_PER_HOST=4โข HOSTFILE=/var/spool/pbs/aux/3061463.polaris-pbs-01.hsn.cm.polaris.alcf.anl.govโข DIST_LAUNCH=mpiexec --verbose--envall-n 8 -ppn 4 --hostfile /var/spool/pbs/aux/3061463.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov --cpu-bind depth -d 8[LAUNCH]:โข To launch across all available GPUs, use: launchlaunch = mpiexec --verbose--envall-n 8 -ppn 4 --hostfile /var/spool/pbs/aux/3061463.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov --cpu-bind depth -d 8
Figure 28: The simplest, fastest repository for training / finetuning GPT based models.
Prepare Data
$ python3 wordplay/data/shakespeare_char/prepare.pyUsing HF_DATASETS_CACHE=/home/foremans/tmp/polaris-talk/2024-07-17-073327/wordplay/data/shakespeare_char/.cache/huggingfacelength of dataset in characters: 1,115,394all the unique characters:!$&\',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzvocab size: 65train has 1,003,854 tokensval has 111,540 tokens
Launch Training (DDP)
$ launch python3 -m wordplay \ train.backend=DDP \ train.eval_interval=100 \ data=shakespeare \ train.dtype=bf16 \ model.batch_size=64 \ model.block_size=1024 \ train.max_iters=1000 \ train.log_interval=10 \ train.compile=false \|tee wordplay-gpt2-DDP.log[2024-07-17 07:42:11.746540][INFO][__init__:156]- Setting logging level to 'INFO' on 'RANK == 0'[2024-07-17 07:42:11.748763][INFO][__init__:157]- Setting logging level to 'CRITICAL' on all others 'RANK != 0'[2024-07-17 07:42:11.749453][INFO][__init__:160]- To disable this behavior, and log from ALL ranks (not recommended), set: 'export LOG_FROM_ALL_RANKS=1' in your environment, and re-run.[2024-07-17 07:42:11.772718][INFO][configs:81]- Setting HF_DATASETS_CACHE to /home/foremans/tmp/polaris-talk/2024-07-17-073327/wordplay/.cache/huggingface/datasets[2024-07-17 07:42:15.341532][INFO][dist:358]-[device='cuda'][rank=2/3][local_rank=2/3][node=0/0][2024-07-17 07:42:15.342381][INFO][dist:358]-[device='cuda'][rank=1/3][local_rank=1/3][node=0/0][2024-07-17 07:42:15.342430][INFO][dist:358]-[device='cuda'][rank=3/3][local_rank=3/3][node=0/0][2024-07-17 07:42:15.348657][INFO][dist:95]-[dist_info]:โข DEVICE=cudaโข DEVICE_ID=cuda:0โข DISTRIBUTED_BACKEND=ncclโข GPUS_PER_NODE=4โข HOSTS=['x3101c0s13b0n0.hsn.cm.polaris.alcf.anl.gov']โข HOSTFILE=/var/spool/pbs/aux/2024084.polaris-pbs-01.hsn.cm.polaris.alcf.anl.govโข HOSTNAME=x3101c0s13b0n0.hsn.cm.polaris.alcf.anl.govโข LOCAL_RANK=0โข MACHINE=Polarisโข NUM_NODES=1โข NGPUS=4โข NGPUS_AVAILABLE=4โข NODE_ID=0โข RANK=0โข SCHEDULER=PBSโข WORLD_SIZE_TOTAL=4โข WORLD_SIZE_IN_USE=4โข LAUNCH_CMD=mpiexec --verbose--envall-n 4 -ppn 4 --hostfile /var/spool/pbs/aux/2024084.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov --cpu-bind depth -d 16[2024-07-17 07:42:15.351446][INFO][dist:725]-[0/4] Using device='cuda' with backend='DDP' + 'nccl' for distributed training.[2024-07-17 07:42:15.356169][INFO][dist:358]-[device='cuda'][rank=0/3][local_rank=0/3][node=0/0][2024-07-17 07:42:15.356692][WARNING][dist:364]- Using [4 / 4] available "cuda" devices !![2024-07-17 07:42:15.359571][INFO][configs:317]- Loading val from /home/foremans/tmp/polaris-talk/2024-07-17-073327/wordplay/data/shakespeare_char/val.bin[2024-07-17 07:42:15.360138][INFO][configs:317]- Loading train from /home/foremans/tmp/polaris-talk/2024-07-17-073327/wordplay/data/shakespeare_char/train.bin[2024-07-17 07:42:15.361154][INFO][configs:442]- Tokens per iteration: 262,144[2024-07-17 07:42:15.361574][INFO][configs:465]- Using self.ptdtype=torch.float16 on self.device_type='cuda'[2024-07-17 07:42:15.362002][INFO][configs:471]- Initializing a new model from scratch[2024-07-17 07:42:15.362529][INFO][dist:874]- Setting up wandb from rank: 0[2024-07-17 07:42:15.362896][INFO][dist:875]- Using: WB PROJECT: WordPlay[2024-07-17 07:42:16.451786][INFO][dist:905]- W&B RUN: [still-frog-17](https://wandb.ai/aurora_gpt/WordPlay/runs/6by9vpcj)[2024-07-17 07:42:16.464106][INFO][dist:312]- Updating wandb.run: still-frog-17 config with "DIST_INFO"[2024-07-17 07:42:16.469424][INFO][dist:938]- Running on machine='Polaris'[2024-07-17 07:42:16.471151][WARNING][__main__:89]- {"train": {"framework":"pytorch","backend":"DDP","device": null,"seed": null,"port": null,"ds_config_path": null,"precision": null,"ngpus": null,"use_wandb": true,"eval_interval": 100,"log_interval": 10,"eval_iters": 200,"eval_only": false,"always_save_checkpoint": false,"init_from":"scratch","wandb_project":"WordPlay","max_iters": 1000,"warmup_iters": 100,"dtype":"bf16","compile": false},"model": {"n_layer": 12,"n_head": 12,"n_embd": 768,"batch_size": 64,"block_size": 1024,"activation":"gelu","dropout": 0.0,"bias": false,"vocab_size": 65},"data": {"dataset":"shakespeare_char","out_dir":"out-shakespeare-char","root_path": null},"optimizer": {"gas": 1,"name":"AdamW","learning_rate": 0.0006,"weight_decay": 0.1,"beta1": 0.9,"beta2": 0.95,"grad_clip": 1.0,"decay_lr": true,"lr_decay_iters": 600000,"min_lr": 6e-05}}[2024-07-17 07:42:16.474305][WARNING][__main__:90]- Output dir: /home/foremans/tmp/polaris-talk/outputs/runs/pytorch/DDP/2024-07-17/07-42-13[2024-07-17 07:42:16.474922][INFO][trainer:246]- Initializing a new model from scratch[2024-07-17 07:42:17.258904][INFO][model:255]- number of parameters: 85.00M[2024-07-17 07:42:17.290004][INFO][trainer:264]- Model size: num_params=85003776[2024-07-17 07:42:17.292626][INFO][model:445]- num decayed parameter tensors: 50, with 85,771,008 parameters[2024-07-17 07:42:17.293296][INFO][model:449]- num non-decayed parameter tensors: 25, with 19,200 parameters[2024-07-17 07:42:17.515324][CRITICAL][trainer:316]-"devid='cuda:1'"[2024-07-17 07:42:17.515340][CRITICAL][trainer:316]-"devid='cuda:2'"[2024-07-17 07:42:17.515465][CRITICAL][trainer:316]-"devid='cuda:3'"[2024-07-17 07:42:18.431814][INFO][model:465]- using fused AdamW: True[2024-07-17 07:42:18.432620][CRITICAL][trainer:316]-"devid='cuda:0'"[2024-07-17 07:42:19.951020][INFO][trainer:356]- โข self.model=GPT((transformer): ModuleDict((wte): Embedding(65, 768)(wpe): Embedding(1024, 768)(drop): Dropout(p=0.0, inplace=False)(h): ModuleList((0-11): 12 x Block((ln_1): LayerNorm()(attn): CausalSelfAttention((c_attn): Linear(in_features=768, out_features=2304, bias=False)(c_proj): Linear(in_features=768, out_features=768, bias=False)(attn_dropout): Dropout(p=0.0, inplace=False)(resid_dropout): Dropout(p=0.0, inplace=False))(ln_2): LayerNorm()(mlp): MLP((c_fc): Linear(in_features=768, out_features=3072, bias=False)(act_fn): GELU(approximate='none')(c_proj): Linear(in_features=3072, out_features=768, bias=False)(dropout): Dropout(p=0.0, inplace=False))))(ln_f): LayerNorm())(lm_head): Linear(in_features=768, out_features=65, bias=False))[2024-07-17 07:42:19.955340][INFO][trainer:357]- โข self.grad_scaler=<torch.cuda.amp.grad_scaler.GradScaler object at 0x145a38f0f090>[2024-07-17 07:42:19.956897][INFO][trainer:358]- โข self.model_engine=DistributedDataParallel((module): GPT((transformer): ModuleDict((wte): Embedding(65, 768)(wpe): Embedding(1024, 768)(drop): Dropout(p=0.0, inplace=False)(h): ModuleList((0-11): 12 x Block((ln_1): LayerNorm()(attn): CausalSelfAttention((c_attn): Linear(in_features=768, out_features=2304, bias=False)(c_proj): Linear(in_features=768, out_features=768, bias=False)(attn_dropout): Dropout(p=0.0, inplace=False)(resid_dropout): Dropout(p=0.0, inplace=False))(ln_2): LayerNorm()(mlp): MLP((c_fc): Linear(in_features=768, out_features=3072, bias=False)(act_fn): GELU(approximate='none')(c_proj): Linear(in_features=3072, out_features=768, bias=False)(dropout): Dropout(p=0.0, inplace=False))))(ln_f): LayerNorm())(lm_head): Linear(in_features=768, out_features=65, bias=False)))[2024-07-17 07:42:19.961066][INFO][trainer:359]- โข self.optimizer=AdamW (Parameter Group 0amsgrad: Falsebetas:(0.9, 0.95)capturable: Falsedifferentiable: Falseeps: 1e-08foreach: Nonefused: Truelr: 0.0006maximize: Falseweight_decay: 0.1Parameter Group 1amsgrad: Falsebetas:(0.9, 0.95)capturable: Falsedifferentiable: Falseeps: 1e-08foreach: Nonefused: Truelr: 0.0006maximize: Falseweight_decay: 0.0)[2024-07-17 07:42:19.988827][INFO][trainer:802]- Startup time: 6.7125Training Legendโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ abbr โ desc โโกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉโ step โ Current training iteration โโ loss โ Loss value โโ dt โ Elapsed time per training step โโ dtf โ Elapsed time per forward step โโ dtb โ Elapsed time per backward step โโ sps โ Samples per second โโ sps_per_gpu โ Samples per second (per GPU)โโ tps โ Tokens per second โโ tps_per_gpu โ Tokens per second (per GPU)โโ mfu โ Model flops utilization โโ train_loss โ Training loss value โโ val_loss โ Validation loss value โโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ[2024-07-17 07:42:21.451865][INFO][trainer:820]-['prompt']: 'What is an LLM?'[2024-07-17 07:42:21.452667][INFO][trainer:824]-['response']:What is an LLM?eelEl\'$nltPwBSWal,;PWw bbu\'HiyP\'FWwF &AhW:ygrn kk-\'\'KFlMwnlEfflkc,elpWaWtgml$Pgglhllw lglhFllzczPAFHpeAAPPSltgkrWPPhlEMgcrN ggPWt-WPSSzHSkkrzzk.FFrtSSkgMll&gFXr,hghaueaVPW-pHFF-gg,,,FF,,kbApgg gg\'aWWzzkk\'a\'CggHl$bGeA,FFk,,SF;UF,,aZ;gglee$,k.US&kg:S,,zVzzc[2024-07-17 07:43:01.573073][INFO][trainer:885]- step=10 loss=3.154310 dt=0.282833 dtf=0.005247 dtb=0.011417 sps=14.142633 sps_per_gpu=3.535658 tps=926851.609409 tps_per_gpu=231712.902352 mfu=46.288281 train_loss=4.125778 val_loss=4.128809[2024-07-17 07:43:04.402750][INFO][trainer:885]- step=20 loss=2.660851 dt=0.306263 dtf=0.005233 dtb=0.011419 sps=13.060678 sps_per_gpu=3.265170 tps=855944.613638 tps_per_gpu=213986.153409 mfu=45.934162 train_loss=4.125778 val_loss=4.128809[2024-07-17 07:43:07.237507][INFO][trainer:885]- step=30 loss=2.543283 dt=0.283021 dtf=0.005238 dtb=0.011245 sps=14.133211 sps_per_gpu=3.533303 tps=926234.088226 tps_per_gpu=231558.522057 mfu=45.966490 train_loss=4.125778 val_loss=4.128809[2024-07-17 07:43:10.077248][INFO][trainer:885]- step=40 loss=2.503963 dt=0.285001 dtf=0.005213 dtb=0.011471 sps=14.035061 sps_per_gpu=3.508765 tps=919801.749941 tps_per_gpu=229950.437485 mfu=45.963461 train_loss=4.125778 val_loss=4.128809[2024-07-17 07:43:12.917039][INFO][trainer:885]- step=50 loss=2.477469 dt=0.283532 dtf=0.005166 dtb=0.011294 sps=14.107763 sps_per_gpu=3.526941 tps=924566.380009 tps_per_gpu=231141.595002 mfu=45.984530 train_loss=4.125778 val_loss=4.128809[2024-07-17 07:43:15.760749][INFO][trainer:885]- step=60 loss=2.471083 dt=0.284630 dtf=0.005140 dtb=0.011224 sps=14.053326 sps_per_gpu=3.513332 tps=920998.786204 tps_per_gpu=230249.696551 mfu=45.985675 train_loss=4.125778 val_loss=4.128809[2024-07-17 07:43:18.602785][INFO][trainer:885]- step=70 loss=2.458894 dt=0.283926 dtf=0.005219 dtb=0.010383 sps=14.088155 sps_per_gpu=3.522039 tps=923281.352698 tps_per_gpu=230820.338174 mfu=45.998106 train_loss=4.125778 val_loss=4.128809[2024-07-17 07:43:21.451433][INFO][trainer:885]- step=80 loss=2.489088 dt=0.285537 dtf=0.005183 dtb=0.011373 sps=14.008683 sps_per_gpu=3.502171 tps=918073.060430 tps_per_gpu=229518.265108 mfu=45.983282 train_loss=4.125778 val_loss=4.128809[2024-07-17 07:43:24.302241][INFO][trainer:885]- step=90 loss=2.471990 dt=0.300767 dtf=0.005445 dtb=0.010290 sps=13.299337 sps_per_gpu=3.324834 tps=871585.359388 tps_per_gpu=217896.339847 mfu=45.737774 train_loss=4.125778 val_loss=4.128809[2024-07-17 07:43:27.153275][INFO][trainer:885]- step=100 loss=2.445556 dt=0.285869 dtf=0.005182 dtb=0.011251 sps=13.992403 sps_per_gpu=3.498101 tps=917006.151328 tps_per_gpu=229251.537832 mfu=45.743655 train_loss=4.125778 val_loss=4.128809[2024-07-17 07:43:28.182553][INFO][trainer:820]-['prompt']: 'What is an LLM?'[2024-07-17 07:43:28.183179][INFO][trainer:824]-['response']:What is an LLM?Goupay my winghimithell bls ger t bon sinthard ht omind be,And lereind h py balithand frd oforondof wimon me hageas thinero mand,Thacanes,An frift ghik med d herthecke ntore thack couthen ale, t thit ang d m t h chy me fache ag, wit my hathan glat ng[2024-07-17 07:44:06.025837][INFO][trainer:760]- Saving checkpoint to: /home/foremans/tmp/polaris-talk/outputs/runs/pytorch/DDP/2024-07-17/07-42-13[2024-07-17 07:44:06.026607][INFO][trainer:761]- Saving model to: /home/foremans/tmp/polaris-talk/outputs/runs/pytorch/DDP/2024-07-17/07-42-13/model.pth[2024-07-17 07:44:07.682968][INFO][configs:141]- Appending /home/foremans/tmp/polaris-talk/outputs/runs/pytorch/DDP/2024-07-17/07-42-13 to /home/foremans/tmp/polaris-talk/2024-07-17-073327/wordplay/src/ckpts/checkpoints.log[2024-07-17 07:44:10.519506][INFO][trainer:885]- step=110 loss=2.433923 dt=0.285038 dtf=0.005757 dtb=0.011762 sps=14.033209 sps_per_gpu=3.508302 tps=919680.367894 tps_per_gpu=229920.091974 mfu=45.762304 train_loss=2.439494 val_loss=2.478951[2024-07-17 07:44:13.362148][INFO][trainer:885]- step=120 loss=2.429014 dt=0.284445 dtf=0.005222 dtb=0.011486 sps=14.062460 sps_per_gpu=3.515615 tps=921597.361532 tps_per_gpu=230399.340383 mfu=45.788661 train_loss=2.439494 val_loss=2.478951[2024-07-17 07:44:16.210694][INFO][trainer:885]- step=130 loss=2.402059 dt=0.285559 dtf=0.005199 dtb=0.011765 sps=14.007633 sps_per_gpu=3.501908 tps=918004.211586 tps_per_gpu=229501.052897 mfu=45.794438 train_loss=2.439494 val_loss=2.478951[2024-07-17 07:44:19.061546][INFO][trainer:885]- step=140 loss=2.374062 dt=0.285476 dtf=0.005239 dtb=0.011453 sps=14.011662 sps_per_gpu=3.502916 tps=918268.297093 tps_per_gpu=229567.074273 mfu=45.800956 train_loss=2.439494 val_loss=2.478951[2024-07-17 07:44:21.917283][INFO][trainer:885]- step=150 loss=2.365385 dt=0.285846 dtf=0.005125 dtb=0.011320 sps=13.993568 sps_per_gpu=3.498392 tps=917082.475791 tps_per_gpu=229270.618948 mfu=45.800900 train_loss=2.439494 val_loss=2.478951[2024-07-17 07:44:24.771924][INFO][trainer:885]- step=160 loss=2.317337 dt=0.280788 dtf=0.005173 dtb=0.011249 sps=14.245602 sps_per_gpu=3.561401 tps=933599.792506 tps_per_gpu=233399.948127 mfu=45.883340 train_loss=2.439494 val_loss=2.478951[2024-07-17 07:44:27.626812][INFO][trainer:885]- step=170 loss=2.256231 dt=0.284973 dtf=0.005141 dtb=0.011299 sps=14.036416 sps_per_gpu=3.509104 tps=919890.544506 tps_per_gpu=229972.636126 mfu=45.889069 train_loss=2.439494 val_loss=2.478951[2024-07-17 07:44:30.480952][INFO][trainer:885]- step=180 loss=2.216419 dt=0.286555 dtf=0.005180 dtb=0.011402 sps=13.958906 sps_per_gpu=3.489726 tps=914810.852170 tps_per_gpu=228702.713043 mfu=45.868857 train_loss=2.439494 val_loss=2.478951[2024-07-17 07:44:33.337342][INFO][trainer:885]- step=190 loss=2.145123 dt=0.291456 dtf=0.005409 dtb=0.019347 sps=13.724205 sps_per_gpu=3.431051 tps=899429.467247 tps_per_gpu=224857.366812 mfu=45.773849 train_loss=2.439494 val_loss=2.478951[2024-07-17 07:44:36.194584][INFO][trainer:885]- step=200 loss=2.068149 dt=0.285703 dtf=0.005153 dtb=0.011286 sps=14.000555 sps_per_gpu=3.500139 tps=917540.393411 tps_per_gpu=229385.098353 mfu=45.778791 train_loss=2.439494 val_loss=2.478951[2024-07-17 07:44:37.224149][INFO][trainer:820]-['prompt']: 'What is an LLM?'[2024-07-17 07:44:37.224745][INFO][trainer:824]-['response']:What is an LLM?LORTESS LA:No, sighappat selace? don downd sourciceans note cancen up sof liondThis and my man, werame, of re theeThise not will I on land brond sul me a fingore?FLER:Tisint your not nare lame o igen,-to brorst.SamERS:Sin:I\'l hell she lor hen w[2024-07-17 07:45:14.409129][INFO][trainer:760]- Saving checkpoint to: /home/foremans/tmp/polaris-talk/outputs/runs/pytorch/DDP/2024-07-17/07-42-13[2024-07-17 07:45:14.409820][INFO][trainer:761]- Saving model to: /home/foremans/tmp/polaris-talk/outputs/runs/pytorch/DDP/2024-07-17/07-42-13/model.pth[2024-07-17 07:45:16.366935][INFO][configs:141]- Appending /home/foremans/tmp/polaris-talk/outputs/runs/pytorch/DDP/2024-07-17/07-42-13 to /home/foremans/tmp/polaris-talk/2024-07-17-073327/wordplay/src/ckpts/checkpoints.log[2024-07-17 07:45:19.245061][INFO][trainer:885]- step=210 loss=1.982169 dt=0.283305 dtf=0.005223 dtb=0.011284 sps=14.119042 sps_per_gpu=3.529760 tps=925305.515083 tps_per_gpu=231326.378771 mfu=45.822019 train_loss=2.045786 val_loss=2.148510[2024-07-17 07:45:22.092430][INFO][trainer:885]- step=220 loss=1.897731 dt=0.284759 dtf=0.005217 dtb=0.011187 sps=14.046945 sps_per_gpu=3.511736 tps=920580.608106 tps_per_gpu=230145.152026 mfu=45.837327 train_loss=2.045786 val_loss=2.148510[2024-07-17 07:45:24.942639][INFO][trainer:885]- step=230 loss=1.817213 dt=0.285266 dtf=0.005208 dtb=0.011446 sps=14.022003 sps_per_gpu=3.505501 tps=918945.985503 tps_per_gpu=229736.496376 mfu=45.842940 train_loss=2.045786 val_loss=2.148510[2024-07-17 07:45:27.797910][INFO][trainer:885]- step=240 loss=1.779287 dt=0.285465 dtf=0.005189 dtb=0.011220 sps=14.012250 sps_per_gpu=3.503062 tps=918306.793546 tps_per_gpu=229576.698387 mfu=45.844800 train_loss=2.045786 val_loss=2.148510[2024-07-17 07:45:30.653597][INFO][trainer:885]- step=250 loss=1.704220 dt=0.289284 dtf=0.005471 dtb=0.010346 sps=13.827253 sps_per_gpu=3.456813 tps=906182.836379 tps_per_gpu=226545.709095 mfu=45.785926 train_loss=2.045786 val_loss=2.148510[2024-07-17 07:45:33.512769][INFO][trainer:885]- step=260 loss=1.671318 dt=0.287679 dtf=0.005125 dtb=0.011250 sps=13.904380 sps_per_gpu=3.476095 tps=911237.442617 tps_per_gpu=227809.360654 mfu=45.758182 train_loss=2.045786 val_loss=2.148510[2024-07-17 07:45:36.373461][INFO][trainer:885]- step=270 loss=1.650952 dt=0.298661 dtf=0.005118 dtb=0.011520 sps=13.393107 sps_per_gpu=3.348277 tps=877730.651421 tps_per_gpu=219432.662855 mfu=45.565875 train_loss=2.045786 val_loss=2.148510[2024-07-17 07:45:39.236930][INFO][trainer:885]- step=280 loss=1.573242 dt=0.285970 dtf=0.005171 dtb=0.011290 sps=13.987477 sps_per_gpu=3.496869 tps=916683.279847 tps_per_gpu=229170.819962 mfu=45.587333 train_loss=2.045786 val_loss=2.148510[2024-07-17 07:45:42.100605][INFO][trainer:885]- step=290 loss=1.533265 dt=0.286487 dtf=0.005432 dtb=0.011288 sps=13.962259 sps_per_gpu=3.490565 tps=915030.617828 tps_per_gpu=228757.654457 mfu=45.598392 train_loss=2.045786 val_loss=2.148510[2024-07-17 07:45:44.964424][INFO][trainer:885]- step=300 loss=1.492064 dt=0.288480 dtf=0.005355 dtb=0.011480 sps=13.865774 sps_per_gpu=3.466443 tps=908707.340870 tps_per_gpu=227176.835218 mfu=45.576766 train_loss=2.045786 val_loss=2.148510[2024-07-17 07:45:45.995833][INFO][trainer:820]-['prompt']: 'What is an LLM?'[2024-07-17 07:45:45.996497][INFO][trainer:824]-['response']:What is an LLM?RICHMORD:Char stire? how in those are name the range hone.GLOUCESTER:Nay, in lond's time the palt are worder moreThat wilt in the purpose be a peyAnd thou thine onter hands, and the which broth.ELBOWINCA:At lie my lord with the me an arms be a s[2024-07-17 07:46:23.549987][INFO][trainer:760] - Saving checkpoint to: /home/foremans/tmp/polaris-talk/outputs/runs/pytorch/DDP/2024-07-17/07-42-13[2024-07-17 07:46:23.550696][INFO][trainer:761] - Saving model to: /home/foremans/tmp/polaris-talk/outputs/runs/pytorch/DDP/2024-07-17/07-42-13/model.pth[2024-07-17 07:46:25.496559][INFO][configs:141] - Appending /home/foremans/tmp/polaris-talk/outputs/runs/pytorch/DDP/2024-07-17/07-42-13 to /home/foremans/tmp/polaris-talk/2024-07-17-073327/wordplay/src/ckpts/checkpoints.log[2024-07-17 07:46:28.374854][INFO][trainer:885] - step=310 loss=1.444200 dt=0.299907 dtf=0.005333 dtb=0.010637 sps=13.337481 sps_per_gpu=3.334370 tps=874085.133345 tps_per_gpu=218521.283336 mfu=45.384395 train_loss=1.495372 val_loss=1.713714[2024-07-17 07:46:31.223079][INFO][trainer:885] - step=320 loss=1.429350 dt=0.285238 dtf=0.005245 dtb=0.011485 sps=14.023353 sps_per_gpu=3.505838 tps=919034.479880 tps_per_gpu=229758.619970 mfu=45.435743 train_loss=1.495372 val_loss=1.713714[2024-07-17 07:46:34.074957][INFO][trainer:885] - step=330 loss=1.362220 dt=0.285027 dtf=0.005165 dtb=0.011407 sps=14.033736 sps_per_gpu=3.508434 tps=919714.904826 tps_per_gpu=229928.726207 mfu=45.485355 train_loss=1.495372 val_loss=1.713714[2024-07-17 07:46:36.929464][INFO][trainer:885] - step=340 loss=1.350888 dt=0.284436 dtf=0.005199 dtb=0.011287 sps=14.062893 sps_per_gpu=3.515723 tps=921625.744709 tps_per_gpu=230406.436177 mfu=45.539549 train_loss=1.495372 val_loss=1.713714
This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.
References
Wei, Jason, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, et al. 2022. โEmergent Abilities of Large Language Models.โhttps://arxiv.org/abs/2206.07682.
Yang, Jingfeng, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, and Xia Hu. 2023. โHarnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond.โhttps://arxiv.org/abs/2304.13712.
Yao, Shunyu, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. โTree of Thoughts: Deliberate Problem Solving with Large Language Models.โhttps://arxiv.org/abs/2305.10601.