🍹 BlendCorpus + TorchTitan @ ALCF

Sam Foreman

🍹 BlendCorpus + TorchTitan @ ALCF

Training large language models with BlendCorpus and TorchTitan on supercomputers at the Argonne Leadership Computing Facility.

Author

Affiliation

Sam Foreman

ALCF

Published

September 12, 2025

Modified

October 7, 2025

📌 Source Repositories

Things are changing quickly, so to avoid confusion, here are the exact branches used for this demo:

Using:
- auroraGPT-ANL/torchtitan @ saforem2/blendcorpus
- saforem2/blendcorpus @ saforem2/reorg-imports ¹

🏃‍♂️ Running

Clone repo:

git clone https://github.com/auroraGPT-ANL/torchtitan
cd torchtitan
checkout saforem2/blendcorpus

Setup env:

source <(curl -L https://bit.ly/ezpz-utils)
ezpz_setup_env

2025-09-11, on Aurora @ ALCF:

output:

; ssh x4112c1s0b0n0

#[🐍 aurora_nre_models_frameworks-2025.2.0]
#[/f/A/A/E/A/t/a/torchtitan][🌱 saforem2/blendcorpus][📝🤷✓] 
#[09/11/25 @ 14:08:35][x4112c1s0b0n0]
; source <(curl -L https://bit.ly/ezpz-utils)

#[🐍 aurora_nre_models_frameworks-2025.2.0]
#[/f/A/A/E/A/t/a/torchtitan][🌱 saforem2/blendcorpus][📝🤷✓] 
#[09/11/25 @ 14:08:37][x4112c1s0b0n0]
; ezpz_setup_env                                                                                                                                                                      
[2025-09-11-140838][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2720] Detected PBS scheduler environment.
[2025-09-11-140838][W][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2756] Current working directory does not match PBS_O_WORKDIR! This may cause issues with the job submission.
[2025-09-11-140838][W][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2757] PBS_O_WORKDIR /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/zhenghh04/torchtitan
[2025-09-11-140838][W][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2758] WORKING_DIR /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan
[2025-09-11-140838][W][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2759] Exporting PBS_O_WORKDIR=WORKING_DIR=/lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan and continuing...
[2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2486] running [ezpz_setup_env]...
[2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1298] [PYTHON]
[2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1327]   - Conda active, conda=/opt/aurora/25.190.0/frameworks/aurora_nre_models_frameworks-2025.2.0...
[2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1328]   - No virtual_env found in environment
[2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1142]   - Found python root at /opt/aurora/25.190.0/frameworks/aurora_nre_models_frameworks-2025.2.0
[2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1157]   - No VIRTUAL_ENV found in environment!
[2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1160]   - Looking for venv in venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0...
[2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1182]   - Activating existing venv in VENV_DIR=venvs/torchtitan-aurora_nre_models_frameworks-2025.2.0
[2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1184]   - Found /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan/venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0/bin/activate
[2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1353]   - Using python from: /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan/venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0/bin/python3
[2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2335] [JOB]
[2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2336]   - Setting up env for foremans
[2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2337]   - Detected pbs scheduler
[2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2338]   - Machine: aurora
[2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2339]   - Hostname: x4112c1s0b0n0
[2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2249]   - PBS_JOBID=7559761.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
    to calculate:
      - num_hosts: 2
      - num_cores_per_host: 208
      - num_cpus_per_host: 104
      - num_gpus_per_host: 12
      - depth: 8
      - num_gpus: 24
[2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1754] [HOSTS] - ezpz_print_hosts
[2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1756]   - Detected PBS Scheduler
[2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1774] [HOSTS]
[2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1775]   - HOSTFILE=/var/spool/pbs/aux/7559761.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
[2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1776]   - NHOSTS=2
[2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1777]   - HOSTS:
[2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1780]     - [host:0] - x4112c1s0b0n0.hsn.cm.aurora.alcf.anl.gov
[2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1780]     - [host:1] - x4112c1s1b0n0.hsn.cm.aurora.alcf.anl.gov
[2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1941] [DIST_INFO]
[2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1942]   - HOSTFILE=/var/spool/pbs/aux/7559761.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
[2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1943]   - NHOSTS=2
[2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1944]   - NGPU_PER_HOST=12
[2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1945]   - NGPUS=24
[2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1947] [LAUNCH]
[2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1948]   - To launch across all available GPUs, use: 'launch'
[2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1949]     launch = mpiexec --verbose --envall -n 24 -ppn 12 --hostfile /var/spool/pbs/aux/7559761.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-8
0:86-88:94-96
[2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1950]   - Run 'which launch' to ensure that the alias is set correctly
[2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2495] [✓] Finished [ezpz_setup_env]
took: 0h:00m:04s

Install dependencies. From inside your clone of torchtitan:

🍋 ezpz:

# uv not required, but useful!
# to download: curl -LsSf https://astral.sh/uv/install.sh | sh
uv pip install "git+https://github.com/saforem2/ezpz"

🍹 BlendCorpus:

git clone https://github.com/saforem2/blendcorpus deps/blendcorpus
cd deps/blendcorpus
git checkout reorg-imports
uv pip install -e "."

🔥 TorchTitan:

python3 -m pip install "git+https://github.com/saforem2/blendcorpus@saforem2/reorg-imports"
# from inside auroraGPT-ANL/torchtitan @ saforem2/blendcorpus
python3 -m pip install -e "."

Download Artifacts:

AuroraGPT-2B:

python3 scripts/download_hf_assets.py --repo_id google/gemma-7b --assets tokenizer
mkdir assets/hf/AuroraGPT-2B
cp assets/hf/gemma-7b/tokenizer.model assets/hf/AuroraGPT-2B

AuroraGPT-7B:

python3 scripts/download_hf_assets.py --repo_id meta-llama/llama-2-7b-hf --assets tokenizer
mkdir assets/hf/AuroraGPT-7B
cp assets/hf/gemma-7b/tokenizer.model assets/hf/AuroraGPT-7B

Launch:

; ezpz-launch -m torchtitan.experiments.blendcorpus.train --job.config_file torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml

output:

#[09/12/25 @ 11:33:56][x4117c4s2b0n0]
; ezpz-launch -m torchtitan.experiments.blendcorpus.train --job.config_file torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml

[2025-09-12-113711][I][-zsh:91] Using torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml
[2025-09-12-113711][I][-zsh:91] Logs will be saved to: logs/auroraGPT_7B-2025-09-12-113711.log
[W912 11:37:15.852098275 OperatorEntry.cpp:219] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/aten/src/ATen/VmapModeRegistrations.cpp:37
      new kernel: registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/ipex_2.8.10_xpu_rel_08_18_2025/intel-extension-for-pytorch/build/Release/csrc/gpu/csrc/gpu/xpu/ATen/RegisterXPU_0.cpp:172 (function operator())
[2025-09-12 11:37:16,879] [INFO] [real_accelerator.py:260:get_accelerator] Setting ds_accelerator to xpu (auto detect)
/opt/aurora/25.190.0/frameworks/aurora_nre_models_frameworks-2025.2.0/lib/python3.10/site-packages/neural_compressor/utils/utility.py:44: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  from pkg_resources import parse_version
[2025-09-12 11:37:30,939] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False
[2025-09-12 11:37:43,263749][I][ezpz/__init__:266:<module>] Setting logging level to 'INFO' on 'RANK == 0'
[2025-09-12 11:37:43,266470][I][ezpz/__init__:267:<module>] Setting logging level to 'CRITICAL' on all others 'RANK != 0'


[2025-09-12 11:37:43,273704][I][ezpz/launch:340:launch] ----[🍋 ezpz.launch][started][2025-09-12-113743]----
[2025-09-12 11:37:47,537879][I][ezpz/launch:345:launch] Job ID: 7591191
[2025-09-12 11:37:47,538702][I][ezpz/launch:346:launch] nodelist: ['x4117c4s2b0n0', 'x4117c4s6b0n0']
[2025-09-12 11:37:47,539093][I][ezpz/launch:347:launch] hostfile: /var/spool/pbs/aux/7591191.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
[2025-09-12 11:37:47,540277][I][ezpz/pbs:188:get_pbs_launch_cmd] ✅ Using [24/24] GPUs [2 hosts] x [12 GPU/host]
[2025-09-12 11:37:47,541233][I][ezpz/launch:316:build_executable] Building command to execute by piecing together:
[2025-09-12 11:37:47,541638][I][ezpz/launch:317:build_executable] (1.) launch_cmd: mpiexec --verbose --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/7591191.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
[2025-09-12 11:37:47,542413][I][ezpz/launch:318:build_executable] (2.) cmd_to_launch: /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan/venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0/bin/python3 -m torchtitan.experiments.blendcorpus.train --job.config_file torchtitan/experiments /blendcorpus/train_configs/auroraGPT_7B.toml
[2025-09-12 11:37:47,543429][I][ezpz/launch:360:launch] Took: 4.27 seconds to build command.
[2025-09-12 11:37:47,543810][I][ezpz/launch:363:launch] Executing:
mpiexec
  --verbose
  --envall
  --np=24
  --ppn=12
  --hostfile=/var/spool/pbs/aux/7591191.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
  --no-vni
  --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
  /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan/venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0/bin/python3
  -m
  torchtitan.experiments.blendcorpus.train
  --job.config_file
  torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml
[2025-09-12 11:37:47,545251][I][ezpz/launch:179:get_aurora_filters] Filtering for Aurora-specific messages. To view list of filters, run with EZPZ_LOG_LEVEL=DEBUG
[2025-09-12 11:37:47,545756][I][ezpz/launch:370:launch] Execution started @ 2025-09-12-113747...
[2025-09-12 11:37:47,546182][I][ezpz/launch:371:launch] ----[🍋 ezpz.launch][stop][2025-09-12-113747]----
[2025-09-12 11:37:47,546634][I][ezpz/launch:99:run_command] Caught 24 filters
[2025-09-12 11:37:47,547002][I][ezpz/launch:100:run_command] Running command:
mpiexec --verbose --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/7591191.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan/venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0/bin/python3 -m torchtitan.experiments.blendcorpus.train --job.config_file torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml
Disabling local launch: multi-node application
Connected to tcp://x4117c4s2b0n0.hsn.cm.aurora.alcf.anl.gov:7919
Launching application 422e0368-f389-4475-8131-3de313723140
cpubind:list x4117c4s2b0n0 pid 35392 rank 0 0: mask 0x1c
cpubind:list x4117c4s2b0n0 pid 35393 rank 1 1: mask 0x1c00
cpubind:list x4117c4s2b0n0 pid 35394 rank 2 2: mask 0x1c0000
cpubind:list x4117c4s2b0n0 pid 35395 rank 3 3: mask 0x1c000000
cpubind:list x4117c4s2b0n0 pid 35396 rank 4 4: mask 0x1c00000000
cpubind:list x4117c4s2b0n0 pid 35397 rank 5 5: mask 0x1c0000000000
cpubind:list x4117c4s2b0n0 pid 35398 rank 6 6: mask 0x1c0000000000000
cpubind:list x4117c4s2b0n0 pid 35399 rank 7 7: mask 0x1c000000000000000
cpubind:list x4117c4s2b0n0 pid 35400 rank 8 8: mask 0x1c00000000000000000
cpubind:list x4117c4s2b0n0 pid 35401 rank 9 9: mask 0x1c0000000000000000000
cpubind:list x4117c4s2b0n0 pid 35402 rank 10 10: mask 0x1c000000000000000000000
cpubind:list x4117c4s2b0n0 pid 35403 rank 11 11: mask 0x1c00000000000000000000000
Application 422e0368-f389-4475-8131-3de313723140 started execution
cpubind:list x4117c4s6b0n0 pid 111063 rank 12 0: mask 0x1c
cpubind:list x4117c4s6b0n0 pid 111064 rank 13 1: mask 0x1c00
cpubind:list x4117c4s6b0n0 pid 111065 rank 14 2: mask 0x1c0000
cpubind:list x4117c4s6b0n0 pid 111066 rank 15 3: mask 0x1c000000
cpubind:list x4117c4s6b0n0 pid 111067 rank 16 4: mask 0x1c00000000
cpubind:list x4117c4s6b0n0 pid 111068 rank 17 5: mask 0x1c0000000000
cpubind:list x4117c4s6b0n0 pid 111069 rank 18 6: mask 0x1c0000000000000
cpubind:list x4117c4s6b0n0 pid 111070 rank 19 7: mask 0x1c000000000000000
cpubind:list x4117c4s6b0n0 pid 111071 rank 20 8: mask 0x1c00000000000000000
cpubind:list x4117c4s6b0n0 pid 111072 rank 21 9: mask 0x1c0000000000000000000
cpubind:list x4117c4s6b0n0 pid 111073 rank 22 10: mask 0x1c000000000000000000000
cpubind:list x4117c4s6b0n0 pid 111074 rank 23 11: mask 0x1c00000000000000000000000
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  from pkg_resources import parse_version
# [...repeated...]: TODO: Add this to the list of filters in ezpz
  from pkg_resources import parse_version
[2025-09-12 11:38:05,512552][I][ezpz/__init__:266:<module>] Setting logging level to 'INFO' on 'RANK == 0'
[2025-09-12 11:38:05,515164][I][ezpz/__init__:267:<module>] Setting logging level to 'CRITICAL' on all others 'RANK != 0'
[2025-09-12 11:38:07,293955][I][ezpz/dist:1181:setup_torch_distributed] Using fw='ddp' with torch_{device,backend}= {xpu, xccl}
[2025-09-12 11:38:07,295126][I][ezpz/dist:1039:setup_torch_DDP] Caught MASTER_PORT=44635 from environment!
[2025-09-12 11:38:07,295968][I][ezpz/dist:1055:setup_torch_DDP] Using torch.distributed.init_process_group with
- master_addr='x4117c4s2b0n0.hsn.cm.aurora.alcf.anl.gov'
- master_port='44635'
- world_size=24
- rank=0
- local_rank=0
- timeout=datetime.timedelta(seconds=3600)
- backend='xccl'
[2025-09-12 11:38:07,297280][I][ezpz/dist:772:init_process_group] Calling torch.distributed.init_process_group_with: rank=0 world_size=24 backend=xccl
[2025-09-12 11:38:21,344380][I][ezpz/pbs:188:get_pbs_launch_cmd] ✅ Using [24/24] GPUs [2 hosts] x [12 GPU/host]
[2025-09-12 11:38:21,346401][I][ezpz/dist:450:print_dist_setup] [device='xpu'][rank=0/23][local_rank=0/11][node=0/1]
[2025-09-12 11:38:21,347018][W][utils/_logger:68:warning] Using [24 / 24] available "xpu" devices !!
2025:09:12-11:38:21:(35392) |CCL_WARN| value of CCL_LOG_LEVEL changed to be error (default:warn)
[2025-09-12 11:38:22,154201][I][ezpz/dist:1401:setup_torch] Using device='xpu' with backend='xccl' + 'xccl' for distributed training.
[2025-09-12 11:38:22,155050][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 0/23]
[2025-09-12 11:38:22,154185][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 6/23]
[2025-09-12 11:38:22,154353][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 7/23]
[2025-09-12 11:38:22,154299][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 9/23]
[2025-09-12 11:38:22,154355][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][11/23]
[2025-09-12 11:38:22,154185][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 1/23]
[2025-09-12 11:38:22,154184][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 5/23]
[2025-09-12 11:38:22,154350][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 8/23]
[2025-09-12 11:38:22,154312][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][12/23]
[2025-09-12 11:38:22,154495][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][10/23]
[2025-09-12 11:38:22,154339][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 4/23]
[2025-09-12 11:38:22,154312][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][13/23]
[2025-09-12 11:38:22,154312][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][15/23]
[2025-09-12 11:38:22,154312][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][16/23]
[2025-09-12 11:38:22,154379][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][17/23]
[2025-09-12 11:38:22,154398][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 2/23]
[2025-09-12 11:38:22,154319][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][18/23]
[2025-09-12 11:38:22,154284][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][19/23]
[2025-09-12 11:38:22,154325][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][20/23]
[2025-09-12 11:38:22,154382][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][22/23]
[2025-09-12 11:38:22,154502][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][14/23]
[2025-09-12 11:38:22,154391][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][21/23]
[2025-09-12 11:38:22,154451][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][23/23]
[2025-09-12 11:38:22,154411][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 3/23]
[2025-09-12 11:38:22,694566][I][blendcorpus/train:85:__init__] Starting job: AuroraGPT-7B Training
[2025-09-12 11:38:22,695590][I][blendcorpus/train:93:__init__] Running with args: {
  "activation_checkpoint": {
    "early_stop": false,
    "mode": "none",
    "per_op_sac_force_recompute_mm_shapes_by_fqns": [
      "moe.router.gate"
    ],
    "selective_ac_option": "op"
  },
  "blendcorpus": {
    "append_eod": true,
    "blend_sample_in_corpus": false,
    "data_cache_path": "./.cache/data/auroraGPT-7B/olmo-mix-1124/",
    "data_file_list": null,
    "dataloader_type": "single",
    "eod_token_id": 2,
    "micro_batch_size": null,
    "num_workers": 2,
    "provide_attention_mask": false,
    "seq_length": null,
    "shuffle": true,
    "shuffle_sample_in_corpus": true,
    "split": "98,1,1"
  },
  "checkpoint": {
    "async_mode": "disabled",
    "create_seed_checkpoint": false,
    "enable": false,
    "enable_first_step_checkpoint": false,
    "exclude_from_loading": [],
    "export_dtype": "float32",
    "folder": "checkpoint",
    "initial_load_in_hf": false,
    "initial_load_model_only": true,
    "initial_load_path": null,
    "interval": 10,
    "keep_latest_k": 10,
    "last_save_in_hf": false,
    "last_save_model_only": false,
    "load_step": -1
  },
  "comm": {
    "init_timeout_seconds": 300,
    "save_traces_folder": "comm_traces",
    "trace_buf_size": 20000,
    "train_timeout_seconds": 100
  },
  "compile": {
    "components": [
      "model",
      "loss"
    ],
    "enable": true
  },
  "experimental": {
    "custom_args_module": "torchtitan.experiments.blendcorpus.job_config",
    "custom_import": ""
  },
  "fault_tolerance": {
    "enable": false,
    "group_size": 0,
    "min_replica_size": 1,
    "process_group": "gloo",
    "process_group_timeout_ms": 10000,
    "replica_id": 0,
    "semi_sync_method": null
  },
  "float8": {
    "emulate": false,
    "enable_fsdp_float8_all_gather": false,
    "filter_fqns": [
      "output"
    ],
    "moe_fqns_prototype": [],
    "precompute_float8_dynamic_scale_for_fsdp": false,
    "recipe_name": null
  },
  "job": {
    "config_file": "torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml",
    "description": "AuroraGPT-7B Training",
    "dump_folder": "./outputs/AuroraGPT-7B",
    "print_args": true,
    "use_for_integration_test": true
  },
  "lr_scheduler": {
    "decay_ratio": 0.8,
    "decay_type": "linear",
    "min_lr_factor": 0.0,
    "warmup_steps": 2
  },
  "memory_estimation": {
    "disable_fake_mode": false,
    "enable": false
  },
  "metrics": {
    "disable_color_printing": false,
    "enable_tensorboard": true,
    "enable_wandb": true,
    "log_freq": 1,
    "save_for_all_ranks": false,
    "save_tb_folder": "tb"
  },
  "model": {
    "converters": [],
    "flavor": "AuroraGPT-7B",
    "hf_assets_path": "./assets/hf/AuroraGPT-7B",
    "name": "blendcorpus",
    "print_after_conversion": false,
    "tokenizer_backend": "sptoken",
    "tokenizer_path": null
  },
  "mx": {
    "filter_fqns": [
      "output"
    ],
    "moe_fqns_prototype": [],
    "mxfp8_dim1_cast_kernel_choice": "triton",
    "recipe_name": "mxfp8_cublas"
  },
  "optimizer": {
    "beta1": 0.9,
    "beta2": 0.95,
    "early_step_in_backward": false,
    "eps": 1e-08,
    "implementation": "fused",
    "lr": 0.0002,
    "name": "AdamW",
    "weight_decay": 0.1
  },
  "parallelism": {
    "context_parallel_degree": 1,
    "context_parallel_rotate_method": "allgather",
    "data_parallel_replicate_degree": 1,
    "data_parallel_shard_degree": -1,
    "disable_loss_parallel": false,
    "enable_async_tensor_parallel": false,
    "enable_compiled_autograd": false,
    "expert_parallel_degree": 1,
    "expert_tensor_parallel_degree": 1,
    "fsdp_reshard_after_forward": "default",
    "module_fqns_per_model_part": null,
    "pipeline_parallel_degree": 1,
    "pipeline_parallel_first_stage_less_layers": 1,
    "pipeline_parallel_last_stage_less_layers": 1,
    "pipeline_parallel_layers_per_stage": null,
    "pipeline_parallel_microbatch_size": 1,
    "pipeline_parallel_schedule": "1F1B",
    "pipeline_parallel_schedule_csv": "",
    "pipeline_parallel_split_points": [],
    "tensor_parallel_degree": 1
  },
  "profiling": {
    "enable_memory_snapshot": false,
    "enable_profiling": false,
    "profile_freq": 10,
    "save_memory_snapshot_folder": "memory_snapshot",
    "save_traces_folder": "profile_trace"
  },
  "training": {
    "dataset": "blendcorpus",
    "dataset_path": "/flare/Aurora_deployment/AuroraGPT/datasets/dolma/dolma_v1_7_file_list_mini.txt",
    "deterministic": false,
    "enable_cpu_offload": false,
    "gc_debug": false,
    "gc_freq": 50,
    "global_batch_size": -1,
    "local_batch_size": 1,
    "max_norm": 1.0,
    "mixed_precision_param": "bfloat16",
    "mixed_precision_reduce": "float32",
    "seed": null,
    "seq_len": 4096,
    "steps": 1000
  },
  "validation": {
    "dataset": "c4_validation",
    "dataset_path": null,
    "enable": false,
    "freq": 5,
    "local_batch_size": 8,
    "seq_len": 2048,
    "steps": 10
  }
}
Number of ranks per node: 12
Is initialized already
[2025-09-12 11:38:22,781763][I][distributed/parallel_dims:158:_build_mesh_without_ep] Building 1-D device mesh with ['dp_shard'], [24]
[2025-09-12 11:38:22,783219][I][tools/utils:65:collect] [GC] Initial GC collection 0.00 seconds
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[2025-09-12 11:38:22,795599][I][dataset/sptoken:75:build_sentencepiece_tokenizer] [SPTokenizer] Using model path: ./assets/hf/AuroraGPT-7B
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[Tokenizer] Using backend: sptoken (SentencePiece)
[2025-09-12 11:38:22,806079][I][dataset/sptoken:36:__init__] [SPTokenizer] Loaded model: ./assets/hf/AuroraGPT-7B/tokenizer.model, vocab size: 32000
[INFO][2025-09-12 11:38:22.811010] Reading data from /flare/Aurora_deployment/AuroraGPT/datasets/dolma/dolma_v1_7_file_list_mini.txt
[INFO][2025-09-12 11:38:22.811281] Number of datasets: 9
[INFO][2025-09-12 11:38:22.811427] Global batch size: 24
[INFO][2025-09-12 11:38:22.811559] Training iterations: 1000
[INFO][2025-09-12 11:38:22.811682] Evaluation iterations: 0
[INFO][2025-09-12 11:38:22.811805] Total number of training samples: 24000
[INFO][2025-09-12 11:38:22.811932] Total number of evaluation samples: 0
[INFO][2025-09-12 11:38:22.812052] Total number of testing samples: 0
[2025-09-12 11:38:23,388839][I][data/gpt_dataset:263:_cache_indices] > loading algebraic corpus dataset index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/eaaf4239bf399fe90985648264fc597b_index.npy
[2025-09-12 11:38:23,400289][I][data/gpt_dataset:270:_cache_indices] > loading algebraic corpus dataset sample index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/eaaf4239bf399fe90985648264fc597b_sample_index.npy
[2025-09-12 11:38:23,401313][I][data/gpt_dataset:277:_cache_indices] > finished loading in 0.01251498400233686 seconds
[2025-09-12 11:38:23,402526][I][data/gpt_dataset:291:__init__] [BuildCorpusDataset] Caught args.shuffle_sample_in_corpus=True across 19984 samples
[2025-09-12 11:38:23,498032][I][data/gpt_dataset:263:_cache_indices] > loading arxiv corpus dataset index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/9958f3591de484302ca17a2f1feeafaf_index.npy
[2025-09-12 11:38:23,502674][I][data/gpt_dataset:270:_cache_indices] > loading arxiv corpus dataset sample index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/9958f3591de484302ca17a2f1feeafaf_sample_index.npy
[2025-09-12 11:38:23,506868][I][data/gpt_dataset:277:_cache_indices] > finished loading in 0.008856782980728894 seconds
[2025-09-12 11:38:23,507665][I][data/gpt_dataset:291:__init__] [BuildCorpusDataset] Caught args.shuffle_sample_in_corpus=True across 4140 samples
[2025-09-12 11:38:23,520625][I][data/blendable_dataset:131:__init__] > loading blendable dataset index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/212cf8a136a93b7bc9cd575fa4c82f21_index.npy
[2025-09-12 11:38:23,527379][I][data/blendable_dataset:134:__init__] > loading blendable dataset sample index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/212cf8a136a93b7bc9cd575fa4c82f21_sample_index.npy
[2025-09-12 11:38:23,532038][I][data/blendable_dataset:139:__init__] > finished loading in 0.011423073010519147 seconds
[2025-09-12 11:38:23,543427][I][data/blendable_dataset:152:__init__] > size of blendable dataset: 24124 samples
[2025-09-12 11:38:23,544235][I][blendcorpus/train:177:__init__] Using BlendCorpus dataloader.
[2025-09-12 11:38:23,544713][I][blendcorpus/train:185:__init__] Building blendcorpus AuroraGPT-7B with TransformerModelArgs(_enforced='This field is used to enforce all fields have defaults.', dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, rope_theta=10000, max_seq_len=4096, depth_init=True, use_flex_attn=False, attn_mask_type='causal', eos_id=0)
wandb: Currently logged in as: foremans (aurora_gpt) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.21.3
wandb: Run data is saved locally in ./outputs/AuroraGPT-7B/tb/20250912-1138/wandb/run-20250912_113823-qzle9mdw
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run snowy-sunset-14
wandb:  View project at https://wandb.ai/aurora_gpt/torchtitan
wandb:  View run at https://wandb.ai/aurora_gpt/torchtitan/runs/qzle9mdw
[2025-09-12 11:38:24,703889][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:24,750784][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:24,776002][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:24,848549][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:24,864781][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:24,964752][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:25,005897][I][components/metrics:155:__init__] WandB logging enabled
[2025-09-12 11:38:25,012474][I][components/metrics:124:__init__] TensorBoard logging enabled. Logs will be saved at ./outputs/AuroraGPT-7B/tb/20250912-1138
[2025-09-12 11:38:25,017569][I][components/metrics:101:build_device_memory_monitor] XPU capacity: Intel(R) Data Center GPU Max 1550 with 63.98GiB memory
[2025-09-12 11:38:25,044275][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:25,093814][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:25,126732][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:25,146946][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:25,146998][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:25,147707][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:25,148411][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:25,149299][I][blendcorpus/train:212:__init__] Model blendcorpus AuroraGPT-7B size: 5,933,109,248 total parameters
[2025-09-12 11:38:25,150242][I][components/loss:28:build_cross_entropy_loss] Compiling the loss function with torch.compile
[2025-09-12 11:38:25,190998][I][infra/parallelize:357:apply_compile] Compiling each TransformerBlock with torch.compile
[2025-09-12 11:38:25,220312][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:25,245422][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:25,262857][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:25,271332][I][infra/parallelize:122:parallelize_llama] Applied FSDP to the model
[2025-09-12 11:38:25,289336][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:25,296201][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:25,296289][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:25,298835][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:25,299750][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:25,299754][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:25,299888][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:25,770147][I][blendcorpus/train:290:__init__] Peak FLOPS used for computing MFU: 2.982e+14
[2025-09-12 11:38:25,771316][I][blendcorpus/train:292:__init__] XPU memory usage for model: 1.04GiB(1.63%)
[2025-09-12 11:38:25,773314][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
[2025-09-12 11:38:25,774216][I][distributed/utils:225:maybe_enable_amp] Mixed precision training is handled by fully_shard
[2025-09-12 11:38:25,774808][I][blendcorpus/train:381:__init__] Trainer is initialized with local batch size 1, global batch size 24, gradient accumulation steps 1, sequence length 4096, total steps 1000 (warmup 2)
[2025-09-12 11:38:25,775505][I][blendcorpus/train:695:<module>] Using SDPBackend.FLASH_ATTENTION backend for SDPA
[2025-09-12 11:38:25,776216][I][blendcorpus/train:569:train] BlendCorpus dataloader advanced to consumed =0 samples (step={self.step}).
[2025-09-12 11:38:25,776915][I][blendcorpus/train:581:train] Training starts at step 1.
[2025-09-12 11:39:11,844905][I][components/metrics:442:log] step:  1  loss: 10.8919  grad_norm:  5.7773  memory: 21.74GiB(33.98%)  tps: 88  tflops: 3.62  mfu: 1.21%
[2025-09-12 11:39:11,847254][I][distributed/utils:299:set_pg_timeouts] Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[2025-09-12 11:39:13,996720][I][components/metrics:442:log] step:  2  loss: 15.4482  grad_norm: 95.7768  memory: 23.63GiB(36.93%)  tps: 1,906  tflops: 78.63  mfu: 26.37%
[2025-09-12 11:39:16,148721][I][components/metrics:442:log] step:  3  loss: 18.1145  grad_norm: 177.2544  memory: 23.63GiB(36.93%)  tps: 1,905  tflops: 78.60  mfu: 26.36%
[2025-09-12 11:39:18,293594][I][components/metrics:442:log] step:  4  loss: 12.2966  grad_norm: 47.6269  memory: 23.63GiB(36.93%)  tps: 1,912  tflops: 78.86  mfu: 26.45%
[2025-09-12 11:39:20,423330][I][components/metrics:442:log] step:  5  loss: 12.4196  grad_norm: 55.3153  memory: 23.63GiB(36.93%)  tps: 1,925  tflops: 79.42  mfu: 26.63%
[2025-09-12 11:39:22,550981][I][components/metrics:442:log] step:  6  loss: 10.8771  grad_norm:  5.3124  memory: 23.63GiB(36.93%)  tps: 1,927  tflops: 79.50  mfu: 26.66%
[2025-09-12 11:39:24,670689][I][components/metrics:442:log] step:  7  loss: 10.9488  grad_norm: 41.6404  memory: 23.63GiB(36.93%)  tps: 1,934  tflops: 79.80  mfu: 26.76%
[2025-09-12 11:39:26,791101][I][components/metrics:442:log] step:  8  loss:  9.9818  grad_norm: 18.3422  memory: 23.63GiB(36.93%)  tps: 1,934  tflops: 79.77  mfu: 26.75%
[2025-09-12 11:39:28,911059][I][components/metrics:442:log] step:  9  loss:  9.0792  grad_norm:  9.5251  memory: 23.63GiB(36.93%)  tps: 1,934  tflops: 79.79  mfu: 26.76%
[2025-09-12 11:39:31,025851][I][components/metrics:442:log] step: 10  loss:  8.4230  grad_norm:  4.9722  memory: 23.63GiB(36.93%)  tps: 1,939  tflops: 79.98  mfu: 26.82%
[2025-09-12 11:39:33,138436][I][components/metrics:442:log] step: 11  loss:  8.0111  grad_norm:  4.7603  memory: 23.63GiB(36.93%)  tps: 1,941  tflops: 80.07  mfu: 26.85%
[2025-09-12 11:39:35,250642][I][components/metrics:442:log] step: 12  loss:  7.8059  grad_norm:  9.0702  memory: 23.63GiB(36.93%)  tps: 1,941  tflops: 80.08  mfu: 26.85%
[2025-09-12 11:39:37,361018][I][components/metrics:442:log] step: 13  loss:  7.3035  grad_norm:  5.1540  memory: 23.63GiB(36.93%)  tps: 1,943  tflops: 80.15  mfu: 26.88%
[2025-09-12 11:39:39,472014][I][components/metrics:442:log] step: 14  loss:  7.1419  grad_norm:  4.1700  memory: 23.63GiB(36.93%)  tps: 1,942  tflops: 80.13  mfu: 26.87%
[2025-09-12 11:39:41,584217][I][components/metrics:442:log] step: 15  loss:  6.9347  grad_norm:  4.9882  memory: 23.63GiB(36.93%)  tps: 1,941  tflops: 80.08  mfu: 26.86%
[2025-09-12 11:39:43,690898][I][components/metrics:442:log] step: 16  loss:  7.3633  grad_norm: 31.0589  memory: 23.63GiB(36.93%)  tps: 1,946  tflops: 80.29  mfu: 26.93%
[2025-09-12 11:39:45,799715][I][components/metrics:442:log] step: 17  loss:  7.1793  grad_norm: 13.7271  memory: 23.63GiB(36.93%)  tps: 1,944  tflops: 80.21  mfu: 26.90%
[2025-09-12 11:39:47,907438][I][components/metrics:442:log] step: 18  loss:  7.2268  grad_norm: 10.9098  memory: 23.63GiB(36.93%)  tps: 1,945  tflops: 80.25  mfu: 26.91%
[2025-09-12 11:39:50,018253][I][components/metrics:442:log] step: 19  loss:  6.9895  grad_norm:  6.6582  memory: 23.63GiB(36.93%)  tps: 1,942  tflops: 80.13  mfu: 26.87%
[2025-09-12 11:39:52,127309][I][components/metrics:442:log] step: 20  loss:  6.7515  grad_norm:  3.5633  memory: 23.63GiB(36.93%)  tps: 1,944  tflops: 80.20  mfu: 26.90%
[2025-09-12 11:39:54,237784][I][components/metrics:442:log] step: 21  loss:  6.7755  grad_norm:  3.6999  memory: 23.63GiB(36.93%)  tps: 1,943  tflops: 80.15  mfu: 26.88%
[2025-09-12 11:39:56,348825][I][components/metrics:442:log] step: 22  loss:  6.9412  grad_norm:  3.5428  memory: 23.63GiB(36.93%)  tps: 1,942  tflops: 80.13  mfu: 26.87%
[2025-09-12 11:39:58,460931][I][components/metrics:442:log] step: 23  loss:  6.8696  grad_norm:  2.8968  memory: 23.63GiB(36.93%)  tps: 1,941  tflops: 80.08  mfu: 26.86%
[2025-09-12 11:40:00,572489][I][components/metrics:442:log] step: 24  loss:  6.6327  grad_norm:  5.1677  memory: 23.63GiB(36.93%)  tps: 1,942  tflops: 80.11  mfu: 26.86%
[2025-09-12 11:40:02,683070][I][components/metrics:442:log] step: 25  loss:  6.7134  grad_norm:  3.7672  memory: 23.63GiB(36.93%)  tps: 1,943  tflops: 80.14  mfu: 26.88%
[2025-09-12 11:40:04,793520][I][components/metrics:442:log] step: 26  loss:  6.5521  grad_norm:  3.4081  memory: 23.63GiB(36.93%)  tps: 1,943  tflops: 80.15  mfu: 26.88%
[2025-09-12 11:40:06,906933][I][components/metrics:442:log] step: 27  loss:  6.6118  grad_norm:  2.8971  memory: 23.63GiB(36.93%)  tps: 1,940  tflops: 80.04  mfu: 26.84%
[2025-09-12 11:40:09,019771][I][components/metrics:442:log] step: 28  loss:  6.7229  grad_norm:  2.6085  memory: 23.63GiB(36.93%)  tps: 1,941  tflops: 80.06  mfu: 26.85%
[2025-09-12 11:40:11,135250][I][components/metrics:442:log] step: 29  loss:  6.5777  grad_norm:  2.8184  memory: 23.63GiB(36.93%)  tps: 1,938  tflops: 79.96  mfu: 26.81%
[2025-09-12 11:40:13,249416][I][components/metrics:442:log] step: 30  loss:  6.5954  grad_norm:  2.7959  memory: 23.63GiB(36.93%)  tps: 1,939  tflops: 80.00  mfu: 26.83%
[2025-09-12 11:40:15,364869][I][components/metrics:442:log] step: 31  loss:  6.4546  grad_norm:  3.2096  memory: 23.63GiB(36.93%)  tps: 1,938  tflops: 79.96  mfu: 26.82%
[2025-09-12 11:40:17,476265][I][components/metrics:442:log] step: 32  loss:  6.6677  grad_norm:  2.1374  memory: 23.63GiB(36.93%)  tps: 1,942  tflops: 80.11  mfu: 26.87%
[2025-09-12 11:40:19,590038][I][components/metrics:442:log] step: 33  loss:  6.5451  grad_norm:  2.0738  memory: 23.63GiB(36.93%)  tps: 1,940  tflops: 80.02  mfu: 26.84%
[2025-09-12 11:40:21,706964][I][components/metrics:442:log] step: 34  loss:  6.7087  grad_norm:  2.5267  memory: 23.63GiB(36.93%)  tps: 1,937  tflops: 79.91  mfu: 26.80%
[2025-09-12 11:40:23,826393][I][components/metrics:442:log] step: 35  loss:  6.3955  grad_norm:  1.9991  memory: 23.63GiB(36.93%)  tps: 1,935  tflops: 79.81  mfu: 26.76%
[2025-09-12 11:40:25,943121][I][components/metrics:442:log] step: 36  loss:  6.4686  grad_norm:  1.5817  memory: 23.63GiB(36.93%)  tps: 1,937  tflops: 79.91  mfu: 26.80%
[2025-09-12 11:40:28,062842][I][components/metrics:442:log] step: 37  loss:  6.3481  grad_norm:  2.6166  memory: 23.63GiB(36.93%)  tps: 1,934  tflops: 79.79  mfu: 26.76%
[2025-09-12 11:40:30,184717][I][components/metrics:442:log] step: 38  loss:  6.4443  grad_norm:  2.5323  memory: 23.63GiB(36.93%)  tps: 1,932  tflops: 79.71  mfu: 26.73%
[2025-09-12 11:40:32,305122][I][components/metrics:442:log] step: 39  loss:  6.2732  grad_norm:  2.1087  memory: 23.63GiB(36.93%)  tps: 1,934  tflops: 79.77  mfu: 26.75%
[2025-09-12 11:40:34,431400][I][components/metrics:442:log] step: 40  loss:  6.1638  grad_norm:  1.6096  memory: 23.63GiB(36.93%)  tps: 1,928  tflops: 79.55  mfu: 26.68%
[2025-09-12 11:40:36,558993][I][components/metrics:442:log] step: 41  loss:  6.2434  grad_norm:  2.1429  memory: 23.63GiB(36.93%)  tps: 1,927  tflops: 79.50  mfu: 26.66%
[2025-09-12 11:40:38,684159][I][components/metrics:442:log] step: 42  loss:  6.2472  grad_norm:  1.9758  memory: 23.63GiB(36.93%)  tps: 1,929  tflops: 79.59  mfu: 26.69%
[2025-09-12 11:40:40,811350][I][components/metrics:442:log] step: 43  loss:  6.0686  grad_norm:  2.0387  memory: 23.63GiB(36.93%)  tps: 1,927  tflops: 79.52  mfu: 26.67%
[2025-09-12 11:40:42,942820][I][components/metrics:442:log] step: 44  loss:  6.0512  grad_norm:  1.7659  memory: 23.63GiB(36.93%)  tps: 1,924  tflops: 79.36  mfu: 26.61%
[2025-09-12 11:40:45,071924][I][components/metrics:442:log] step: 45  loss:  5.9693  grad_norm:  3.0356  memory: 23.63GiB(36.93%)  tps: 1,926  tflops: 79.44  mfu: 26.64%
[2025-09-12 11:40:47,202347][I][components/metrics:442:log] step: 46  loss:  6.1370  grad_norm:  2.2346  memory: 23.63GiB(36.93%)  tps: 1,924  tflops: 79.39  mfu: 26.62%
[2025-09-12 11:40:49,335707][I][components/metrics:442:log] step: 47  loss:  6.0951  grad_norm:  2.2721  memory: 23.63GiB(36.93%)  tps: 1,922  tflops: 79.29  mfu: 26.59%
[2025-09-12 11:40:51,472182][I][components/metrics:442:log] step: 48  loss:  6.1080  grad_norm:  2.3427  memory: 23.63GiB(36.93%)  tps: 1,919  tflops: 79.17  mfu: 26.55%
[2025-09-12 11:40:53,607441][I][components/metrics:442:log] step: 49  loss:  5.8213  grad_norm:  2.4015  memory: 23.63GiB(36.93%)  tps: 1,920  tflops: 79.22  mfu: 26.57%
[2025-09-12 11:40:53,644423][I][tools/utils:65:collect] [GC] Performing periodical GC collection 0.04 seconds
[2025-09-12 11:40:55,782338][I][components/metrics:442:log] step: 50  loss:  6.0710  grad_norm:  2.2237  memory: 23.63GiB(36.93%)  tps: 1,885  tflops: 77.77  mfu: 26.08%
[2025-09-12 11:40:57,921332][I][components/metrics:442:log] step: 51  loss:  5.6129  grad_norm:  1.8282  memory: 23.63GiB(36.93%)  tps: 1,917  tflops: 79.07  mfu: 26.52%
[2025-09-12 11:41:00,060512][I][components/metrics:442:log] step: 52  loss:  5.8381  grad_norm:  2.2276  memory: 23.63GiB(36.93%)  tps: 1,917  tflops: 79.07  mfu: 26.52%
[2025-09-12 11:41:02,201596][I][components/metrics:442:log] step: 53  loss:  5.5789  grad_norm:  1.8904  memory: 23.63GiB(36.93%)  tps: 1,915  tflops: 79.00  mfu: 26.49%
[2025-09-12 11:41:04,338853][I][components/metrics:442:log] step: 54  loss:  5.5972  grad_norm:  1.9285  memory: 23.63GiB(36.93%)  tps: 1,918  tflops: 79.14  mfu: 26.54%
[2025-09-12 11:41:06,483940][I][components/metrics:442:log] step: 55  loss:  5.5264  grad_norm:  2.1031  memory: 23.63GiB(36.93%)  tps: 1,911  tflops: 78.86  mfu: 26.45%
[2025-09-12 11:41:08,626486][I][components/metrics:442:log] step: 56  loss:  5.6756  grad_norm:  1.8958  memory: 23.63GiB(36.93%)  tps: 1,914  tflops: 78.95  mfu: 26.48%
[2025-09-12 11:41:10,768986][I][components/metrics:442:log] step: 57  loss:  5.5827  grad_norm:  1.9008  memory: 23.63GiB(36.93%)  tps: 1,914  tflops: 78.95  mfu: 26.48%
[2025-09-12 11:41:12,915983][I][components/metrics:442:log] step: 58  loss:  6.1343  grad_norm:  2.2042  memory: 23.63GiB(36.93%)  tps: 1,910  tflops: 78.78  mfu: 26.42%
[2025-09-12 11:41:15,057467][I][components/metrics:442:log] step: 59  loss:  5.7517  grad_norm:  1.7251  memory: 23.63GiB(36.93%)  tps: 1,914  tflops: 78.98  mfu: 26.49%
[2025-09-12 11:41:17,195890][I][components/metrics:442:log] step: 60  loss:  5.5449  grad_norm:  1.7781  memory: 23.63GiB(36.93%)  tps: 1,917  tflops: 79.10  mfu: 26.53%
[2025-09-12 11:41:19,340106][I][components/metrics:442:log] step: 61  loss:  5.5037  grad_norm:  1.8137  memory: 23.63GiB(36.93%)  tps: 1,912  tflops: 78.88  mfu: 26.45%
[2025-09-12 11:41:21,479998][I][components/metrics:442:log] step: 62  loss:  5.5703  grad_norm:  2.2754  memory: 23.63GiB(36.93%)  tps: 1,916  tflops: 79.04  mfu: 26.51%
[2025-09-12 11:41:23,619646][I][components/metrics:442:log] step: 63  loss:  5.3396  grad_norm:  1.9820  memory: 23.63GiB(36.93%)  tps: 1,916  tflops: 79.06  mfu: 26.51%
[2025-09-12 11:41:25,758931][I][components/metrics:442:log] step: 64  loss:  5.2862  grad_norm:  2.1926  memory: 23.63GiB(36.93%)  tps: 1,917  tflops: 79.07  mfu: 26.52%
[2025-09-12 11:41:27,902443][I][components/metrics:442:log] step: 65  loss:  5.3883  grad_norm:  1.8266  memory: 23.63GiB(36.93%)  tps: 1,913  tflops: 78.91  mfu: 26.46%
[2025-09-12 11:41:30,047189][I][components/metrics:442:log] step: 66  loss:  5.3715  grad_norm:  1.8546  memory: 23.63GiB(36.93%)  tps: 1,912  tflops: 78.86  mfu: 26.45%
[2025-09-12 11:41:32,191202][I][components/metrics:442:log] step: 67  loss:  5.3473  grad_norm:  1.8945  memory: 23.63GiB(36.93%)  tps: 1,912  tflops: 78.88  mfu: 26.45%
[2025-09-12 11:41:34,336648][I][components/metrics:442:log] step: 68  loss:  5.4083  grad_norm:  1.6982  memory: 23.63GiB(36.93%)  tps: 1,911  tflops: 78.83  mfu: 26.44%
[2025-09-12 11:41:36,480695][I][components/metrics:442:log] step: 69  loss:  5.2105  grad_norm:  1.5840  memory: 23.63GiB(36.93%)  tps: 1,912  tflops: 78.89  mfu: 26.45%
[2025-09-12 11:41:38,625671][I][components/metrics:442:log] step: 70  loss:  5.2483  grad_norm:  1.8750  memory: 23.63GiB(36.93%)  tps: 1,911  tflops: 78.85  mfu: 26.44%
[2025-09-12 11:41:40,772186][I][components/metrics:442:log] step: 71  loss:  5.1239  grad_norm:  1.9717  memory: 23.63GiB(36.93%)  tps: 1,910  tflops: 78.80  mfu: 26.43%
[2025-09-12 11:41:42,918729][I][components/metrics:442:log] step: 72  loss:  5.3355  grad_norm:  1.8882  memory: 23.63GiB(36.93%)  tps: 1,910  tflops: 78.79  mfu: 26.42%
[2025-09-12 11:41:45,066384][I][components/metrics:442:log] step: 73  loss:  5.0560  grad_norm:  1.6971  memory: 23.63GiB(36.93%)  tps: 1,909  tflops: 78.76  mfu: 26.41%
[2025-09-12 11:41:47,209176][I][components/metrics:442:log] step: 74  loss:  5.0859  grad_norm:  2.6819  memory: 23.63GiB(36.93%)  tps: 1,913  tflops: 78.93  mfu: 26.47%
[2025-09-12 11:41:49,355442][I][components/metrics:442:log] step: 75  loss:  5.2856  grad_norm:  1.8572  memory: 23.63GiB(36.93%)  tps: 1,910  tflops: 78.81  mfu: 26.43%
[2025-09-12 11:41:51,499099][I][components/metrics:442:log] step: 76  loss:  5.2415  grad_norm:  1.4722  memory: 23.63GiB(36.93%)  tps: 1,913  tflops: 78.90  mfu: 26.46%
[2025-09-12 11:41:53,642872][I][components/metrics:442:log] step: 77  loss:  5.1465  grad_norm:  1.6991  memory: 23.63GiB(36.93%)  tps: 1,913  tflops: 78.90  mfu: 26.46%
[2025-09-12 11:41:55,790222][I][components/metrics:442:log] step: 78  loss:  4.9042  grad_norm:  2.5348  memory: 23.63GiB(36.93%)  tps: 1,909  tflops: 78.77  mfu: 26.41%
[2025-09-12 11:41:57,938398][I][components/metrics:442:log] step: 79  loss:  5.1845  grad_norm:  2.1790  memory: 23.63GiB(36.93%)  tps: 1,908  tflops: 78.73  mfu: 26.40%
[2025-09-12 11:42:00,085052][I][components/metrics:442:log] step: 80  loss:  5.0380  grad_norm:  1.8122  memory: 23.63GiB(36.93%)  tps: 1,910  tflops: 78.79  mfu: 26.42%
[2025-09-12 11:42:02,229187][I][components/metrics:442:log] step: 81  loss:  5.1028  grad_norm:  2.3178  memory: 23.63GiB(36.93%)  tps: 1,912  tflops: 78.89  mfu: 26.46%
[2025-09-12 11:42:04,376585][I][components/metrics:442:log] step: 82  loss:  4.9639  grad_norm:  1.7682  memory: 23.63GiB(36.93%)  tps: 1,909  tflops: 78.77  mfu: 26.42%
[2025-09-12 11:42:06,522266][I][components/metrics:442:log] step: 83  loss:  5.1079  grad_norm:  2.0751  memory: 23.63GiB(36.93%)  tps: 1,911  tflops: 78.83  mfu: 26.44%
[2025-09-12 11:42:08,668032][I][components/metrics:442:log] step: 84  loss:  5.0744  grad_norm:  1.4189  memory: 23.63GiB(36.93%)  tps: 1,911  tflops: 78.82  mfu: 26.43%

Footnotes

Submitted PR #2 ↩︎

Citation

BibTeX citation:

@online{foreman2025,
  author = {Foreman, Sam},
  title = {🍹 {BlendCorpus} + {TorchTitan} @ {ALCF}},
  date = {2025-09-12},
  url = {https://samforeman.me/posts/2025/09/12/},
  langid = {en}
}

For attribution, please cite this work as:

Foreman, Sam. 2025. “🍹 BlendCorpus + TorchTitan @ ALCF.” September 12, 2025. https://samforeman.me/posts/2025/09/12/.

--- title: "🍹 BlendCorpus + TorchTitan @ ALCF" description: "Training large language models with BlendCorpus and TorchTitan on supercomputers at the Argonne Leadership Computing Facility." date: 2025-09-12 date-modified: last-modified aliases: - /posts/torchtitan-blendcorpus/index.html - /posts/2025/09/torchtitan-blendcorpus/index.html categories: - "pytorch" - "Aurora" - "ALCF" - "torchtitan" - "blendcorpus" --- ## 📌 Source Repositories Things are changing _quickly_, so to avoid confusion, here are the exact branches used for this demo: - Using: - [auroraGPT-ANL/`torchtitan` @ saforem2/blendcorpus](https://github.com/auroraGPT-ANL/torchtitan/tree/saforem2/blendcorpus) - [saforem2/`blendcorpus` @ saforem2/reorg-imports](https://github.com/saforem2/blendcorpus/tree/saforem2/reorg-imports)[^PR] [^PR]: Submitted [PR #2](https://github.com/zhenghh04/blendcorpus/pull/2) ## 🏃‍♂️ Running - Clone repo: ```bash git clone https://github.com/auroraGPT-ANL/torchtitan cd torchtitan checkout saforem2/blendcorpus ``` - Setup env: ```bash source <(curl -L https://bit.ly/ezpz-utils) ezpz_setup_env ``` - 2025-09-11, on Aurora @ ALCF: <details closed><summary><code>output</code>:</summary> ```bash ; ssh x4112c1s0b0n0 #[🐍 aurora_nre_models_frameworks-2025.2.0] #[/f/A/A/E/A/t/a/torchtitan][🌱 saforem2/blendcorpus][📝🤷✓] #[09/11/25 @ 14:08:35][x4112c1s0b0n0] ; source <(curl -L https://bit.ly/ezpz-utils) #[🐍 aurora_nre_models_frameworks-2025.2.0] #[/f/A/A/E/A/t/a/torchtitan][🌱 saforem2/blendcorpus][📝🤷✓] #[09/11/25 @ 14:08:37][x4112c1s0b0n0] ; ezpz_setup_env [2025-09-11-140838][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2720] Detected PBS scheduler environment. [2025-09-11-140838][W][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2756] Current working directory does not match PBS_O_WORKDIR! This may cause issues with the job submission. [2025-09-11-140838][W][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2757] PBS_O_WORKDIR /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/zhenghh04/torchtitan [2025-09-11-140838][W][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2758] WORKING_DIR /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan [2025-09-11-140838][W][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2759] Exporting PBS_O_WORKDIR=WORKING_DIR=/lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan and continuing... [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2486] running [ezpz_setup_env]... [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1298] [PYTHON] [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1327] - Conda active, conda=/opt/aurora/25.190.0/frameworks/aurora_nre_models_frameworks-2025.2.0... [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1328] - No virtual_env found in environment [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1142] - Found python root at /opt/aurora/25.190.0/frameworks/aurora_nre_models_frameworks-2025.2.0 [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1157] - No VIRTUAL_ENV found in environment! [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1160] - Looking for venv in venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0... [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1182] - Activating existing venv in VENV_DIR=venvs/torchtitan-aurora_nre_models_frameworks-2025.2.0 [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1184] - Found /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan/venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0/bin/activate [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1353] - Using python from: /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan/venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0/bin/python3 [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2335] [JOB] [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2336] - Setting up env for foremans [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2337] - Detected pbs scheduler [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2338] - Machine: aurora [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2339] - Hostname: x4112c1s0b0n0 [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2249] - PBS_JOBID=7559761.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov to calculate: - num_hosts: 2 - num_cores_per_host: 208 - num_cpus_per_host: 104 - num_gpus_per_host: 12 - depth: 8 - num_gpus: 24 [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1754] [HOSTS] - ezpz_print_hosts [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1756] - Detected PBS Scheduler [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1774] [HOSTS] [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1775] - HOSTFILE=/var/spool/pbs/aux/7559761.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1776] - NHOSTS=2 [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1777] - HOSTS: [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1780] - [host:0] - x4112c1s0b0n0.hsn.cm.aurora.alcf.anl.gov [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1780] - [host:1] - x4112c1s1b0n0.hsn.cm.aurora.alcf.anl.gov [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1941] [DIST_INFO] [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1942] - HOSTFILE=/var/spool/pbs/aux/7559761.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1943] - NHOSTS=2 [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1944] - NGPU_PER_HOST=12 [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1945] - NGPUS=24 [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1947] [LAUNCH] [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1948] - To launch across all available GPUs, use: 'launch' [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1949] launch = mpiexec --verbose --envall -n 24 -ppn 12 --hostfile /var/spool/pbs/aux/7559761.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-8 0:86-88:94-96 [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1950] - Run 'which launch' to ensure that the alias is set correctly [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2495] [✓] Finished [ezpz_setup_env] took: 0h:00m:04s ``` </details> - Install dependencies. From _inside your clone of `torchtitan`_: - 🍋 ezpz: ```bash # uv not required, but useful! # to download: curl -LsSf https://astral.sh/uv/install.sh | sh uv pip install "git+https://github.com/saforem2/ezpz" ``` - 🍹 BlendCorpus: ```bash git clone https://github.com/saforem2/blendcorpus deps/blendcorpus cd deps/blendcorpus git checkout reorg-imports uv pip install -e "." ``` - 🔥 TorchTitan: ```bash python3 -m pip install "git+https://github.com/saforem2/blendcorpus@saforem2/reorg-imports" # from inside auroraGPT-ANL/torchtitan @ saforem2/blendcorpus python3 -m pip install -e "." ``` - Download Artifacts: - AuroraGPT-2B: ```bash python3 scripts/download_hf_assets.py --repo_id google/gemma-7b --assets tokenizer mkdir assets/hf/AuroraGPT-2B cp assets/hf/gemma-7b/tokenizer.model assets/hf/AuroraGPT-2B ``` - AuroraGPT-7B: ```bash python3 scripts/download_hf_assets.py --repo_id meta-llama/llama-2-7b-hf --assets tokenizer mkdir assets/hf/AuroraGPT-7B cp assets/hf/gemma-7b/tokenizer.model assets/hf/AuroraGPT-7B ``` - Launch: ```bash ; ezpz-launch -m torchtitan.experiments.blendcorpus.train --job.config_file torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml ``` <details closed><summary><code>output</code>:</summary> ```bash #[09/12/25 @ 11:33:56][x4117c4s2b0n0] ; ezpz-launch -m torchtitan.experiments.blendcorpus.train --job.config_file torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml [2025-09-12-113711][I][-zsh:91] Using torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml [2025-09-12-113711][I][-zsh:91] Logs will be saved to: logs/auroraGPT_7B-2025-09-12-113711.log [W912 11:37:15.852098275 OperatorEntry.cpp:219] Warning: Warning only once for all operators, other operators may also be overridden. Overriding a previously registered kernel for the same operator and the same dispatch key operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 dispatch key: XPU previous kernel: registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/aten/src/ATen/VmapModeRegistrations.cpp:37 new kernel: registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/ipex_2.8.10_xpu_rel_08_18_2025/intel-extension-for-pytorch/build/Release/csrc/gpu/csrc/gpu/xpu/ATen/RegisterXPU_0.cpp:172 (function operator()) [2025-09-12 11:37:16,879] [INFO] [real_accelerator.py:260:get_accelerator] Setting ds_accelerator to xpu (auto detect) /opt/aurora/25.190.0/frameworks/aurora_nre_models_frameworks-2025.2.0/lib/python3.10/site-packages/neural_compressor/utils/utility.py:44: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. from pkg_resources import parse_version [2025-09-12 11:37:30,939] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False [2025-09-12 11:37:43,263749][I][ezpz/__init__:266:<module>] Setting logging level to 'INFO' on 'RANK == 0' [2025-09-12 11:37:43,266470][I][ezpz/__init__:267:<module>] Setting logging level to 'CRITICAL' on all others 'RANK != 0' [2025-09-12 11:37:43,273704][I][ezpz/launch:340:launch] ----[🍋 ezpz.launch][started][2025-09-12-113743]---- [2025-09-12 11:37:47,537879][I][ezpz/launch:345:launch] Job ID: 7591191 [2025-09-12 11:37:47,538702][I][ezpz/launch:346:launch] nodelist: ['x4117c4s2b0n0', 'x4117c4s6b0n0'] [2025-09-12 11:37:47,539093][I][ezpz/launch:347:launch] hostfile: /var/spool/pbs/aux/7591191.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov [2025-09-12 11:37:47,540277][I][ezpz/pbs:188:get_pbs_launch_cmd] ✅ Using [24/24] GPUs [2 hosts] x [12 GPU/host] [2025-09-12 11:37:47,541233][I][ezpz/launch:316:build_executable] Building command to execute by piecing together: [2025-09-12 11:37:47,541638][I][ezpz/launch:317:build_executable] (1.) launch_cmd: mpiexec --verbose --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/7591191.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 [2025-09-12 11:37:47,542413][I][ezpz/launch:318:build_executable] (2.) cmd_to_launch: /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan/venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0/bin/python3 -m torchtitan.experiments.blendcorpus.train --job.config_file torchtitan/experiments /blendcorpus/train_configs/auroraGPT_7B.toml [2025-09-12 11:37:47,543429][I][ezpz/launch:360:launch] Took: 4.27 seconds to build command. [2025-09-12 11:37:47,543810][I][ezpz/launch:363:launch] Executing: mpiexec --verbose --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/7591191.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan/venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0/bin/python3 -m torchtitan.experiments.blendcorpus.train --job.config_file torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml [2025-09-12 11:37:47,545251][I][ezpz/launch:179:get_aurora_filters] Filtering for Aurora-specific messages. To view list of filters, run with EZPZ_LOG_LEVEL=DEBUG [2025-09-12 11:37:47,545756][I][ezpz/launch:370:launch] Execution started @ 2025-09-12-113747... [2025-09-12 11:37:47,546182][I][ezpz/launch:371:launch] ----[🍋 ezpz.launch][stop][2025-09-12-113747]---- [2025-09-12 11:37:47,546634][I][ezpz/launch:99:run_command] Caught 24 filters [2025-09-12 11:37:47,547002][I][ezpz/launch:100:run_command] Running command: mpiexec --verbose --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/7591191.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan/venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0/bin/python3 -m torchtitan.experiments.blendcorpus.train --job.config_file torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml Disabling local launch: multi-node application Connected to tcp://x4117c4s2b0n0.hsn.cm.aurora.alcf.anl.gov:7919 Launching application 422e0368-f389-4475-8131-3de313723140 cpubind:list x4117c4s2b0n0 pid 35392 rank 0 0: mask 0x1c cpubind:list x4117c4s2b0n0 pid 35393 rank 1 1: mask 0x1c00 cpubind:list x4117c4s2b0n0 pid 35394 rank 2 2: mask 0x1c0000 cpubind:list x4117c4s2b0n0 pid 35395 rank 3 3: mask 0x1c000000 cpubind:list x4117c4s2b0n0 pid 35396 rank 4 4: mask 0x1c00000000 cpubind:list x4117c4s2b0n0 pid 35397 rank 5 5: mask 0x1c0000000000 cpubind:list x4117c4s2b0n0 pid 35398 rank 6 6: mask 0x1c0000000000000 cpubind:list x4117c4s2b0n0 pid 35399 rank 7 7: mask 0x1c000000000000000 cpubind:list x4117c4s2b0n0 pid 35400 rank 8 8: mask 0x1c00000000000000000 cpubind:list x4117c4s2b0n0 pid 35401 rank 9 9: mask 0x1c0000000000000000000 cpubind:list x4117c4s2b0n0 pid 35402 rank 10 10: mask 0x1c000000000000000000000 cpubind:list x4117c4s2b0n0 pid 35403 rank 11 11: mask 0x1c00000000000000000000000 Application 422e0368-f389-4475-8131-3de313723140 started execution cpubind:list x4117c4s6b0n0 pid 111063 rank 12 0: mask 0x1c cpubind:list x4117c4s6b0n0 pid 111064 rank 13 1: mask 0x1c00 cpubind:list x4117c4s6b0n0 pid 111065 rank 14 2: mask 0x1c0000 cpubind:list x4117c4s6b0n0 pid 111066 rank 15 3: mask 0x1c000000 cpubind:list x4117c4s6b0n0 pid 111067 rank 16 4: mask 0x1c00000000 cpubind:list x4117c4s6b0n0 pid 111068 rank 17 5: mask 0x1c0000000000 cpubind:list x4117c4s6b0n0 pid 111069 rank 18 6: mask 0x1c0000000000000 cpubind:list x4117c4s6b0n0 pid 111070 rank 19 7: mask 0x1c000000000000000 cpubind:list x4117c4s6b0n0 pid 111071 rank 20 8: mask 0x1c00000000000000000 cpubind:list x4117c4s6b0n0 pid 111072 rank 21 9: mask 0x1c0000000000000000000 cpubind:list x4117c4s6b0n0 pid 111073 rank 22 10: mask 0x1c000000000000000000000 cpubind:list x4117c4s6b0n0 pid 111074 rank 23 11: mask 0x1c00000000000000000000000 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!) registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 from pkg_resources import parse_version # [...repeated...]: TODO: Add this to the list of filters in ezpz from pkg_resources import parse_version [2025-09-12 11:38:05,512552][I][ezpz/__init__:266:<module>] Setting logging level to 'INFO' on 'RANK == 0' [2025-09-12 11:38:05,515164][I][ezpz/__init__:267:<module>] Setting logging level to 'CRITICAL' on all others 'RANK != 0' [2025-09-12 11:38:07,293955][I][ezpz/dist:1181:setup_torch_distributed] Using fw='ddp' with torch_{device,backend}= {xpu, xccl} [2025-09-12 11:38:07,295126][I][ezpz/dist:1039:setup_torch_DDP] Caught MASTER_PORT=44635 from environment! [2025-09-12 11:38:07,295968][I][ezpz/dist:1055:setup_torch_DDP] Using torch.distributed.init_process_group with - master_addr='x4117c4s2b0n0.hsn.cm.aurora.alcf.anl.gov' - master_port='44635' - world_size=24 - rank=0 - local_rank=0 - timeout=datetime.timedelta(seconds=3600) - backend='xccl' [2025-09-12 11:38:07,297280][I][ezpz/dist:772:init_process_group] Calling torch.distributed.init_process_group_with: rank=0 world_size=24 backend=xccl [2025-09-12 11:38:21,344380][I][ezpz/pbs:188:get_pbs_launch_cmd] ✅ Using [24/24] GPUs [2 hosts] x [12 GPU/host] [2025-09-12 11:38:21,346401][I][ezpz/dist:450:print_dist_setup] [device='xpu'][rank=0/23][local_rank=0/11][node=0/1] [2025-09-12 11:38:21,347018][W][utils/_logger:68:warning] Using [24 / 24] available "xpu" devices !! 2025:09:12-11:38:21:(35392) |CCL_WARN| value of CCL_LOG_LEVEL changed to be error (default:warn) [2025-09-12 11:38:22,154201][I][ezpz/dist:1401:setup_torch] Using device='xpu' with backend='xccl' + 'xccl' for distributed training. [2025-09-12 11:38:22,155050][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 0/23] [2025-09-12 11:38:22,154185][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 6/23] [2025-09-12 11:38:22,154353][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 7/23] [2025-09-12 11:38:22,154299][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 9/23] [2025-09-12 11:38:22,154355][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][11/23] [2025-09-12 11:38:22,154185][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 1/23] [2025-09-12 11:38:22,154184][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 5/23] [2025-09-12 11:38:22,154350][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 8/23] [2025-09-12 11:38:22,154312][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][12/23] [2025-09-12 11:38:22,154495][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][10/23] [2025-09-12 11:38:22,154339][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 4/23] [2025-09-12 11:38:22,154312][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][13/23] [2025-09-12 11:38:22,154312][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][15/23] [2025-09-12 11:38:22,154312][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][16/23] [2025-09-12 11:38:22,154379][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][17/23] [2025-09-12 11:38:22,154398][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 2/23] [2025-09-12 11:38:22,154319][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][18/23] [2025-09-12 11:38:22,154284][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][19/23] [2025-09-12 11:38:22,154325][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][20/23] [2025-09-12 11:38:22,154382][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][22/23] [2025-09-12 11:38:22,154502][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][14/23] [2025-09-12 11:38:22,154391][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][21/23] [2025-09-12 11:38:22,154451][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][23/23] [2025-09-12 11:38:22,154411][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 3/23] [2025-09-12 11:38:22,694566][I][blendcorpus/train:85:__init__] Starting job: AuroraGPT-7B Training [2025-09-12 11:38:22,695590][I][blendcorpus/train:93:__init__] Running with args: { "activation_checkpoint": { "early_stop": false, "mode": "none", "per_op_sac_force_recompute_mm_shapes_by_fqns": [ "moe.router.gate" ], "selective_ac_option": "op" }, "blendcorpus": { "append_eod": true, "blend_sample_in_corpus": false, "data_cache_path": "./.cache/data/auroraGPT-7B/olmo-mix-1124/", "data_file_list": null, "dataloader_type": "single", "eod_token_id": 2, "micro_batch_size": null, "num_workers": 2, "provide_attention_mask": false, "seq_length": null, "shuffle": true, "shuffle_sample_in_corpus": true, "split": "98,1,1" }, "checkpoint": { "async_mode": "disabled", "create_seed_checkpoint": false, "enable": false, "enable_first_step_checkpoint": false, "exclude_from_loading": [], "export_dtype": "float32", "folder": "checkpoint", "initial_load_in_hf": false, "initial_load_model_only": true, "initial_load_path": null, "interval": 10, "keep_latest_k": 10, "last_save_in_hf": false, "last_save_model_only": false, "load_step": -1 }, "comm": { "init_timeout_seconds": 300, "save_traces_folder": "comm_traces", "trace_buf_size": 20000, "train_timeout_seconds": 100 }, "compile": { "components": [ "model", "loss" ], "enable": true }, "experimental": { "custom_args_module": "torchtitan.experiments.blendcorpus.job_config", "custom_import": "" }, "fault_tolerance": { "enable": false, "group_size": 0, "min_replica_size": 1, "process_group": "gloo", "process_group_timeout_ms": 10000, "replica_id": 0, "semi_sync_method": null }, "float8": { "emulate": false, "enable_fsdp_float8_all_gather": false, "filter_fqns": [ "output" ], "moe_fqns_prototype": [], "precompute_float8_dynamic_scale_for_fsdp": false, "recipe_name": null }, "job": { "config_file": "torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml", "description": "AuroraGPT-7B Training", "dump_folder": "./outputs/AuroraGPT-7B", "print_args": true, "use_for_integration_test": true }, "lr_scheduler": { "decay_ratio": 0.8, "decay_type": "linear", "min_lr_factor": 0.0, "warmup_steps": 2 }, "memory_estimation": { "disable_fake_mode": false, "enable": false }, "metrics": { "disable_color_printing": false, "enable_tensorboard": true, "enable_wandb": true, "log_freq": 1, "save_for_all_ranks": false, "save_tb_folder": "tb" }, "model": { "converters": [], "flavor": "AuroraGPT-7B", "hf_assets_path": "./assets/hf/AuroraGPT-7B", "name": "blendcorpus", "print_after_conversion": false, "tokenizer_backend": "sptoken", "tokenizer_path": null }, "mx": { "filter_fqns": [ "output" ], "moe_fqns_prototype": [], "mxfp8_dim1_cast_kernel_choice": "triton", "recipe_name": "mxfp8_cublas" }, "optimizer": { "beta1": 0.9, "beta2": 0.95, "early_step_in_backward": false, "eps": 1e-08, "implementation": "fused", "lr": 0.0002, "name": "AdamW", "weight_decay": 0.1 }, "parallelism": { "context_parallel_degree": 1, "context_parallel_rotate_method": "allgather", "data_parallel_replicate_degree": 1, "data_parallel_shard_degree": -1, "disable_loss_parallel": false, "enable_async_tensor_parallel": false, "enable_compiled_autograd": false, "expert_parallel_degree": 1, "expert_tensor_parallel_degree": 1, "fsdp_reshard_after_forward": "default", "module_fqns_per_model_part": null, "pipeline_parallel_degree": 1, "pipeline_parallel_first_stage_less_layers": 1, "pipeline_parallel_last_stage_less_layers": 1, "pipeline_parallel_layers_per_stage": null, "pipeline_parallel_microbatch_size": 1, "pipeline_parallel_schedule": "1F1B", "pipeline_parallel_schedule_csv": "", "pipeline_parallel_split_points": [], "tensor_parallel_degree": 1 }, "profiling": { "enable_memory_snapshot": false, "enable_profiling": false, "profile_freq": 10, "save_memory_snapshot_folder": "memory_snapshot", "save_traces_folder": "profile_trace" }, "training": { "dataset": "blendcorpus", "dataset_path": "/flare/Aurora_deployment/AuroraGPT/datasets/dolma/dolma_v1_7_file_list_mini.txt", "deterministic": false, "enable_cpu_offload": false, "gc_debug": false, "gc_freq": 50, "global_batch_size": -1, "local_batch_size": 1, "max_norm": 1.0, "mixed_precision_param": "bfloat16", "mixed_precision_reduce": "float32", "seed": null, "seq_len": 4096, "steps": 1000 }, "validation": { "dataset": "c4_validation", "dataset_path": null, "enable": false, "freq": 5, "local_batch_size": 8, "seq_len": 2048, "steps": 10 } } Number of ranks per node: 12 Is initialized already [2025-09-12 11:38:22,781763][I][distributed/parallel_dims:158:_build_mesh_without_ep] Building 1-D device mesh with ['dp_shard'], [24] [2025-09-12 11:38:22,783219][I][tools/utils:65:collect] [GC] Initial GC collection 0.00 seconds [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [2025-09-12 11:38:22,795599][I][dataset/sptoken:75:build_sentencepiece_tokenizer] [SPTokenizer] Using model path: ./assets/hf/AuroraGPT-7B [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [Tokenizer] Using backend: sptoken (SentencePiece) [2025-09-12 11:38:22,806079][I][dataset/sptoken:36:__init__] [SPTokenizer] Loaded model: ./assets/hf/AuroraGPT-7B/tokenizer.model, vocab size: 32000 [INFO][2025-09-12 11:38:22.811010] Reading data from /flare/Aurora_deployment/AuroraGPT/datasets/dolma/dolma_v1_7_file_list_mini.txt [INFO][2025-09-12 11:38:22.811281] Number of datasets: 9 [INFO][2025-09-12 11:38:22.811427] Global batch size: 24 [INFO][2025-09-12 11:38:22.811559] Training iterations: 1000 [INFO][2025-09-12 11:38:22.811682] Evaluation iterations: 0 [INFO][2025-09-12 11:38:22.811805] Total number of training samples: 24000 [INFO][2025-09-12 11:38:22.811932] Total number of evaluation samples: 0 [INFO][2025-09-12 11:38:22.812052] Total number of testing samples: 0 [2025-09-12 11:38:23,388839][I][data/gpt_dataset:263:_cache_indices] > loading algebraic corpus dataset index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/eaaf4239bf399fe90985648264fc597b_index.npy [2025-09-12 11:38:23,400289][I][data/gpt_dataset:270:_cache_indices] > loading algebraic corpus dataset sample index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/eaaf4239bf399fe90985648264fc597b_sample_index.npy [2025-09-12 11:38:23,401313][I][data/gpt_dataset:277:_cache_indices] > finished loading in 0.01251498400233686 seconds [2025-09-12 11:38:23,402526][I][data/gpt_dataset:291:__init__] [BuildCorpusDataset] Caught args.shuffle_sample_in_corpus=True across 19984 samples [2025-09-12 11:38:23,498032][I][data/gpt_dataset:263:_cache_indices] > loading arxiv corpus dataset index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/9958f3591de484302ca17a2f1feeafaf_index.npy [2025-09-12 11:38:23,502674][I][data/gpt_dataset:270:_cache_indices] > loading arxiv corpus dataset sample index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/9958f3591de484302ca17a2f1feeafaf_sample_index.npy [2025-09-12 11:38:23,506868][I][data/gpt_dataset:277:_cache_indices] > finished loading in 0.008856782980728894 seconds [2025-09-12 11:38:23,507665][I][data/gpt_dataset:291:__init__] [BuildCorpusDataset] Caught args.shuffle_sample_in_corpus=True across 4140 samples [2025-09-12 11:38:23,520625][I][data/blendable_dataset:131:__init__] > loading blendable dataset index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/212cf8a136a93b7bc9cd575fa4c82f21_index.npy [2025-09-12 11:38:23,527379][I][data/blendable_dataset:134:__init__] > loading blendable dataset sample index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/212cf8a136a93b7bc9cd575fa4c82f21_sample_index.npy [2025-09-12 11:38:23,532038][I][data/blendable_dataset:139:__init__] > finished loading in 0.011423073010519147 seconds [2025-09-12 11:38:23,543427][I][data/blendable_dataset:152:__init__] > size of blendable dataset: 24124 samples [2025-09-12 11:38:23,544235][I][blendcorpus/train:177:__init__] Using BlendCorpus dataloader. [2025-09-12 11:38:23,544713][I][blendcorpus/train:185:__init__] Building blendcorpus AuroraGPT-7B with TransformerModelArgs(_enforced='This field is used to enforce all fields have defaults.', dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, rope_theta=10000, max_seq_len=4096, depth_init=True, use_flex_attn=False, attn_mask_type='causal', eos_id=0) wandb: Currently logged in as: foremans (aurora_gpt) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin wandb: Tracking run with wandb version 0.21.3 wandb: Run data is saved locally in ./outputs/AuroraGPT-7B/tb/20250912-1138/wandb/run-20250912_113823-qzle9mdw wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run snowy-sunset-14 wandb: View project at https://wandb.ai/aurora_gpt/torchtitan wandb: View run at https://wandb.ai/aurora_gpt/torchtitan/runs/qzle9mdw [2025-09-12 11:38:24,703889][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:24,750784][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:24,776002][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:24,848549][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:24,864781][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:24,964752][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:25,005897][I][components/metrics:155:__init__] WandB logging enabled [2025-09-12 11:38:25,012474][I][components/metrics:124:__init__] TensorBoard logging enabled. Logs will be saved at ./outputs/AuroraGPT-7B/tb/20250912-1138 [2025-09-12 11:38:25,017569][I][components/metrics:101:build_device_memory_monitor] XPU capacity: Intel(R) Data Center GPU Max 1550 with 63.98GiB memory [2025-09-12 11:38:25,044275][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:25,093814][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:25,126732][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:25,146946][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:25,146998][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:25,147707][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:25,148411][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:25,149299][I][blendcorpus/train:212:__init__] Model blendcorpus AuroraGPT-7B size: 5,933,109,248 total parameters [2025-09-12 11:38:25,150242][I][components/loss:28:build_cross_entropy_loss] Compiling the loss function with torch.compile [2025-09-12 11:38:25,190998][I][infra/parallelize:357:apply_compile] Compiling each TransformerBlock with torch.compile [2025-09-12 11:38:25,220312][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:25,245422][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:25,262857][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:25,271332][I][infra/parallelize:122:parallelize_llama] Applied FSDP to the model [2025-09-12 11:38:25,289336][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:25,296201][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:25,296289][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:25,298835][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:25,299750][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:25,299754][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:25,299888][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:25,770147][I][blendcorpus/train:290:__init__] Peak FLOPS used for computing MFU: 2.982e+14 [2025-09-12 11:38:25,771316][I][blendcorpus/train:292:__init__] XPU memory usage for model: 1.04GiB(1.63%) [2025-09-12 11:38:25,773314][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format [2025-09-12 11:38:25,774216][I][distributed/utils:225:maybe_enable_amp] Mixed precision training is handled by fully_shard [2025-09-12 11:38:25,774808][I][blendcorpus/train:381:__init__] Trainer is initialized with local batch size 1, global batch size 24, gradient accumulation steps 1, sequence length 4096, total steps 1000 (warmup 2) [2025-09-12 11:38:25,775505][I][blendcorpus/train:695:<module>] Using SDPBackend.FLASH_ATTENTION backend for SDPA [2025-09-12 11:38:25,776216][I][blendcorpus/train:569:train] BlendCorpus dataloader advanced to consumed =0 samples (step={self.step}). [2025-09-12 11:38:25,776915][I][blendcorpus/train:581:train] Training starts at step 1. [2025-09-12 11:39:11,844905][I][components/metrics:442:log] step: 1 loss: 10.8919 grad_norm: 5.7773 memory: 21.74GiB(33.98%) tps: 88 tflops: 3.62 mfu: 1.21% [2025-09-12 11:39:11,847254][I][distributed/utils:299:set_pg_timeouts] Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [2025-09-12 11:39:13,996720][I][components/metrics:442:log] step: 2 loss: 15.4482 grad_norm: 95.7768 memory: 23.63GiB(36.93%) tps: 1,906 tflops: 78.63 mfu: 26.37% [2025-09-12 11:39:16,148721][I][components/metrics:442:log] step: 3 loss: 18.1145 grad_norm: 177.2544 memory: 23.63GiB(36.93%) tps: 1,905 tflops: 78.60 mfu: 26.36% [2025-09-12 11:39:18,293594][I][components/metrics:442:log] step: 4 loss: 12.2966 grad_norm: 47.6269 memory: 23.63GiB(36.93%) tps: 1,912 tflops: 78.86 mfu: 26.45% [2025-09-12 11:39:20,423330][I][components/metrics:442:log] step: 5 loss: 12.4196 grad_norm: 55.3153 memory: 23.63GiB(36.93%) tps: 1,925 tflops: 79.42 mfu: 26.63% [2025-09-12 11:39:22,550981][I][components/metrics:442:log] step: 6 loss: 10.8771 grad_norm: 5.3124 memory: 23.63GiB(36.93%) tps: 1,927 tflops: 79.50 mfu: 26.66% [2025-09-12 11:39:24,670689][I][components/metrics:442:log] step: 7 loss: 10.9488 grad_norm: 41.6404 memory: 23.63GiB(36.93%) tps: 1,934 tflops: 79.80 mfu: 26.76% [2025-09-12 11:39:26,791101][I][components/metrics:442:log] step: 8 loss: 9.9818 grad_norm: 18.3422 memory: 23.63GiB(36.93%) tps: 1,934 tflops: 79.77 mfu: 26.75% [2025-09-12 11:39:28,911059][I][components/metrics:442:log] step: 9 loss: 9.0792 grad_norm: 9.5251 memory: 23.63GiB(36.93%) tps: 1,934 tflops: 79.79 mfu: 26.76% [2025-09-12 11:39:31,025851][I][components/metrics:442:log] step: 10 loss: 8.4230 grad_norm: 4.9722 memory: 23.63GiB(36.93%) tps: 1,939 tflops: 79.98 mfu: 26.82% [2025-09-12 11:39:33,138436][I][components/metrics:442:log] step: 11 loss: 8.0111 grad_norm: 4.7603 memory: 23.63GiB(36.93%) tps: 1,941 tflops: 80.07 mfu: 26.85% [2025-09-12 11:39:35,250642][I][components/metrics:442:log] step: 12 loss: 7.8059 grad_norm: 9.0702 memory: 23.63GiB(36.93%) tps: 1,941 tflops: 80.08 mfu: 26.85% [2025-09-12 11:39:37,361018][I][components/metrics:442:log] step: 13 loss: 7.3035 grad_norm: 5.1540 memory: 23.63GiB(36.93%) tps: 1,943 tflops: 80.15 mfu: 26.88% [2025-09-12 11:39:39,472014][I][components/metrics:442:log] step: 14 loss: 7.1419 grad_norm: 4.1700 memory: 23.63GiB(36.93%) tps: 1,942 tflops: 80.13 mfu: 26.87% [2025-09-12 11:39:41,584217][I][components/metrics:442:log] step: 15 loss: 6.9347 grad_norm: 4.9882 memory: 23.63GiB(36.93%) tps: 1,941 tflops: 80.08 mfu: 26.86% [2025-09-12 11:39:43,690898][I][components/metrics:442:log] step: 16 loss: 7.3633 grad_norm: 31.0589 memory: 23.63GiB(36.93%) tps: 1,946 tflops: 80.29 mfu: 26.93% [2025-09-12 11:39:45,799715][I][components/metrics:442:log] step: 17 loss: 7.1793 grad_norm: 13.7271 memory: 23.63GiB(36.93%) tps: 1,944 tflops: 80.21 mfu: 26.90% [2025-09-12 11:39:47,907438][I][components/metrics:442:log] step: 18 loss: 7.2268 grad_norm: 10.9098 memory: 23.63GiB(36.93%) tps: 1,945 tflops: 80.25 mfu: 26.91% [2025-09-12 11:39:50,018253][I][components/metrics:442:log] step: 19 loss: 6.9895 grad_norm: 6.6582 memory: 23.63GiB(36.93%) tps: 1,942 tflops: 80.13 mfu: 26.87% [2025-09-12 11:39:52,127309][I][components/metrics:442:log] step: 20 loss: 6.7515 grad_norm: 3.5633 memory: 23.63GiB(36.93%) tps: 1,944 tflops: 80.20 mfu: 26.90% [2025-09-12 11:39:54,237784][I][components/metrics:442:log] step: 21 loss: 6.7755 grad_norm: 3.6999 memory: 23.63GiB(36.93%) tps: 1,943 tflops: 80.15 mfu: 26.88% [2025-09-12 11:39:56,348825][I][components/metrics:442:log] step: 22 loss: 6.9412 grad_norm: 3.5428 memory: 23.63GiB(36.93%) tps: 1,942 tflops: 80.13 mfu: 26.87% [2025-09-12 11:39:58,460931][I][components/metrics:442:log] step: 23 loss: 6.8696 grad_norm: 2.8968 memory: 23.63GiB(36.93%) tps: 1,941 tflops: 80.08 mfu: 26.86% [2025-09-12 11:40:00,572489][I][components/metrics:442:log] step: 24 loss: 6.6327 grad_norm: 5.1677 memory: 23.63GiB(36.93%) tps: 1,942 tflops: 80.11 mfu: 26.86% [2025-09-12 11:40:02,683070][I][components/metrics:442:log] step: 25 loss: 6.7134 grad_norm: 3.7672 memory: 23.63GiB(36.93%) tps: 1,943 tflops: 80.14 mfu: 26.88% [2025-09-12 11:40:04,793520][I][components/metrics:442:log] step: 26 loss: 6.5521 grad_norm: 3.4081 memory: 23.63GiB(36.93%) tps: 1,943 tflops: 80.15 mfu: 26.88% [2025-09-12 11:40:06,906933][I][components/metrics:442:log] step: 27 loss: 6.6118 grad_norm: 2.8971 memory: 23.63GiB(36.93%) tps: 1,940 tflops: 80.04 mfu: 26.84% [2025-09-12 11:40:09,019771][I][components/metrics:442:log] step: 28 loss: 6.7229 grad_norm: 2.6085 memory: 23.63GiB(36.93%) tps: 1,941 tflops: 80.06 mfu: 26.85% [2025-09-12 11:40:11,135250][I][components/metrics:442:log] step: 29 loss: 6.5777 grad_norm: 2.8184 memory: 23.63GiB(36.93%) tps: 1,938 tflops: 79.96 mfu: 26.81% [2025-09-12 11:40:13,249416][I][components/metrics:442:log] step: 30 loss: 6.5954 grad_norm: 2.7959 memory: 23.63GiB(36.93%) tps: 1,939 tflops: 80.00 mfu: 26.83% [2025-09-12 11:40:15,364869][I][components/metrics:442:log] step: 31 loss: 6.4546 grad_norm: 3.2096 memory: 23.63GiB(36.93%) tps: 1,938 tflops: 79.96 mfu: 26.82% [2025-09-12 11:40:17,476265][I][components/metrics:442:log] step: 32 loss: 6.6677 grad_norm: 2.1374 memory: 23.63GiB(36.93%) tps: 1,942 tflops: 80.11 mfu: 26.87% [2025-09-12 11:40:19,590038][I][components/metrics:442:log] step: 33 loss: 6.5451 grad_norm: 2.0738 memory: 23.63GiB(36.93%) tps: 1,940 tflops: 80.02 mfu: 26.84% [2025-09-12 11:40:21,706964][I][components/metrics:442:log] step: 34 loss: 6.7087 grad_norm: 2.5267 memory: 23.63GiB(36.93%) tps: 1,937 tflops: 79.91 mfu: 26.80% [2025-09-12 11:40:23,826393][I][components/metrics:442:log] step: 35 loss: 6.3955 grad_norm: 1.9991 memory: 23.63GiB(36.93%) tps: 1,935 tflops: 79.81 mfu: 26.76% [2025-09-12 11:40:25,943121][I][components/metrics:442:log] step: 36 loss: 6.4686 grad_norm: 1.5817 memory: 23.63GiB(36.93%) tps: 1,937 tflops: 79.91 mfu: 26.80% [2025-09-12 11:40:28,062842][I][components/metrics:442:log] step: 37 loss: 6.3481 grad_norm: 2.6166 memory: 23.63GiB(36.93%) tps: 1,934 tflops: 79.79 mfu: 26.76% [2025-09-12 11:40:30,184717][I][components/metrics:442:log] step: 38 loss: 6.4443 grad_norm: 2.5323 memory: 23.63GiB(36.93%) tps: 1,932 tflops: 79.71 mfu: 26.73% [2025-09-12 11:40:32,305122][I][components/metrics:442:log] step: 39 loss: 6.2732 grad_norm: 2.1087 memory: 23.63GiB(36.93%) tps: 1,934 tflops: 79.77 mfu: 26.75% [2025-09-12 11:40:34,431400][I][components/metrics:442:log] step: 40 loss: 6.1638 grad_norm: 1.6096 memory: 23.63GiB(36.93%) tps: 1,928 tflops: 79.55 mfu: 26.68% [2025-09-12 11:40:36,558993][I][components/metrics:442:log] step: 41 loss: 6.2434 grad_norm: 2.1429 memory: 23.63GiB(36.93%) tps: 1,927 tflops: 79.50 mfu: 26.66% [2025-09-12 11:40:38,684159][I][components/metrics:442:log] step: 42 loss: 6.2472 grad_norm: 1.9758 memory: 23.63GiB(36.93%) tps: 1,929 tflops: 79.59 mfu: 26.69% [2025-09-12 11:40:40,811350][I][components/metrics:442:log] step: 43 loss: 6.0686 grad_norm: 2.0387 memory: 23.63GiB(36.93%) tps: 1,927 tflops: 79.52 mfu: 26.67% [2025-09-12 11:40:42,942820][I][components/metrics:442:log] step: 44 loss: 6.0512 grad_norm: 1.7659 memory: 23.63GiB(36.93%) tps: 1,924 tflops: 79.36 mfu: 26.61% [2025-09-12 11:40:45,071924][I][components/metrics:442:log] step: 45 loss: 5.9693 grad_norm: 3.0356 memory: 23.63GiB(36.93%) tps: 1,926 tflops: 79.44 mfu: 26.64% [2025-09-12 11:40:47,202347][I][components/metrics:442:log] step: 46 loss: 6.1370 grad_norm: 2.2346 memory: 23.63GiB(36.93%) tps: 1,924 tflops: 79.39 mfu: 26.62% [2025-09-12 11:40:49,335707][I][components/metrics:442:log] step: 47 loss: 6.0951 grad_norm: 2.2721 memory: 23.63GiB(36.93%) tps: 1,922 tflops: 79.29 mfu: 26.59% [2025-09-12 11:40:51,472182][I][components/metrics:442:log] step: 48 loss: 6.1080 grad_norm: 2.3427 memory: 23.63GiB(36.93%) tps: 1,919 tflops: 79.17 mfu: 26.55% [2025-09-12 11:40:53,607441][I][components/metrics:442:log] step: 49 loss: 5.8213 grad_norm: 2.4015 memory: 23.63GiB(36.93%) tps: 1,920 tflops: 79.22 mfu: 26.57% [2025-09-12 11:40:53,644423][I][tools/utils:65:collect] [GC] Performing periodical GC collection 0.04 seconds [2025-09-12 11:40:55,782338][I][components/metrics:442:log] step: 50 loss: 6.0710 grad_norm: 2.2237 memory: 23.63GiB(36.93%) tps: 1,885 tflops: 77.77 mfu: 26.08% [2025-09-12 11:40:57,921332][I][components/metrics:442:log] step: 51 loss: 5.6129 grad_norm: 1.8282 memory: 23.63GiB(36.93%) tps: 1,917 tflops: 79.07 mfu: 26.52% [2025-09-12 11:41:00,060512][I][components/metrics:442:log] step: 52 loss: 5.8381 grad_norm: 2.2276 memory: 23.63GiB(36.93%) tps: 1,917 tflops: 79.07 mfu: 26.52% [2025-09-12 11:41:02,201596][I][components/metrics:442:log] step: 53 loss: 5.5789 grad_norm: 1.8904 memory: 23.63GiB(36.93%) tps: 1,915 tflops: 79.00 mfu: 26.49% [2025-09-12 11:41:04,338853][I][components/metrics:442:log] step: 54 loss: 5.5972 grad_norm: 1.9285 memory: 23.63GiB(36.93%) tps: 1,918 tflops: 79.14 mfu: 26.54% [2025-09-12 11:41:06,483940][I][components/metrics:442:log] step: 55 loss: 5.5264 grad_norm: 2.1031 memory: 23.63GiB(36.93%) tps: 1,911 tflops: 78.86 mfu: 26.45% [2025-09-12 11:41:08,626486][I][components/metrics:442:log] step: 56 loss: 5.6756 grad_norm: 1.8958 memory: 23.63GiB(36.93%) tps: 1,914 tflops: 78.95 mfu: 26.48% [2025-09-12 11:41:10,768986][I][components/metrics:442:log] step: 57 loss: 5.5827 grad_norm: 1.9008 memory: 23.63GiB(36.93%) tps: 1,914 tflops: 78.95 mfu: 26.48% [2025-09-12 11:41:12,915983][I][components/metrics:442:log] step: 58 loss: 6.1343 grad_norm: 2.2042 memory: 23.63GiB(36.93%) tps: 1,910 tflops: 78.78 mfu: 26.42% [2025-09-12 11:41:15,057467][I][components/metrics:442:log] step: 59 loss: 5.7517 grad_norm: 1.7251 memory: 23.63GiB(36.93%) tps: 1,914 tflops: 78.98 mfu: 26.49% [2025-09-12 11:41:17,195890][I][components/metrics:442:log] step: 60 loss: 5.5449 grad_norm: 1.7781 memory: 23.63GiB(36.93%) tps: 1,917 tflops: 79.10 mfu: 26.53% [2025-09-12 11:41:19,340106][I][components/metrics:442:log] step: 61 loss: 5.5037 grad_norm: 1.8137 memory: 23.63GiB(36.93%) tps: 1,912 tflops: 78.88 mfu: 26.45% [2025-09-12 11:41:21,479998][I][components/metrics:442:log] step: 62 loss: 5.5703 grad_norm: 2.2754 memory: 23.63GiB(36.93%) tps: 1,916 tflops: 79.04 mfu: 26.51% [2025-09-12 11:41:23,619646][I][components/metrics:442:log] step: 63 loss: 5.3396 grad_norm: 1.9820 memory: 23.63GiB(36.93%) tps: 1,916 tflops: 79.06 mfu: 26.51% [2025-09-12 11:41:25,758931][I][components/metrics:442:log] step: 64 loss: 5.2862 grad_norm: 2.1926 memory: 23.63GiB(36.93%) tps: 1,917 tflops: 79.07 mfu: 26.52% [2025-09-12 11:41:27,902443][I][components/metrics:442:log] step: 65 loss: 5.3883 grad_norm: 1.8266 memory: 23.63GiB(36.93%) tps: 1,913 tflops: 78.91 mfu: 26.46% [2025-09-12 11:41:30,047189][I][components/metrics:442:log] step: 66 loss: 5.3715 grad_norm: 1.8546 memory: 23.63GiB(36.93%) tps: 1,912 tflops: 78.86 mfu: 26.45% [2025-09-12 11:41:32,191202][I][components/metrics:442:log] step: 67 loss: 5.3473 grad_norm: 1.8945 memory: 23.63GiB(36.93%) tps: 1,912 tflops: 78.88 mfu: 26.45% [2025-09-12 11:41:34,336648][I][components/metrics:442:log] step: 68 loss: 5.4083 grad_norm: 1.6982 memory: 23.63GiB(36.93%) tps: 1,911 tflops: 78.83 mfu: 26.44% [2025-09-12 11:41:36,480695][I][components/metrics:442:log] step: 69 loss: 5.2105 grad_norm: 1.5840 memory: 23.63GiB(36.93%) tps: 1,912 tflops: 78.89 mfu: 26.45% [2025-09-12 11:41:38,625671][I][components/metrics:442:log] step: 70 loss: 5.2483 grad_norm: 1.8750 memory: 23.63GiB(36.93%) tps: 1,911 tflops: 78.85 mfu: 26.44% [2025-09-12 11:41:40,772186][I][components/metrics:442:log] step: 71 loss: 5.1239 grad_norm: 1.9717 memory: 23.63GiB(36.93%) tps: 1,910 tflops: 78.80 mfu: 26.43% [2025-09-12 11:41:42,918729][I][components/metrics:442:log] step: 72 loss: 5.3355 grad_norm: 1.8882 memory: 23.63GiB(36.93%) tps: 1,910 tflops: 78.79 mfu: 26.42% [2025-09-12 11:41:45,066384][I][components/metrics:442:log] step: 73 loss: 5.0560 grad_norm: 1.6971 memory: 23.63GiB(36.93%) tps: 1,909 tflops: 78.76 mfu: 26.41% [2025-09-12 11:41:47,209176][I][components/metrics:442:log] step: 74 loss: 5.0859 grad_norm: 2.6819 memory: 23.63GiB(36.93%) tps: 1,913 tflops: 78.93 mfu: 26.47% [2025-09-12 11:41:49,355442][I][components/metrics:442:log] step: 75 loss: 5.2856 grad_norm: 1.8572 memory: 23.63GiB(36.93%) tps: 1,910 tflops: 78.81 mfu: 26.43% [2025-09-12 11:41:51,499099][I][components/metrics:442:log] step: 76 loss: 5.2415 grad_norm: 1.4722 memory: 23.63GiB(36.93%) tps: 1,913 tflops: 78.90 mfu: 26.46% [2025-09-12 11:41:53,642872][I][components/metrics:442:log] step: 77 loss: 5.1465 grad_norm: 1.6991 memory: 23.63GiB(36.93%) tps: 1,913 tflops: 78.90 mfu: 26.46% [2025-09-12 11:41:55,790222][I][components/metrics:442:log] step: 78 loss: 4.9042 grad_norm: 2.5348 memory: 23.63GiB(36.93%) tps: 1,909 tflops: 78.77 mfu: 26.41% [2025-09-12 11:41:57,938398][I][components/metrics:442:log] step: 79 loss: 5.1845 grad_norm: 2.1790 memory: 23.63GiB(36.93%) tps: 1,908 tflops: 78.73 mfu: 26.40% [2025-09-12 11:42:00,085052][I][components/metrics:442:log] step: 80 loss: 5.0380 grad_norm: 1.8122 memory: 23.63GiB(36.93%) tps: 1,910 tflops: 78.79 mfu: 26.42% [2025-09-12 11:42:02,229187][I][components/metrics:442:log] step: 81 loss: 5.1028 grad_norm: 2.3178 memory: 23.63GiB(36.93%) tps: 1,912 tflops: 78.89 mfu: 26.46% [2025-09-12 11:42:04,376585][I][components/metrics:442:log] step: 82 loss: 4.9639 grad_norm: 1.7682 memory: 23.63GiB(36.93%) tps: 1,909 tflops: 78.77 mfu: 26.42% [2025-09-12 11:42:06,522266][I][components/metrics:442:log] step: 83 loss: 5.1079 grad_norm: 2.0751 memory: 23.63GiB(36.93%) tps: 1,911 tflops: 78.83 mfu: 26.44% [2025-09-12 11:42:08,668032][I][components/metrics:442:log] step: 84 loss: 5.0744 grad_norm: 1.4189 memory: 23.63GiB(36.93%) tps: 1,911 tflops: 78.82 mfu: 26.43% ```