🍹 BlendCorpus + TorchTitan @ ALCF

Training large language models with BlendCorpus and TorchTitan on supercomputers at the Argonne Leadership Computing Facility.
Author
Affiliation
Published

September 12, 2025

Modified

October 7, 2025

📌 Source Repositories

Things are changing quickly, so to avoid confusion, here are the exact branches used for this demo:

🏃‍♂️ Running

  • Clone repo:

    git clone https://github.com/auroraGPT-ANL/torchtitan
    cd torchtitan
    checkout saforem2/blendcorpus
  • Setup env:

    source <(curl -L https://bit.ly/ezpz-utils)
    ezpz_setup_env
    • 2025-09-11, on Aurora @ ALCF:

      output:

      ; ssh x4112c1s0b0n0
      
      #[🐍 aurora_nre_models_frameworks-2025.2.0]
      #[/f/A/A/E/A/t/a/torchtitan][🌱 saforem2/blendcorpus][📝🤷✓] 
      #[09/11/25 @ 14:08:35][x4112c1s0b0n0]
      ; source <(curl -L https://bit.ly/ezpz-utils)
      
      #[🐍 aurora_nre_models_frameworks-2025.2.0]
      #[/f/A/A/E/A/t/a/torchtitan][🌱 saforem2/blendcorpus][📝🤷✓] 
      #[09/11/25 @ 14:08:37][x4112c1s0b0n0]
      ; ezpz_setup_env                                                                                                                                                                      
      [2025-09-11-140838][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2720] Detected PBS scheduler environment.
      [2025-09-11-140838][W][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2756] Current working directory does not match PBS_O_WORKDIR! This may cause issues with the job submission.
      [2025-09-11-140838][W][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2757] PBS_O_WORKDIR /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/zhenghh04/torchtitan
      [2025-09-11-140838][W][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2758] WORKING_DIR /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan
      [2025-09-11-140838][W][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2759] Exporting PBS_O_WORKDIR=WORKING_DIR=/lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan and continuing...
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2486] running [ezpz_setup_env]...
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1298] [PYTHON]
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1327]   - Conda active, conda=/opt/aurora/25.190.0/frameworks/aurora_nre_models_frameworks-2025.2.0...
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1328]   - No virtual_env found in environment
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1142]   - Found python root at /opt/aurora/25.190.0/frameworks/aurora_nre_models_frameworks-2025.2.0
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1157]   - No VIRTUAL_ENV found in environment!
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1160]   - Looking for venv in venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0...
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1182]   - Activating existing venv in VENV_DIR=venvs/torchtitan-aurora_nre_models_frameworks-2025.2.0
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1184]   - Found /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan/venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0/bin/activate
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1353]   - Using python from: /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan/venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0/bin/python3
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2335] [JOB]
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2336]   - Setting up env for foremans
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2337]   - Detected pbs scheduler
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2338]   - Machine: aurora
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2339]   - Hostname: x4112c1s0b0n0
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2249]   - PBS_JOBID=7559761.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
          to calculate:
            - num_hosts: 2
            - num_cores_per_host: 208
            - num_cpus_per_host: 104
            - num_gpus_per_host: 12
            - depth: 8
            - num_gpus: 24
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1754] [HOSTS] - ezpz_print_hosts
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1756]   - Detected PBS Scheduler
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1774] [HOSTS]
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1775]   - HOSTFILE=/var/spool/pbs/aux/7559761.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1776]   - NHOSTS=2
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1777]   - HOSTS:
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1780]     - [host:0] - x4112c1s0b0n0.hsn.cm.aurora.alcf.anl.gov
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1780]     - [host:1] - x4112c1s1b0n0.hsn.cm.aurora.alcf.anl.gov
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1941] [DIST_INFO]
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1942]   - HOSTFILE=/var/spool/pbs/aux/7559761.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1943]   - NHOSTS=2
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1944]   - NGPU_PER_HOST=12
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1945]   - NGPUS=24
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1947] [LAUNCH]
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1948]   - To launch across all available GPUs, use: 'launch'
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1949]     launch = mpiexec --verbose --envall -n 24 -ppn 12 --hostfile /var/spool/pbs/aux/7559761.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-8
      0:86-88:94-96
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1950]   - Run 'which launch' to ensure that the alias is set correctly
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2495] [] Finished [ezpz_setup_env]
      took: 0h:00m:04s
  • Install dependencies. From inside your clone of torchtitan:

    • 🍋 ezpz:

      # uv not required, but useful!
      # to download: curl -LsSf https://astral.sh/uv/install.sh | sh
      uv pip install "git+https://github.com/saforem2/ezpz"
    • 🍹 BlendCorpus:

      git clone https://github.com/saforem2/blendcorpus deps/blendcorpus
      cd deps/blendcorpus
      git checkout reorg-imports
      uv pip install -e "."
    • 🔥 TorchTitan:

      python3 -m pip install "git+https://github.com/saforem2/blendcorpus@saforem2/reorg-imports"
      # from inside auroraGPT-ANL/torchtitan @ saforem2/blendcorpus
      python3 -m pip install -e "."
  • Download Artifacts:

    • AuroraGPT-2B:

      python3 scripts/download_hf_assets.py --repo_id google/gemma-7b --assets tokenizer
      mkdir assets/hf/AuroraGPT-2B
      cp assets/hf/gemma-7b/tokenizer.model assets/hf/AuroraGPT-2B
    • AuroraGPT-7B:

      python3 scripts/download_hf_assets.py --repo_id meta-llama/llama-2-7b-hf --assets tokenizer
      mkdir assets/hf/AuroraGPT-7B
      cp assets/hf/gemma-7b/tokenizer.model assets/hf/AuroraGPT-7B
  • Launch:

    ; ezpz-launch -m torchtitan.experiments.blendcorpus.train --job.config_file torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml

    output:

    #[09/12/25 @ 11:33:56][x4117c4s2b0n0]
    ; ezpz-launch -m torchtitan.experiments.blendcorpus.train --job.config_file torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml
    
    [2025-09-12-113711][I][-zsh:91] Using torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml
    [2025-09-12-113711][I][-zsh:91] Logs will be saved to: logs/auroraGPT_7B-2025-09-12-113711.log
    [W912 11:37:15.852098275 OperatorEntry.cpp:219] Warning: Warning only once for all operators,  other operators may also be overridden.
      Overriding a previously registered kernel for the same operator and the same dispatch key
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      dispatch key: XPU
      previous kernel: registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/aten/src/ATen/VmapModeRegistrations.cpp:37
          new kernel: registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/ipex_2.8.10_xpu_rel_08_18_2025/intel-extension-for-pytorch/build/Release/csrc/gpu/csrc/gpu/xpu/ATen/RegisterXPU_0.cpp:172 (function operator())
    [2025-09-12 11:37:16,879] [INFO] [real_accelerator.py:260:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    /opt/aurora/25.190.0/frameworks/aurora_nre_models_frameworks-2025.2.0/lib/python3.10/site-packages/neural_compressor/utils/utility.py:44: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
      from pkg_resources import parse_version
    [2025-09-12 11:37:30,939] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False
    [2025-09-12 11:37:43,263749][I][ezpz/__init__:266:<module>] Setting logging level to 'INFO' on 'RANK == 0'
    [2025-09-12 11:37:43,266470][I][ezpz/__init__:267:<module>] Setting logging level to 'CRITICAL' on all others 'RANK != 0'
    
    
    [2025-09-12 11:37:43,273704][I][ezpz/launch:340:launch] ----[🍋 ezpz.launch][started][2025-09-12-113743]----
    [2025-09-12 11:37:47,537879][I][ezpz/launch:345:launch] Job ID: 7591191
    [2025-09-12 11:37:47,538702][I][ezpz/launch:346:launch] nodelist: ['x4117c4s2b0n0', 'x4117c4s6b0n0']
    [2025-09-12 11:37:47,539093][I][ezpz/launch:347:launch] hostfile: /var/spool/pbs/aux/7591191.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
    [2025-09-12 11:37:47,540277][I][ezpz/pbs:188:get_pbs_launch_cmd] ✅ Using [24/24] GPUs [2 hosts] x [12 GPU/host]
    [2025-09-12 11:37:47,541233][I][ezpz/launch:316:build_executable] Building command to execute by piecing together:
    [2025-09-12 11:37:47,541638][I][ezpz/launch:317:build_executable] (1.) launch_cmd: mpiexec --verbose --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/7591191.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
    [2025-09-12 11:37:47,542413][I][ezpz/launch:318:build_executable] (2.) cmd_to_launch: /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan/venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0/bin/python3 -m torchtitan.experiments.blendcorpus.train --job.config_file torchtitan/experiments /blendcorpus/train_configs/auroraGPT_7B.toml
    [2025-09-12 11:37:47,543429][I][ezpz/launch:360:launch] Took: 4.27 seconds to build command.
    [2025-09-12 11:37:47,543810][I][ezpz/launch:363:launch] Executing:
    mpiexec
      --verbose
      --envall
      --np=24
      --ppn=12
      --hostfile=/var/spool/pbs/aux/7591191.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
      --no-vni
      --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
      /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan/venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0/bin/python3
      -m
      torchtitan.experiments.blendcorpus.train
      --job.config_file
      torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml
    [2025-09-12 11:37:47,545251][I][ezpz/launch:179:get_aurora_filters] Filtering for Aurora-specific messages. To view list of filters, run with EZPZ_LOG_LEVEL=DEBUG
    [2025-09-12 11:37:47,545756][I][ezpz/launch:370:launch] Execution started @ 2025-09-12-113747...
    [2025-09-12 11:37:47,546182][I][ezpz/launch:371:launch] ----[🍋 ezpz.launch][stop][2025-09-12-113747]----
    [2025-09-12 11:37:47,546634][I][ezpz/launch:99:run_command] Caught 24 filters
    [2025-09-12 11:37:47,547002][I][ezpz/launch:100:run_command] Running command:
    mpiexec --verbose --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/7591191.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan/venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0/bin/python3 -m torchtitan.experiments.blendcorpus.train --job.config_file torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml
    Disabling local launch: multi-node application
    Connected to tcp://x4117c4s2b0n0.hsn.cm.aurora.alcf.anl.gov:7919
    Launching application 422e0368-f389-4475-8131-3de313723140
    cpubind:list x4117c4s2b0n0 pid 35392 rank 0 0: mask 0x1c
    cpubind:list x4117c4s2b0n0 pid 35393 rank 1 1: mask 0x1c00
    cpubind:list x4117c4s2b0n0 pid 35394 rank 2 2: mask 0x1c0000
    cpubind:list x4117c4s2b0n0 pid 35395 rank 3 3: mask 0x1c000000
    cpubind:list x4117c4s2b0n0 pid 35396 rank 4 4: mask 0x1c00000000
    cpubind:list x4117c4s2b0n0 pid 35397 rank 5 5: mask 0x1c0000000000
    cpubind:list x4117c4s2b0n0 pid 35398 rank 6 6: mask 0x1c0000000000000
    cpubind:list x4117c4s2b0n0 pid 35399 rank 7 7: mask 0x1c000000000000000
    cpubind:list x4117c4s2b0n0 pid 35400 rank 8 8: mask 0x1c00000000000000000
    cpubind:list x4117c4s2b0n0 pid 35401 rank 9 9: mask 0x1c0000000000000000000
    cpubind:list x4117c4s2b0n0 pid 35402 rank 10 10: mask 0x1c000000000000000000000
    cpubind:list x4117c4s2b0n0 pid 35403 rank 11 11: mask 0x1c00000000000000000000000
    Application 422e0368-f389-4475-8131-3de313723140 started execution
    cpubind:list x4117c4s6b0n0 pid 111063 rank 12 0: mask 0x1c
    cpubind:list x4117c4s6b0n0 pid 111064 rank 13 1: mask 0x1c00
    cpubind:list x4117c4s6b0n0 pid 111065 rank 14 2: mask 0x1c0000
    cpubind:list x4117c4s6b0n0 pid 111066 rank 15 3: mask 0x1c000000
    cpubind:list x4117c4s6b0n0 pid 111067 rank 16 4: mask 0x1c00000000
    cpubind:list x4117c4s6b0n0 pid 111068 rank 17 5: mask 0x1c0000000000
    cpubind:list x4117c4s6b0n0 pid 111069 rank 18 6: mask 0x1c0000000000000
    cpubind:list x4117c4s6b0n0 pid 111070 rank 19 7: mask 0x1c000000000000000
    cpubind:list x4117c4s6b0n0 pid 111071 rank 20 8: mask 0x1c00000000000000000
    cpubind:list x4117c4s6b0n0 pid 111072 rank 21 9: mask 0x1c0000000000000000000
    cpubind:list x4117c4s6b0n0 pid 111073 rank 22 10: mask 0x1c000000000000000000000
    cpubind:list x4117c4s6b0n0 pid 111074 rank 23 11: mask 0x1c00000000000000000000000
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      from pkg_resources import parse_version
    # [...repeated...]: TODO: Add this to the list of filters in ezpz
      from pkg_resources import parse_version
    [2025-09-12 11:38:05,512552][I][ezpz/__init__:266:<module>] Setting logging level to 'INFO' on 'RANK == 0'
    [2025-09-12 11:38:05,515164][I][ezpz/__init__:267:<module>] Setting logging level to 'CRITICAL' on all others 'RANK != 0'
    [2025-09-12 11:38:07,293955][I][ezpz/dist:1181:setup_torch_distributed] Using fw='ddp' with torch_{device,backend}= {xpu, xccl}
    [2025-09-12 11:38:07,295126][I][ezpz/dist:1039:setup_torch_DDP] Caught MASTER_PORT=44635 from environment!
    [2025-09-12 11:38:07,295968][I][ezpz/dist:1055:setup_torch_DDP] Using torch.distributed.init_process_group with
    - master_addr='x4117c4s2b0n0.hsn.cm.aurora.alcf.anl.gov'
    - master_port='44635'
    - world_size=24
    - rank=0
    - local_rank=0
    - timeout=datetime.timedelta(seconds=3600)
    - backend='xccl'
    [2025-09-12 11:38:07,297280][I][ezpz/dist:772:init_process_group] Calling torch.distributed.init_process_group_with: rank=0 world_size=24 backend=xccl
    [2025-09-12 11:38:21,344380][I][ezpz/pbs:188:get_pbs_launch_cmd] ✅ Using [24/24] GPUs [2 hosts] x [12 GPU/host]
    [2025-09-12 11:38:21,346401][I][ezpz/dist:450:print_dist_setup] [device='xpu'][rank=0/23][local_rank=0/11][node=0/1]
    [2025-09-12 11:38:21,347018][W][utils/_logger:68:warning] Using [24 / 24] available "xpu" devices !!
    2025:09:12-11:38:21:(35392) |CCL_WARN| value of CCL_LOG_LEVEL changed to be error (default:warn)
    [2025-09-12 11:38:22,154201][I][ezpz/dist:1401:setup_torch] Using device='xpu' with backend='xccl' + 'xccl' for distributed training.
    [2025-09-12 11:38:22,155050][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 0/23]
    [2025-09-12 11:38:22,154185][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 6/23]
    [2025-09-12 11:38:22,154353][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 7/23]
    [2025-09-12 11:38:22,154299][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 9/23]
    [2025-09-12 11:38:22,154355][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][11/23]
    [2025-09-12 11:38:22,154185][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 1/23]
    [2025-09-12 11:38:22,154184][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 5/23]
    [2025-09-12 11:38:22,154350][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 8/23]
    [2025-09-12 11:38:22,154312][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][12/23]
    [2025-09-12 11:38:22,154495][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][10/23]
    [2025-09-12 11:38:22,154339][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 4/23]
    [2025-09-12 11:38:22,154312][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][13/23]
    [2025-09-12 11:38:22,154312][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][15/23]
    [2025-09-12 11:38:22,154312][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][16/23]
    [2025-09-12 11:38:22,154379][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][17/23]
    [2025-09-12 11:38:22,154398][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 2/23]
    [2025-09-12 11:38:22,154319][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][18/23]
    [2025-09-12 11:38:22,154284][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][19/23]
    [2025-09-12 11:38:22,154325][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][20/23]
    [2025-09-12 11:38:22,154382][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][22/23]
    [2025-09-12 11:38:22,154502][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][14/23]
    [2025-09-12 11:38:22,154391][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][21/23]
    [2025-09-12 11:38:22,154451][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][23/23]
    [2025-09-12 11:38:22,154411][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 3/23]
    [2025-09-12 11:38:22,694566][I][blendcorpus/train:85:__init__] Starting job: AuroraGPT-7B Training
    [2025-09-12 11:38:22,695590][I][blendcorpus/train:93:__init__] Running with args: {
      "activation_checkpoint": {
        "early_stop": false,
        "mode": "none",
        "per_op_sac_force_recompute_mm_shapes_by_fqns": [
          "moe.router.gate"
        ],
        "selective_ac_option": "op"
      },
      "blendcorpus": {
        "append_eod": true,
        "blend_sample_in_corpus": false,
        "data_cache_path": "./.cache/data/auroraGPT-7B/olmo-mix-1124/",
        "data_file_list": null,
        "dataloader_type": "single",
        "eod_token_id": 2,
        "micro_batch_size": null,
        "num_workers": 2,
        "provide_attention_mask": false,
        "seq_length": null,
        "shuffle": true,
        "shuffle_sample_in_corpus": true,
        "split": "98,1,1"
      },
      "checkpoint": {
        "async_mode": "disabled",
        "create_seed_checkpoint": false,
        "enable": false,
        "enable_first_step_checkpoint": false,
        "exclude_from_loading": [],
        "export_dtype": "float32",
        "folder": "checkpoint",
        "initial_load_in_hf": false,
        "initial_load_model_only": true,
        "initial_load_path": null,
        "interval": 10,
        "keep_latest_k": 10,
        "last_save_in_hf": false,
        "last_save_model_only": false,
        "load_step": -1
      },
      "comm": {
        "init_timeout_seconds": 300,
        "save_traces_folder": "comm_traces",
        "trace_buf_size": 20000,
        "train_timeout_seconds": 100
      },
      "compile": {
        "components": [
          "model",
          "loss"
        ],
        "enable": true
      },
      "experimental": {
        "custom_args_module": "torchtitan.experiments.blendcorpus.job_config",
        "custom_import": ""
      },
      "fault_tolerance": {
        "enable": false,
        "group_size": 0,
        "min_replica_size": 1,
        "process_group": "gloo",
        "process_group_timeout_ms": 10000,
        "replica_id": 0,
        "semi_sync_method": null
      },
      "float8": {
        "emulate": false,
        "enable_fsdp_float8_all_gather": false,
        "filter_fqns": [
          "output"
        ],
        "moe_fqns_prototype": [],
        "precompute_float8_dynamic_scale_for_fsdp": false,
        "recipe_name": null
      },
      "job": {
        "config_file": "torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml",
        "description": "AuroraGPT-7B Training",
        "dump_folder": "./outputs/AuroraGPT-7B",
        "print_args": true,
        "use_for_integration_test": true
      },
      "lr_scheduler": {
        "decay_ratio": 0.8,
        "decay_type": "linear",
        "min_lr_factor": 0.0,
        "warmup_steps": 2
      },
      "memory_estimation": {
        "disable_fake_mode": false,
        "enable": false
      },
      "metrics": {
        "disable_color_printing": false,
        "enable_tensorboard": true,
        "enable_wandb": true,
        "log_freq": 1,
        "save_for_all_ranks": false,
        "save_tb_folder": "tb"
      },
      "model": {
        "converters": [],
        "flavor": "AuroraGPT-7B",
        "hf_assets_path": "./assets/hf/AuroraGPT-7B",
        "name": "blendcorpus",
        "print_after_conversion": false,
        "tokenizer_backend": "sptoken",
        "tokenizer_path": null
      },
      "mx": {
        "filter_fqns": [
          "output"
        ],
        "moe_fqns_prototype": [],
        "mxfp8_dim1_cast_kernel_choice": "triton",
        "recipe_name": "mxfp8_cublas"
      },
      "optimizer": {
        "beta1": 0.9,
        "beta2": 0.95,
        "early_step_in_backward": false,
        "eps": 1e-08,
        "implementation": "fused",
        "lr": 0.0002,
        "name": "AdamW",
        "weight_decay": 0.1
      },
      "parallelism": {
        "context_parallel_degree": 1,
        "context_parallel_rotate_method": "allgather",
        "data_parallel_replicate_degree": 1,
        "data_parallel_shard_degree": -1,
        "disable_loss_parallel": false,
        "enable_async_tensor_parallel": false,
        "enable_compiled_autograd": false,
        "expert_parallel_degree": 1,
        "expert_tensor_parallel_degree": 1,
        "fsdp_reshard_after_forward": "default",
        "module_fqns_per_model_part": null,
        "pipeline_parallel_degree": 1,
        "pipeline_parallel_first_stage_less_layers": 1,
        "pipeline_parallel_last_stage_less_layers": 1,
        "pipeline_parallel_layers_per_stage": null,
        "pipeline_parallel_microbatch_size": 1,
        "pipeline_parallel_schedule": "1F1B",
        "pipeline_parallel_schedule_csv": "",
        "pipeline_parallel_split_points": [],
        "tensor_parallel_degree": 1
      },
      "profiling": {
        "enable_memory_snapshot": false,
        "enable_profiling": false,
        "profile_freq": 10,
        "save_memory_snapshot_folder": "memory_snapshot",
        "save_traces_folder": "profile_trace"
      },
      "training": {
        "dataset": "blendcorpus",
        "dataset_path": "/flare/Aurora_deployment/AuroraGPT/datasets/dolma/dolma_v1_7_file_list_mini.txt",
        "deterministic": false,
        "enable_cpu_offload": false,
        "gc_debug": false,
        "gc_freq": 50,
        "global_batch_size": -1,
        "local_batch_size": 1,
        "max_norm": 1.0,
        "mixed_precision_param": "bfloat16",
        "mixed_precision_reduce": "float32",
        "seed": null,
        "seq_len": 4096,
        "steps": 1000
      },
      "validation": {
        "dataset": "c4_validation",
        "dataset_path": null,
        "enable": false,
        "freq": 5,
        "local_batch_size": 8,
        "seq_len": 2048,
        "steps": 10
      }
    }
    Number of ranks per node: 12
    Is initialized already
    [2025-09-12 11:38:22,781763][I][distributed/parallel_dims:158:_build_mesh_without_ep] Building 1-D device mesh with ['dp_shard'], [24]
    [2025-09-12 11:38:22,783219][I][tools/utils:65:collect] [GC] Initial GC collection 0.00 seconds
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [2025-09-12 11:38:22,795599][I][dataset/sptoken:75:build_sentencepiece_tokenizer] [SPTokenizer] Using model path: ./assets/hf/AuroraGPT-7B
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [2025-09-12 11:38:22,806079][I][dataset/sptoken:36:__init__] [SPTokenizer] Loaded model: ./assets/hf/AuroraGPT-7B/tokenizer.model, vocab size: 32000
    [INFO][2025-09-12 11:38:22.811010] Reading data from /flare/Aurora_deployment/AuroraGPT/datasets/dolma/dolma_v1_7_file_list_mini.txt
    [INFO][2025-09-12 11:38:22.811281] Number of datasets: 9
    [INFO][2025-09-12 11:38:22.811427] Global batch size: 24
    [INFO][2025-09-12 11:38:22.811559] Training iterations: 1000
    [INFO][2025-09-12 11:38:22.811682] Evaluation iterations: 0
    [INFO][2025-09-12 11:38:22.811805] Total number of training samples: 24000
    [INFO][2025-09-12 11:38:22.811932] Total number of evaluation samples: 0
    [INFO][2025-09-12 11:38:22.812052] Total number of testing samples: 0
    [2025-09-12 11:38:23,388839][I][data/gpt_dataset:263:_cache_indices] > loading algebraic corpus dataset index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/eaaf4239bf399fe90985648264fc597b_index.npy
    [2025-09-12 11:38:23,400289][I][data/gpt_dataset:270:_cache_indices] > loading algebraic corpus dataset sample index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/eaaf4239bf399fe90985648264fc597b_sample_index.npy
    [2025-09-12 11:38:23,401313][I][data/gpt_dataset:277:_cache_indices] > finished loading in 0.01251498400233686 seconds
    [2025-09-12 11:38:23,402526][I][data/gpt_dataset:291:__init__] [BuildCorpusDataset] Caught args.shuffle_sample_in_corpus=True across 19984 samples
    [2025-09-12 11:38:23,498032][I][data/gpt_dataset:263:_cache_indices] > loading arxiv corpus dataset index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/9958f3591de484302ca17a2f1feeafaf_index.npy
    [2025-09-12 11:38:23,502674][I][data/gpt_dataset:270:_cache_indices] > loading arxiv corpus dataset sample index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/9958f3591de484302ca17a2f1feeafaf_sample_index.npy
    [2025-09-12 11:38:23,506868][I][data/gpt_dataset:277:_cache_indices] > finished loading in 0.008856782980728894 seconds
    [2025-09-12 11:38:23,507665][I][data/gpt_dataset:291:__init__] [BuildCorpusDataset] Caught args.shuffle_sample_in_corpus=True across 4140 samples
    [2025-09-12 11:38:23,520625][I][data/blendable_dataset:131:__init__] > loading blendable dataset index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/212cf8a136a93b7bc9cd575fa4c82f21_index.npy
    [2025-09-12 11:38:23,527379][I][data/blendable_dataset:134:__init__] > loading blendable dataset sample index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/212cf8a136a93b7bc9cd575fa4c82f21_sample_index.npy
    [2025-09-12 11:38:23,532038][I][data/blendable_dataset:139:__init__] > finished loading in 0.011423073010519147 seconds
    [2025-09-12 11:38:23,543427][I][data/blendable_dataset:152:__init__] > size of blendable dataset: 24124 samples
    [2025-09-12 11:38:23,544235][I][blendcorpus/train:177:__init__] Using BlendCorpus dataloader.
    [2025-09-12 11:38:23,544713][I][blendcorpus/train:185:__init__] Building blendcorpus AuroraGPT-7B with TransformerModelArgs(_enforced='This field is used to enforce all fields have defaults.', dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, rope_theta=10000, max_seq_len=4096, depth_init=True, use_flex_attn=False, attn_mask_type='causal', eos_id=0)
    wandb: Currently logged in as: foremans (aurora_gpt) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
    wandb: Tracking run with wandb version 0.21.3
    wandb: Run data is saved locally in ./outputs/AuroraGPT-7B/tb/20250912-1138/wandb/run-20250912_113823-qzle9mdw
    wandb: Run `wandb offline` to turn off syncing.
    wandb: Syncing run snowy-sunset-14
    wandb:  View project at https://wandb.ai/aurora_gpt/torchtitan
    wandb:  View run at https://wandb.ai/aurora_gpt/torchtitan/runs/qzle9mdw
    [2025-09-12 11:38:24,703889][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:24,750784][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:24,776002][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:24,848549][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:24,864781][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:24,964752][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,005897][I][components/metrics:155:__init__] WandB logging enabled
    [2025-09-12 11:38:25,012474][I][components/metrics:124:__init__] TensorBoard logging enabled. Logs will be saved at ./outputs/AuroraGPT-7B/tb/20250912-1138
    [2025-09-12 11:38:25,017569][I][components/metrics:101:build_device_memory_monitor] XPU capacity: Intel(R) Data Center GPU Max 1550 with 63.98GiB memory
    [2025-09-12 11:38:25,044275][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,093814][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,126732][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,146946][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,146998][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,147707][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,148411][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,149299][I][blendcorpus/train:212:__init__] Model blendcorpus AuroraGPT-7B size: 5,933,109,248 total parameters
    [2025-09-12 11:38:25,150242][I][components/loss:28:build_cross_entropy_loss] Compiling the loss function with torch.compile
    [2025-09-12 11:38:25,190998][I][infra/parallelize:357:apply_compile] Compiling each TransformerBlock with torch.compile
    [2025-09-12 11:38:25,220312][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,245422][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,262857][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,271332][I][infra/parallelize:122:parallelize_llama] Applied FSDP to the model
    [2025-09-12 11:38:25,289336][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,296201][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,296289][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,298835][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,299750][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,299754][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,299888][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,770147][I][blendcorpus/train:290:__init__] Peak FLOPS used for computing MFU: 2.982e+14
    [2025-09-12 11:38:25,771316][I][blendcorpus/train:292:__init__] XPU memory usage for model: 1.04GiB(1.63%)
    [2025-09-12 11:38:25,773314][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,774216][I][distributed/utils:225:maybe_enable_amp] Mixed precision training is handled by fully_shard
    [2025-09-12 11:38:25,774808][I][blendcorpus/train:381:__init__] Trainer is initialized with local batch size 1, global batch size 24, gradient accumulation steps 1, sequence length 4096, total steps 1000 (warmup 2)
    [2025-09-12 11:38:25,775505][I][blendcorpus/train:695:<module>] Using SDPBackend.FLASH_ATTENTION backend for SDPA
    [2025-09-12 11:38:25,776216][I][blendcorpus/train:569:train] BlendCorpus dataloader advanced to consumed =0 samples (step={self.step}).
    [2025-09-12 11:38:25,776915][I][blendcorpus/train:581:train] Training starts at step 1.
    [2025-09-12 11:39:11,844905][I][components/metrics:442:log] step:  1  loss: 10.8919  grad_norm:  5.7773  memory: 21.74GiB(33.98%)  tps: 88  tflops: 3.62  mfu: 1.21%
    [2025-09-12 11:39:11,847254][I][distributed/utils:299:set_pg_timeouts] Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
    [2025-09-12 11:39:13,996720][I][components/metrics:442:log] step:  2  loss: 15.4482  grad_norm: 95.7768  memory: 23.63GiB(36.93%)  tps: 1,906  tflops: 78.63  mfu: 26.37%
    [2025-09-12 11:39:16,148721][I][components/metrics:442:log] step:  3  loss: 18.1145  grad_norm: 177.2544  memory: 23.63GiB(36.93%)  tps: 1,905  tflops: 78.60  mfu: 26.36%
    [2025-09-12 11:39:18,293594][I][components/metrics:442:log] step:  4  loss: 12.2966  grad_norm: 47.6269  memory: 23.63GiB(36.93%)  tps: 1,912  tflops: 78.86  mfu: 26.45%
    [2025-09-12 11:39:20,423330][I][components/metrics:442:log] step:  5  loss: 12.4196  grad_norm: 55.3153  memory: 23.63GiB(36.93%)  tps: 1,925  tflops: 79.42  mfu: 26.63%
    [2025-09-12 11:39:22,550981][I][components/metrics:442:log] step:  6  loss: 10.8771  grad_norm:  5.3124  memory: 23.63GiB(36.93%)  tps: 1,927  tflops: 79.50  mfu: 26.66%
    [2025-09-12 11:39:24,670689][I][components/metrics:442:log] step:  7  loss: 10.9488  grad_norm: 41.6404  memory: 23.63GiB(36.93%)  tps: 1,934  tflops: 79.80  mfu: 26.76%
    [2025-09-12 11:39:26,791101][I][components/metrics:442:log] step:  8  loss:  9.9818  grad_norm: 18.3422  memory: 23.63GiB(36.93%)  tps: 1,934  tflops: 79.77  mfu: 26.75%
    [2025-09-12 11:39:28,911059][I][components/metrics:442:log] step:  9  loss:  9.0792  grad_norm:  9.5251  memory: 23.63GiB(36.93%)  tps: 1,934  tflops: 79.79  mfu: 26.76%
    [2025-09-12 11:39:31,025851][I][components/metrics:442:log] step: 10  loss:  8.4230  grad_norm:  4.9722  memory: 23.63GiB(36.93%)  tps: 1,939  tflops: 79.98  mfu: 26.82%
    [2025-09-12 11:39:33,138436][I][components/metrics:442:log] step: 11  loss:  8.0111  grad_norm:  4.7603  memory: 23.63GiB(36.93%)  tps: 1,941  tflops: 80.07  mfu: 26.85%
    [2025-09-12 11:39:35,250642][I][components/metrics:442:log] step: 12  loss:  7.8059  grad_norm:  9.0702  memory: 23.63GiB(36.93%)  tps: 1,941  tflops: 80.08  mfu: 26.85%
    [2025-09-12 11:39:37,361018][I][components/metrics:442:log] step: 13  loss:  7.3035  grad_norm:  5.1540  memory: 23.63GiB(36.93%)  tps: 1,943  tflops: 80.15  mfu: 26.88%
    [2025-09-12 11:39:39,472014][I][components/metrics:442:log] step: 14  loss:  7.1419  grad_norm:  4.1700  memory: 23.63GiB(36.93%)  tps: 1,942  tflops: 80.13  mfu: 26.87%
    [2025-09-12 11:39:41,584217][I][components/metrics:442:log] step: 15  loss:  6.9347  grad_norm:  4.9882  memory: 23.63GiB(36.93%)  tps: 1,941  tflops: 80.08  mfu: 26.86%
    [2025-09-12 11:39:43,690898][I][components/metrics:442:log] step: 16  loss:  7.3633  grad_norm: 31.0589  memory: 23.63GiB(36.93%)  tps: 1,946  tflops: 80.29  mfu: 26.93%
    [2025-09-12 11:39:45,799715][I][components/metrics:442:log] step: 17  loss:  7.1793  grad_norm: 13.7271  memory: 23.63GiB(36.93%)  tps: 1,944  tflops: 80.21  mfu: 26.90%
    [2025-09-12 11:39:47,907438][I][components/metrics:442:log] step: 18  loss:  7.2268  grad_norm: 10.9098  memory: 23.63GiB(36.93%)  tps: 1,945  tflops: 80.25  mfu: 26.91%
    [2025-09-12 11:39:50,018253][I][components/metrics:442:log] step: 19  loss:  6.9895  grad_norm:  6.6582  memory: 23.63GiB(36.93%)  tps: 1,942  tflops: 80.13  mfu: 26.87%
    [2025-09-12 11:39:52,127309][I][components/metrics:442:log] step: 20  loss:  6.7515  grad_norm:  3.5633  memory: 23.63GiB(36.93%)  tps: 1,944  tflops: 80.20  mfu: 26.90%
    [2025-09-12 11:39:54,237784][I][components/metrics:442:log] step: 21  loss:  6.7755  grad_norm:  3.6999  memory: 23.63GiB(36.93%)  tps: 1,943  tflops: 80.15  mfu: 26.88%
    [2025-09-12 11:39:56,348825][I][components/metrics:442:log] step: 22  loss:  6.9412  grad_norm:  3.5428  memory: 23.63GiB(36.93%)  tps: 1,942  tflops: 80.13  mfu: 26.87%
    [2025-09-12 11:39:58,460931][I][components/metrics:442:log] step: 23  loss:  6.8696  grad_norm:  2.8968  memory: 23.63GiB(36.93%)  tps: 1,941  tflops: 80.08  mfu: 26.86%
    [2025-09-12 11:40:00,572489][I][components/metrics:442:log] step: 24  loss:  6.6327  grad_norm:  5.1677  memory: 23.63GiB(36.93%)  tps: 1,942  tflops: 80.11  mfu: 26.86%
    [2025-09-12 11:40:02,683070][I][components/metrics:442:log] step: 25  loss:  6.7134  grad_norm:  3.7672  memory: 23.63GiB(36.93%)  tps: 1,943  tflops: 80.14  mfu: 26.88%
    [2025-09-12 11:40:04,793520][I][components/metrics:442:log] step: 26  loss:  6.5521  grad_norm:  3.4081  memory: 23.63GiB(36.93%)  tps: 1,943  tflops: 80.15  mfu: 26.88%
    [2025-09-12 11:40:06,906933][I][components/metrics:442:log] step: 27  loss:  6.6118  grad_norm:  2.8971  memory: 23.63GiB(36.93%)  tps: 1,940  tflops: 80.04  mfu: 26.84%
    [2025-09-12 11:40:09,019771][I][components/metrics:442:log] step: 28  loss:  6.7229  grad_norm:  2.6085  memory: 23.63GiB(36.93%)  tps: 1,941  tflops: 80.06  mfu: 26.85%
    [2025-09-12 11:40:11,135250][I][components/metrics:442:log] step: 29  loss:  6.5777  grad_norm:  2.8184  memory: 23.63GiB(36.93%)  tps: 1,938  tflops: 79.96  mfu: 26.81%
    [2025-09-12 11:40:13,249416][I][components/metrics:442:log] step: 30  loss:  6.5954  grad_norm:  2.7959  memory: 23.63GiB(36.93%)  tps: 1,939  tflops: 80.00  mfu: 26.83%
    [2025-09-12 11:40:15,364869][I][components/metrics:442:log] step: 31  loss:  6.4546  grad_norm:  3.2096  memory: 23.63GiB(36.93%)  tps: 1,938  tflops: 79.96  mfu: 26.82%
    [2025-09-12 11:40:17,476265][I][components/metrics:442:log] step: 32  loss:  6.6677  grad_norm:  2.1374  memory: 23.63GiB(36.93%)  tps: 1,942  tflops: 80.11  mfu: 26.87%
    [2025-09-12 11:40:19,590038][I][components/metrics:442:log] step: 33  loss:  6.5451  grad_norm:  2.0738  memory: 23.63GiB(36.93%)  tps: 1,940  tflops: 80.02  mfu: 26.84%
    [2025-09-12 11:40:21,706964][I][components/metrics:442:log] step: 34  loss:  6.7087  grad_norm:  2.5267  memory: 23.63GiB(36.93%)  tps: 1,937  tflops: 79.91  mfu: 26.80%
    [2025-09-12 11:40:23,826393][I][components/metrics:442:log] step: 35  loss:  6.3955  grad_norm:  1.9991  memory: 23.63GiB(36.93%)  tps: 1,935  tflops: 79.81  mfu: 26.76%
    [2025-09-12 11:40:25,943121][I][components/metrics:442:log] step: 36  loss:  6.4686  grad_norm:  1.5817  memory: 23.63GiB(36.93%)  tps: 1,937  tflops: 79.91  mfu: 26.80%
    [2025-09-12 11:40:28,062842][I][components/metrics:442:log] step: 37  loss:  6.3481  grad_norm:  2.6166  memory: 23.63GiB(36.93%)  tps: 1,934  tflops: 79.79  mfu: 26.76%
    [2025-09-12 11:40:30,184717][I][components/metrics:442:log] step: 38  loss:  6.4443  grad_norm:  2.5323  memory: 23.63GiB(36.93%)  tps: 1,932  tflops: 79.71  mfu: 26.73%
    [2025-09-12 11:40:32,305122][I][components/metrics:442:log] step: 39  loss:  6.2732  grad_norm:  2.1087  memory: 23.63GiB(36.93%)  tps: 1,934  tflops: 79.77  mfu: 26.75%
    [2025-09-12 11:40:34,431400][I][components/metrics:442:log] step: 40  loss:  6.1638  grad_norm:  1.6096  memory: 23.63GiB(36.93%)  tps: 1,928  tflops: 79.55  mfu: 26.68%
    [2025-09-12 11:40:36,558993][I][components/metrics:442:log] step: 41  loss:  6.2434  grad_norm:  2.1429  memory: 23.63GiB(36.93%)  tps: 1,927  tflops: 79.50  mfu: 26.66%
    [2025-09-12 11:40:38,684159][I][components/metrics:442:log] step: 42  loss:  6.2472  grad_norm:  1.9758  memory: 23.63GiB(36.93%)  tps: 1,929  tflops: 79.59  mfu: 26.69%
    [2025-09-12 11:40:40,811350][I][components/metrics:442:log] step: 43  loss:  6.0686  grad_norm:  2.0387  memory: 23.63GiB(36.93%)  tps: 1,927  tflops: 79.52  mfu: 26.67%
    [2025-09-12 11:40:42,942820][I][components/metrics:442:log] step: 44  loss:  6.0512  grad_norm:  1.7659  memory: 23.63GiB(36.93%)  tps: 1,924  tflops: 79.36  mfu: 26.61%
    [2025-09-12 11:40:45,071924][I][components/metrics:442:log] step: 45  loss:  5.9693  grad_norm:  3.0356  memory: 23.63GiB(36.93%)  tps: 1,926  tflops: 79.44  mfu: 26.64%
    [2025-09-12 11:40:47,202347][I][components/metrics:442:log] step: 46  loss:  6.1370  grad_norm:  2.2346  memory: 23.63GiB(36.93%)  tps: 1,924  tflops: 79.39  mfu: 26.62%
    [2025-09-12 11:40:49,335707][I][components/metrics:442:log] step: 47  loss:  6.0951  grad_norm:  2.2721  memory: 23.63GiB(36.93%)  tps: 1,922  tflops: 79.29  mfu: 26.59%
    [2025-09-12 11:40:51,472182][I][components/metrics:442:log] step: 48  loss:  6.1080  grad_norm:  2.3427  memory: 23.63GiB(36.93%)  tps: 1,919  tflops: 79.17  mfu: 26.55%
    [2025-09-12 11:40:53,607441][I][components/metrics:442:log] step: 49  loss:  5.8213  grad_norm:  2.4015  memory: 23.63GiB(36.93%)  tps: 1,920  tflops: 79.22  mfu: 26.57%
    [2025-09-12 11:40:53,644423][I][tools/utils:65:collect] [GC] Performing periodical GC collection 0.04 seconds
    [2025-09-12 11:40:55,782338][I][components/metrics:442:log] step: 50  loss:  6.0710  grad_norm:  2.2237  memory: 23.63GiB(36.93%)  tps: 1,885  tflops: 77.77  mfu: 26.08%
    [2025-09-12 11:40:57,921332][I][components/metrics:442:log] step: 51  loss:  5.6129  grad_norm:  1.8282  memory: 23.63GiB(36.93%)  tps: 1,917  tflops: 79.07  mfu: 26.52%
    [2025-09-12 11:41:00,060512][I][components/metrics:442:log] step: 52  loss:  5.8381  grad_norm:  2.2276  memory: 23.63GiB(36.93%)  tps: 1,917  tflops: 79.07  mfu: 26.52%
    [2025-09-12 11:41:02,201596][I][components/metrics:442:log] step: 53  loss:  5.5789  grad_norm:  1.8904  memory: 23.63GiB(36.93%)  tps: 1,915  tflops: 79.00  mfu: 26.49%
    [2025-09-12 11:41:04,338853][I][components/metrics:442:log] step: 54  loss:  5.5972  grad_norm:  1.9285  memory: 23.63GiB(36.93%)  tps: 1,918  tflops: 79.14  mfu: 26.54%
    [2025-09-12 11:41:06,483940][I][components/metrics:442:log] step: 55  loss:  5.5264  grad_norm:  2.1031  memory: 23.63GiB(36.93%)  tps: 1,911  tflops: 78.86  mfu: 26.45%
    [2025-09-12 11:41:08,626486][I][components/metrics:442:log] step: 56  loss:  5.6756  grad_norm:  1.8958  memory: 23.63GiB(36.93%)  tps: 1,914  tflops: 78.95  mfu: 26.48%
    [2025-09-12 11:41:10,768986][I][components/metrics:442:log] step: 57  loss:  5.5827  grad_norm:  1.9008  memory: 23.63GiB(36.93%)  tps: 1,914  tflops: 78.95  mfu: 26.48%
    [2025-09-12 11:41:12,915983][I][components/metrics:442:log] step: 58  loss:  6.1343  grad_norm:  2.2042  memory: 23.63GiB(36.93%)  tps: 1,910  tflops: 78.78  mfu: 26.42%
    [2025-09-12 11:41:15,057467][I][components/metrics:442:log] step: 59  loss:  5.7517  grad_norm:  1.7251  memory: 23.63GiB(36.93%)  tps: 1,914  tflops: 78.98  mfu: 26.49%
    [2025-09-12 11:41:17,195890][I][components/metrics:442:log] step: 60  loss:  5.5449  grad_norm:  1.7781  memory: 23.63GiB(36.93%)  tps: 1,917  tflops: 79.10  mfu: 26.53%
    [2025-09-12 11:41:19,340106][I][components/metrics:442:log] step: 61  loss:  5.5037  grad_norm:  1.8137  memory: 23.63GiB(36.93%)  tps: 1,912  tflops: 78.88  mfu: 26.45%
    [2025-09-12 11:41:21,479998][I][components/metrics:442:log] step: 62  loss:  5.5703  grad_norm:  2.2754  memory: 23.63GiB(36.93%)  tps: 1,916  tflops: 79.04  mfu: 26.51%
    [2025-09-12 11:41:23,619646][I][components/metrics:442:log] step: 63  loss:  5.3396  grad_norm:  1.9820  memory: 23.63GiB(36.93%)  tps: 1,916  tflops: 79.06  mfu: 26.51%
    [2025-09-12 11:41:25,758931][I][components/metrics:442:log] step: 64  loss:  5.2862  grad_norm:  2.1926  memory: 23.63GiB(36.93%)  tps: 1,917  tflops: 79.07  mfu: 26.52%
    [2025-09-12 11:41:27,902443][I][components/metrics:442:log] step: 65  loss:  5.3883  grad_norm:  1.8266  memory: 23.63GiB(36.93%)  tps: 1,913  tflops: 78.91  mfu: 26.46%
    [2025-09-12 11:41:30,047189][I][components/metrics:442:log] step: 66  loss:  5.3715  grad_norm:  1.8546  memory: 23.63GiB(36.93%)  tps: 1,912  tflops: 78.86  mfu: 26.45%
    [2025-09-12 11:41:32,191202][I][components/metrics:442:log] step: 67  loss:  5.3473  grad_norm:  1.8945  memory: 23.63GiB(36.93%)  tps: 1,912  tflops: 78.88  mfu: 26.45%
    [2025-09-12 11:41:34,336648][I][components/metrics:442:log] step: 68  loss:  5.4083  grad_norm:  1.6982  memory: 23.63GiB(36.93%)  tps: 1,911  tflops: 78.83  mfu: 26.44%
    [2025-09-12 11:41:36,480695][I][components/metrics:442:log] step: 69  loss:  5.2105  grad_norm:  1.5840  memory: 23.63GiB(36.93%)  tps: 1,912  tflops: 78.89  mfu: 26.45%
    [2025-09-12 11:41:38,625671][I][components/metrics:442:log] step: 70  loss:  5.2483  grad_norm:  1.8750  memory: 23.63GiB(36.93%)  tps: 1,911  tflops: 78.85  mfu: 26.44%
    [2025-09-12 11:41:40,772186][I][components/metrics:442:log] step: 71  loss:  5.1239  grad_norm:  1.9717  memory: 23.63GiB(36.93%)  tps: 1,910  tflops: 78.80  mfu: 26.43%
    [2025-09-12 11:41:42,918729][I][components/metrics:442:log] step: 72  loss:  5.3355  grad_norm:  1.8882  memory: 23.63GiB(36.93%)  tps: 1,910  tflops: 78.79  mfu: 26.42%
    [2025-09-12 11:41:45,066384][I][components/metrics:442:log] step: 73  loss:  5.0560  grad_norm:  1.6971  memory: 23.63GiB(36.93%)  tps: 1,909  tflops: 78.76  mfu: 26.41%
    [2025-09-12 11:41:47,209176][I][components/metrics:442:log] step: 74  loss:  5.0859  grad_norm:  2.6819  memory: 23.63GiB(36.93%)  tps: 1,913  tflops: 78.93  mfu: 26.47%
    [2025-09-12 11:41:49,355442][I][components/metrics:442:log] step: 75  loss:  5.2856  grad_norm:  1.8572  memory: 23.63GiB(36.93%)  tps: 1,910  tflops: 78.81  mfu: 26.43%
    [2025-09-12 11:41:51,499099][I][components/metrics:442:log] step: 76  loss:  5.2415  grad_norm:  1.4722  memory: 23.63GiB(36.93%)  tps: 1,913  tflops: 78.90  mfu: 26.46%
    [2025-09-12 11:41:53,642872][I][components/metrics:442:log] step: 77  loss:  5.1465  grad_norm:  1.6991  memory: 23.63GiB(36.93%)  tps: 1,913  tflops: 78.90  mfu: 26.46%
    [2025-09-12 11:41:55,790222][I][components/metrics:442:log] step: 78  loss:  4.9042  grad_norm:  2.5348  memory: 23.63GiB(36.93%)  tps: 1,909  tflops: 78.77  mfu: 26.41%
    [2025-09-12 11:41:57,938398][I][components/metrics:442:log] step: 79  loss:  5.1845  grad_norm:  2.1790  memory: 23.63GiB(36.93%)  tps: 1,908  tflops: 78.73  mfu: 26.40%
    [2025-09-12 11:42:00,085052][I][components/metrics:442:log] step: 80  loss:  5.0380  grad_norm:  1.8122  memory: 23.63GiB(36.93%)  tps: 1,910  tflops: 78.79  mfu: 26.42%
    [2025-09-12 11:42:02,229187][I][components/metrics:442:log] step: 81  loss:  5.1028  grad_norm:  2.3178  memory: 23.63GiB(36.93%)  tps: 1,912  tflops: 78.89  mfu: 26.46%
    [2025-09-12 11:42:04,376585][I][components/metrics:442:log] step: 82  loss:  4.9639  grad_norm:  1.7682  memory: 23.63GiB(36.93%)  tps: 1,909  tflops: 78.77  mfu: 26.42%
    [2025-09-12 11:42:06,522266][I][components/metrics:442:log] step: 83  loss:  5.1079  grad_norm:  2.0751  memory: 23.63GiB(36.93%)  tps: 1,911  tflops: 78.83  mfu: 26.44%
    [2025-09-12 11:42:08,668032][I][components/metrics:442:log] step: 84  loss:  5.0744  grad_norm:  1.4189  memory: 23.63GiB(36.93%)  tps: 1,911  tflops: 78.82  mfu: 26.43%

Footnotes

  1. Submitted PR #2↩︎

Citation

BibTeX citation:
@online{foreman2025,
  author = {Foreman, Sam},
  title = {🍹 {BlendCorpus} + {TorchTitan} @ {ALCF}},
  date = {2025-09-12},
  url = {https://samforeman.me/posts/2025/09/12/},
  langid = {en}
}
For attribution, please cite this work as:
Foreman, Sam. 2025. “🍹 BlendCorpus + TorchTitan @ ALCF.” September 12, 2025. https://samforeman.me/posts/2025/09/12/.