đŸč BlendCorpus + TorchTitan @ ALCF

pytorch
Aurora
ALCF
torchtitan
blendcorpus
Training large language models with BlendCorpus and TorchTitan on supercomputers at the Argonne Leadership Computing Facility.
Author
Affiliation
Published

September 12, 2025

Modified

October 7, 2025

📌 Source Repositories

Things are changing quickly, so to avoid confusion, here are the exact branches used for this demo:

đŸƒâ€â™‚ïž Running

  • Clone repo:

    git clone https://github.com/auroraGPT-ANL/torchtitan
    cd torchtitan
    checkout saforem2/blendcorpus
  • Setup env:

    source <(curl -L https://bit.ly/ezpz-utils)
    ezpz_setup_env
    • 2025-09-11, on Aurora @ ALCF:

      output:

      ; ssh x4112c1s0b0n0
      
      #[🐍 aurora_nre_models_frameworks-2025.2.0]
      #[/f/A/A/E/A/t/a/torchtitan][đŸŒ± saforem2/blendcorpus][đŸ“đŸ€·âœ“] 
      #[09/11/25 @ 14:08:35][x4112c1s0b0n0]
      ; source <(curl -L https://bit.ly/ezpz-utils)
      
      #[🐍 aurora_nre_models_frameworks-2025.2.0]
      #[/f/A/A/E/A/t/a/torchtitan][đŸŒ± saforem2/blendcorpus][đŸ“đŸ€·âœ“] 
      #[09/11/25 @ 14:08:37][x4112c1s0b0n0]
      ; ezpz_setup_env                                                                                                                                                                      
      [2025-09-11-140838][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2720] Detected PBS scheduler environment.
      [2025-09-11-140838][W][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2756] Current working directory does not match PBS_O_WORKDIR! This may cause issues with the job submission.
      [2025-09-11-140838][W][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2757] PBS_O_WORKDIR /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/zhenghh04/torchtitan
      [2025-09-11-140838][W][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2758] WORKING_DIR /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan
      [2025-09-11-140838][W][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2759] Exporting PBS_O_WORKDIR=WORKING_DIR=/lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan and continuing...
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2486] running [ezpz_setup_env]...
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1298] [PYTHON]
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1327]   - Conda active, conda=/opt/aurora/25.190.0/frameworks/aurora_nre_models_frameworks-2025.2.0...
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1328]   - No virtual_env found in environment
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1142]   - Found python root at /opt/aurora/25.190.0/frameworks/aurora_nre_models_frameworks-2025.2.0
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1157]   - No VIRTUAL_ENV found in environment!
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1160]   - Looking for venv in venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0...
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1182]   - Activating existing venv in VENV_DIR=venvs/torchtitan-aurora_nre_models_frameworks-2025.2.0
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1184]   - Found /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan/venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0/bin/activate
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1353]   - Using python from: /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan/venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0/bin/python3
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2335] [JOB]
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2336]   - Setting up env for foremans
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2337]   - Detected pbs scheduler
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2338]   - Machine: aurora
      [2025-09-11-140839][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2339]   - Hostname: x4112c1s0b0n0
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2249]   - PBS_JOBID=7559761.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
          to calculate:
            - num_hosts: 2
            - num_cores_per_host: 208
            - num_cpus_per_host: 104
            - num_gpus_per_host: 12
            - depth: 8
            - num_gpus: 24
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1754] [HOSTS] - ezpz_print_hosts
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1756]   - Detected PBS Scheduler
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1774] [HOSTS]
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1775]   - HOSTFILE=/var/spool/pbs/aux/7559761.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1776]   - NHOSTS=2
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1777]   - HOSTS:
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1780]     - [host:0] - x4112c1s0b0n0.hsn.cm.aurora.alcf.anl.gov
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1780]     - [host:1] - x4112c1s1b0n0.hsn.cm.aurora.alcf.anl.gov
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1941] [DIST_INFO]
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1942]   - HOSTFILE=/var/spool/pbs/aux/7559761.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1943]   - NHOSTS=2
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1944]   - NGPU_PER_HOST=12
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1945]   - NGPUS=24
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1947] [LAUNCH]
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1948]   - To launch across all available GPUs, use: 'launch'
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1949]     launch = mpiexec --verbose --envall -n 24 -ppn 12 --hostfile /var/spool/pbs/aux/7559761.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-8
      0:86-88:94-96
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:1950]   - Run 'which launch' to ensure that the alias is set correctly
      [2025-09-11-140842][I][/home/foremans/ezpz/src/ezpz/bin/utils.sh:2495] [✓] Finished [ezpz_setup_env]
      took: 0h:00m:04s
  • Install dependencies. From inside your clone of torchtitan:

    • 🍋 ezpz:

      # uv not required, but useful!
      # to download: curl -LsSf https://astral.sh/uv/install.sh | sh
      uv pip install "git+https://github.com/saforem2/ezpz"
    • đŸč BlendCorpus:

      git clone https://github.com/saforem2/blendcorpus deps/blendcorpus
      cd deps/blendcorpus
      git checkout reorg-imports
      uv pip install -e "."
    • đŸ”„ TorchTitan:

      python3 -m pip install "git+https://github.com/saforem2/blendcorpus@saforem2/reorg-imports"
      # from inside auroraGPT-ANL/torchtitan @ saforem2/blendcorpus
      python3 -m pip install -e "."
  • Download Artifacts:

    • AuroraGPT-2B:

      python3 scripts/download_hf_assets.py --repo_id google/gemma-7b --assets tokenizer
      mkdir assets/hf/AuroraGPT-2B
      cp assets/hf/gemma-7b/tokenizer.model assets/hf/AuroraGPT-2B
    • AuroraGPT-7B:

      python3 scripts/download_hf_assets.py --repo_id meta-llama/llama-2-7b-hf --assets tokenizer
      mkdir assets/hf/AuroraGPT-7B
      cp assets/hf/gemma-7b/tokenizer.model assets/hf/AuroraGPT-7B
  • Launch:

    ; ezpz-launch -m torchtitan.experiments.blendcorpus.train --job.config_file torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml

    output:

    #[09/12/25 @ 11:33:56][x4117c4s2b0n0]
    ; ezpz-launch -m torchtitan.experiments.blendcorpus.train --job.config_file torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml
    
    [2025-09-12-113711][I][-zsh:91] Using torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml
    [2025-09-12-113711][I][-zsh:91] Logs will be saved to: logs/auroraGPT_7B-2025-09-12-113711.log
    [W912 11:37:15.852098275 OperatorEntry.cpp:219] Warning: Warning only once for all operators,  other operators may also be overridden.
      Overriding a previously registered kernel for the same operator and the same dispatch key
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      dispatch key: XPU
      previous kernel: registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/aten/src/ATen/VmapModeRegistrations.cpp:37
          new kernel: registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/ipex_2.8.10_xpu_rel_08_18_2025/intel-extension-for-pytorch/build/Release/csrc/gpu/csrc/gpu/xpu/ATen/RegisterXPU_0.cpp:172 (function operator())
    [2025-09-12 11:37:16,879] [INFO] [real_accelerator.py:260:get_accelerator] Setting ds_accelerator to xpu (auto detect)
    /opt/aurora/25.190.0/frameworks/aurora_nre_models_frameworks-2025.2.0/lib/python3.10/site-packages/neural_compressor/utils/utility.py:44: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
      from pkg_resources import parse_version
    [2025-09-12 11:37:30,939] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False
    [2025-09-12 11:37:43,263749][I][ezpz/__init__:266:<module>] Setting logging level to 'INFO' on 'RANK == 0'
    [2025-09-12 11:37:43,266470][I][ezpz/__init__:267:<module>] Setting logging level to 'CRITICAL' on all others 'RANK != 0'
    
    
    [2025-09-12 11:37:43,273704][I][ezpz/launch:340:launch] ----[🍋 ezpz.launch][started][2025-09-12-113743]----
    [2025-09-12 11:37:47,537879][I][ezpz/launch:345:launch] Job ID: 7591191
    [2025-09-12 11:37:47,538702][I][ezpz/launch:346:launch] nodelist: ['x4117c4s2b0n0', 'x4117c4s6b0n0']
    [2025-09-12 11:37:47,539093][I][ezpz/launch:347:launch] hostfile: /var/spool/pbs/aux/7591191.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
    [2025-09-12 11:37:47,540277][I][ezpz/pbs:188:get_pbs_launch_cmd] ✅ Using [24/24] GPUs [2 hosts] x [12 GPU/host]
    [2025-09-12 11:37:47,541233][I][ezpz/launch:316:build_executable] Building command to execute by piecing together:
    [2025-09-12 11:37:47,541638][I][ezpz/launch:317:build_executable] (1.) launch_cmd: mpiexec --verbose --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/7591191.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
    [2025-09-12 11:37:47,542413][I][ezpz/launch:318:build_executable] (2.) cmd_to_launch: /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan/venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0/bin/python3 -m torchtitan.experiments.blendcorpus.train --job.config_file torchtitan/experiments /blendcorpus/train_configs/auroraGPT_7B.toml
    [2025-09-12 11:37:47,543429][I][ezpz/launch:360:launch] Took: 4.27 seconds to build command.
    [2025-09-12 11:37:47,543810][I][ezpz/launch:363:launch] Executing:
    mpiexec
      --verbose
      --envall
      --np=24
      --ppn=12
      --hostfile=/var/spool/pbs/aux/7591191.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
      --no-vni
      --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
      /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan/venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0/bin/python3
      -m
      torchtitan.experiments.blendcorpus.train
      --job.config_file
      torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml
    [2025-09-12 11:37:47,545251][I][ezpz/launch:179:get_aurora_filters] Filtering for Aurora-specific messages. To view list of filters, run with EZPZ_LOG_LEVEL=DEBUG
    [2025-09-12 11:37:47,545756][I][ezpz/launch:370:launch] Execution started @ 2025-09-12-113747...
    [2025-09-12 11:37:47,546182][I][ezpz/launch:371:launch] ----[🍋 ezpz.launch][stop][2025-09-12-113747]----
    [2025-09-12 11:37:47,546634][I][ezpz/launch:99:run_command] Caught 24 filters
    [2025-09-12 11:37:47,547002][I][ezpz/launch:100:run_command] Running command:
    mpiexec --verbose --envall --np=24 --ppn=12 --hostfile=/var/spool/pbs/aux/7591191.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 /lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/auroraGPT-ANL/torchtitan/venvs/aurora/torchtitan-aurora_nre_models_frameworks-2025.2.0/bin/python3 -m torchtitan.experiments.blendcorpus.train --job.config_file torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml
    Disabling local launch: multi-node application
    Connected to tcp://x4117c4s2b0n0.hsn.cm.aurora.alcf.anl.gov:7919
    Launching application 422e0368-f389-4475-8131-3de313723140
    cpubind:list x4117c4s2b0n0 pid 35392 rank 0 0: mask 0x1c
    cpubind:list x4117c4s2b0n0 pid 35393 rank 1 1: mask 0x1c00
    cpubind:list x4117c4s2b0n0 pid 35394 rank 2 2: mask 0x1c0000
    cpubind:list x4117c4s2b0n0 pid 35395 rank 3 3: mask 0x1c000000
    cpubind:list x4117c4s2b0n0 pid 35396 rank 4 4: mask 0x1c00000000
    cpubind:list x4117c4s2b0n0 pid 35397 rank 5 5: mask 0x1c0000000000
    cpubind:list x4117c4s2b0n0 pid 35398 rank 6 6: mask 0x1c0000000000000
    cpubind:list x4117c4s2b0n0 pid 35399 rank 7 7: mask 0x1c000000000000000
    cpubind:list x4117c4s2b0n0 pid 35400 rank 8 8: mask 0x1c00000000000000000
    cpubind:list x4117c4s2b0n0 pid 35401 rank 9 9: mask 0x1c0000000000000000000
    cpubind:list x4117c4s2b0n0 pid 35402 rank 10 10: mask 0x1c000000000000000000000
    cpubind:list x4117c4s2b0n0 pid 35403 rank 11 11: mask 0x1c00000000000000000000000
    Application 422e0368-f389-4475-8131-3de313723140 started execution
    cpubind:list x4117c4s6b0n0 pid 111063 rank 12 0: mask 0x1c
    cpubind:list x4117c4s6b0n0 pid 111064 rank 13 1: mask 0x1c00
    cpubind:list x4117c4s6b0n0 pid 111065 rank 14 2: mask 0x1c0000
    cpubind:list x4117c4s6b0n0 pid 111066 rank 15 3: mask 0x1c000000
    cpubind:list x4117c4s6b0n0 pid 111067 rank 16 4: mask 0x1c00000000
    cpubind:list x4117c4s6b0n0 pid 111068 rank 17 5: mask 0x1c0000000000
    cpubind:list x4117c4s6b0n0 pid 111069 rank 18 6: mask 0x1c0000000000000
    cpubind:list x4117c4s6b0n0 pid 111070 rank 19 7: mask 0x1c000000000000000
    cpubind:list x4117c4s6b0n0 pid 111071 rank 20 8: mask 0x1c00000000000000000
    cpubind:list x4117c4s6b0n0 pid 111072 rank 21 9: mask 0x1c0000000000000000000
    cpubind:list x4117c4s6b0n0 pid 111073 rank 22 10: mask 0x1c000000000000000000000
    cpubind:list x4117c4s6b0n0 pid 111074 rank 23 11: mask 0x1c00000000000000000000000
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
        registered at /lus/tegu/projects/datasets/software/wheelforge/repositories/pytorch_2p8_rel_07_18_2025/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
      from pkg_resources import parse_version
    # [...repeated...]: TODO: Add this to the list of filters in ezpz
      from pkg_resources import parse_version
    [2025-09-12 11:38:05,512552][I][ezpz/__init__:266:<module>] Setting logging level to 'INFO' on 'RANK == 0'
    [2025-09-12 11:38:05,515164][I][ezpz/__init__:267:<module>] Setting logging level to 'CRITICAL' on all others 'RANK != 0'
    [2025-09-12 11:38:07,293955][I][ezpz/dist:1181:setup_torch_distributed] Using fw='ddp' with torch_{device,backend}= {xpu, xccl}
    [2025-09-12 11:38:07,295126][I][ezpz/dist:1039:setup_torch_DDP] Caught MASTER_PORT=44635 from environment!
    [2025-09-12 11:38:07,295968][I][ezpz/dist:1055:setup_torch_DDP] Using torch.distributed.init_process_group with
    - master_addr='x4117c4s2b0n0.hsn.cm.aurora.alcf.anl.gov'
    - master_port='44635'
    - world_size=24
    - rank=0
    - local_rank=0
    - timeout=datetime.timedelta(seconds=3600)
    - backend='xccl'
    [2025-09-12 11:38:07,297280][I][ezpz/dist:772:init_process_group] Calling torch.distributed.init_process_group_with: rank=0 world_size=24 backend=xccl
    [2025-09-12 11:38:21,344380][I][ezpz/pbs:188:get_pbs_launch_cmd] ✅ Using [24/24] GPUs [2 hosts] x [12 GPU/host]
    [2025-09-12 11:38:21,346401][I][ezpz/dist:450:print_dist_setup] [device='xpu'][rank=0/23][local_rank=0/11][node=0/1]
    [2025-09-12 11:38:21,347018][W][utils/_logger:68:warning] Using [24 / 24] available "xpu" devices !!
    2025:09:12-11:38:21:(35392) |CCL_WARN| value of CCL_LOG_LEVEL changed to be error (default:warn)
    [2025-09-12 11:38:22,154201][I][ezpz/dist:1401:setup_torch] Using device='xpu' with backend='xccl' + 'xccl' for distributed training.
    [2025-09-12 11:38:22,155050][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 0/23]
    [2025-09-12 11:38:22,154185][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 6/23]
    [2025-09-12 11:38:22,154353][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 7/23]
    [2025-09-12 11:38:22,154299][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 9/23]
    [2025-09-12 11:38:22,154355][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][11/23]
    [2025-09-12 11:38:22,154185][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 1/23]
    [2025-09-12 11:38:22,154184][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 5/23]
    [2025-09-12 11:38:22,154350][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 8/23]
    [2025-09-12 11:38:22,154312][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][12/23]
    [2025-09-12 11:38:22,154495][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][10/23]
    [2025-09-12 11:38:22,154339][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 4/23]
    [2025-09-12 11:38:22,154312][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][13/23]
    [2025-09-12 11:38:22,154312][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][15/23]
    [2025-09-12 11:38:22,154312][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][16/23]
    [2025-09-12 11:38:22,154379][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][17/23]
    [2025-09-12 11:38:22,154398][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 2/23]
    [2025-09-12 11:38:22,154319][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][18/23]
    [2025-09-12 11:38:22,154284][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][19/23]
    [2025-09-12 11:38:22,154325][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][20/23]
    [2025-09-12 11:38:22,154382][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][22/23]
    [2025-09-12 11:38:22,154502][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][14/23]
    [2025-09-12 11:38:22,154391][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][21/23]
    [2025-09-12 11:38:22,154451][I][ezpz/dist:1448:setup_torch] ['x4117c4s6b0n0'][23/23]
    [2025-09-12 11:38:22,154411][I][ezpz/dist:1448:setup_torch] ['x4117c4s2b0n0'][ 3/23]
    [2025-09-12 11:38:22,694566][I][blendcorpus/train:85:__init__] Starting job: AuroraGPT-7B Training
    [2025-09-12 11:38:22,695590][I][blendcorpus/train:93:__init__] Running with args: {
      "activation_checkpoint": {
        "early_stop": false,
        "mode": "none",
        "per_op_sac_force_recompute_mm_shapes_by_fqns": [
          "moe.router.gate"
        ],
        "selective_ac_option": "op"
      },
      "blendcorpus": {
        "append_eod": true,
        "blend_sample_in_corpus": false,
        "data_cache_path": "./.cache/data/auroraGPT-7B/olmo-mix-1124/",
        "data_file_list": null,
        "dataloader_type": "single",
        "eod_token_id": 2,
        "micro_batch_size": null,
        "num_workers": 2,
        "provide_attention_mask": false,
        "seq_length": null,
        "shuffle": true,
        "shuffle_sample_in_corpus": true,
        "split": "98,1,1"
      },
      "checkpoint": {
        "async_mode": "disabled",
        "create_seed_checkpoint": false,
        "enable": false,
        "enable_first_step_checkpoint": false,
        "exclude_from_loading": [],
        "export_dtype": "float32",
        "folder": "checkpoint",
        "initial_load_in_hf": false,
        "initial_load_model_only": true,
        "initial_load_path": null,
        "interval": 10,
        "keep_latest_k": 10,
        "last_save_in_hf": false,
        "last_save_model_only": false,
        "load_step": -1
      },
      "comm": {
        "init_timeout_seconds": 300,
        "save_traces_folder": "comm_traces",
        "trace_buf_size": 20000,
        "train_timeout_seconds": 100
      },
      "compile": {
        "components": [
          "model",
          "loss"
        ],
        "enable": true
      },
      "experimental": {
        "custom_args_module": "torchtitan.experiments.blendcorpus.job_config",
        "custom_import": ""
      },
      "fault_tolerance": {
        "enable": false,
        "group_size": 0,
        "min_replica_size": 1,
        "process_group": "gloo",
        "process_group_timeout_ms": 10000,
        "replica_id": 0,
        "semi_sync_method": null
      },
      "float8": {
        "emulate": false,
        "enable_fsdp_float8_all_gather": false,
        "filter_fqns": [
          "output"
        ],
        "moe_fqns_prototype": [],
        "precompute_float8_dynamic_scale_for_fsdp": false,
        "recipe_name": null
      },
      "job": {
        "config_file": "torchtitan/experiments/blendcorpus/train_configs/auroraGPT_7B.toml",
        "description": "AuroraGPT-7B Training",
        "dump_folder": "./outputs/AuroraGPT-7B",
        "print_args": true,
        "use_for_integration_test": true
      },
      "lr_scheduler": {
        "decay_ratio": 0.8,
        "decay_type": "linear",
        "min_lr_factor": 0.0,
        "warmup_steps": 2
      },
      "memory_estimation": {
        "disable_fake_mode": false,
        "enable": false
      },
      "metrics": {
        "disable_color_printing": false,
        "enable_tensorboard": true,
        "enable_wandb": true,
        "log_freq": 1,
        "save_for_all_ranks": false,
        "save_tb_folder": "tb"
      },
      "model": {
        "converters": [],
        "flavor": "AuroraGPT-7B",
        "hf_assets_path": "./assets/hf/AuroraGPT-7B",
        "name": "blendcorpus",
        "print_after_conversion": false,
        "tokenizer_backend": "sptoken",
        "tokenizer_path": null
      },
      "mx": {
        "filter_fqns": [
          "output"
        ],
        "moe_fqns_prototype": [],
        "mxfp8_dim1_cast_kernel_choice": "triton",
        "recipe_name": "mxfp8_cublas"
      },
      "optimizer": {
        "beta1": 0.9,
        "beta2": 0.95,
        "early_step_in_backward": false,
        "eps": 1e-08,
        "implementation": "fused",
        "lr": 0.0002,
        "name": "AdamW",
        "weight_decay": 0.1
      },
      "parallelism": {
        "context_parallel_degree": 1,
        "context_parallel_rotate_method": "allgather",
        "data_parallel_replicate_degree": 1,
        "data_parallel_shard_degree": -1,
        "disable_loss_parallel": false,
        "enable_async_tensor_parallel": false,
        "enable_compiled_autograd": false,
        "expert_parallel_degree": 1,
        "expert_tensor_parallel_degree": 1,
        "fsdp_reshard_after_forward": "default",
        "module_fqns_per_model_part": null,
        "pipeline_parallel_degree": 1,
        "pipeline_parallel_first_stage_less_layers": 1,
        "pipeline_parallel_last_stage_less_layers": 1,
        "pipeline_parallel_layers_per_stage": null,
        "pipeline_parallel_microbatch_size": 1,
        "pipeline_parallel_schedule": "1F1B",
        "pipeline_parallel_schedule_csv": "",
        "pipeline_parallel_split_points": [],
        "tensor_parallel_degree": 1
      },
      "profiling": {
        "enable_memory_snapshot": false,
        "enable_profiling": false,
        "profile_freq": 10,
        "save_memory_snapshot_folder": "memory_snapshot",
        "save_traces_folder": "profile_trace"
      },
      "training": {
        "dataset": "blendcorpus",
        "dataset_path": "/flare/Aurora_deployment/AuroraGPT/datasets/dolma/dolma_v1_7_file_list_mini.txt",
        "deterministic": false,
        "enable_cpu_offload": false,
        "gc_debug": false,
        "gc_freq": 50,
        "global_batch_size": -1,
        "local_batch_size": 1,
        "max_norm": 1.0,
        "mixed_precision_param": "bfloat16",
        "mixed_precision_reduce": "float32",
        "seed": null,
        "seq_len": 4096,
        "steps": 1000
      },
      "validation": {
        "dataset": "c4_validation",
        "dataset_path": null,
        "enable": false,
        "freq": 5,
        "local_batch_size": 8,
        "seq_len": 2048,
        "steps": 10
      }
    }
    Number of ranks per node: 12
    Is initialized already
    [2025-09-12 11:38:22,781763][I][distributed/parallel_dims:158:_build_mesh_without_ep] Building 1-D device mesh with ['dp_shard'], [24]
    [2025-09-12 11:38:22,783219][I][tools/utils:65:collect] [GC] Initial GC collection 0.00 seconds
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [2025-09-12 11:38:22,795599][I][dataset/sptoken:75:build_sentencepiece_tokenizer] [SPTokenizer] Using model path: ./assets/hf/AuroraGPT-7B
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [Tokenizer] Using backend: sptoken (SentencePiece)
    [2025-09-12 11:38:22,806079][I][dataset/sptoken:36:__init__] [SPTokenizer] Loaded model: ./assets/hf/AuroraGPT-7B/tokenizer.model, vocab size: 32000
    [INFO][2025-09-12 11:38:22.811010] Reading data from /flare/Aurora_deployment/AuroraGPT/datasets/dolma/dolma_v1_7_file_list_mini.txt
    [INFO][2025-09-12 11:38:22.811281] Number of datasets: 9
    [INFO][2025-09-12 11:38:22.811427] Global batch size: 24
    [INFO][2025-09-12 11:38:22.811559] Training iterations: 1000
    [INFO][2025-09-12 11:38:22.811682] Evaluation iterations: 0
    [INFO][2025-09-12 11:38:22.811805] Total number of training samples: 24000
    [INFO][2025-09-12 11:38:22.811932] Total number of evaluation samples: 0
    [INFO][2025-09-12 11:38:22.812052] Total number of testing samples: 0
    [2025-09-12 11:38:23,388839][I][data/gpt_dataset:263:_cache_indices] > loading algebraic corpus dataset index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/eaaf4239bf399fe90985648264fc597b_index.npy
    [2025-09-12 11:38:23,400289][I][data/gpt_dataset:270:_cache_indices] > loading algebraic corpus dataset sample index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/eaaf4239bf399fe90985648264fc597b_sample_index.npy
    [2025-09-12 11:38:23,401313][I][data/gpt_dataset:277:_cache_indices] > finished loading in 0.01251498400233686 seconds
    [2025-09-12 11:38:23,402526][I][data/gpt_dataset:291:__init__] [BuildCorpusDataset] Caught args.shuffle_sample_in_corpus=True across 19984 samples
    [2025-09-12 11:38:23,498032][I][data/gpt_dataset:263:_cache_indices] > loading arxiv corpus dataset index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/9958f3591de484302ca17a2f1feeafaf_index.npy
    [2025-09-12 11:38:23,502674][I][data/gpt_dataset:270:_cache_indices] > loading arxiv corpus dataset sample index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/9958f3591de484302ca17a2f1feeafaf_sample_index.npy
    [2025-09-12 11:38:23,506868][I][data/gpt_dataset:277:_cache_indices] > finished loading in 0.008856782980728894 seconds
    [2025-09-12 11:38:23,507665][I][data/gpt_dataset:291:__init__] [BuildCorpusDataset] Caught args.shuffle_sample_in_corpus=True across 4140 samples
    [2025-09-12 11:38:23,520625][I][data/blendable_dataset:131:__init__] > loading blendable dataset index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/212cf8a136a93b7bc9cd575fa4c82f21_index.npy
    [2025-09-12 11:38:23,527379][I][data/blendable_dataset:134:__init__] > loading blendable dataset sample index: ./.cache/data/auroraGPT-7B/olmo-mix-1124/212cf8a136a93b7bc9cd575fa4c82f21_sample_index.npy
    [2025-09-12 11:38:23,532038][I][data/blendable_dataset:139:__init__] > finished loading in 0.011423073010519147 seconds
    [2025-09-12 11:38:23,543427][I][data/blendable_dataset:152:__init__] > size of blendable dataset: 24124 samples
    [2025-09-12 11:38:23,544235][I][blendcorpus/train:177:__init__] Using BlendCorpus dataloader.
    [2025-09-12 11:38:23,544713][I][blendcorpus/train:185:__init__] Building blendcorpus AuroraGPT-7B with TransformerModelArgs(_enforced='This field is used to enforce all fields have defaults.', dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, rope_theta=10000, max_seq_len=4096, depth_init=True, use_flex_attn=False, attn_mask_type='causal', eos_id=0)
    wandb: Currently logged in as: foremans (aurora_gpt) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
    wandb: Tracking run with wandb version 0.21.3
    wandb: Run data is saved locally in ./outputs/AuroraGPT-7B/tb/20250912-1138/wandb/run-20250912_113823-qzle9mdw
    wandb: Run `wandb offline` to turn off syncing.
    wandb: Syncing run snowy-sunset-14
    wandb:  View project at https://wandb.ai/aurora_gpt/torchtitan
    wandb:  View run at https://wandb.ai/aurora_gpt/torchtitan/runs/qzle9mdw
    [2025-09-12 11:38:24,703889][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:24,750784][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:24,776002][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:24,848549][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:24,864781][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:24,964752][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,005897][I][components/metrics:155:__init__] WandB logging enabled
    [2025-09-12 11:38:25,012474][I][components/metrics:124:__init__] TensorBoard logging enabled. Logs will be saved at ./outputs/AuroraGPT-7B/tb/20250912-1138
    [2025-09-12 11:38:25,017569][I][components/metrics:101:build_device_memory_monitor] XPU capacity: Intel(R) Data Center GPU Max 1550 with 63.98GiB memory
    [2025-09-12 11:38:25,044275][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,093814][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,126732][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,146946][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,146998][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,147707][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,148411][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,149299][I][blendcorpus/train:212:__init__] Model blendcorpus AuroraGPT-7B size: 5,933,109,248 total parameters
    [2025-09-12 11:38:25,150242][I][components/loss:28:build_cross_entropy_loss] Compiling the loss function with torch.compile
    [2025-09-12 11:38:25,190998][I][infra/parallelize:357:apply_compile] Compiling each TransformerBlock with torch.compile
    [2025-09-12 11:38:25,220312][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,245422][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,262857][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,271332][I][infra/parallelize:122:parallelize_llama] Applied FSDP to the model
    [2025-09-12 11:38:25,289336][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safetensors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,296201][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,296289][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,298835][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,299750][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,299754][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,299888][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,770147][I][blendcorpus/train:290:__init__] Peak FLOPS used for computing MFU: 2.982e+14
    [2025-09-12 11:38:25,771316][I][blendcorpus/train:292:__init__] XPU memory usage for model: 1.04GiB(1.63%)
    [2025-09-12 11:38:25,773314][W][protocols/state_dict_adapter:76:__init__] model.safetensors.index.json not found at hf_assets_path: ./assets/hf/AuroraGPT-7B/model.safeten sors.index.json. Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2025-09-12 11:38:25,774216][I][distributed/utils:225:maybe_enable_amp] Mixed precision training is handled by fully_shard
    [2025-09-12 11:38:25,774808][I][blendcorpus/train:381:__init__] Trainer is initialized with local batch size 1, global batch size 24, gradient accumulation steps 1, sequence length 4096, total steps 1000 (warmup 2)
    [2025-09-12 11:38:25,775505][I][blendcorpus/train:695:<module>] Using SDPBackend.FLASH_ATTENTION backend for SDPA
    [2025-09-12 11:38:25,776216][I][blendcorpus/train:569:train] BlendCorpus dataloader advanced to consumed =0 samples (step={self.step}).
    [2025-09-12 11:38:25,776915][I][blendcorpus/train:581:train] Training starts at step 1.
    [2025-09-12 11:39:11,844905][I][components/metrics:442:log] step:  1  loss: 10.8919  grad_norm:  5.7773  memory: 21.74GiB(33.98%)  tps: 88  tflops: 3.62  mfu: 1.21%
    [2025-09-12 11:39:11,847254][I][distributed/utils:299:set_pg_timeouts] Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
    [2025-09-12 11:39:13,996720][I][components/metrics:442:log] step:  2  loss: 15.4482  grad_norm: 95.7768  memory: 23.63GiB(36.93%)  tps: 1,906  tflops: 78.63  mfu: 26.37%
    [2025-09-12 11:39:16,148721][I][components/metrics:442:log] step:  3  loss: 18.1145  grad_norm: 177.2544  memory: 23.63GiB(36.93%)  tps: 1,905  tflops: 78.60  mfu: 26.36%
    [2025-09-12 11:39:18,293594][I][components/metrics:442:log] step:  4  loss: 12.2966  grad_norm: 47.6269  memory: 23.63GiB(36.93%)  tps: 1,912  tflops: 78.86  mfu: 26.45%
    [2025-09-12 11:39:20,423330][I][components/metrics:442:log] step:  5  loss: 12.4196  grad_norm: 55.3153  memory: 23.63GiB(36.93%)  tps: 1,925  tflops: 79.42  mfu: 26.63%
    [2025-09-12 11:39:22,550981][I][components/metrics:442:log] step:  6  loss: 10.8771  grad_norm:  5.3124  memory: 23.63GiB(36.93%)  tps: 1,927  tflops: 79.50  mfu: 26.66%
    [2025-09-12 11:39:24,670689][I][components/metrics:442:log] step:  7  loss: 10.9488  grad_norm: 41.6404  memory: 23.63GiB(36.93%)  tps: 1,934  tflops: 79.80  mfu: 26.76%
    [2025-09-12 11:39:26,791101][I][components/metrics:442:log] step:  8  loss:  9.9818  grad_norm: 18.3422  memory: 23.63GiB(36.93%)  tps: 1,934  tflops: 79.77  mfu: 26.75%
    [2025-09-12 11:39:28,911059][I][components/metrics:442:log] step:  9  loss:  9.0792  grad_norm:  9.5251  memory: 23.63GiB(36.93%)  tps: 1,934  tflops: 79.79  mfu: 26.76%
    [2025-09-12 11:39:31,025851][I][components/metrics:442:log] step: 10  loss:  8.4230  grad_norm:  4.9722  memory: 23.63GiB(36.93%)  tps: 1,939  tflops: 79.98  mfu: 26.82%
    [2025-09-12 11:39:33,138436][I][components/metrics:442:log] step: 11  loss:  8.0111  grad_norm:  4.7603  memory: 23.63GiB(36.93%)  tps: 1,941  tflops: 80.07  mfu: 26.85%
    [2025-09-12 11:39:35,250642][I][components/metrics:442:log] step: 12  loss:  7.8059  grad_norm:  9.0702  memory: 23.63GiB(36.93%)  tps: 1,941  tflops: 80.08  mfu: 26.85%
    [2025-09-12 11:39:37,361018][I][components/metrics:442:log] step: 13  loss:  7.3035  grad_norm:  5.1540  memory: 23.63GiB(36.93%)  tps: 1,943  tflops: 80.15  mfu: 26.88%
    [2025-09-12 11:39:39,472014][I][components/metrics:442:log] step: 14  loss:  7.1419  grad_norm:  4.1700  memory: 23.63GiB(36.93%)  tps: 1,942  tflops: 80.13  mfu: 26.87%
    [2025-09-12 11:39:41,584217][I][components/metrics:442:log] step: 15  loss:  6.9347  grad_norm:  4.9882  memory: 23.63GiB(36.93%)  tps: 1,941  tflops: 80.08  mfu: 26.86%
    [2025-09-12 11:39:43,690898][I][components/metrics:442:log] step: 16  loss:  7.3633  grad_norm: 31.0589  memory: 23.63GiB(36.93%)  tps: 1,946  tflops: 80.29  mfu: 26.93%
    [2025-09-12 11:39:45,799715][I][components/metrics:442:log] step: 17  loss:  7.1793  grad_norm: 13.7271  memory: 23.63GiB(36.93%)  tps: 1,944  tflops: 80.21  mfu: 26.90%
    [2025-09-12 11:39:47,907438][I][components/metrics:442:log] step: 18  loss:  7.2268  grad_norm: 10.9098  memory: 23.63GiB(36.93%)  tps: 1,945  tflops: 80.25  mfu: 26.91%
    [2025-09-12 11:39:50,018253][I][components/metrics:442:log] step: 19  loss:  6.9895  grad_norm:  6.6582  memory: 23.63GiB(36.93%)  tps: 1,942  tflops: 80.13  mfu: 26.87%
    [2025-09-12 11:39:52,127309][I][components/metrics:442:log] step: 20  loss:  6.7515  grad_norm:  3.5633  memory: 23.63GiB(36.93%)  tps: 1,944  tflops: 80.20  mfu: 26.90%
    [2025-09-12 11:39:54,237784][I][components/metrics:442:log] step: 21  loss:  6.7755  grad_norm:  3.6999  memory: 23.63GiB(36.93%)  tps: 1,943  tflops: 80.15  mfu: 26.88%
    [2025-09-12 11:39:56,348825][I][components/metrics:442:log] step: 22  loss:  6.9412  grad_norm:  3.5428  memory: 23.63GiB(36.93%)  tps: 1,942  tflops: 80.13  mfu: 26.87%
    [2025-09-12 11:39:58,460931][I][components/metrics:442:log] step: 23  loss:  6.8696  grad_norm:  2.8968  memory: 23.63GiB(36.93%)  tps: 1,941  tflops: 80.08  mfu: 26.86%
    [2025-09-12 11:40:00,572489][I][components/metrics:442:log] step: 24  loss:  6.6327  grad_norm:  5.1677  memory: 23.63GiB(36.93%)  tps: 1,942  tflops: 80.11  mfu: 26.86%
    [2025-09-12 11:40:02,683070][I][components/metrics:442:log] step: 25  loss:  6.7134  grad_norm:  3.7672  memory: 23.63GiB(36.93%)  tps: 1,943  tflops: 80.14  mfu: 26.88%
    [2025-09-12 11:40:04,793520][I][components/metrics:442:log] step: 26  loss:  6.5521  grad_norm:  3.4081  memory: 23.63GiB(36.93%)  tps: 1,943  tflops: 80.15  mfu: 26.88%
    [2025-09-12 11:40:06,906933][I][components/metrics:442:log] step: 27  loss:  6.6118  grad_norm:  2.8971  memory: 23.63GiB(36.93%)  tps: 1,940  tflops: 80.04  mfu: 26.84%
    [2025-09-12 11:40:09,019771][I][components/metrics:442:log] step: 28  loss:  6.7229  grad_norm:  2.6085  memory: 23.63GiB(36.93%)  tps: 1,941  tflops: 80.06  mfu: 26.85%
    [2025-09-12 11:40:11,135250][I][components/metrics:442:log] step: 29  loss:  6.5777  grad_norm:  2.8184  memory: 23.63GiB(36.93%)  tps: 1,938  tflops: 79.96  mfu: 26.81%
    [2025-09-12 11:40:13,249416][I][components/metrics:442:log] step: 30  loss:  6.5954  grad_norm:  2.7959  memory: 23.63GiB(36.93%)  tps: 1,939  tflops: 80.00  mfu: 26.83%
    [2025-09-12 11:40:15,364869][I][components/metrics:442:log] step: 31  loss:  6.4546  grad_norm:  3.2096  memory: 23.63GiB(36.93%)  tps: 1,938  tflops: 79.96  mfu: 26.82%
    [2025-09-12 11:40:17,476265][I][components/metrics:442:log] step: 32  loss:  6.6677  grad_norm:  2.1374  memory: 23.63GiB(36.93%)  tps: 1,942  tflops: 80.11  mfu: 26.87%
    [2025-09-12 11:40:19,590038][I][components/metrics:442:log] step: 33  loss:  6.5451  grad_norm:  2.0738  memory: 23.63GiB(36.93%)  tps: 1,940  tflops: 80.02  mfu: 26.84%
    [2025-09-12 11:40:21,706964][I][components/metrics:442:log] step: 34  loss:  6.7087  grad_norm:  2.5267  memory: 23.63GiB(36.93%)  tps: 1,937  tflops: 79.91  mfu: 26.80%
    [2025-09-12 11:40:23,826393][I][components/metrics:442:log] step: 35  loss:  6.3955  grad_norm:  1.9991  memory: 23.63GiB(36.93%)  tps: 1,935  tflops: 79.81  mfu: 26.76%
    [2025-09-12 11:40:25,943121][I][components/metrics:442:log] step: 36  loss:  6.4686  grad_norm:  1.5817  memory: 23.63GiB(36.93%)  tps: 1,937  tflops: 79.91  mfu: 26.80%
    [2025-09-12 11:40:28,062842][I][components/metrics:442:log] step: 37  loss:  6.3481  grad_norm:  2.6166  memory: 23.63GiB(36.93%)  tps: 1,934  tflops: 79.79  mfu: 26.76%
    [2025-09-12 11:40:30,184717][I][components/metrics:442:log] step: 38  loss:  6.4443  grad_norm:  2.5323  memory: 23.63GiB(36.93%)  tps: 1,932  tflops: 79.71  mfu: 26.73%
    [2025-09-12 11:40:32,305122][I][components/metrics:442:log] step: 39  loss:  6.2732  grad_norm:  2.1087  memory: 23.63GiB(36.93%)  tps: 1,934  tflops: 79.77  mfu: 26.75%
    [2025-09-12 11:40:34,431400][I][components/metrics:442:log] step: 40  loss:  6.1638  grad_norm:  1.6096  memory: 23.63GiB(36.93%)  tps: 1,928  tflops: 79.55  mfu: 26.68%
    [2025-09-12 11:40:36,558993][I][components/metrics:442:log] step: 41  loss:  6.2434  grad_norm:  2.1429  memory: 23.63GiB(36.93%)  tps: 1,927  tflops: 79.50  mfu: 26.66%
    [2025-09-12 11:40:38,684159][I][components/metrics:442:log] step: 42  loss:  6.2472  grad_norm:  1.9758  memory: 23.63GiB(36.93%)  tps: 1,929  tflops: 79.59  mfu: 26.69%
    [2025-09-12 11:40:40,811350][I][components/metrics:442:log] step: 43  loss:  6.0686  grad_norm:  2.0387  memory: 23.63GiB(36.93%)  tps: 1,927  tflops: 79.52  mfu: 26.67%
    [2025-09-12 11:40:42,942820][I][components/metrics:442:log] step: 44  loss:  6.0512  grad_norm:  1.7659  memory: 23.63GiB(36.93%)  tps: 1,924  tflops: 79.36  mfu: 26.61%
    [2025-09-12 11:40:45,071924][I][components/metrics:442:log] step: 45  loss:  5.9693  grad_norm:  3.0356  memory: 23.63GiB(36.93%)  tps: 1,926  tflops: 79.44  mfu: 26.64%
    [2025-09-12 11:40:47,202347][I][components/metrics:442:log] step: 46  loss:  6.1370  grad_norm:  2.2346  memory: 23.63GiB(36.93%)  tps: 1,924  tflops: 79.39  mfu: 26.62%
    [2025-09-12 11:40:49,335707][I][components/metrics:442:log] step: 47  loss:  6.0951  grad_norm:  2.2721  memory: 23.63GiB(36.93%)  tps: 1,922  tflops: 79.29  mfu: 26.59%
    [2025-09-12 11:40:51,472182][I][components/metrics:442:log] step: 48  loss:  6.1080  grad_norm:  2.3427  memory: 23.63GiB(36.93%)  tps: 1,919  tflops: 79.17  mfu: 26.55%
    [2025-09-12 11:40:53,607441][I][components/metrics:442:log] step: 49  loss:  5.8213  grad_norm:  2.4015  memory: 23.63GiB(36.93%)  tps: 1,920  tflops: 79.22  mfu: 26.57%
    [2025-09-12 11:40:53,644423][I][tools/utils:65:collect] [GC] Performing periodical GC collection 0.04 seconds
    [2025-09-12 11:40:55,782338][I][components/metrics:442:log] step: 50  loss:  6.0710  grad_norm:  2.2237  memory: 23.63GiB(36.93%)  tps: 1,885  tflops: 77.77  mfu: 26.08%
    [2025-09-12 11:40:57,921332][I][components/metrics:442:log] step: 51  loss:  5.6129  grad_norm:  1.8282  memory: 23.63GiB(36.93%)  tps: 1,917  tflops: 79.07  mfu: 26.52%
    [2025-09-12 11:41:00,060512][I][components/metrics:442:log] step: 52  loss:  5.8381  grad_norm:  2.2276  memory: 23.63GiB(36.93%)  tps: 1,917  tflops: 79.07  mfu: 26.52%
    [2025-09-12 11:41:02,201596][I][components/metrics:442:log] step: 53  loss:  5.5789  grad_norm:  1.8904  memory: 23.63GiB(36.93%)  tps: 1,915  tflops: 79.00  mfu: 26.49%
    [2025-09-12 11:41:04,338853][I][components/metrics:442:log] step: 54  loss:  5.5972  grad_norm:  1.9285  memory: 23.63GiB(36.93%)  tps: 1,918  tflops: 79.14  mfu: 26.54%
    [2025-09-12 11:41:06,483940][I][components/metrics:442:log] step: 55  loss:  5.5264  grad_norm:  2.1031  memory: 23.63GiB(36.93%)  tps: 1,911  tflops: 78.86  mfu: 26.45%
    [2025-09-12 11:41:08,626486][I][components/metrics:442:log] step: 56  loss:  5.6756  grad_norm:  1.8958  memory: 23.63GiB(36.93%)  tps: 1,914  tflops: 78.95  mfu: 26.48%
    [2025-09-12 11:41:10,768986][I][components/metrics:442:log] step: 57  loss:  5.5827  grad_norm:  1.9008  memory: 23.63GiB(36.93%)  tps: 1,914  tflops: 78.95  mfu: 26.48%
    [2025-09-12 11:41:12,915983][I][components/metrics:442:log] step: 58  loss:  6.1343  grad_norm:  2.2042  memory: 23.63GiB(36.93%)  tps: 1,910  tflops: 78.78  mfu: 26.42%
    [2025-09-12 11:41:15,057467][I][components/metrics:442:log] step: 59  loss:  5.7517  grad_norm:  1.7251  memory: 23.63GiB(36.93%)  tps: 1,914  tflops: 78.98  mfu: 26.49%
    [2025-09-12 11:41:17,195890][I][components/metrics:442:log] step: 60  loss:  5.5449  grad_norm:  1.7781  memory: 23.63GiB(36.93%)  tps: 1,917  tflops: 79.10  mfu: 26.53%
    [2025-09-12 11:41:19,340106][I][components/metrics:442:log] step: 61  loss:  5.5037  grad_norm:  1.8137  memory: 23.63GiB(36.93%)  tps: 1,912  tflops: 78.88  mfu: 26.45%
    [2025-09-12 11:41:21,479998][I][components/metrics:442:log] step: 62  loss:  5.5703  grad_norm:  2.2754  memory: 23.63GiB(36.93%)  tps: 1,916  tflops: 79.04  mfu: 26.51%
    [2025-09-12 11:41:23,619646][I][components/metrics:442:log] step: 63  loss:  5.3396  grad_norm:  1.9820  memory: 23.63GiB(36.93%)  tps: 1,916  tflops: 79.06  mfu: 26.51%
    [2025-09-12 11:41:25,758931][I][components/metrics:442:log] step: 64  loss:  5.2862  grad_norm:  2.1926  memory: 23.63GiB(36.93%)  tps: 1,917  tflops: 79.07  mfu: 26.52%
    [2025-09-12 11:41:27,902443][I][components/metrics:442:log] step: 65  loss:  5.3883  grad_norm:  1.8266  memory: 23.63GiB(36.93%)  tps: 1,913  tflops: 78.91  mfu: 26.46%
    [2025-09-12 11:41:30,047189][I][components/metrics:442:log] step: 66  loss:  5.3715  grad_norm:  1.8546  memory: 23.63GiB(36.93%)  tps: 1,912  tflops: 78.86  mfu: 26.45%
    [2025-09-12 11:41:32,191202][I][components/metrics:442:log] step: 67  loss:  5.3473  grad_norm:  1.8945  memory: 23.63GiB(36.93%)  tps: 1,912  tflops: 78.88  mfu: 26.45%
    [2025-09-12 11:41:34,336648][I][components/metrics:442:log] step: 68  loss:  5.4083  grad_norm:  1.6982  memory: 23.63GiB(36.93%)  tps: 1,911  tflops: 78.83  mfu: 26.44%
    [2025-09-12 11:41:36,480695][I][components/metrics:442:log] step: 69  loss:  5.2105  grad_norm:  1.5840  memory: 23.63GiB(36.93%)  tps: 1,912  tflops: 78.89  mfu: 26.45%
    [2025-09-12 11:41:38,625671][I][components/metrics:442:log] step: 70  loss:  5.2483  grad_norm:  1.8750  memory: 23.63GiB(36.93%)  tps: 1,911  tflops: 78.85  mfu: 26.44%
    [2025-09-12 11:41:40,772186][I][components/metrics:442:log] step: 71  loss:  5.1239  grad_norm:  1.9717  memory: 23.63GiB(36.93%)  tps: 1,910  tflops: 78.80  mfu: 26.43%
    [2025-09-12 11:41:42,918729][I][components/metrics:442:log] step: 72  loss:  5.3355  grad_norm:  1.8882  memory: 23.63GiB(36.93%)  tps: 1,910  tflops: 78.79  mfu: 26.42%
    [2025-09-12 11:41:45,066384][I][components/metrics:442:log] step: 73  loss:  5.0560  grad_norm:  1.6971  memory: 23.63GiB(36.93%)  tps: 1,909  tflops: 78.76  mfu: 26.41%
    [2025-09-12 11:41:47,209176][I][components/metrics:442:log] step: 74  loss:  5.0859  grad_norm:  2.6819  memory: 23.63GiB(36.93%)  tps: 1,913  tflops: 78.93  mfu: 26.47%
    [2025-09-12 11:41:49,355442][I][components/metrics:442:log] step: 75  loss:  5.2856  grad_norm:  1.8572  memory: 23.63GiB(36.93%)  tps: 1,910  tflops: 78.81  mfu: 26.43%
    [2025-09-12 11:41:51,499099][I][components/metrics:442:log] step: 76  loss:  5.2415  grad_norm:  1.4722  memory: 23.63GiB(36.93%)  tps: 1,913  tflops: 78.90  mfu: 26.46%
    [2025-09-12 11:41:53,642872][I][components/metrics:442:log] step: 77  loss:  5.1465  grad_norm:  1.6991  memory: 23.63GiB(36.93%)  tps: 1,913  tflops: 78.90  mfu: 26.46%
    [2025-09-12 11:41:55,790222][I][components/metrics:442:log] step: 78  loss:  4.9042  grad_norm:  2.5348  memory: 23.63GiB(36.93%)  tps: 1,909  tflops: 78.77  mfu: 26.41%
    [2025-09-12 11:41:57,938398][I][components/metrics:442:log] step: 79  loss:  5.1845  grad_norm:  2.1790  memory: 23.63GiB(36.93%)  tps: 1,908  tflops: 78.73  mfu: 26.40%
    [2025-09-12 11:42:00,085052][I][components/metrics:442:log] step: 80  loss:  5.0380  grad_norm:  1.8122  memory: 23.63GiB(36.93%)  tps: 1,910  tflops: 78.79  mfu: 26.42%
    [2025-09-12 11:42:02,229187][I][components/metrics:442:log] step: 81  loss:  5.1028  grad_norm:  2.3178  memory: 23.63GiB(36.93%)  tps: 1,912  tflops: 78.89  mfu: 26.46%
    [2025-09-12 11:42:04,376585][I][components/metrics:442:log] step: 82  loss:  4.9639  grad_norm:  1.7682  memory: 23.63GiB(36.93%)  tps: 1,909  tflops: 78.77  mfu: 26.42%
    [2025-09-12 11:42:06,522266][I][components/metrics:442:log] step: 83  loss:  5.1079  grad_norm:  2.0751  memory: 23.63GiB(36.93%)  tps: 1,911  tflops: 78.83  mfu: 26.44%
    [2025-09-12 11:42:08,668032][I][components/metrics:442:log] step: 84  loss:  5.0744  grad_norm:  1.4189  memory: 23.63GiB(36.93%)  tps: 1,911  tflops: 78.82  mfu: 26.43%

Footnotes

  1. Submitted PR #2↩

Citation

BibTeX citation:
@online{foreman2025,
  author = {Foreman, Sam},
  title = {đŸč {BlendCorpus} + {TorchTitan} @ {ALCF}},
  date = {2025-09-12},
  url = {https://samforeman.me/posts/2025/09/12/},
  langid = {en}
}
For attribution, please cite this work as:
Foreman, Sam. 2025. “đŸč BlendCorpus + TorchTitan @ ALCF.” September 12, 2025. https://samforeman.me/posts/2025/09/12/.