🧊 Cooling Down Checkpoints: Best Practices for Model Evaluation

Explore effective strategies for evaluating machine learning models using checkpoints, ensuring optimal performance and reliability
Author
Affiliation
Published

November 12, 2025

Modified

November 17, 2025

📉 Simple Experiment to Compare Validation Loss

Cool Down Comparison

Cool Down Comparison

☃️ Cooling Down

  • 256 Nodes of Aurora:

    • Cooled down over last 10%:

    • Explicit command:

      ROPE_THETA=50000 \
        GRAD_ACC_STEPS=2 \
        MICRO_BATCH=1 \
        USE_ACTIVATION_CHECKPOINTING=0 \
        ZERO_STAGE=0 \
        TRAIN_TOKENS=4673780159710 \
        OPT=sophiag \
        DATA_FILE_LIST=ALCF/data-lists/aurora/olmo-mix-1124.txt \
        LR_DECAY_STYLE=constant \
        LOAD=cooldown-checkpoints/sophiag-global-step-73500/global_step73500 \
        bash train_alcf.sh \
          --no-load-lr-state \
          --lr_constant_plus_cooldown \
          --lr_constant_plus_cooldown_frac 0.10

♻️ Convert to Universal (Optional)

TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1 python3 ALCF/ds_to_universal.py \
    --input_folder test_rollback/global_step136000 \
    --output_folder test_rollback/global_step136000_universal

📄 W&B Report

Citation

BibTeX citation:
@online{foreman2025,
  author = {Foreman, Sam},
  title = {🧊 {Cooling} {Down} {Checkpoints:} {Best} {Practices} for
    {Model} {Evaluation}},
  date = {2025-11-12},
  url = {https://samforeman.me/posts/2025/11/12/},
  langid = {en}
}
For attribution, please cite this work as:
Foreman, Sam. 2025. “🧊 Cooling Down Checkpoints: Best Practices for Model Evaluation.” November 12, 2025. https://samforeman.me/posts/2025/11/12/.