🧊 Cooling Down Checkpoints: Best Practices for Model Evaluation
Explore effective strategies for evaluating machine learning models using checkpoints, ensuring optimal performance and reliability
📉 Simple Experiment to Compare Validation Loss
☃️ Cooling Down
256 Nodes of Aurora:
Cooled down over last 10%:
- W&B Run: volcanic-blaze-4312
Explicit command:
ROPE_THETA=50000 \ GRAD_ACC_STEPS=2 \ MICRO_BATCH=1 \ USE_ACTIVATION_CHECKPOINTING=0 \ ZERO_STAGE=0 \ TRAIN_TOKENS=4673780159710 \ OPT=sophiag \ DATA_FILE_LIST=ALCF/data-lists/aurora/olmo-mix-1124.txt \ LR_DECAY_STYLE=constant \ LOAD=cooldown-checkpoints/sophiag-global-step-73500/global_step73500 \ bash train_alcf.sh \ --no-load-lr-state \ --lr_constant_plus_cooldown \ --lr_constant_plus_cooldown_frac 0.10
♻️ Convert to Universal (Optional)
📄 W&B Report
Citation
BibTeX citation:
@online{foreman2025,
author = {Foreman, Sam},
title = {🧊 {Cooling} {Down} {Checkpoints:} {Best} {Practices} for
{Model} {Evaluation}},
date = {2025-11-12},
url = {https://samforeman.me/posts/2025/11/12/},
langid = {en}
}
For attribution, please cite this work as:
Foreman, Sam. 2025. “🧊 Cooling Down Checkpoints: Best Practices
for Model Evaluation.” November 12, 2025. https://samforeman.me/posts/2025/11/12/.
