feat: Add validation loss tracking, early stopping, and checkpoint cleanup#2633
feat: Add validation loss tracking, early stopping, and checkpoint cleanup#2633NotNANtoN wants to merge 6 commits intohuggingface:mainfrom
Conversation
This PR adds the ability to track validation loss during training: Features: - validation_fraction config option to split episodes into train/val sets - Validation loss computed using inference (select_action) for model-agnostic metrics - L1 and L2 loss metrics logged to wandb under val/ prefix - Early stopping based on validation loss or eval success rate - keep_last_n_checkpoints option to automatically cleanup old checkpoints The validation uses a separate dataset copy without augmentations for clean evaluation. Uses select_action for inference-based validation, making it policy-agnostic. Backward compatible - defaults maintain existing behavior (no validation split). Config options: - validation_fraction: 0.0-1.0 (default 0.0, no validation) - early_stopping.enable: bool (default False) - early_stopping.patience_steps: int (default 10000) - early_stopping.monitor: 'val_loss' or 'eval_success' - keep_last_n_checkpoints: int (default 0, keep all)
…t_action, and safe checkpoint cleanup
|
Thanks for the contribution! Can you solve the conflicts? |
Thanks! Conflicts are solved now. |
|
Hello @NotNANtoN , thanks again for this contribution. Would it be too much to ask to split this PR into smaller ones? There are a lot of different -unrelated- things happening on this one which makes the review extensive. I would suggest opening a PR with only the early stopping for example, once that one is merged we can proceed with the 2 other features added here 😄 |
|
Thanks for the feedback! Happy to split this up. Just to confirm, since early stopping requires validation loss to monitor, should I keep validation tracking + early stopping together in one PR, and split out checkpoint cleanup (keep_last_n_checkpoints + logic) as a separate PR? Or did you have a different split in mind? |
Summary
Adds a robust validation pipeline to
lerobot-train. This allows for monitoring generalization during training using a separate subset of episodes, enabling early stopping and automatic disk space management via checkpoint cleanup.Features
validation_fractionconfig option to split dataset episodes into train/val sets.early_stopping.shuffle_episodes(default:True) to ensure the validation set is a representative cross-section of the whole dataset, avoiding bias from sequential data collection.select_actionmethod. This provides L1 and L2 losses that are comparable across different architectures.val_lossoreval_success) stops improving after a set patience period.keep_last_n_checkpointsautomatically prunes old checkpoints to save disk space. The logic is robust to resolving thelastsymlink, ensuring the target of the symlink is never deleted.Design Decisions
select_actionvsforward: We useselect_actionfor validation to provide a "real-world" metric. This captures the model's performance as it would behave in deployment, including any inference-time processing or ensembling.batch_size=1for validation is a proactive design choice. While not all current policies use internal state, many inference implementations (including those with temporal queuing or history tracking) are designed for single-stream execution. This ensures the validation framework is compatible with the widest range of policies.Config Options
Testing
val_loss