Skip to content

feat: Add validation loss tracking, early stopping, and checkpoint cleanup#2633

Open
NotNANtoN wants to merge 6 commits intohuggingface:mainfrom
githubnemo:validation_loss
Open

feat: Add validation loss tracking, early stopping, and checkpoint cleanup#2633
NotNANtoN wants to merge 6 commits intohuggingface:mainfrom
githubnemo:validation_loss

Conversation

@NotNANtoN
Copy link

@NotNANtoN NotNANtoN commented Dec 12, 2025

Summary

Adds a robust validation pipeline to lerobot-train. This allows for monitoring generalization during training using a separate subset of episodes, enabling early stopping and automatic disk space management via checkpoint cleanup.

Features

  • Validation Split: validation_fraction config option to split dataset episodes into train/val sets.
  • Random Episode Shuffling: Includes early_stopping.shuffle_episodes (default: True) to ensure the validation set is a representative cross-section of the whole dataset, avoiding bias from sequential data collection.
  • Model-Agnostic Validation Loss: Computed using the policy's select_action method. This provides L1 and L2 losses that are comparable across different architectures.
  • Early Stopping: Automatically stop training if a monitored metric (val_loss or eval_success) stops improving after a set patience period.
  • Checkpoint Management: keep_last_n_checkpoints automatically prunes old checkpoints to save disk space. The logic is robust to resolving the last symlink, ensuring the target of the symlink is never deleted.

Design Decisions

  • select_action vs forward: We use select_action for validation to provide a "real-world" metric. This captures the model's performance as it would behave in deployment, including any inference-time processing or ensembling.
  • Inference Compatibility: Forcing batch_size=1 for validation is a proactive design choice. While not all current policies use internal state, many inference implementations (including those with temporal queuing or history tracking) are designed for single-stream execution. This ensures the validation framework is compatible with the widest range of policies.
  • Clean Validation Data: The validation dataset is instantiated with image transforms disabled to provide a consistent, unaugmented baseline for performance monitoring.
  • Minimal Impact: All features are opt-in. Default behavior (fraction = 0.0) remains identical to the current main branch.

Config Options

validation_fraction: float = 0.0          # 0.1 = 10% of episodes for validation
early_stopping.enable: bool = False       # Toggle early stopping
early_stopping.patience_steps: int = 10000 # Steps to wait for improvement
early_stopping.monitor: str = "val_loss"  # "val_loss" or "eval_success"
early_stopping.shuffle_episodes: bool = True # Shuffle before split (Recommended)
keep_last_n_checkpoints: int = 0          # 0 = keep all, N = keep only latest N

Testing

  • Verified with ACT (model-agnostic L1/L2 monitoring)
  • Verified with SmolVLA (VRAM efficiency with batch size 1)
  • Verified Early Stopping triggers correctly on val_loss

This PR adds the ability to track validation loss during training:

Features:
- validation_fraction config option to split episodes into train/val sets
- Validation loss computed using inference (select_action) for model-agnostic metrics
- L1 and L2 loss metrics logged to wandb under val/ prefix
- Early stopping based on validation loss or eval success rate
- keep_last_n_checkpoints option to automatically cleanup old checkpoints

The validation uses a separate dataset copy without augmentations for clean evaluation.
Uses select_action for inference-based validation, making it policy-agnostic.
Backward compatible - defaults maintain existing behavior (no validation split).

Config options:
- validation_fraction: 0.0-1.0 (default 0.0, no validation)
- early_stopping.enable: bool (default False)
- early_stopping.patience_steps: int (default 10000)
- early_stopping.monitor: 'val_loss' or 'eval_success'
- keep_last_n_checkpoints: int (default 0, keep all)
@github-actions github-actions bot added the configuration Problems with configuration files or settings label Dec 24, 2025
@imstevenpmwork imstevenpmwork self-assigned this Jan 13, 2026
@imstevenpmwork
Copy link
Collaborator

Thanks for the contribution!

Can you solve the conflicts?

@imstevenpmwork imstevenpmwork self-requested a review January 13, 2026 15:54
@NotNANtoN
Copy link
Author

Thanks for the contribution!

Can you solve the conflicts?

Thanks! Conflicts are solved now.

@imstevenpmwork
Copy link
Collaborator

Hello @NotNANtoN , thanks again for this contribution. Would it be too much to ask to split this PR into smaller ones? There are a lot of different -unrelated- things happening on this one which makes the review extensive. I would suggest opening a PR with only the early stopping for example, once that one is merged we can proceed with the 2 other features added here 😄

@NotNANtoN
Copy link
Author

Thanks for the feedback! Happy to split this up. Just to confirm, since early stopping requires validation loss to monitor, should I keep validation tracking + early stopping together in one PR, and split out checkpoint cleanup (keep_last_n_checkpoints + logic) as a separate PR? Or did you have a different split in mind?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

configuration Problems with configuration files or settings

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants