Skip to content

Fix: lerobot-dataset-edit merge with custom root paths locally#2739

Open
riochuong wants to merge 3 commits intohuggingface:mainfrom
riochuong:fix/dataset-merge-locally-with-custom-root
Open

Fix: lerobot-dataset-edit merge with custom root paths locally#2739
riochuong wants to merge 3 commits intohuggingface:mainfrom
riochuong:fix/dataset-merge-locally-with-custom-root

Conversation

@riochuong
Copy link

@riochuong riochuong commented Jan 1, 2026

Title

fix(scripts): handle custom root paths in dataset merge operation

CREDIT:

Fix and tests are written with assistant from Claude 4.5 Sonnet. I did verify the changes and executing testing locally to make sure it works as expected

Type / Scope

  • Type: Bug
  • Scope: scripts/lerobot_edit_dataset, datasets/dataset_tools

Summary / Motivation

The lerobot-dataset-edit --operation.type merge command failed when users specified custom --root paths for merging local datasets. The handle_merge() function was passing the root directory directly to LeRobotDataset() without appending the individual dataset's repo_id, causing the loader to look for metadata files in the wrong location (e.g., /path/to/datasets/meta/info.json instead of /path/to/datasets/dataset1/meta/info.json).

This fix enables users to merge locally stored datasets that are organized in custom directories, which is a common workflow when working with self-collected robotic datasets before uploading to HuggingFace Hub.

Related issues

  • Fixes: N/A (discovered during local dataset management workflow)
  • Related: N/A

What changed

Code changes:

  • src/lerobot/scripts/lerobot_edit_dataset.py (lines 248-250): Updated handle_merge() to construct full dataset paths by appending repo_id to custom root when provided:

    Before (buggy):

    datasets = [LeRobotDataset(repo_id, root=cfg.root) for repo_id in cfg.operation.repo_ids]

    After (fixed):

    datasets = [
    LeRobotDataset(repo_id, root=Path(cfg.root) / repo_id if cfg.root else None)
    for repo_id in cfg.operation.repo_ids
    ]
    Test additions:

  • tests/datasets/test_dataset_tools.py: Added 3 comprehensive test functions:

    • test_handle_merge_with_custom_root() - Validates the bug fix with custom root paths
    • test_handle_merge_without_custom_root() - Ensures default behavior still works
    • test_handle_merge_custom_root_preserves_metadata() - Verifies metadata preservation during merge

Breaking changes: None. This is a pure bug fix that maintains backward compatibility.

How was this tested

Tests added:

  • test_handle_merge_with_custom_root - Creates two datasets in a custom root directory, merges them, and verifies the merged dataset is created in the correct location with proper episode counts.
  • test_handle_merge_without_custom_root - Tests that the default behavior (no custom root) continues to work correctly.
  • test_handle_merge_custom_root_preserves_metadata - Ensures that dataset metadata (FPS, features, episode counts, frame counts) are correctly preserved after merging with custom roots.

Manual testing:

  • Successfully merged two local datasets with custom root paths using:
    lerobot-dataset-edit
    --repo_id merged_dataset
    --root /path/to/datasets
    --operation.type merge
    --operation.repo_ids "['dataset1', 'dataset2']"
    Test results:
    $ pytest tests/datasets/test_dataset_tools.py -k merge -v

All 7 merge tests pass (3 new + 4 existing)## How to run locally (reviewer)

Run all merge-related tests:
pytest tests/datasets/test_dataset_tools.py -k merge -vRun only the new tests:
pytest tests/datasets/test_dataset_tools.py::test_handle_merge_with_custom_root -v
pytest tests/datasets/test_dataset_tools.py::test_handle_merge_without_custom_root -v
pytest tests/datasets/test_dataset_tools.py::test_handle_merge_custom_root_preserves_metadata -vManual test with local datasets:

Create two test datasets in a custom directory

lerobot-record --some-config # or use existing datasets

Try merging with custom root (this would have failed before the fix)

lerobot-dataset-edit
--repo_id test_merged
--root /path/to/your/datasets
--operation.type merge
--operation.repo_ids "['dataset1', 'dataset2']"## Checklist (required before merge)

  • Linting/formatting run (pre-commit run -a)
  • All tests pass locally (pytest)
  • Documentation updated (N/A - internal bug fix, no user-facing API changes)
  • CI is green

Reviewer notes

Focus areas:

  • Line 248-250 in lerobot_edit_dataset.py: Verify the path construction logic correctly handles both None and custom root cases
  • Test coverage: The three new tests cover the bug scenario (custom root), default behavior (no root), and metadata preservation
  • Backward compatibility: Existing 4 merge tests still pass, confirming no regressions

Design note:
The fix follows the pattern already established by LeRobotDataset.__init__() which accepts either:

  • root=None → uses HF_LEROBOT_HOME / repo_id
  • root=Path(...) → uses the provided path as-is

By constructing Path(cfg.root) / repo_id before passing to LeRobotDataset(), we ensure the dataset loader receives the complete path to each individual dataset directory.

Edge cases covered:

  • ✅ Custom root with multiple datasets
  • ✅ No custom root (default HF_LEROBOT_HOME behavior)
  • ✅ Metadata preservation (FPS, features, episode/frame counts)
  • ✅ Different numbers of episodes per dataset

@github-actions github-actions bot added the tests Problems with test coverage, failures, or improvements to testing label Jan 1, 2026
# by appending the repo_id to the root. When root is None, LeRobotDataset
# automatically uses HF_LEROBOT_HOME / repo_id.
datasets = [
LeRobotDataset(repo_id, root=Path(cfg.root) / repo_id if cfg.root else None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As defined in LeRobotDataset, datasets will be stored under root/repo_id, so we need to standardize the dataset location in this way.

related to #2316

Copy link
Author

@riochuong riochuong Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that mean for merge to work correctly on custom local folder the only way is to use root=None and move data to default HF_LEROBOT_HOME (not too bad but need to remember to set this as venv) ??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests Problems with test coverage, failures, or improvements to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants