Skip to content

fix(dataset): use revision-safe Hub cache for downloaded datasets#3229

Closed
AdilZouitine wants to merge 3 commits intohuggingface:mainfrom
AdilZouitine:fix/caching-dataset
Closed

fix(dataset): use revision-safe Hub cache for downloaded datasets#3229
AdilZouitine wants to merge 3 commits intohuggingface:mainfrom
AdilZouitine:fix/caching-dataset

Conversation

@AdilZouitine
Copy link
Copy Markdown
Contributor

@AdilZouitine AdilZouitine commented Mar 27, 2026

What & Why

LeRobotDataset currently calls snapshot_download(local_dir=self.root) for every
Hub fetch, which flattens all revisions into a single directory tree under
$HF_LEROBOT_HOME/{repo_id}. When two processes (or two training jobs) read
different revisions of the same dataset concurrently, one overwrites the other's
files, corrupting both.

Fixes #3224

How

When root=None (the default for all Hub-downloaded datasets), downloads now use
the Hub's native content-addressable blob store:

snapshot_download(repo_id, cache_dir=HF_LEROBOT_HUB_CACHE, revision=...)

instead of the previous:

snapshot_download(repo_id, local_dir=self.root, revision=...)

Each revision resolves to its own immutable snapshot directory under
$HF_LEROBOT_HOME/hub/datasets--{org}--{name}/snapshots/{commit}/, so two
concurrent readers can never step on each other. Blobs are shared via hardlinks,
so disk usage stays flat (~93 MB for a 30-episode dataset regardless of how many
revisions are checked out).

When root is explicitly provided (local datasets, recording, etc.), the
existing local_dir materialization is preserved — no behavioral change.

A lightweight heuristic (_has_legacy_hub_download_metadata) detects old
local_dir mirrors (they contain .cache/huggingface/download/) and
transparently migrates to the snapshot cache on the next load.

Files changed:

File Summary
configs/default.py Updated docstring for root field
datasets/dataset_metadata.py Route _pull_from_repo through cache_dir when root=None; add legacy-mirror detection
datasets/lerobot_dataset.py Route _download through cache_dir when root=None; propagate resolved root to reader/meta
datasets/streaming_dataset.py Mirror the same cache_dir routing for the streaming path
utils/constants.py Add HF_LEROBOT_HUB_CACHE constant
tests/datasets/test_lerobot_dataset.py 3 new tests covering snapshot isolation, cache-dir routing, and legacy bypass

Testing

Manual concurrent-read test (reproducer from the issue):

# Terminal 1 — reads "main" (has 'test' column)
python experiment_concurent_read.py main

# Terminal 2 — reads an older commit (no 'test' column)
python experiment_concurent_read.py b59010db93eb6cc3cf06ef2f7cae1bbe62b726d9

Before this fix both terminals eventually report *** CHANGED *** as one process
overwrites the other's parquet files. After this fix each terminal reads from its
own snapshot directory and the columns never change:

# Terminal 1 (main);  always True
[09:39:13] [main] has 'test' column=True

# Terminal 2 (old commit); always False
[09:39:26] [b59010db93eb] has 'test' column=False

Cache stays compact ; the Hub blob store deduplicates unchanged files via hardlinks:

❯ du -sh datasets--AdilZtn--pick_and_place_intermediate_30_ep
 93M    datasets--AdilZtn--pick_and_place_intermediate_30_ep
❯ du -sh datasets--AdilZtn--pick_and_place_intermediate_30_ep/snapshots/
  0B    datasets--AdilZtn--pick_and_place_intermediate_30_ep/snapshots/

Unit tests:

pytest -sx tests/datasets/test_lerobot_dataset.py::test_metadata_without_root_uses_hub_cache_snapshot_download
pytest -sx tests/datasets/test_lerobot_dataset.py::test_without_root_reads_different_revisions_from_distinct_snapshot_roots
pytest -sx tests/datasets/test_lerobot_dataset.py::test_metadata_without_root_ignores_legacy_local_dir_cache

Notes for reviewers

  • The core change is small: _pull_from_repo and _download branch on
    self._requested_root is None to choose between cache_dir and local_dir.
    Everything else is docstrings, constants, and tests.
  • The legacy-mirror heuristic checks for
    root / ".cache" / "huggingface" / "download" — this directory is created by
    snapshot_download(local_dir=...) but never by user-created datasets.
  • streaming_dataset.py follows the same pattern for consistency.
  • The _requested_root attribute distinguishes "user explicitly set root" from
    "we derived root from the default".
pytest -sx tests/datasets/test_lerobot_dataset.py -k "hub_cache or legacy or distinct_snapshot"

…uce hub cache support

- Updated DatasetConfig and LeRobotDatasetMetadata to clarify root directory behavior and introduce a dedicated hub cache for downloads.
- Refactored LeRobotDataset and StreamingLeRobotDataset to utilize the new hub cache and improve directory management.
- Added tests to ensure correct behavior when using the hub cache and handling different revisions without a specified root directory.
@AdilZouitine AdilZouitine marked this pull request as ready for review March 27, 2026 08:52
@github-actions github-actions bot added dataset Issues regarding data inputs, processing, or datasets tests Problems with test coverage, failures, or improvements to testing configuration Problems with configuration files or settings labels Mar 27, 2026
@imstevenpmwork imstevenpmwork self-requested a review March 27, 2026 10:14
@imstevenpmwork imstevenpmwork self-assigned this Mar 27, 2026
- Updated LeRobotDataset to store the requested root path separately from the actual root path.
- Adjusted metadata loading to use the requested root, enhancing clarity and consistency in directory management.
@imstevenpmwork
Copy link
Copy Markdown
Collaborator

imstevenpmwork commented Mar 27, 2026

Great issue description and follow-up PR, thanks!
I superseded in here for some small details to avoid the back-and-forth: #3233
However; we would need to wait for #3231 to get the CI check green

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

configuration Problems with configuration files or settings dataset Issues regarding data inputs, processing, or datasets tests Problems with test coverage, failures, or improvements to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LeRobotDataset is not revision-safe on shared storage

2 participants