fix(dataset): use revision-safe Hub cache for downloaded datasets#3229
Closed
AdilZouitine wants to merge 3 commits intohuggingface:mainfrom
Closed
fix(dataset): use revision-safe Hub cache for downloaded datasets#3229AdilZouitine wants to merge 3 commits intohuggingface:mainfrom
AdilZouitine wants to merge 3 commits intohuggingface:mainfrom
Conversation
…uce hub cache support - Updated DatasetConfig and LeRobotDatasetMetadata to clarify root directory behavior and introduce a dedicated hub cache for downloads. - Refactored LeRobotDataset and StreamingLeRobotDataset to utilize the new hub cache and improve directory management. - Added tests to ensure correct behavior when using the hub cache and handling different revisions without a specified root directory.
- Updated LeRobotDataset to store the requested root path separately from the actual root path. - Adjusted metadata loading to use the requested root, enhancing clarity and consistency in directory management.
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & Why
LeRobotDatasetcurrently callssnapshot_download(local_dir=self.root)for everyHub fetch, which flattens all revisions into a single directory tree under
$HF_LEROBOT_HOME/{repo_id}. When two processes (or two training jobs) readdifferent revisions of the same dataset concurrently, one overwrites the other's
files, corrupting both.
Fixes #3224
How
When
root=None(the default for all Hub-downloaded datasets), downloads now usethe Hub's native content-addressable blob store:
instead of the previous:
Each revision resolves to its own immutable snapshot directory under
$HF_LEROBOT_HOME/hub/datasets--{org}--{name}/snapshots/{commit}/, so twoconcurrent readers can never step on each other. Blobs are shared via hardlinks,
so disk usage stays flat (~93 MB for a 30-episode dataset regardless of how many
revisions are checked out).
When
rootis explicitly provided (local datasets, recording, etc.), theexisting
local_dirmaterialization is preserved — no behavioral change.A lightweight heuristic (
_has_legacy_hub_download_metadata) detects oldlocal_dirmirrors (they contain.cache/huggingface/download/) andtransparently migrates to the snapshot cache on the next load.
Files changed:
configs/default.pyrootfielddatasets/dataset_metadata.py_pull_from_repothroughcache_dirwhenroot=None; add legacy-mirror detectiondatasets/lerobot_dataset.py_downloadthroughcache_dirwhenroot=None; propagate resolved root to reader/metadatasets/streaming_dataset.pycache_dirrouting for the streaming pathutils/constants.pyHF_LEROBOT_HUB_CACHEconstanttests/datasets/test_lerobot_dataset.pyTesting
Manual concurrent-read test (reproducer from the issue):
Before this fix both terminals eventually report
*** CHANGED ***as one processoverwrites the other's parquet files. After this fix each terminal reads from its
own snapshot directory and the columns never change:
Cache stays compact ; the Hub blob store deduplicates unchanged files via hardlinks:
Unit tests:
Notes for reviewers
_pull_from_repoand_downloadbranch onself._requested_root is Noneto choose betweencache_dirandlocal_dir.Everything else is docstrings, constants, and tests.
root / ".cache" / "huggingface" / "download"— this directory is created bysnapshot_download(local_dir=...)but never by user-created datasets.streaming_dataset.pyfollows the same pattern for consistency._requested_rootattribute distinguishes "user explicitly set root" from"we derived root from the default".
pytest -sx tests/datasets/test_lerobot_dataset.py -k "hub_cache or legacy or distinct_snapshot"