Fixes aggregation of image datasets#2717
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a bug in image dataset aggregation where HuggingFace Image() feature types were being lost during the aggregation process, causing images to be stored with generic struct schemas instead. The fix ensures proper preservation of image schemas by passing feature metadata through the aggregation pipeline.
Key changes:
- Modified
aggregate_data()andappend_or_create_parquet_file()to retrieve and pass HuggingFace features schema for image datasets - Added special handling for reading and writing parquet files containing images using
datasets.Dataset.from_parquet()to preserve image format - Added comprehensive test coverage with
test_aggregate_image_datasets()to verify schema preservation and data integrity
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| tests/datasets/test_aggregate.py | Added comprehensive test for image dataset aggregation including schema validation and data integrity checks |
| src/lerobot/datasets/aggregate.py | Updated aggregation logic to retrieve and pass HuggingFace features schema when processing image datasets, ensuring proper Image() type preservation in parquet files |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
imstevenpmwork
left a comment
There was a problem hiding this comment.
Thanks for the contribution.
I wonder if copilot suggestions makes sense. It seems that we could call datasets.Dataset.from_parquet() with the features if we have them already available. WDYT ?
Some other notes for the future -unrelated to this PR-:
aggregatehot path seems to be changing frequently between panda frames, parquet files anddatasetrepresentations, probably not ideal.- I wonder if we should do this also for non-image datasets.
Thanks again!
* fix: use features when aggregating image based datasets * add: test asserting for data type * add: features param to writing dataset --------- Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
* fix: use features when aggregating image based datasets * add: test asserting for data type * add: features param to writing dataset --------- Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
* fix: use features when aggregating image based datasets * add: test asserting for data type * add: features param to writing dataset --------- Co-authored-by: Steven Palma <imstevenpmwork@ieee.org>
Title
Fixes aggregation of image datasets
Type / Scope
aggregate_datasets. Also affectslerobot-edit-datasetSummary / Motivation
aggregate_datasetsloses Image feature schema for image datasets #2715. Also, adds tests to ensure this edge case is properly covered.Related issues
aggregate_datasetsloses Image feature schema for image datasets #2715What changed
How was this tested
How to run locally (reviewer)
Run the relevant tests:
Run these tests to confirm no breaking changes on closely related parts of the library.
Checklist (required before merge)
pre-commit run -a)pytest) (run test_aggregate.py, test_dataset_tools.py)Reviewer notes