feat(dataviewer): enhance HDF5 video handling and nested dataset support#180
Merged
Conversation
…lob storage handling - add path utilities for converting dataset IDs to blob prefixes - enhance dataset service to manage nested datasets up to 5 levels - update local storage adapter to resolve nested paths correctly - add tests for nested dataset discovery and path resolution 🗂️ - Generated by Copilot
- add tests for video loading overlay behavior before and after 200ms delay - remove unused displayCanvasRef from playback card tests - refactor media controller to remove video frame cache usage - clean up video sync tests by removing frame cache related assertions - delete obsolete video frame cache test file 🎥 - Generated by Copilot
- add support for on_generated callbacks in video generation queue - implement safe video path checks to prevent path traversal - update Dockerfile to include ffmpeg for HDF5 video generation - suppress verbose Azure SDK logging in main application 🔒 - Generated by Copilot
…ed queue - remove video prefetch scheduling and related methods - streamline video path retrieval and generation - enhance HDF5 dataset syncing to include cached videos 🔧 - Generated by Copilot
Contributor
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #180 +/- ##
=======================================
Coverage 37.47% 37.47%
=======================================
Files 43 43
Lines 6135 6135
Branches 497 497
=======================================
Hits 2299 2299
Misses 3826 3826
Partials 10 10
🚀 New features to boost your workflow:
|
- replace newline characters in dataset_id and camera name - ensure proper validation of path parameters 🔒 - Generated by Copilot
…aram for dataset_id validation refactor(export): update dataset_id validation to use path_string_param refactor(detection): replace validated_dataset_id with path_string_param for dataset_id validation refactor(joint_config): use path_string_param for dataset_id validation refactor(labels): replace validated_dataset_id with path_string_param for dataset_id validation refactor(detection_service): remove unnecessary string conversion for model_name and confidence test(validation): add tests for path_string_param and query_csv_ints_param test(api): add tests for request model sanitization and query parameters 🔧 - Generated by Copilot
🔧 - Generated by Copilot
🔧 - Generated by Copilot
…on processing - remove unnecessary float conversion for frame indices - ensure logging displays frame indices as integers 🔧 - Generated by Copilot
WilliamBerryiii
approved these changes
Mar 12, 2026
- sanitize dataset_id in various logging statements to remove newlines and carriage returns - ensure consistent logging format across dataset operations and detection service 🔒 - Generated by Copilot
akhanattentive
pushed a commit
that referenced
this pull request
Mar 12, 2026
…ort (#180) # feat(dataviewer): enhance HDF5 video handling and nested dataset support This PR strengthens the dataviewer's HDF5 capabilities across video generation, blob storage integration, and dataset organization. It introduces on-demand video generation from HDF5 image data, extends Azure Blob Storage support to HDF5 datasets, enables hierarchical dataset IDs up to 5 levels deep, and simplifies the frontend video playback architecture by removing the persistent frame cache in favor of native HTML5 video rendering. > The frame caching system served its purpose during initial development but added complexity that wasn't justified as HDF5 video generation matured. Native `<video>` element playback is simpler and more reliable for the generated MP4 files. ## Description ### HDF5 Video Generation The HDF5 format handler gained full video generation capabilities with an on-demand caching strategy. Videos are generated using **ffmpeg** (H.264, `ultrafast` preset, `yuv420p` pixel format) with an **OpenCV fallback** when ffmpeg is unavailable. Generated videos are cached at `meta/videos/{camera}/episode_{NNNNNN}.mp4` and uploaded to blob storage when configured. - Added `_generate_video()` and `_generate_video_cv2()` to *hdf5_handler.py* with graceful degradation between backends - Implemented `get_video_path()` with cache-first lookup and synchronous on-demand generation - Added `load_single_frame()` in *hdf5_loader.py* using h5py slice indexing for memory-efficient frame extraction - Installed **ffmpeg** system dependency in the backend *Dockerfile* ### Blob Storage HDF5 Integration Extended the blob dataset provider to discover, sync, and serve HDF5 datasets alongside existing LeRobot support. - Added `scan_all_dataset_ids()` for single-pass discovery returning both LeRobot and HDF5 dataset types - Implemented `sync_hdf5_dataset_to_local()` for metadata and placeholder sync, and `sync_hdf5_episode_to_local()` for on-demand episode downloads - Enhanced video path resolution with `_build_video_path_candidates()` supporting template-based, index-based, and fallback scan strategies - Switched from `list_blobs()` to `list_blob_names()` for faster blob enumeration - Added `upload_video()` for persisting locally generated videos back to blob storage ### Nested Dataset ID Support Enabled hierarchical dataset organization using `--` separators (e.g., `project--recordings--session_1`) mapping to `/`-separated paths in storage. - Added *paths.py* utility with `dataset_id_to_blob_prefix()` centralized across storage adapters - Updated `_scan_directory()` with recursive traversal and `_validate_dataset_id()` enforcing max 5 segments - Refactored dataset grouping from `split("--", 1)[0]` to `"--".join(split("--")[:-1])` for correct multi-level grouping - Updated *DataviewerShellHeader.tsx* to display group keys with forward slashes instead of dashes ### Label Storage Abstraction Introduced a `LabelStorage` protocol in *labels.py* with **LocalLabelStorage** and **BlobLabelStorage** implementations, enabling label persistence in both local filesystem and Azure Blob Storage backends. Both implementations include path traversal protection via `realpath()` + `startswith()` validation. ### Frontend Playback Simplification Removed the persistent video frame caching system and simplified the annotation workspace playback architecture. - Deleted *useVideoFrameCache.ts* (165 lines) and its test suite (244 lines) - Removed RVFC-based canvas rendering, `displayCanvasRef`, and cache state management from *useAnnotationWorkspaceVideoSync.ts* (~150 lines) - Replaced canvas-based video rendering with direct HTML5 `<video>` element in *AnnotationWorkspacePlaybackCard.tsx* - Added 200ms debounced loading overlay to prevent flicker on quick video loads ### Configuration and Environment - Updated *.gitignore* dataset path from `src/dataviewer/datasets/` to root-level `datasets/` - Added *.env.azure.example* template for Azure Blob Storage development setup - Enhanced *.env.example* with Azure CLI prerequisites and permission documentation - Added `HMI_DATA_PATH` environment variable to the backend dev script in *package.json* - Added Azure SDK logging suppression and lifespan-based temp directory cleanup in *main.py* ## Type of Change - [ ] 🐛 Bug fix (non-breaking change fixing an issue) - [x] ✨ New feature (non-breaking change adding functionality) - [x] 💥 Breaking change (fix or feature causing existing functionality to change) - [ ] 📚 Documentation update - [ ] 🏗️ Infrastructure change (Terraform/IaC) - [x] ♻️ Refactoring (no functional changes) ## Component(s) Affected - [ ] `deploy/000-prerequisites` - Azure subscription setup - [ ] `deploy/001-iac` - Terraform infrastructure - [ ] `deploy/002-setup` - OSMO control plane / Helm - [ ] `deploy/004-workflow` - Training workflows - [ ] `src/training` - Python training scripts - [ ] `docs/` - Documentation ## Testing Performed - [ ] Terraform `plan` reviewed (no unexpected changes) - [ ] Terraform `apply` tested in dev environment - [ ] Training scripts tested locally with Isaac Sim - [ ] OSMO workflow submitted successfully - [ ] Smoke tests passed (`smoke_test_azure.py`) ## Documentation Impact - [x] No documentation changes needed - [ ] Documentation updated in this PR - [ ] Documentation issue filed ## Bug Fix Checklist *Complete this section for bug fix PRs. Skip for other contribution types.* - [ ] Linked to issue being fixed - [ ] Regression test included, OR - [ ] Justification for no regression test: ## Checklist - [x] My code follows the [project conventions](copilot-instructions.md) - [x] Commit messages follow [conventional commit format](instructions/commit-message.instructions.md) - [x] I have performed a self-review - [x] Documentation impact assessed above - [x] No new linting warnings introduced
akhanattentive
pushed a commit
that referenced
this pull request
Mar 16, 2026
…ort (#180) support This PR strengthens the dataviewer's HDF5 capabilities across video generation, blob storage integration, and dataset organization. It introduces on-demand video generation from HDF5 image data, extends Azure Blob Storage support to HDF5 datasets, enables hierarchical dataset IDs up to 5 levels deep, and simplifies the frontend video playback architecture by removing the persistent frame cache in favor of native HTML5 video rendering. > The frame caching system served its purpose during initial development but added complexity that wasn't justified as HDF5 video generation matured. Native `<video>` element playback is simpler and more reliable for the generated MP4 files. The HDF5 format handler gained full video generation capabilities with an on-demand caching strategy. Videos are generated using **ffmpeg** (H.264, `ultrafast` preset, `yuv420p` pixel format) with an **OpenCV fallback** when ffmpeg is unavailable. Generated videos are cached at `meta/videos/{camera}/episode_{NNNNNN}.mp4` and uploaded to blob storage when configured. - Added `_generate_video()` and `_generate_video_cv2()` to *hdf5_handler.py* with graceful degradation between backends - Implemented `get_video_path()` with cache-first lookup and synchronous on-demand generation - Added `load_single_frame()` in *hdf5_loader.py* using h5py slice indexing for memory-efficient frame extraction - Installed **ffmpeg** system dependency in the backend *Dockerfile* Extended the blob dataset provider to discover, sync, and serve HDF5 datasets alongside existing LeRobot support. - Added `scan_all_dataset_ids()` for single-pass discovery returning both LeRobot and HDF5 dataset types - Implemented `sync_hdf5_dataset_to_local()` for metadata and placeholder sync, and `sync_hdf5_episode_to_local()` for on-demand episode downloads - Enhanced video path resolution with `_build_video_path_candidates()` supporting template-based, index-based, and fallback scan strategies - Switched from `list_blobs()` to `list_blob_names()` for faster blob enumeration - Added `upload_video()` for persisting locally generated videos back to blob storage Enabled hierarchical dataset organization using `--` separators (e.g., `project--recordings--session_1`) mapping to `/`-separated paths in storage. - Added *paths.py* utility with `dataset_id_to_blob_prefix()` centralized across storage adapters - Updated `_scan_directory()` with recursive traversal and `_validate_dataset_id()` enforcing max 5 segments - Refactored dataset grouping from `split("--", 1)[0]` to `"--".join(split("--")[:-1])` for correct multi-level grouping - Updated *DataviewerShellHeader.tsx* to display group keys with forward slashes instead of dashes Introduced a `LabelStorage` protocol in *labels.py* with **LocalLabelStorage** and **BlobLabelStorage** implementations, enabling label persistence in both local filesystem and Azure Blob Storage backends. Both implementations include path traversal protection via `realpath()` + `startswith()` validation. Removed the persistent video frame caching system and simplified the annotation workspace playback architecture. - Deleted *useVideoFrameCache.ts* (165 lines) and its test suite (244 lines) - Removed RVFC-based canvas rendering, `displayCanvasRef`, and cache state management from *useAnnotationWorkspaceVideoSync.ts* (~150 lines) - Replaced canvas-based video rendering with direct HTML5 `<video>` element in *AnnotationWorkspacePlaybackCard.tsx* - Added 200ms debounced loading overlay to prevent flicker on quick video loads - Updated *.gitignore* dataset path from `src/dataviewer/datasets/` to root-level `datasets/` - Added *.env.azure.example* template for Azure Blob Storage development setup - Enhanced *.env.example* with Azure CLI prerequisites and permission documentation - Added `HMI_DATA_PATH` environment variable to the backend dev script in *package.json* - Added Azure SDK logging suppression and lifespan-based temp directory cleanup in *main.py* - [ ] 🐛 Bug fix (non-breaking change fixing an issue) - [x] ✨ New feature (non-breaking change adding functionality) - [x] 💥 Breaking change (fix or feature causing existing functionality to change) - [ ] 📚 Documentation update - [ ] 🏗️ Infrastructure change (Terraform/IaC) - [x] ♻️ Refactoring (no functional changes) - [ ] `deploy/000-prerequisites` - Azure subscription setup - [ ] `deploy/001-iac` - Terraform infrastructure - [ ] `deploy/002-setup` - OSMO control plane / Helm - [ ] `deploy/004-workflow` - Training workflows - [ ] `src/training` - Python training scripts - [ ] `docs/` - Documentation - [ ] Terraform `plan` reviewed (no unexpected changes) - [ ] Terraform `apply` tested in dev environment - [ ] Training scripts tested locally with Isaac Sim - [ ] OSMO workflow submitted successfully - [ ] Smoke tests passed (`smoke_test_azure.py`) - [x] No documentation changes needed - [ ] Documentation updated in this PR - [ ] Documentation issue filed *Complete this section for bug fix PRs. Skip for other contribution types.* - [ ] Linked to issue being fixed - [ ] Regression test included, OR - [ ] Justification for no regression test: - [x] My code follows the [project conventions](copilot-instructions.md) - [x] Commit messages follow [conventional commit format](instructions/commit-message.instructions.md) - [x] I have performed a self-review - [x] Documentation impact assessed above - [x] No new linting warnings introduced
akhanattentive
pushed a commit
that referenced
this pull request
Mar 16, 2026
…ort (#180) support This PR strengthens the dataviewer's HDF5 capabilities across video generation, blob storage integration, and dataset organization. It introduces on-demand video generation from HDF5 image data, extends Azure Blob Storage support to HDF5 datasets, enables hierarchical dataset IDs up to 5 levels deep, and simplifies the frontend video playback architecture by removing the persistent frame cache in favor of native HTML5 video rendering. > The frame caching system served its purpose during initial development but added complexity that wasn't justified as HDF5 video generation matured. Native `<video>` element playback is simpler and more reliable for the generated MP4 files. The HDF5 format handler gained full video generation capabilities with an on-demand caching strategy. Videos are generated using **ffmpeg** (H.264, `ultrafast` preset, `yuv420p` pixel format) with an **OpenCV fallback** when ffmpeg is unavailable. Generated videos are cached at `meta/videos/{camera}/episode_{NNNNNN}.mp4` and uploaded to blob storage when configured. - Added `_generate_video()` and `_generate_video_cv2()` to *hdf5_handler.py* with graceful degradation between backends - Implemented `get_video_path()` with cache-first lookup and synchronous on-demand generation - Added `load_single_frame()` in *hdf5_loader.py* using h5py slice indexing for memory-efficient frame extraction - Installed **ffmpeg** system dependency in the backend *Dockerfile* Extended the blob dataset provider to discover, sync, and serve HDF5 datasets alongside existing LeRobot support. - Added `scan_all_dataset_ids()` for single-pass discovery returning both LeRobot and HDF5 dataset types - Implemented `sync_hdf5_dataset_to_local()` for metadata and placeholder sync, and `sync_hdf5_episode_to_local()` for on-demand episode downloads - Enhanced video path resolution with `_build_video_path_candidates()` supporting template-based, index-based, and fallback scan strategies - Switched from `list_blobs()` to `list_blob_names()` for faster blob enumeration - Added `upload_video()` for persisting locally generated videos back to blob storage Enabled hierarchical dataset organization using `--` separators (e.g., `project--recordings--session_1`) mapping to `/`-separated paths in storage. - Added *paths.py* utility with `dataset_id_to_blob_prefix()` centralized across storage adapters - Updated `_scan_directory()` with recursive traversal and `_validate_dataset_id()` enforcing max 5 segments - Refactored dataset grouping from `split("--", 1)[0]` to `"--".join(split("--")[:-1])` for correct multi-level grouping - Updated *DataviewerShellHeader.tsx* to display group keys with forward slashes instead of dashes Introduced a `LabelStorage` protocol in *labels.py* with **LocalLabelStorage** and **BlobLabelStorage** implementations, enabling label persistence in both local filesystem and Azure Blob Storage backends. Both implementations include path traversal protection via `realpath()` + `startswith()` validation. Removed the persistent video frame caching system and simplified the annotation workspace playback architecture. - Deleted *useVideoFrameCache.ts* (165 lines) and its test suite (244 lines) - Removed RVFC-based canvas rendering, `displayCanvasRef`, and cache state management from *useAnnotationWorkspaceVideoSync.ts* (~150 lines) - Replaced canvas-based video rendering with direct HTML5 `<video>` element in *AnnotationWorkspacePlaybackCard.tsx* - Added 200ms debounced loading overlay to prevent flicker on quick video loads - Updated *.gitignore* dataset path from `src/dataviewer/datasets/` to root-level `datasets/` - Added *.env.azure.example* template for Azure Blob Storage development setup - Enhanced *.env.example* with Azure CLI prerequisites and permission documentation - Added `HMI_DATA_PATH` environment variable to the backend dev script in *package.json* - Added Azure SDK logging suppression and lifespan-based temp directory cleanup in *main.py* - [ ] 🐛 Bug fix (non-breaking change fixing an issue) - [x] ✨ New feature (non-breaking change adding functionality) - [x] 💥 Breaking change (fix or feature causing existing functionality to change) - [ ] 📚 Documentation update - [ ] 🏗️ Infrastructure change (Terraform/IaC) - [x] ♻️ Refactoring (no functional changes) - [ ] `deploy/000-prerequisites` - Azure subscription setup - [ ] `deploy/001-iac` - Terraform infrastructure - [ ] `deploy/002-setup` - OSMO control plane / Helm - [ ] `deploy/004-workflow` - Training workflows - [ ] `src/training` - Python training scripts - [ ] `docs/` - Documentation - [ ] Terraform `plan` reviewed (no unexpected changes) - [ ] Terraform `apply` tested in dev environment - [ ] Training scripts tested locally with Isaac Sim - [ ] OSMO workflow submitted successfully - [ ] Smoke tests passed (`smoke_test_azure.py`) - [x] No documentation changes needed - [ ] Documentation updated in this PR - [ ] Documentation issue filed *Complete this section for bug fix PRs. Skip for other contribution types.* - [ ] Linked to issue being fixed - [ ] Regression test included, OR - [ ] Justification for no regression test: - [x] My code follows the [project conventions](copilot-instructions.md) - [x] Commit messages follow [conventional commit format](instructions/commit-message.instructions.md) - [x] I have performed a self-review - [x] Documentation impact assessed above - [x] No new linting warnings introduced
akhanattentive
pushed a commit
that referenced
this pull request
Mar 16, 2026
…ort (#180) support This PR strengthens the dataviewer's HDF5 capabilities across video generation, blob storage integration, and dataset organization. It introduces on-demand video generation from HDF5 image data, extends Azure Blob Storage support to HDF5 datasets, enables hierarchical dataset IDs up to 5 levels deep, and simplifies the frontend video playback architecture by removing the persistent frame cache in favor of native HTML5 video rendering. > The frame caching system served its purpose during initial development but added complexity that wasn't justified as HDF5 video generation matured. Native `<video>` element playback is simpler and more reliable for the generated MP4 files. The HDF5 format handler gained full video generation capabilities with an on-demand caching strategy. Videos are generated using **ffmpeg** (H.264, `ultrafast` preset, `yuv420p` pixel format) with an **OpenCV fallback** when ffmpeg is unavailable. Generated videos are cached at `meta/videos/{camera}/episode_{NNNNNN}.mp4` and uploaded to blob storage when configured. - Added `_generate_video()` and `_generate_video_cv2()` to *hdf5_handler.py* with graceful degradation between backends - Implemented `get_video_path()` with cache-first lookup and synchronous on-demand generation - Added `load_single_frame()` in *hdf5_loader.py* using h5py slice indexing for memory-efficient frame extraction - Installed **ffmpeg** system dependency in the backend *Dockerfile* Extended the blob dataset provider to discover, sync, and serve HDF5 datasets alongside existing LeRobot support. - Added `scan_all_dataset_ids()` for single-pass discovery returning both LeRobot and HDF5 dataset types - Implemented `sync_hdf5_dataset_to_local()` for metadata and placeholder sync, and `sync_hdf5_episode_to_local()` for on-demand episode downloads - Enhanced video path resolution with `_build_video_path_candidates()` supporting template-based, index-based, and fallback scan strategies - Switched from `list_blobs()` to `list_blob_names()` for faster blob enumeration - Added `upload_video()` for persisting locally generated videos back to blob storage Enabled hierarchical dataset organization using `--` separators (e.g., `project--recordings--session_1`) mapping to `/`-separated paths in storage. - Added *paths.py* utility with `dataset_id_to_blob_prefix()` centralized across storage adapters - Updated `_scan_directory()` with recursive traversal and `_validate_dataset_id()` enforcing max 5 segments - Refactored dataset grouping from `split("--", 1)[0]` to `"--".join(split("--")[:-1])` for correct multi-level grouping - Updated *DataviewerShellHeader.tsx* to display group keys with forward slashes instead of dashes Introduced a `LabelStorage` protocol in *labels.py* with **LocalLabelStorage** and **BlobLabelStorage** implementations, enabling label persistence in both local filesystem and Azure Blob Storage backends. Both implementations include path traversal protection via `realpath()` + `startswith()` validation. Removed the persistent video frame caching system and simplified the annotation workspace playback architecture. - Deleted *useVideoFrameCache.ts* (165 lines) and its test suite (244 lines) - Removed RVFC-based canvas rendering, `displayCanvasRef`, and cache state management from *useAnnotationWorkspaceVideoSync.ts* (~150 lines) - Replaced canvas-based video rendering with direct HTML5 `<video>` element in *AnnotationWorkspacePlaybackCard.tsx* - Added 200ms debounced loading overlay to prevent flicker on quick video loads - Updated *.gitignore* dataset path from `src/dataviewer/datasets/` to root-level `datasets/` - Added *.env.azure.example* template for Azure Blob Storage development setup - Enhanced *.env.example* with Azure CLI prerequisites and permission documentation - Added `HMI_DATA_PATH` environment variable to the backend dev script in *package.json* - Added Azure SDK logging suppression and lifespan-based temp directory cleanup in *main.py* - [ ] 🐛 Bug fix (non-breaking change fixing an issue) - [x] ✨ New feature (non-breaking change adding functionality) - [x] 💥 Breaking change (fix or feature causing existing functionality to change) - [ ] 📚 Documentation update - [ ] 🏗️ Infrastructure change (Terraform/IaC) - [x] ♻️ Refactoring (no functional changes) - [ ] `deploy/000-prerequisites` - Azure subscription setup - [ ] `deploy/001-iac` - Terraform infrastructure - [ ] `deploy/002-setup` - OSMO control plane / Helm - [ ] `deploy/004-workflow` - Training workflows - [ ] `src/training` - Python training scripts - [ ] `docs/` - Documentation - [ ] Terraform `plan` reviewed (no unexpected changes) - [ ] Terraform `apply` tested in dev environment - [ ] Training scripts tested locally with Isaac Sim - [ ] OSMO workflow submitted successfully - [ ] Smoke tests passed (`smoke_test_azure.py`) - [x] No documentation changes needed - [ ] Documentation updated in this PR - [ ] Documentation issue filed *Complete this section for bug fix PRs. Skip for other contribution types.* - [ ] Linked to issue being fixed - [ ] Regression test included, OR - [ ] Justification for no regression test: - [x] My code follows the [project conventions](copilot-instructions.md) - [x] Commit messages follow [conventional commit format](instructions/commit-message.instructions.md) - [x] I have performed a self-review - [x] Documentation impact assessed above - [x] No new linting warnings introduced
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(dataviewer): enhance HDF5 video handling and nested dataset support
This PR strengthens the dataviewer's HDF5 capabilities across video generation, blob storage integration, and dataset organization. It introduces on-demand video generation from HDF5 image data, extends Azure Blob Storage support to HDF5 datasets, enables hierarchical dataset IDs up to 5 levels deep, and simplifies the frontend video playback architecture by removing the persistent frame cache in favor of native HTML5 video rendering.
Description
HDF5 Video Generation
The HDF5 format handler gained full video generation capabilities with an on-demand caching strategy. Videos are generated using ffmpeg (H.264,
ultrafastpreset,yuv420ppixel format) with an OpenCV fallback when ffmpeg is unavailable. Generated videos are cached atmeta/videos/{camera}/episode_{NNNNNN}.mp4and uploaded to blob storage when configured._generate_video()and_generate_video_cv2()to hdf5_handler.py with graceful degradation between backendsget_video_path()with cache-first lookup and synchronous on-demand generationload_single_frame()in hdf5_loader.py using h5py slice indexing for memory-efficient frame extractionBlob Storage HDF5 Integration
Extended the blob dataset provider to discover, sync, and serve HDF5 datasets alongside existing LeRobot support.
scan_all_dataset_ids()for single-pass discovery returning both LeRobot and HDF5 dataset typessync_hdf5_dataset_to_local()for metadata and placeholder sync, andsync_hdf5_episode_to_local()for on-demand episode downloads_build_video_path_candidates()supporting template-based, index-based, and fallback scan strategieslist_blobs()tolist_blob_names()for faster blob enumerationupload_video()for persisting locally generated videos back to blob storageNested Dataset ID Support
Enabled hierarchical dataset organization using
--separators (e.g.,project--recordings--session_1) mapping to/-separated paths in storage.dataset_id_to_blob_prefix()centralized across storage adapters_scan_directory()with recursive traversal and_validate_dataset_id()enforcing max 5 segmentssplit("--", 1)[0]to"--".join(split("--")[:-1])for correct multi-level groupingLabel Storage Abstraction
Introduced a
LabelStorageprotocol in labels.py with LocalLabelStorage and BlobLabelStorage implementations, enabling label persistence in both local filesystem and Azure Blob Storage backends. Both implementations include path traversal protection viarealpath()+startswith()validation.Frontend Playback Simplification
Removed the persistent video frame caching system and simplified the annotation workspace playback architecture.
displayCanvasRef, and cache state management from useAnnotationWorkspaceVideoSync.ts (~150 lines)<video>element in AnnotationWorkspacePlaybackCard.tsxConfiguration and Environment
src/dataviewer/datasets/to root-leveldatasets/HMI_DATA_PATHenvironment variable to the backend dev script in package.jsonType of Change
Component(s) Affected
deploy/000-prerequisites- Azure subscription setupdeploy/001-iac- Terraform infrastructuredeploy/002-setup- OSMO control plane / Helmdeploy/004-workflow- Training workflowssrc/training- Python training scriptsdocs/- DocumentationTesting Performed
planreviewed (no unexpected changes)applytested in dev environmentsmoke_test_azure.py)Documentation Impact
Bug Fix Checklist
Complete this section for bug fix PRs. Skip for other contribution types.
Checklist