feat(dataviewer): enhance HDF5 video handling and nested dataset support by agreaves-ms · Pull Request #180 · microsoft/physical-ai-toolchain

agreaves-ms · 2026-03-12T19:34:54Z

feat(dataviewer): enhance HDF5 video handling and nested dataset support

This PR strengthens the dataviewer's HDF5 capabilities across video generation, blob storage integration, and dataset organization. It introduces on-demand video generation from HDF5 image data, extends Azure Blob Storage support to HDF5 datasets, enables hierarchical dataset IDs up to 5 levels deep, and simplifies the frontend video playback architecture by removing the persistent frame cache in favor of native HTML5 video rendering.

The frame caching system served its purpose during initial development but added complexity that wasn't justified as HDF5 video generation matured. Native <video> element playback is simpler and more reliable for the generated MP4 files.

Description

HDF5 Video Generation

The HDF5 format handler gained full video generation capabilities with an on-demand caching strategy. Videos are generated using ffmpeg (H.264, ultrafast preset, yuv420p pixel format) with an OpenCV fallback when ffmpeg is unavailable. Generated videos are cached at meta/videos/{camera}/episode_{NNNNNN}.mp4 and uploaded to blob storage when configured.

Added _generate_video() and _generate_video_cv2() to hdf5_handler.py with graceful degradation between backends
Implemented get_video_path() with cache-first lookup and synchronous on-demand generation
Added load_single_frame() in hdf5_loader.py using h5py slice indexing for memory-efficient frame extraction
Installed ffmpeg system dependency in the backend Dockerfile

Blob Storage HDF5 Integration

Extended the blob dataset provider to discover, sync, and serve HDF5 datasets alongside existing LeRobot support.

Added scan_all_dataset_ids() for single-pass discovery returning both LeRobot and HDF5 dataset types
Implemented sync_hdf5_dataset_to_local() for metadata and placeholder sync, and sync_hdf5_episode_to_local() for on-demand episode downloads
Enhanced video path resolution with _build_video_path_candidates() supporting template-based, index-based, and fallback scan strategies
Switched from list_blobs() to list_blob_names() for faster blob enumeration
Added upload_video() for persisting locally generated videos back to blob storage

Nested Dataset ID Support

Enabled hierarchical dataset organization using -- separators (e.g., project--recordings--session_1) mapping to /-separated paths in storage.

Added paths.py utility with dataset_id_to_blob_prefix() centralized across storage adapters
Updated _scan_directory() with recursive traversal and _validate_dataset_id() enforcing max 5 segments
Refactored dataset grouping from split("--", 1)[0] to "--".join(split("--")[:-1]) for correct multi-level grouping
Updated DataviewerShellHeader.tsx to display group keys with forward slashes instead of dashes

Label Storage Abstraction

Introduced a LabelStorage protocol in labels.py with LocalLabelStorage and BlobLabelStorage implementations, enabling label persistence in both local filesystem and Azure Blob Storage backends. Both implementations include path traversal protection via realpath() + startswith() validation.

Frontend Playback Simplification

Removed the persistent video frame caching system and simplified the annotation workspace playback architecture.

Deleted useVideoFrameCache.ts (165 lines) and its test suite (244 lines)
Removed RVFC-based canvas rendering, displayCanvasRef, and cache state management from useAnnotationWorkspaceVideoSync.ts (~150 lines)
Replaced canvas-based video rendering with direct HTML5 <video> element in AnnotationWorkspacePlaybackCard.tsx
Added 200ms debounced loading overlay to prevent flicker on quick video loads

Configuration and Environment

Updated .gitignore dataset path from src/dataviewer/datasets/ to root-level datasets/
Added .env.azure.example template for Azure Blob Storage development setup
Enhanced .env.example with Azure CLI prerequisites and permission documentation
Added HMI_DATA_PATH environment variable to the backend dev script in package.json
Added Azure SDK logging suppression and lifespan-based temp directory cleanup in main.py

Type of Change

🐛 Bug fix (non-breaking change fixing an issue)
✨ New feature (non-breaking change adding functionality)
💥 Breaking change (fix or feature causing existing functionality to change)
📚 Documentation update
🏗️ Infrastructure change (Terraform/IaC)
♻️ Refactoring (no functional changes)

Component(s) Affected

deploy/000-prerequisites - Azure subscription setup
deploy/001-iac - Terraform infrastructure
deploy/002-setup - OSMO control plane / Helm
deploy/004-workflow - Training workflows
src/training - Python training scripts
docs/ - Documentation

Testing Performed

Terraform plan reviewed (no unexpected changes)
Terraform apply tested in dev environment
Training scripts tested locally with Isaac Sim
OSMO workflow submitted successfully
Smoke tests passed (smoke_test_azure.py)

Documentation Impact

No documentation changes needed
Documentation updated in this PR
Documentation issue filed

Bug Fix Checklist

Complete this section for bug fix PRs. Skip for other contribution types.

Linked to issue being fixed
Regression test included, OR
Justification for no regression test:

Checklist

My code follows the project conventions
Commit messages follow conventional commit format
I have performed a self-review
Documentation impact assessed above
No new linting warnings introduced

…lob storage handling - add path utilities for converting dataset IDs to blob prefixes - enhance dataset service to manage nested datasets up to 5 levels - update local storage adapter to resolve nested paths correctly - add tests for nested dataset discovery and path resolution 🗂️ - Generated by Copilot

- add tests for video loading overlay behavior before and after 200ms delay - remove unused displayCanvasRef from playback card tests - refactor media controller to remove video frame cache usage - clean up video sync tests by removing frame cache related assertions - delete obsolete video frame cache test file 🎥 - Generated by Copilot

- add support for on_generated callbacks in video generation queue - implement safe video path checks to prevent path traversal - update Dockerfile to include ffmpeg for HDF5 video generation - suppress verbose Azure SDK logging in main application 🔒 - Generated by Copilot

…ed queue - remove video prefetch scheduling and related methods - streamline video path retrieval and generation - enhance HDF5 dataset syncing to include cached videos 🔧 - Generated by Copilot

github-actions · 2026-03-12T19:35:24Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

codecov-commenter · 2026-03-12T19:35:57Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 37.47%. Comparing base (745321e) to head (a0e2e1c).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #180   +/-   ##
=======================================
  Coverage   37.47%   37.47%           
=======================================
  Files          43       43           
  Lines        6135     6135           
  Branches      497      497           
=======================================
  Hits         2299     2299           
  Misses       3826     3826           
  Partials       10       10

Flag	Coverage Δ
pester	`84.80% <ø> (ø)`
pytest	`6.89% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- replace newline characters in dataset_id and camera name - ensure proper validation of path parameters 🔒 - Generated by Copilot

…aram for dataset_id validation refactor(export): update dataset_id validation to use path_string_param refactor(detection): replace validated_dataset_id with path_string_param for dataset_id validation refactor(joint_config): use path_string_param for dataset_id validation refactor(labels): replace validated_dataset_id with path_string_param for dataset_id validation refactor(detection_service): remove unnecessary string conversion for model_name and confidence test(validation): add tests for path_string_param and query_csv_ints_param test(api): add tests for request model sanitization and query parameters 🔧 - Generated by Copilot

🔧 - Generated by Copilot

…on processing - remove unnecessary float conversion for frame indices - ensure logging displays frame indices as integers 🔧 - Generated by Copilot

- sanitize dataset_id in various logging statements to remove newlines and carriage returns - ensure consistent logging format across dataset operations and detection service 🔒 - Generated by Copilot

…ort (#180) # feat(dataviewer): enhance HDF5 video handling and nested dataset support This PR strengthens the dataviewer's HDF5 capabilities across video generation, blob storage integration, and dataset organization. It introduces on-demand video generation from HDF5 image data, extends Azure Blob Storage support to HDF5 datasets, enables hierarchical dataset IDs up to 5 levels deep, and simplifies the frontend video playback architecture by removing the persistent frame cache in favor of native HTML5 video rendering. > The frame caching system served its purpose during initial development but added complexity that wasn't justified as HDF5 video generation matured. Native `<video>` element playback is simpler and more reliable for the generated MP4 files. ## Description ### HDF5 Video Generation The HDF5 format handler gained full video generation capabilities with an on-demand caching strategy. Videos are generated using **ffmpeg** (H.264, `ultrafast` preset, `yuv420p` pixel format) with an **OpenCV fallback** when ffmpeg is unavailable. Generated videos are cached at `meta/videos/{camera}/episode_{NNNNNN}.mp4` and uploaded to blob storage when configured. - Added `_generate_video()` and `_generate_video_cv2()` to *hdf5_handler.py* with graceful degradation between backends - Implemented `get_video_path()` with cache-first lookup and synchronous on-demand generation - Added `load_single_frame()` in *hdf5_loader.py* using h5py slice indexing for memory-efficient frame extraction - Installed **ffmpeg** system dependency in the backend *Dockerfile* ### Blob Storage HDF5 Integration Extended the blob dataset provider to discover, sync, and serve HDF5 datasets alongside existing LeRobot support. - Added `scan_all_dataset_ids()` for single-pass discovery returning both LeRobot and HDF5 dataset types - Implemented `sync_hdf5_dataset_to_local()` for metadata and placeholder sync, and `sync_hdf5_episode_to_local()` for on-demand episode downloads - Enhanced video path resolution with `_build_video_path_candidates()` supporting template-based, index-based, and fallback scan strategies - Switched from `list_blobs()` to `list_blob_names()` for faster blob enumeration - Added `upload_video()` for persisting locally generated videos back to blob storage ### Nested Dataset ID Support Enabled hierarchical dataset organization using `--` separators (e.g., `project--recordings--session_1`) mapping to `/`-separated paths in storage. - Added *paths.py* utility with `dataset_id_to_blob_prefix()` centralized across storage adapters - Updated `_scan_directory()` with recursive traversal and `_validate_dataset_id()` enforcing max 5 segments - Refactored dataset grouping from `split("--", 1)[0]` to `"--".join(split("--")[:-1])` for correct multi-level grouping - Updated *DataviewerShellHeader.tsx* to display group keys with forward slashes instead of dashes ### Label Storage Abstraction Introduced a `LabelStorage` protocol in *labels.py* with **LocalLabelStorage** and **BlobLabelStorage** implementations, enabling label persistence in both local filesystem and Azure Blob Storage backends. Both implementations include path traversal protection via `realpath()` + `startswith()` validation. ### Frontend Playback Simplification Removed the persistent video frame caching system and simplified the annotation workspace playback architecture. - Deleted *useVideoFrameCache.ts* (165 lines) and its test suite (244 lines) - Removed RVFC-based canvas rendering, `displayCanvasRef`, and cache state management from *useAnnotationWorkspaceVideoSync.ts* (~150 lines) - Replaced canvas-based video rendering with direct HTML5 `<video>` element in *AnnotationWorkspacePlaybackCard.tsx* - Added 200ms debounced loading overlay to prevent flicker on quick video loads ### Configuration and Environment - Updated *.gitignore* dataset path from `src/dataviewer/datasets/` to root-level `datasets/` - Added *.env.azure.example* template for Azure Blob Storage development setup - Enhanced *.env.example* with Azure CLI prerequisites and permission documentation - Added `HMI_DATA_PATH` environment variable to the backend dev script in *package.json* - Added Azure SDK logging suppression and lifespan-based temp directory cleanup in *main.py* ## Type of Change - [ ] 🐛 Bug fix (non-breaking change fixing an issue) - [x] ✨ New feature (non-breaking change adding functionality) - [x] 💥 Breaking change (fix or feature causing existing functionality to change) - [ ] 📚 Documentation update - [ ] 🏗️ Infrastructure change (Terraform/IaC) - [x] ♻️ Refactoring (no functional changes) ## Component(s) Affected - [ ] `deploy/000-prerequisites` - Azure subscription setup - [ ] `deploy/001-iac` - Terraform infrastructure - [ ] `deploy/002-setup` - OSMO control plane / Helm - [ ] `deploy/004-workflow` - Training workflows - [ ] `src/training` - Python training scripts - [ ] `docs/` - Documentation ## Testing Performed - [ ] Terraform `plan` reviewed (no unexpected changes) - [ ] Terraform `apply` tested in dev environment - [ ] Training scripts tested locally with Isaac Sim - [ ] OSMO workflow submitted successfully - [ ] Smoke tests passed (`smoke_test_azure.py`) ## Documentation Impact - [x] No documentation changes needed - [ ] Documentation updated in this PR - [ ] Documentation issue filed ## Bug Fix Checklist *Complete this section for bug fix PRs. Skip for other contribution types.* - [ ] Linked to issue being fixed - [ ] Regression test included, OR - [ ] Justification for no regression test: ## Checklist - [x] My code follows the [project conventions](copilot-instructions.md) - [x] Commit messages follow [conventional commit format](instructions/commit-message.instructions.md) - [x] I have performed a self-review - [x] Documentation impact assessed above - [x] No new linting warnings introduced

…ort (#180) support This PR strengthens the dataviewer's HDF5 capabilities across video generation, blob storage integration, and dataset organization. It introduces on-demand video generation from HDF5 image data, extends Azure Blob Storage support to HDF5 datasets, enables hierarchical dataset IDs up to 5 levels deep, and simplifies the frontend video playback architecture by removing the persistent frame cache in favor of native HTML5 video rendering. > The frame caching system served its purpose during initial development but added complexity that wasn't justified as HDF5 video generation matured. Native `<video>` element playback is simpler and more reliable for the generated MP4 files. The HDF5 format handler gained full video generation capabilities with an on-demand caching strategy. Videos are generated using **ffmpeg** (H.264, `ultrafast` preset, `yuv420p` pixel format) with an **OpenCV fallback** when ffmpeg is unavailable. Generated videos are cached at `meta/videos/{camera}/episode_{NNNNNN}.mp4` and uploaded to blob storage when configured. - Added `_generate_video()` and `_generate_video_cv2()` to *hdf5_handler.py* with graceful degradation between backends - Implemented `get_video_path()` with cache-first lookup and synchronous on-demand generation - Added `load_single_frame()` in *hdf5_loader.py* using h5py slice indexing for memory-efficient frame extraction - Installed **ffmpeg** system dependency in the backend *Dockerfile* Extended the blob dataset provider to discover, sync, and serve HDF5 datasets alongside existing LeRobot support. - Added `scan_all_dataset_ids()` for single-pass discovery returning both LeRobot and HDF5 dataset types - Implemented `sync_hdf5_dataset_to_local()` for metadata and placeholder sync, and `sync_hdf5_episode_to_local()` for on-demand episode downloads - Enhanced video path resolution with `_build_video_path_candidates()` supporting template-based, index-based, and fallback scan strategies - Switched from `list_blobs()` to `list_blob_names()` for faster blob enumeration - Added `upload_video()` for persisting locally generated videos back to blob storage Enabled hierarchical dataset organization using `--` separators (e.g., `project--recordings--session_1`) mapping to `/`-separated paths in storage. - Added *paths.py* utility with `dataset_id_to_blob_prefix()` centralized across storage adapters - Updated `_scan_directory()` with recursive traversal and `_validate_dataset_id()` enforcing max 5 segments - Refactored dataset grouping from `split("--", 1)[0]` to `"--".join(split("--")[:-1])` for correct multi-level grouping - Updated *DataviewerShellHeader.tsx* to display group keys with forward slashes instead of dashes Introduced a `LabelStorage` protocol in *labels.py* with **LocalLabelStorage** and **BlobLabelStorage** implementations, enabling label persistence in both local filesystem and Azure Blob Storage backends. Both implementations include path traversal protection via `realpath()` + `startswith()` validation. Removed the persistent video frame caching system and simplified the annotation workspace playback architecture. - Deleted *useVideoFrameCache.ts* (165 lines) and its test suite (244 lines) - Removed RVFC-based canvas rendering, `displayCanvasRef`, and cache state management from *useAnnotationWorkspaceVideoSync.ts* (~150 lines) - Replaced canvas-based video rendering with direct HTML5 `<video>` element in *AnnotationWorkspacePlaybackCard.tsx* - Added 200ms debounced loading overlay to prevent flicker on quick video loads - Updated *.gitignore* dataset path from `src/dataviewer/datasets/` to root-level `datasets/` - Added *.env.azure.example* template for Azure Blob Storage development setup - Enhanced *.env.example* with Azure CLI prerequisites and permission documentation - Added `HMI_DATA_PATH` environment variable to the backend dev script in *package.json* - Added Azure SDK logging suppression and lifespan-based temp directory cleanup in *main.py* - [ ] 🐛 Bug fix (non-breaking change fixing an issue) - [x] ✨ New feature (non-breaking change adding functionality) - [x] 💥 Breaking change (fix or feature causing existing functionality to change) - [ ] 📚 Documentation update - [ ] 🏗️ Infrastructure change (Terraform/IaC) - [x] ♻️ Refactoring (no functional changes) - [ ] `deploy/000-prerequisites` - Azure subscription setup - [ ] `deploy/001-iac` - Terraform infrastructure - [ ] `deploy/002-setup` - OSMO control plane / Helm - [ ] `deploy/004-workflow` - Training workflows - [ ] `src/training` - Python training scripts - [ ] `docs/` - Documentation - [ ] Terraform `plan` reviewed (no unexpected changes) - [ ] Terraform `apply` tested in dev environment - [ ] Training scripts tested locally with Isaac Sim - [ ] OSMO workflow submitted successfully - [ ] Smoke tests passed (`smoke_test_azure.py`) - [x] No documentation changes needed - [ ] Documentation updated in this PR - [ ] Documentation issue filed *Complete this section for bug fix PRs. Skip for other contribution types.* - [ ] Linked to issue being fixed - [ ] Regression test included, OR - [ ] Justification for no regression test: - [x] My code follows the [project conventions](copilot-instructions.md) - [x] Commit messages follow [conventional commit format](instructions/commit-message.instructions.md) - [x] I have performed a self-review - [x] Documentation impact assessed above - [x] No new linting warnings introduced

agreaves-ms added 4 commits March 12, 2026 12:28

refactor(dataviewer): simplify video generation logic and remove unus…

f26f102

…ed queue - remove video prefetch scheduling and related methods - streamline video path retrieval and generation - enhance HDF5 dataset syncing to include cached videos 🔧 - Generated by Copilot

agreaves-ms requested a review from a team as a code owner March 12, 2026 19:34

github-advanced-security AI found potential problems Mar 12, 2026

View reviewed changes

fix(validation): sanitize dataset_id and camera name parameters

d23afaa

- replace newline characters in dataset_id and camera name - ensure proper validation of path parameters 🔒 - Generated by Copilot

github-advanced-security AI found potential problems Mar 12, 2026

View reviewed changes

Comment thread src/dataviewer/backend/src/api/services/detection_service.py Fixed

Comment thread src/dataviewer/backend/src/api/services/detection_service.py Fixed

agreaves-ms added 3 commits March 12, 2026 14:03

fix(detection_service): sanitize confidence and model name parameters

87a13d8

fix(detection_service): correct frame index logging to preserve type

85c440e

🔧 - Generated by Copilot

github-advanced-security AI found potential problems Mar 12, 2026

View reviewed changes

Comment thread src/dataviewer/backend/src/api/services/detection_service.py Fixed

agreaves-ms added 2 commits March 12, 2026 14:57

fix(detection_service): convert frame indices to float for processing

80be3a6

🔧 - Generated by Copilot

fix(detection_service): preserve integer frame indices during detecti…

b884ef4

…on processing - remove unnecessary float conversion for frame indices - ensure logging displays frame indices as integers 🔧 - Generated by Copilot

WilliamBerryiii approved these changes Mar 12, 2026

View reviewed changes

fix(logging): sanitize dataset_id and model_name in log messages

a0e2e1c

- sanitize dataset_id in various logging statements to remove newlines and carriage returns - ensure consistent logging format across dataset operations and detection service 🔒 - Generated by Copilot

agreaves-ms merged commit f1b6139 into main Mar 12, 2026
19 checks passed

agreaves-ms deleted the feat/dataviewer-hdf5-fixes branch March 12, 2026 23:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dataviewer): enhance HDF5 video handling and nested dataset support#180

feat(dataviewer): enhance HDF5 video handling and nested dataset support#180
agreaves-ms merged 11 commits into
mainfrom
feat/dataviewer-hdf5-fixes

agreaves-ms commented Mar 12, 2026

Uh oh!

github-actions Bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Mar 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

agreaves-ms commented Mar 12, 2026

feat(dataviewer): enhance HDF5 video handling and nested dataset support

Description

HDF5 Video Generation

Blob Storage HDF5 Integration

Nested Dataset ID Support

Label Storage Abstraction

Frontend Playback Simplification

Configuration and Environment

Type of Change

Component(s) Affected

Testing Performed

Documentation Impact

Bug Fix Checklist

Checklist

Uh oh!

github-actions Bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

codecov-commenter commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions Bot commented Mar 12, 2026 •

edited

Loading

codecov-commenter commented Mar 12, 2026 •

edited

Loading