Skip to content

feat(dataviewer): enhance HDF5 video handling and nested dataset support#180

Merged
agreaves-ms merged 11 commits into
mainfrom
feat/dataviewer-hdf5-fixes
Mar 12, 2026
Merged

feat(dataviewer): enhance HDF5 video handling and nested dataset support#180
agreaves-ms merged 11 commits into
mainfrom
feat/dataviewer-hdf5-fixes

Conversation

@agreaves-ms
Copy link
Copy Markdown
Contributor

feat(dataviewer): enhance HDF5 video handling and nested dataset support

This PR strengthens the dataviewer's HDF5 capabilities across video generation, blob storage integration, and dataset organization. It introduces on-demand video generation from HDF5 image data, extends Azure Blob Storage support to HDF5 datasets, enables hierarchical dataset IDs up to 5 levels deep, and simplifies the frontend video playback architecture by removing the persistent frame cache in favor of native HTML5 video rendering.

The frame caching system served its purpose during initial development but added complexity that wasn't justified as HDF5 video generation matured. Native <video> element playback is simpler and more reliable for the generated MP4 files.

Description

HDF5 Video Generation

The HDF5 format handler gained full video generation capabilities with an on-demand caching strategy. Videos are generated using ffmpeg (H.264, ultrafast preset, yuv420p pixel format) with an OpenCV fallback when ffmpeg is unavailable. Generated videos are cached at meta/videos/{camera}/episode_{NNNNNN}.mp4 and uploaded to blob storage when configured.

  • Added _generate_video() and _generate_video_cv2() to hdf5_handler.py with graceful degradation between backends
  • Implemented get_video_path() with cache-first lookup and synchronous on-demand generation
  • Added load_single_frame() in hdf5_loader.py using h5py slice indexing for memory-efficient frame extraction
  • Installed ffmpeg system dependency in the backend Dockerfile

Blob Storage HDF5 Integration

Extended the blob dataset provider to discover, sync, and serve HDF5 datasets alongside existing LeRobot support.

  • Added scan_all_dataset_ids() for single-pass discovery returning both LeRobot and HDF5 dataset types
  • Implemented sync_hdf5_dataset_to_local() for metadata and placeholder sync, and sync_hdf5_episode_to_local() for on-demand episode downloads
  • Enhanced video path resolution with _build_video_path_candidates() supporting template-based, index-based, and fallback scan strategies
  • Switched from list_blobs() to list_blob_names() for faster blob enumeration
  • Added upload_video() for persisting locally generated videos back to blob storage

Nested Dataset ID Support

Enabled hierarchical dataset organization using -- separators (e.g., project--recordings--session_1) mapping to /-separated paths in storage.

  • Added paths.py utility with dataset_id_to_blob_prefix() centralized across storage adapters
  • Updated _scan_directory() with recursive traversal and _validate_dataset_id() enforcing max 5 segments
  • Refactored dataset grouping from split("--", 1)[0] to "--".join(split("--")[:-1]) for correct multi-level grouping
  • Updated DataviewerShellHeader.tsx to display group keys with forward slashes instead of dashes

Label Storage Abstraction

Introduced a LabelStorage protocol in labels.py with LocalLabelStorage and BlobLabelStorage implementations, enabling label persistence in both local filesystem and Azure Blob Storage backends. Both implementations include path traversal protection via realpath() + startswith() validation.

Frontend Playback Simplification

Removed the persistent video frame caching system and simplified the annotation workspace playback architecture.

  • Deleted useVideoFrameCache.ts (165 lines) and its test suite (244 lines)
  • Removed RVFC-based canvas rendering, displayCanvasRef, and cache state management from useAnnotationWorkspaceVideoSync.ts (~150 lines)
  • Replaced canvas-based video rendering with direct HTML5 <video> element in AnnotationWorkspacePlaybackCard.tsx
  • Added 200ms debounced loading overlay to prevent flicker on quick video loads

Configuration and Environment

  • Updated .gitignore dataset path from src/dataviewer/datasets/ to root-level datasets/
  • Added .env.azure.example template for Azure Blob Storage development setup
  • Enhanced .env.example with Azure CLI prerequisites and permission documentation
  • Added HMI_DATA_PATH environment variable to the backend dev script in package.json
  • Added Azure SDK logging suppression and lifespan-based temp directory cleanup in main.py

Type of Change

  • 🐛 Bug fix (non-breaking change fixing an issue)
  • ✨ New feature (non-breaking change adding functionality)
  • 💥 Breaking change (fix or feature causing existing functionality to change)
  • 📚 Documentation update
  • 🏗️ Infrastructure change (Terraform/IaC)
  • ♻️ Refactoring (no functional changes)

Component(s) Affected

  • deploy/000-prerequisites - Azure subscription setup
  • deploy/001-iac - Terraform infrastructure
  • deploy/002-setup - OSMO control plane / Helm
  • deploy/004-workflow - Training workflows
  • src/training - Python training scripts
  • docs/ - Documentation

Testing Performed

  • Terraform plan reviewed (no unexpected changes)
  • Terraform apply tested in dev environment
  • Training scripts tested locally with Isaac Sim
  • OSMO workflow submitted successfully
  • Smoke tests passed (smoke_test_azure.py)

Documentation Impact

  • No documentation changes needed
  • Documentation updated in this PR
  • Documentation issue filed

Bug Fix Checklist

Complete this section for bug fix PRs. Skip for other contribution types.

  • Linked to issue being fixed
  • Regression test included, OR
  • Justification for no regression test:

Checklist

…lob storage handling

- add path utilities for converting dataset IDs to blob prefixes
- enhance dataset service to manage nested datasets up to 5 levels
- update local storage adapter to resolve nested paths correctly
- add tests for nested dataset discovery and path resolution

🗂️ - Generated by Copilot
- add tests for video loading overlay behavior before and after 200ms delay
- remove unused displayCanvasRef from playback card tests
- refactor media controller to remove video frame cache usage
- clean up video sync tests by removing frame cache related assertions
- delete obsolete video frame cache test file

🎥 - Generated by Copilot
- add support for on_generated callbacks in video generation queue
- implement safe video path checks to prevent path traversal
- update Dockerfile to include ffmpeg for HDF5 video generation
- suppress verbose Azure SDK logging in main application

🔒 - Generated by Copilot
…ed queue

- remove video prefetch scheduling and related methods
- streamline video path retrieval and generation
- enhance HDF5 dataset syncing to include cached videos

🔧 - Generated by Copilot
@agreaves-ms agreaves-ms requested a review from a team as a code owner March 12, 2026 19:34
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 12, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 37.47%. Comparing base (745321e) to head (a0e2e1c).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #180   +/-   ##
=======================================
  Coverage   37.47%   37.47%           
=======================================
  Files          43       43           
  Lines        6135     6135           
  Branches      497      497           
=======================================
  Hits         2299     2299           
  Misses       3826     3826           
  Partials       10       10           
Flag Coverage Δ
pester 84.80% <ø> (ø)
pytest 6.89% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread src/dataviewer/backend/src/api/routers/labels.py Fixed
Comment thread src/dataviewer/backend/src/api/routers/labels.py Fixed
Comment thread src/dataviewer/backend/src/api/main.py Fixed
Comment thread src/dataviewer/backend/src/api/routers/labels.py Fixed
Comment thread src/dataviewer/backend/src/api/routers/labels.py Fixed
Comment thread src/dataviewer/backend/src/api/services/dataset_service/service.py Fixed
Comment thread src/dataviewer/backend/src/api/storage/blob_dataset.py Fixed
Comment thread src/dataviewer/backend/src/api/storage/blob_dataset.py Fixed
- replace newline characters in dataset_id and camera name
- ensure proper validation of path parameters

🔒 - Generated by Copilot
Comment thread src/dataviewer/backend/src/api/services/detection_service.py Fixed
Comment thread src/dataviewer/backend/src/api/services/detection_service.py Fixed
…aram for dataset_id validation

refactor(export): update dataset_id validation to use path_string_param

refactor(detection): replace validated_dataset_id with path_string_param for dataset_id validation

refactor(joint_config): use path_string_param for dataset_id validation

refactor(labels): replace validated_dataset_id with path_string_param for dataset_id validation

refactor(detection_service): remove unnecessary string conversion for model_name and confidence

test(validation): add tests for path_string_param and query_csv_ints_param

test(api): add tests for request model sanitization and query parameters

🔧 - Generated by Copilot
Comment thread src/dataviewer/backend/src/api/services/detection_service.py Fixed
…on processing

- remove unnecessary float conversion for frame indices
- ensure logging displays frame indices as integers

🔧 - Generated by Copilot
- sanitize dataset_id in various logging statements to remove newlines and carriage returns
- ensure consistent logging format across dataset operations and detection service

🔒 - Generated by Copilot
@agreaves-ms agreaves-ms merged commit f1b6139 into main Mar 12, 2026
19 checks passed
@agreaves-ms agreaves-ms deleted the feat/dataviewer-hdf5-fixes branch March 12, 2026 23:03
akhanattentive pushed a commit that referenced this pull request Mar 12, 2026
…ort (#180)

# feat(dataviewer): enhance HDF5 video handling and nested dataset
support

This PR strengthens the dataviewer's HDF5 capabilities across video
generation, blob storage integration, and dataset organization. It
introduces on-demand video generation from HDF5 image data, extends
Azure Blob Storage support to HDF5 datasets, enables hierarchical
dataset IDs up to 5 levels deep, and simplifies the frontend video
playback architecture by removing the persistent frame cache in favor of
native HTML5 video rendering.

> The frame caching system served its purpose during initial development
but added complexity that wasn't justified as HDF5 video generation
matured. Native `<video>` element playback is simpler and more reliable
for the generated MP4 files.

## Description

### HDF5 Video Generation

The HDF5 format handler gained full video generation capabilities with
an on-demand caching strategy. Videos are generated using **ffmpeg**
(H.264, `ultrafast` preset, `yuv420p` pixel format) with an **OpenCV
fallback** when ffmpeg is unavailable. Generated videos are cached at
`meta/videos/{camera}/episode_{NNNNNN}.mp4` and uploaded to blob storage
when configured.

- Added `_generate_video()` and `_generate_video_cv2()` to
*hdf5_handler.py* with graceful degradation between backends
- Implemented `get_video_path()` with cache-first lookup and synchronous
on-demand generation
- Added `load_single_frame()` in *hdf5_loader.py* using h5py slice
indexing for memory-efficient frame extraction
- Installed **ffmpeg** system dependency in the backend *Dockerfile*

### Blob Storage HDF5 Integration

Extended the blob dataset provider to discover, sync, and serve HDF5
datasets alongside existing LeRobot support.

- Added `scan_all_dataset_ids()` for single-pass discovery returning
both LeRobot and HDF5 dataset types
- Implemented `sync_hdf5_dataset_to_local()` for metadata and
placeholder sync, and `sync_hdf5_episode_to_local()` for on-demand
episode downloads
- Enhanced video path resolution with `_build_video_path_candidates()`
supporting template-based, index-based, and fallback scan strategies
- Switched from `list_blobs()` to `list_blob_names()` for faster blob
enumeration
- Added `upload_video()` for persisting locally generated videos back to
blob storage

### Nested Dataset ID Support

Enabled hierarchical dataset organization using `--` separators (e.g.,
`project--recordings--session_1`) mapping to `/`-separated paths in
storage.

- Added *paths.py* utility with `dataset_id_to_blob_prefix()`
centralized across storage adapters
- Updated `_scan_directory()` with recursive traversal and
`_validate_dataset_id()` enforcing max 5 segments
- Refactored dataset grouping from `split("--", 1)[0]` to
`"--".join(split("--")[:-1])` for correct multi-level grouping
- Updated *DataviewerShellHeader.tsx* to display group keys with forward
slashes instead of dashes

### Label Storage Abstraction

Introduced a `LabelStorage` protocol in *labels.py* with
**LocalLabelStorage** and **BlobLabelStorage** implementations, enabling
label persistence in both local filesystem and Azure Blob Storage
backends. Both implementations include path traversal protection via
`realpath()` + `startswith()` validation.

### Frontend Playback Simplification

Removed the persistent video frame caching system and simplified the
annotation workspace playback architecture.

- Deleted *useVideoFrameCache.ts* (165 lines) and its test suite (244
lines)
- Removed RVFC-based canvas rendering, `displayCanvasRef`, and cache
state management from *useAnnotationWorkspaceVideoSync.ts* (~150 lines)
- Replaced canvas-based video rendering with direct HTML5 `<video>`
element in *AnnotationWorkspacePlaybackCard.tsx*
- Added 200ms debounced loading overlay to prevent flicker on quick
video loads

### Configuration and Environment

- Updated *.gitignore* dataset path from `src/dataviewer/datasets/` to
root-level `datasets/`
- Added *.env.azure.example* template for Azure Blob Storage development
setup
- Enhanced *.env.example* with Azure CLI prerequisites and permission
documentation
- Added `HMI_DATA_PATH` environment variable to the backend dev script
in *package.json*
- Added Azure SDK logging suppression and lifespan-based temp directory
cleanup in *main.py*

## Type of Change

- [ ] 🐛 Bug fix (non-breaking change fixing an issue)
- [x] ✨ New feature (non-breaking change adding functionality)
- [x] 💥 Breaking change (fix or feature causing existing functionality
to change)
- [ ] 📚 Documentation update
- [ ] 🏗️ Infrastructure change (Terraform/IaC)
- [x] ♻️ Refactoring (no functional changes)

## Component(s) Affected

- [ ] `deploy/000-prerequisites` - Azure subscription setup
- [ ] `deploy/001-iac` - Terraform infrastructure
- [ ] `deploy/002-setup` - OSMO control plane / Helm
- [ ] `deploy/004-workflow` - Training workflows
- [ ] `src/training` - Python training scripts
- [ ] `docs/` - Documentation

## Testing Performed

- [ ] Terraform `plan` reviewed (no unexpected changes)
- [ ] Terraform `apply` tested in dev environment
- [ ] Training scripts tested locally with Isaac Sim
- [ ] OSMO workflow submitted successfully
- [ ] Smoke tests passed (`smoke_test_azure.py`)

## Documentation Impact

- [x] No documentation changes needed
- [ ] Documentation updated in this PR
- [ ] Documentation issue filed

## Bug Fix Checklist

*Complete this section for bug fix PRs. Skip for other contribution
types.*

- [ ] Linked to issue being fixed
- [ ] Regression test included, OR
- [ ] Justification for no regression test:

## Checklist

- [x] My code follows the [project conventions](copilot-instructions.md)
- [x] Commit messages follow [conventional commit
format](instructions/commit-message.instructions.md)
- [x] I have performed a self-review
- [x] Documentation impact assessed above
- [x] No new linting warnings introduced
akhanattentive pushed a commit that referenced this pull request Mar 16, 2026
…ort (#180)

support

This PR strengthens the dataviewer's HDF5 capabilities across video
generation, blob storage integration, and dataset organization. It
introduces on-demand video generation from HDF5 image data, extends
Azure Blob Storage support to HDF5 datasets, enables hierarchical
dataset IDs up to 5 levels deep, and simplifies the frontend video
playback architecture by removing the persistent frame cache in favor of
native HTML5 video rendering.

> The frame caching system served its purpose during initial development
but added complexity that wasn't justified as HDF5 video generation
matured. Native `<video>` element playback is simpler and more reliable
for the generated MP4 files.

The HDF5 format handler gained full video generation capabilities with
an on-demand caching strategy. Videos are generated using **ffmpeg**
(H.264, `ultrafast` preset, `yuv420p` pixel format) with an **OpenCV
fallback** when ffmpeg is unavailable. Generated videos are cached at
`meta/videos/{camera}/episode_{NNNNNN}.mp4` and uploaded to blob storage
when configured.

- Added `_generate_video()` and `_generate_video_cv2()` to
*hdf5_handler.py* with graceful degradation between backends
- Implemented `get_video_path()` with cache-first lookup and synchronous
on-demand generation
- Added `load_single_frame()` in *hdf5_loader.py* using h5py slice
indexing for memory-efficient frame extraction
- Installed **ffmpeg** system dependency in the backend *Dockerfile*

Extended the blob dataset provider to discover, sync, and serve HDF5
datasets alongside existing LeRobot support.

- Added `scan_all_dataset_ids()` for single-pass discovery returning
both LeRobot and HDF5 dataset types
- Implemented `sync_hdf5_dataset_to_local()` for metadata and
placeholder sync, and `sync_hdf5_episode_to_local()` for on-demand
episode downloads
- Enhanced video path resolution with `_build_video_path_candidates()`
supporting template-based, index-based, and fallback scan strategies
- Switched from `list_blobs()` to `list_blob_names()` for faster blob
enumeration
- Added `upload_video()` for persisting locally generated videos back to
blob storage

Enabled hierarchical dataset organization using `--` separators (e.g.,
`project--recordings--session_1`) mapping to `/`-separated paths in
storage.

- Added *paths.py* utility with `dataset_id_to_blob_prefix()`
centralized across storage adapters
- Updated `_scan_directory()` with recursive traversal and
`_validate_dataset_id()` enforcing max 5 segments
- Refactored dataset grouping from `split("--", 1)[0]` to
`"--".join(split("--")[:-1])` for correct multi-level grouping
- Updated *DataviewerShellHeader.tsx* to display group keys with forward
slashes instead of dashes

Introduced a `LabelStorage` protocol in *labels.py* with
**LocalLabelStorage** and **BlobLabelStorage** implementations, enabling
label persistence in both local filesystem and Azure Blob Storage
backends. Both implementations include path traversal protection via
`realpath()` + `startswith()` validation.

Removed the persistent video frame caching system and simplified the
annotation workspace playback architecture.

- Deleted *useVideoFrameCache.ts* (165 lines) and its test suite (244
lines)
- Removed RVFC-based canvas rendering, `displayCanvasRef`, and cache
state management from *useAnnotationWorkspaceVideoSync.ts* (~150 lines)
- Replaced canvas-based video rendering with direct HTML5 `<video>`
element in *AnnotationWorkspacePlaybackCard.tsx*
- Added 200ms debounced loading overlay to prevent flicker on quick
video loads

- Updated *.gitignore* dataset path from `src/dataviewer/datasets/` to
root-level `datasets/`
- Added *.env.azure.example* template for Azure Blob Storage development
setup
- Enhanced *.env.example* with Azure CLI prerequisites and permission
documentation
- Added `HMI_DATA_PATH` environment variable to the backend dev script
in *package.json*
- Added Azure SDK logging suppression and lifespan-based temp directory
cleanup in *main.py*

- [ ] 🐛 Bug fix (non-breaking change fixing an issue)
- [x] ✨ New feature (non-breaking change adding functionality)
- [x] 💥 Breaking change (fix or feature causing existing functionality
to change)
- [ ] 📚 Documentation update
- [ ] 🏗️ Infrastructure change (Terraform/IaC)
- [x] ♻️ Refactoring (no functional changes)

- [ ] `deploy/000-prerequisites` - Azure subscription setup
- [ ] `deploy/001-iac` - Terraform infrastructure
- [ ] `deploy/002-setup` - OSMO control plane / Helm
- [ ] `deploy/004-workflow` - Training workflows
- [ ] `src/training` - Python training scripts
- [ ] `docs/` - Documentation

- [ ] Terraform `plan` reviewed (no unexpected changes)
- [ ] Terraform `apply` tested in dev environment
- [ ] Training scripts tested locally with Isaac Sim
- [ ] OSMO workflow submitted successfully
- [ ] Smoke tests passed (`smoke_test_azure.py`)

- [x] No documentation changes needed
- [ ] Documentation updated in this PR
- [ ] Documentation issue filed

*Complete this section for bug fix PRs. Skip for other contribution
types.*

- [ ] Linked to issue being fixed
- [ ] Regression test included, OR
- [ ] Justification for no regression test:

- [x] My code follows the [project conventions](copilot-instructions.md)
- [x] Commit messages follow [conventional commit
format](instructions/commit-message.instructions.md)
- [x] I have performed a self-review
- [x] Documentation impact assessed above
- [x] No new linting warnings introduced
akhanattentive pushed a commit that referenced this pull request Mar 16, 2026
…ort (#180)

support

This PR strengthens the dataviewer's HDF5 capabilities across video
generation, blob storage integration, and dataset organization. It
introduces on-demand video generation from HDF5 image data, extends
Azure Blob Storage support to HDF5 datasets, enables hierarchical
dataset IDs up to 5 levels deep, and simplifies the frontend video
playback architecture by removing the persistent frame cache in favor of
native HTML5 video rendering.

> The frame caching system served its purpose during initial development
but added complexity that wasn't justified as HDF5 video generation
matured. Native `<video>` element playback is simpler and more reliable
for the generated MP4 files.

The HDF5 format handler gained full video generation capabilities with
an on-demand caching strategy. Videos are generated using **ffmpeg**
(H.264, `ultrafast` preset, `yuv420p` pixel format) with an **OpenCV
fallback** when ffmpeg is unavailable. Generated videos are cached at
`meta/videos/{camera}/episode_{NNNNNN}.mp4` and uploaded to blob storage
when configured.

- Added `_generate_video()` and `_generate_video_cv2()` to
*hdf5_handler.py* with graceful degradation between backends
- Implemented `get_video_path()` with cache-first lookup and synchronous
on-demand generation
- Added `load_single_frame()` in *hdf5_loader.py* using h5py slice
indexing for memory-efficient frame extraction
- Installed **ffmpeg** system dependency in the backend *Dockerfile*

Extended the blob dataset provider to discover, sync, and serve HDF5
datasets alongside existing LeRobot support.

- Added `scan_all_dataset_ids()` for single-pass discovery returning
both LeRobot and HDF5 dataset types
- Implemented `sync_hdf5_dataset_to_local()` for metadata and
placeholder sync, and `sync_hdf5_episode_to_local()` for on-demand
episode downloads
- Enhanced video path resolution with `_build_video_path_candidates()`
supporting template-based, index-based, and fallback scan strategies
- Switched from `list_blobs()` to `list_blob_names()` for faster blob
enumeration
- Added `upload_video()` for persisting locally generated videos back to
blob storage

Enabled hierarchical dataset organization using `--` separators (e.g.,
`project--recordings--session_1`) mapping to `/`-separated paths in
storage.

- Added *paths.py* utility with `dataset_id_to_blob_prefix()`
centralized across storage adapters
- Updated `_scan_directory()` with recursive traversal and
`_validate_dataset_id()` enforcing max 5 segments
- Refactored dataset grouping from `split("--", 1)[0]` to
`"--".join(split("--")[:-1])` for correct multi-level grouping
- Updated *DataviewerShellHeader.tsx* to display group keys with forward
slashes instead of dashes

Introduced a `LabelStorage` protocol in *labels.py* with
**LocalLabelStorage** and **BlobLabelStorage** implementations, enabling
label persistence in both local filesystem and Azure Blob Storage
backends. Both implementations include path traversal protection via
`realpath()` + `startswith()` validation.

Removed the persistent video frame caching system and simplified the
annotation workspace playback architecture.

- Deleted *useVideoFrameCache.ts* (165 lines) and its test suite (244
lines)
- Removed RVFC-based canvas rendering, `displayCanvasRef`, and cache
state management from *useAnnotationWorkspaceVideoSync.ts* (~150 lines)
- Replaced canvas-based video rendering with direct HTML5 `<video>`
element in *AnnotationWorkspacePlaybackCard.tsx*
- Added 200ms debounced loading overlay to prevent flicker on quick
video loads

- Updated *.gitignore* dataset path from `src/dataviewer/datasets/` to
root-level `datasets/`
- Added *.env.azure.example* template for Azure Blob Storage development
setup
- Enhanced *.env.example* with Azure CLI prerequisites and permission
documentation
- Added `HMI_DATA_PATH` environment variable to the backend dev script
in *package.json*
- Added Azure SDK logging suppression and lifespan-based temp directory
cleanup in *main.py*

- [ ] 🐛 Bug fix (non-breaking change fixing an issue)
- [x] ✨ New feature (non-breaking change adding functionality)
- [x] 💥 Breaking change (fix or feature causing existing functionality
to change)
- [ ] 📚 Documentation update
- [ ] 🏗️ Infrastructure change (Terraform/IaC)
- [x] ♻️ Refactoring (no functional changes)

- [ ] `deploy/000-prerequisites` - Azure subscription setup
- [ ] `deploy/001-iac` - Terraform infrastructure
- [ ] `deploy/002-setup` - OSMO control plane / Helm
- [ ] `deploy/004-workflow` - Training workflows
- [ ] `src/training` - Python training scripts
- [ ] `docs/` - Documentation

- [ ] Terraform `plan` reviewed (no unexpected changes)
- [ ] Terraform `apply` tested in dev environment
- [ ] Training scripts tested locally with Isaac Sim
- [ ] OSMO workflow submitted successfully
- [ ] Smoke tests passed (`smoke_test_azure.py`)

- [x] No documentation changes needed
- [ ] Documentation updated in this PR
- [ ] Documentation issue filed

*Complete this section for bug fix PRs. Skip for other contribution
types.*

- [ ] Linked to issue being fixed
- [ ] Regression test included, OR
- [ ] Justification for no regression test:

- [x] My code follows the [project conventions](copilot-instructions.md)
- [x] Commit messages follow [conventional commit
format](instructions/commit-message.instructions.md)
- [x] I have performed a self-review
- [x] Documentation impact assessed above
- [x] No new linting warnings introduced
akhanattentive pushed a commit that referenced this pull request Mar 16, 2026
…ort (#180)

support

This PR strengthens the dataviewer's HDF5 capabilities across video
generation, blob storage integration, and dataset organization. It
introduces on-demand video generation from HDF5 image data, extends
Azure Blob Storage support to HDF5 datasets, enables hierarchical
dataset IDs up to 5 levels deep, and simplifies the frontend video
playback architecture by removing the persistent frame cache in favor of
native HTML5 video rendering.

> The frame caching system served its purpose during initial development
but added complexity that wasn't justified as HDF5 video generation
matured. Native `<video>` element playback is simpler and more reliable
for the generated MP4 files.

The HDF5 format handler gained full video generation capabilities with
an on-demand caching strategy. Videos are generated using **ffmpeg**
(H.264, `ultrafast` preset, `yuv420p` pixel format) with an **OpenCV
fallback** when ffmpeg is unavailable. Generated videos are cached at
`meta/videos/{camera}/episode_{NNNNNN}.mp4` and uploaded to blob storage
when configured.

- Added `_generate_video()` and `_generate_video_cv2()` to
*hdf5_handler.py* with graceful degradation between backends
- Implemented `get_video_path()` with cache-first lookup and synchronous
on-demand generation
- Added `load_single_frame()` in *hdf5_loader.py* using h5py slice
indexing for memory-efficient frame extraction
- Installed **ffmpeg** system dependency in the backend *Dockerfile*

Extended the blob dataset provider to discover, sync, and serve HDF5
datasets alongside existing LeRobot support.

- Added `scan_all_dataset_ids()` for single-pass discovery returning
both LeRobot and HDF5 dataset types
- Implemented `sync_hdf5_dataset_to_local()` for metadata and
placeholder sync, and `sync_hdf5_episode_to_local()` for on-demand
episode downloads
- Enhanced video path resolution with `_build_video_path_candidates()`
supporting template-based, index-based, and fallback scan strategies
- Switched from `list_blobs()` to `list_blob_names()` for faster blob
enumeration
- Added `upload_video()` for persisting locally generated videos back to
blob storage

Enabled hierarchical dataset organization using `--` separators (e.g.,
`project--recordings--session_1`) mapping to `/`-separated paths in
storage.

- Added *paths.py* utility with `dataset_id_to_blob_prefix()`
centralized across storage adapters
- Updated `_scan_directory()` with recursive traversal and
`_validate_dataset_id()` enforcing max 5 segments
- Refactored dataset grouping from `split("--", 1)[0]` to
`"--".join(split("--")[:-1])` for correct multi-level grouping
- Updated *DataviewerShellHeader.tsx* to display group keys with forward
slashes instead of dashes

Introduced a `LabelStorage` protocol in *labels.py* with
**LocalLabelStorage** and **BlobLabelStorage** implementations, enabling
label persistence in both local filesystem and Azure Blob Storage
backends. Both implementations include path traversal protection via
`realpath()` + `startswith()` validation.

Removed the persistent video frame caching system and simplified the
annotation workspace playback architecture.

- Deleted *useVideoFrameCache.ts* (165 lines) and its test suite (244
lines)
- Removed RVFC-based canvas rendering, `displayCanvasRef`, and cache
state management from *useAnnotationWorkspaceVideoSync.ts* (~150 lines)
- Replaced canvas-based video rendering with direct HTML5 `<video>`
element in *AnnotationWorkspacePlaybackCard.tsx*
- Added 200ms debounced loading overlay to prevent flicker on quick
video loads

- Updated *.gitignore* dataset path from `src/dataviewer/datasets/` to
root-level `datasets/`
- Added *.env.azure.example* template for Azure Blob Storage development
setup
- Enhanced *.env.example* with Azure CLI prerequisites and permission
documentation
- Added `HMI_DATA_PATH` environment variable to the backend dev script
in *package.json*
- Added Azure SDK logging suppression and lifespan-based temp directory
cleanup in *main.py*

- [ ] 🐛 Bug fix (non-breaking change fixing an issue)
- [x] ✨ New feature (non-breaking change adding functionality)
- [x] 💥 Breaking change (fix or feature causing existing functionality
to change)
- [ ] 📚 Documentation update
- [ ] 🏗️ Infrastructure change (Terraform/IaC)
- [x] ♻️ Refactoring (no functional changes)

- [ ] `deploy/000-prerequisites` - Azure subscription setup
- [ ] `deploy/001-iac` - Terraform infrastructure
- [ ] `deploy/002-setup` - OSMO control plane / Helm
- [ ] `deploy/004-workflow` - Training workflows
- [ ] `src/training` - Python training scripts
- [ ] `docs/` - Documentation

- [ ] Terraform `plan` reviewed (no unexpected changes)
- [ ] Terraform `apply` tested in dev environment
- [ ] Training scripts tested locally with Isaac Sim
- [ ] OSMO workflow submitted successfully
- [ ] Smoke tests passed (`smoke_test_azure.py`)

- [x] No documentation changes needed
- [ ] Documentation updated in this PR
- [ ] Documentation issue filed

*Complete this section for bug fix PRs. Skip for other contribution
types.*

- [ ] Linked to issue being fixed
- [ ] Regression test included, OR
- [ ] Justification for no regression test:

- [x] My code follows the [project conventions](copilot-instructions.md)
- [x] Commit messages follow [conventional commit
format](instructions/commit-message.instructions.md)
- [x] I have performed a self-review
- [x] Documentation impact assessed above
- [x] No new linting warnings introduced
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants