Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial support for video #1151

Merged
merged 22 commits into from
Sep 21, 2023
Merged

Initial support for video #1151

merged 22 commits into from
Sep 21, 2023

Conversation

pzelasko
Copy link
Collaborator

@pzelasko pzelasko commented Sep 16, 2023

This PR implements some features related to video dataloading support.

TL;DR

recording = Recording.from_file("example.mp4")
video, audio = recording.load_video()

cut = recording.to_cut().truncate(duration=1.0)
video, audio = cut.load_video()

More or less detailed list of changes:

  • Recording
    • Recording.from_file() supports video files (so does lhotse.info())
    • Recording.load_video() method that loads video + audio (and keeps them in sync duration-wise)
    • the video shape is (num_frames, color, height, width), dtype is uint8, and format is RGB - if another makes more sense we can change it later
    • support only a single video stream for now (no multi-video files)
    • metadata about videos is in recording.video
    • check if recording.has_video
    • supports dynamic video resolution scaling with recording = recording.with_video_resolution()
    • can't load video when using perturb speed/tempo
  • Cut
    • Supports all cut types (Mono, Multi, Mixed)
    • Mixed video cuts only support padding + appending with other video cuts, not mixing (but it's OK to mix audio-only cut into video +/- edge cases)
    • cut.has_video and cut.video (metadata)
  • Dataloading
    • collate_video method that pads with black frames and packs video examples into 5d tensor
    • UnsupervisedAudioVideoDataset as a basic example for working with videos, returns video + audio mini-batches
  • First supported audio-video corpus
    • Grid audio-visual speech corpus

For audio-only workflows, the code is otherwise practically unaffected. Dataloading videos with torch DataLoader generally works without issues. At this point though it's likely the current code won't work with every type of video format, and might not support some other standard Lhotse operations for video data yet. And it'll only work with recent PyTorch versions.

@pzelasko pzelasko added this to the v1.17 milestone Sep 16, 2023
@pzelasko pzelasko merged commit 7b60f86 into master Sep 21, 2023
@pzelasko pzelasko deleted the feature/video-support branch September 21, 2023 21:07
flyingleafe pushed a commit to flyingleafe/lhotse that referenced this pull request Oct 11, 2023
* Tutorial materials in main readme page

* Initial crude video support in AudioSource and Recording

* Add downsized test fixture video

* Support for loading video + audio at the same time

* Enforce consistent video and audio duration

* Support for changing video resolution

* Basic video support for most cut types

* Support for padded video MixedCuts

* Enforce audio duration and video duration to be consistent when creating Recording, solving appending/padding issues

* Add missing assertion

* Stricter tests for padding and appending video cuts

* Minimal set of utilities for PyTorch video dataloading

* Grid audio-visual speech corpus recipe + support videos with missing num frames in their header

* Skip video test for PyTorch < 2.0

* Fix issue with torchaudio.info usage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant