-
Notifications
You must be signed in to change notification settings - Fork 295
fix: fix youtube video reading #5126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR fixes broken YouTube video reading functionality in Daft by completely rewriting the approach from streaming to downloading. The original implementation attempted to stream YouTube videos directly by extracting URLs and using HTTP responses with PyAV, but this was fundamentally flawed because YouTube uses complex adaptive streaming protocols (m3u8/HLS) that can't be handled as simple HTTP streams.
The new implementation introduces a more robust download-first approach using yt-dlp
. A new context manager _open_youtube_file()
downloads YouTube videos to temporary files, processes them locally, then cleans up the temporary files afterward. The implementation includes a custom format selector that chooses the best matching video resolution based on the requested image_height
and image_width
parameters.
Additionally, the PR removes generic exception handling in the frame decoding loop that was previously masking real decoding errors, improving debuggability. The documentation has been updated with a working example that processes multiple YouTube URLs simultaneously and includes a comprehensive output table showing the expected DataFrame structure.
This change integrates with Daft's existing video processing pipeline by maintaining the same _read_video_frames()
interface while switching the underlying YouTube handling mechanism from streaming to temporary file processing.
Confidence score: 4/5
- This PR addresses a broken core functionality with a well-tested approach, making it relatively safe to merge
- Score reflects solid implementation but potential concerns around disk space usage and the removal of error handling
- Pay close attention to the video frame processing logic and temporary file cleanup in
daft/io/av/_read_video_frames.py
2 files reviewed, 4 comments
# will handle for us. We cannot reliable pass a file-like HTTP response | ||
# to PyAV; hence why this will download to a tempfile then open the file. | ||
|
||
temp_file = next(tempfile._get_candidate_names()) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Using next(tempfile._get_candidate_names())
is risky as it only generates a filename without creating the file, potentially leading to race conditions or file conflicts
else: | ||
fp, fs, _ = _infer_filesystem(self.path, io_config=self.io_config) | ||
return fs.open_input_file(fp) | ||
|
||
@contextmanager | ||
def _open_youtube_file(self) -> Any: | ||
import yt_dlp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Import should be moved to the top of the file
Context Used: Rule - Import statements should be placed at the top of the file rather than inline within functions or methods. (link)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want lazy imports for this particular feature. Good try g.
try: | ||
with yt_dlp.YoutubeDL(params) as ydl: | ||
ydl.download([self.path]) | ||
yield open(temp_file, mode="rb") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: File handle is not closed explicitly in the context manager - should use with open()
or ensure proper cleanup
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
## Changes Made YouTube streaming was not working, and the example in the docs was broken. This is a dumber but more robust approach to downloading and processing youtube videos. ## Related Issues n/a ## Checklist - [x] Documented in API Docs (if applicable) - [x] Documented in User Guide (if applicable) - [x] If adding a new documentation page, doc is added to `docs/mkdocs.yml` navigation - [x] Documentation builds and is formatted properly (tag @/ccmao1130 for docs review) --------- Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Changes Made
YouTube streaming was not working, and the example in the docs was broken. This is a dumber but more robust approach to downloading and processing youtube videos.
Related Issues
n/a
Checklist
docs/mkdocs.yml
navigation