Skip to content

Conversation

rchowell
Copy link
Contributor

@rchowell rchowell commented Sep 3, 2025

Changes Made

YouTube streaming was not working, and the example in the docs was broken. This is a dumber but more robust approach to downloading and processing youtube videos.

Related Issues

n/a

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

@github-actions github-actions bot added the fix label Sep 3, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR fixes broken YouTube video reading functionality in Daft by completely rewriting the approach from streaming to downloading. The original implementation attempted to stream YouTube videos directly by extracting URLs and using HTTP responses with PyAV, but this was fundamentally flawed because YouTube uses complex adaptive streaming protocols (m3u8/HLS) that can't be handled as simple HTTP streams.

The new implementation introduces a more robust download-first approach using yt-dlp. A new context manager _open_youtube_file() downloads YouTube videos to temporary files, processes them locally, then cleans up the temporary files afterward. The implementation includes a custom format selector that chooses the best matching video resolution based on the requested image_height and image_width parameters.

Additionally, the PR removes generic exception handling in the frame decoding loop that was previously masking real decoding errors, improving debuggability. The documentation has been updated with a working example that processes multiple YouTube URLs simultaneously and includes a comprehensive output table showing the expected DataFrame structure.

This change integrates with Daft's existing video processing pipeline by maintaining the same _read_video_frames() interface while switching the underlying YouTube handling mechanism from streaming to temporary file processing.

Confidence score: 4/5

  • This PR addresses a broken core functionality with a well-tested approach, making it relatively safe to merge
  • Score reflects solid implementation but potential concerns around disk space usage and the removal of error handling
  • Pay close attention to the video frame processing logic and temporary file cleanup in daft/io/av/_read_video_frames.py

2 files reviewed, 4 comments

Edit Code Review Bot Settings | Greptile

# will handle for us. We cannot reliable pass a file-like HTTP response
# to PyAV; hence why this will download to a tempfile then open the file.

temp_file = next(tempfile._get_candidate_names()) # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Using next(tempfile._get_candidate_names()) is risky as it only generates a filename without creating the file, potentially leading to race conditions or file conflicts

else:
fp, fs, _ = _infer_filesystem(self.path, io_config=self.io_config)
return fs.open_input_file(fp)

@contextmanager
def _open_youtube_file(self) -> Any:
import yt_dlp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Import should be moved to the top of the file

Context Used: Rule - Import statements should be placed at the top of the file rather than inline within functions or methods. (link)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want lazy imports for this particular feature. Good try g.

try:
with yt_dlp.YoutubeDL(params) as ydl:
ydl.download([self.path])
yield open(temp_file, mode="rb")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: File handle is not closed explicitly in the context manager - should use with open() or ensure proper cleanup

rchowell and others added 3 commits September 3, 2025 14:14
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
@rchowell rchowell merged commit 6f2f360 into main Sep 3, 2025
29 checks passed
@rchowell rchowell deleted the rchowell/yt-fix branch September 3, 2025 23:03
venkateshdb pushed a commit to venkateshdb/Daft that referenced this pull request Sep 6, 2025
## Changes Made

YouTube streaming was not working, and the example in the docs was
broken. This is a dumber but more robust approach to downloading and
processing youtube videos.

## Related Issues

n/a

## Checklist

- [x] Documented in API Docs (if applicable)
- [x] Documented in User Guide (if applicable)
- [x] If adding a new documentation page, doc is added to
`docs/mkdocs.yml` navigation
- [x] Documentation builds and is formatted properly (tag @/ccmao1130
for docs review)

---------

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant