Revert "[Frontend] Remove librosa from audio dependency" (#37058)#37785
Revert "[Frontend] Remove librosa from audio dependency" (#37058)#37785zhewenl wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request reverts a previous change that removed librosa, which caused a CI failure. The revert correctly re-introduces librosa and removes resampy. It also includes several improvements to audio handling and tests. My review identifies a critical bug in the audio chunking logic that could lead to a TypeError and provides a suggestion to improve the robustness of the audio loading mechanism to better handle various video formats.
| do_split_audio = ( | ||
| self.asr_config.allow_audio_chunking | ||
| and duration > self.asr_config.max_audio_clip_s | ||
| ) |
There was a problem hiding this comment.
The check for self.asr_config.max_audio_clip_s is not None has been removed. If self.asr_config.allow_audio_chunking is True and self.asr_config.max_audio_clip_s is None, the expression duration > self.asr_config.max_audio_clip_s will raise a TypeError. This check should be restored to prevent a potential crash.
| do_split_audio = ( | |
| self.asr_config.allow_audio_chunking | |
| and duration > self.asr_config.max_audio_clip_s | |
| ) | |
| do_split_audio = ( | |
| self.asr_config.allow_audio_chunking | |
| and self.asr_config.max_audio_clip_s is not None | |
| and duration > self.asr_config.max_audio_clip_s | |
| ) |
| def load_bytes(self, data: bytes) -> tuple[npt.NDArray, float]: | ||
| return load_audio(BytesIO(data), sr=None) | ||
| if is_video(data): | ||
| return extract_audio_from_video_bytes(data) | ||
| return librosa.load(BytesIO(data), sr=None) |
There was a problem hiding this comment.
The is_video() check is not exhaustive and might fail for some video formats (e.g., .mkv, .mov). This would cause librosa.load to be called on video data, which may fail. A more robust approach would be to use a try...except block to handle librosa failures and then fall back to extract_audio_from_video_bytes, similar to the logic in speech_to_text.py. This would make the audio loading more resilient to different media formats.
def load_bytes(self, data: bytes) -> tuple[npt.NDArray, float]:
try:
return librosa.load(BytesIO(data), sr=None)
except soundfile.LibsndfileError:
# Fallback to pyav for video containers or other formats
# that soundfile doesn't handle from a buffer.
return extract_audio_from_video_bytes(data)|
This can be closed I think :) |
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
Revert of #37058
This reverts commit c7f98b4 (PR #37058).
Reason
CI build #57431 detected 1 new failure linked to this PR:
test_online_audio_in_video_interleavedfails withError opening <_io.BytesIO object>: (Garbled error message from libsndfile)after librosa was removed from audio dependencies.Auto-generated
This revert PR was auto-generated by the CI failure analyzer.