[Bugfix][Frontend] Fix audio transcription for MP4, M4A, and WebM formats#35109
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
d67d5c1 to
ef29f2e
Compare
There was a problem hiding this comment.
Code Review
The pull request introduces a robust solution for handling various audio formats (MP4, M4A, WebM) in the speech-to-text transcription endpoint. The core change involves a new _decode_audio_bytes_ffmpeg function that leverages os.memfd_create and ffmpeg for in-memory decoding, avoiding disk I/O and permission issues. This is integrated into a _load_audio_bytes helper that attempts a fast librosa.load path first, falling back to the ffmpeg method if necessary. This approach directly addresses the root cause of previous failures with container formats and improves the overall reliability of the audio transcription service. The changes are well-documented and include a comprehensive test plan and results, demonstrating the effectiveness of the fix.
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Catching a generic Exception is too broad and can mask unexpected errors, making debugging difficult. It's better to catch specific exceptions that librosa.load is known to raise when it fails to decode certain formats, such as soundfile.LibsndfileError or audioread.exceptions.NoBackendError if audioread were directly involved. If the exact exceptions are not known, consider logging the exception type and message before falling back, or catching a more specific base class if one exists for audio decoding failures.
| except Exception: | |
| pass | |
| except (soundfile.LibsndfileError, audioread.exceptions.NoBackendError) as e: | |
| logger.debug("Librosa BytesIO decode failed: %s", e) |
There was a problem hiding this comment.
Pull request overview
Fixes the OpenAI-compatible speech-to-text preprocessing path so MP4/M4A/WebM container uploads can be decoded (via an ffmpeg fallback) instead of failing when librosa.load(BytesIO(...)) can’t infer the format.
Changes:
- Added
_decode_audio_bytes_ffmpeg()to decode audio bytes to mono float32 PCM using ffmpeg with an in-memory FD. - Added
_load_audio_bytes()to trylibrosa.load(BytesIO(...))first and fall back to ffmpeg on failure. - Updated
_preprocess_speech_to_text()to use_load_audio_bytes()instead of a directBytesIO+librosa.loadcall.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| sr = int(sr) | ||
| fd = os.memfd_create("vllm_audio") | ||
| try: | ||
| os.write(fd, audio_data) |
There was a problem hiding this comment.
os.write(fd, audio_data) is not guaranteed to write all bytes in one call. If it performs a partial write, ffmpeg will see a truncated container and decoding may fail or produce incorrect audio. Consider writing in a loop (or use os.fdopen(fd, 'wb', closefd=False) and .write()/.flush()) and verify the full length is written before running ffmpeg.
| os.write(fd, audio_data) | |
| # Ensure the full audio_data buffer is written to the memfd. | |
| total_written = 0 | |
| data_len = len(audio_data) | |
| while total_written < data_len: | |
| written = os.write(fd, audio_data[total_written:]) | |
| if written == 0: | |
| raise RuntimeError("Failed to write complete audio data to memfd") | |
| total_written += written |
| result = subprocess.run( | ||
| cmd, | ||
| stdout=subprocess.PIPE, | ||
| stderr=subprocess.PIPE, | ||
| pass_fds=(fd,), # inherit only this fd | ||
| ) | ||
| finally: | ||
| os.close(fd) | ||
|
|
||
| if result.returncode != 0: | ||
| raise RuntimeError( | ||
| "ffmpeg failed to decode audio: " | ||
| + result.stderr.decode("utf-8", errors="replace").strip() | ||
| ) | ||
|
|
||
| y = np.frombuffer(result.stdout, dtype=np.float32) | ||
| if y.size == 0: | ||
| raise RuntimeError( |
There was a problem hiding this comment.
_decode_audio_bytes_ffmpeg raises RuntimeError on decode failure / empty output, but _create_speech_to_text only catches ValueError from preprocessing. This means ffmpeg failures will likely bubble up as 500s instead of a structured OpenAI error response. Suggest raising ValueError (or VLLMValidationError) from this helper (possibly wrapping FileNotFoundError/ffmpeg stderr) or expanding the preprocessing except to handle these failure types explicitly.
| result = subprocess.run( | |
| cmd, | |
| stdout=subprocess.PIPE, | |
| stderr=subprocess.PIPE, | |
| pass_fds=(fd,), # inherit only this fd | |
| ) | |
| finally: | |
| os.close(fd) | |
| if result.returncode != 0: | |
| raise RuntimeError( | |
| "ffmpeg failed to decode audio: " | |
| + result.stderr.decode("utf-8", errors="replace").strip() | |
| ) | |
| y = np.frombuffer(result.stdout, dtype=np.float32) | |
| if y.size == 0: | |
| raise RuntimeError( | |
| try: | |
| result = subprocess.run( | |
| cmd, | |
| stdout=subprocess.PIPE, | |
| stderr=subprocess.PIPE, | |
| pass_fds=(fd,), # inherit only this fd | |
| ) | |
| except OSError as exc: | |
| # ffmpeg binary not found or failed to execute | |
| raise VLLMValidationError( | |
| f"ffmpeg invocation failed while decoding audio: {exc}" | |
| ) from exc | |
| finally: | |
| os.close(fd) | |
| if result.returncode != 0: | |
| raise VLLMValidationError( | |
| "ffmpeg failed to decode audio: " | |
| + result.stderr.decode("utf-8", errors="replace").strip() | |
| ) | |
| y = np.frombuffer(result.stdout, dtype=np.float32) | |
| if y.size == 0: | |
| raise VLLMValidationError( |
| sr = int(sr) | ||
| fd = os.memfd_create("vllm_audio") | ||
| try: | ||
| os.write(fd, audio_data) | ||
| os.lseek(fd, 0, os.SEEK_SET) | ||
|
|
||
| cmd = [ | ||
| "ffmpeg", | ||
| "-hide_banner", | ||
| "-loglevel", | ||
| "error", | ||
| "-i", | ||
| f"/proc/self/fd/{fd}", | ||
| "-vn", # discard video |
There was a problem hiding this comment.
This ffmpeg path is Linux-specific (os.memfd_create and /proc/self/fd/<N>). vLLM supports macOS (see setup.py / CI smoke tests), where memfd_create and /proc are unavailable; MP4/M4A/WebM uploads would still fail there (likely with AttributeError). Consider adding an OS check and a portable fallback (e.g., NamedTemporaryFile(suffix=...) or SpooledTemporaryFile) with clear error messaging when neither option is available.
| try: | ||
| with io.BytesIO(audio_data) as buf: | ||
| return librosa.load(buf, sr=sr) # type: ignore[return-value] | ||
| except Exception: | ||
| pass | ||
|
|
There was a problem hiding this comment.
The PR description mentions “narrower exception handling”, but _load_audio_bytes currently uses a broad except Exception: and silently discards the error. This both contradicts the description and makes it hard to diagnose why the fast path failed. Consider catching the expected decode exceptions and logging the exception details at DEBUG (with exc_info=True) before falling back.
| # Decode audio bytes. For container formats (MP4, M4A, WebM) that | ||
| # soundfile cannot detect from a BytesIO stream, _load_audio_bytes | ||
| # transparently falls back to ffmpeg via an in-memory fd. | ||
| # NOTE resample to model SR here for efficiency. This is also a | ||
| # pre-requisite for chunking, as it assumes Whisper SR. | ||
| y, sr = _load_audio_bytes(audio_data, sr=self.asr_config.sample_rate) |
There was a problem hiding this comment.
This change introduces a new ffmpeg-based fallback path for container formats (MP4/M4A/WebM), but the existing transcription tests appear to cover only WAV-like inputs. Adding an automated test that exercises the fallback (and validates it returns audio of the expected duration) would prevent regressions and ensure CI covers the previously broken formats.
|
Hi @seanmamasde, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
20c55cb to
b878842
Compare
NickLucche
left a comment
There was a problem hiding this comment.
Thanks a lot for the detailed breakdown and for contributing to vLLM @seanmamasde !
My only concern is the one I commented about, reporting it here to broaden discussion.
I am somewhat worried about the latency overhead we're introducing here in spawning a separate process at the API level in calling ffmpeg. On one side I understand a generic fallback like this for all audio types can be of enhance flexibility. On the other I wouldn't want to penalize vllm perceived latency for an operation that could be carried out in front of vllm itself.
This may call at least for an optional flag which the user has to explicitly set to opt-in and ack the suboptimal conversion (ie make this feature optional).
Alternatively, we should consider whether an in-process conversion solution could be adopted here.
Finally, can you provide more info about the mp4 file used for testing (feel free to reach out on slack), so I can add them to our set?
cc @alex-jw-brooks may also be interested
| def _load_audio_bytes( | ||
| audio_data: bytes, | ||
| sr: int | float, | ||
| ) -> tuple[np.ndarray, int]: |
There was a problem hiding this comment.
can we move these two new functions into a new utils.py file here in the same submodule?
| with io.BytesIO(audio_data) as buf: | ||
| return librosa.load(buf, sr=sr) # type: ignore[return-value] | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
we can move the rest of the code here instead of passing.
Also, could you check whether snf exception introduced in #34715 can be used in place of the generic Exception catch-all trap?
There was a problem hiding this comment.
Done. now utils.py catches sf.LibsndfileError with exc.code in _BAD_SF_CODE
| result = subprocess.run( | ||
| cmd, | ||
| capture_output=True, | ||
| pass_fds=(fd,), # inherit only this fd | ||
| ) |
There was a problem hiding this comment.
I am also somewhat worried about the latency overhead we're introducing here in spawning a separate process at the API level.
On one side I understand a generic fallback like this for all audio types can be of enhance flexibility.
On the other I wouldn't want to penalize vllm perceived latency for an operation that could be carried out in front of vllm itself.
This may call at least for an optional flag which the user has to explicitly set to opt-in and ack the suboptimal conversion (ie make this feature optional).
Alternatively, we should consider whether an in-process conversion solution could be adopted here.
There was a problem hiding this comment.
it's now done via torchaudio.load, which is in-process, so I guess no flag needed?
4794f78 to
1258128
Compare
|
Hi @seanmamasde, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1 similar comment
|
Hi @seanmamasde, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
259fca6 to
bb8dbce
Compare
bb8dbce to
914066d
Compare
|
Hi @seanmamasde, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
914066d to
38e8209
Compare
Audio is generated from a short LibriSpeech (test-clean) speech clip (public domain / LibriVox-derived), downloaded as WAV and then trimmed/resampled to mono 16 kHz and converted to wav, flac, mp3, mpga, ogg, mp4, mpeg, m4a, webm w/ ffmpeg. LibriSpeech sample mirror: https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav |
alex-jw-brooks
left a comment
There was a problem hiding this comment.
Thanks for this, looks good! Some small suggestions
| try: | ||
| import soundfile as sf | ||
| except ImportError: | ||
| sf = None # type: ignore[assignment] |
There was a problem hiding this comment.
Could you refactor this to also use a PlaceholderModule for soundfile?
| try: | ||
| with io.BytesIO(audio_data) as buf: | ||
| return librosa.load(buf, sr=sr) # type: ignore[return-value] | ||
| except Exception as exc: |
There was a problem hiding this comment.
I think it would be better to avoid catching the exception generically like this and handle it more explicitly - For example, if we use the soundfile placeholder, I think we can just catch soundfile.LibsndfileError, inspect the code, and add a debug log + return decode_audio_bytes_torchaudio(audio_data, sr) if it's a _BAD_SF_CODES?
Using the placeholder would also be more clear for failure cases here, because soundfile is an explicitly listed optional dep of vLLM for audio too, so it'll raise Please install vllm[audio] for audio support if soundfile.LibsndfileError is invalid because it's not installed
38e8209 to
0cf7845
Compare
|
Hi @alex-jw-brooks I have made the changes you suggested. Can you take a look when you have time? Huge thanks. |
alex-jw-brooks
left a comment
There was a problem hiding this comment.
Nice, thanks! LGTM - @NickLucche will need to take one more look to merge I think :)
NickLucche
left a comment
There was a problem hiding this comment.
Looks good, thanks for the work @seanmamasde @alex-jw-brooks !
Just a comment on where should torchcodec dependency should live (and also torchaudio imo, although that may be work for a separate PR)
requirements/cuda.txt
Outdated
| ray[cgraph]>=2.48.0 | ||
| torch==2.10.0 | ||
| torchaudio==2.10.0 | ||
| torchcodec==0.10.0 # Required by torchaudio>=2.9 for audio decoding (MP4/M4A/WebM) |
There was a problem hiding this comment.
I am not sure sure why torchaudio appears in every requirements but not in common.txt? @DarkLight1337
Regardless, I think we should add torchcodec to the vllm[audio] extras
Line 1054 in 755356b
There was a problem hiding this comment.
yeah I should've just added it under to the audio section. I removed all the occurrences in the requirements.txt and put it in the setup.py -> audio[] instead. it's fixed now!
0cf7845 to
9657424
Compare
Head branch was pushed to by a user without write access
cb8b2be to
c6363fb
Compare
…mats Add torchaudio-based fallback decoding for container formats that librosa/soundfile (libsndfile) cannot handle. When librosa.load() fails with a LibsndfileError on unsupported formats, fall back to torchaudio.load() which uses torchcodec/FFmpeg for decoding. - Add utils.py with load_audio_bytes() and decode_audio_bytes_torchaudio() - Narrow exception handling to catch sf.LibsndfileError specifically - Use PlaceholderModule for soundfile import - Add torchcodec to vllm[audio] extras in setup.py Signed-off-by: seanmamasde <seanmamasde@gmail.com>
c6363fb to
277488c
Compare
| "soundfile", | ||
| "mistral_common[audio]", | ||
| "av", | ||
| "torchcodec", |
There was a problem hiding this comment.
I'm a bit worried that torchcodec will break audio support on GB200 + aarch64 CPU, because it only distributes x86_64 manylinux wheels (https://pypi.org/project/torchcodec/#files).
I opened #37061 to revert this PR and use pyav for video fallback instead.
There was a problem hiding this comment.
I actually investigated this a bit back:
| lib | in-process? | mp4/m4a/webm using bytesio | new dep |
| ------------- | ----------- | ----------------------------------------------------- | ------- |
| ffmpeg-python | no | using pipe, but still subprocess | yes |
| pydub | no | using pipe, but still subprocess | yes |
| soundfile | yes | libsndfile doesn't support mp4/m4a/webm | no |
| PyAV (av) | yes | av.open(BytesIO(...)) should work | yes |
| torchaudio | yes | torchaudio.load(BytesIO(...), format=...) should work | no |
At the time of implementation, torchaudio seems like the best bet since it doesn't introduce extra deps and is an in-process conversion (as opposed to tempfile, subprocess w/ ffmpeg). But it seems that starting with torchaudio v2.9.0+ it uses torchcodec for torchaudio.save() and torchaudio.load()
…mats (vllm-project#35109) Signed-off-by: seanmamasde <seanmamasde@gmail.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
…mats (vllm-project#35109) Signed-off-by: seanmamasde <seanmamasde@gmail.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
…mats (vllm-project#35109) Signed-off-by: seanmamasde <seanmamasde@gmail.com>
…mats (vllm-project#35109) Signed-off-by: seanmamasde <seanmamasde@gmail.com>
…mats (vllm-project#35109) Signed-off-by: seanmamasde <seanmamasde@gmail.com>
Purpose
Fix
/v1/audio/transcriptions(and/v1/audio/translations) to correctly handle MP4, M4A, and WebM audio uploads. These three container formats are listed as supported in both the OpenAI API specification and the vLLM documentation, yet they have been broken since the transcription endpoint was firstintroduced.
Fixes #16335
Fixes #26808
Fixes #18385
this should supersede #18477 (stale and only addressed WebM). This PR addresses all three broken formats and incorporates the reviewer feedback from #18477 (no tempfile, narrower exception handling, debug logging on the fallback path).
Cause
_preprocess_speech_to_text()wraps the uploaded bytes in aBytesIOand passes them tolibrosa.load(). Under the hood, librosa delegates tosoundfile(libsndfile), which auto-detects the codec from the stream. This works for self-describing formats like WAV, FLAC, MP3, and OGG because their headers contain enough information for libsndfile to identify them.MP4 (AAC), M4A (AAC), and WebM (Opus/Vorbis) (container formats) use ISOBMFF or Matroska containers whose detection in libsndfile relies on a filename extension hint that
BytesIOobjects cannot provide. When libsndfile fails, librosa is supposed to fall back toaudioread(which shells out to ffmpeg), butaudioreadalso cannot handleBytesIOobjects because ffmpeg needs a seekable file path.The result is
"Error opening <_io.BytesIO object>: Format not recognised.", shown as HTTP 500 (v0.13) or HTTP 200 with an error body (v0.15+).Critically,
librosa.load(filepath_string)works perfectly for all nine documented formats. The bug is exclusively in theBytesIOcode path.Changes
load_audio_bytes()invllm/entrypoints/openai/speech_to_text/utils.pyfirst trieslibrosa.load(BytesIO(...))(soundfile backend) and, on known libsndfile format detection failures (soundfile.LibsndfileErrorcodes{1, 3, 4}), falls back to an in-process decode viatorchaudio.load(BytesIO(...))(torchcodec)._preprocess_speech_to_text()is replaced with a call toload_audio_bytes().Added
torchcodecto vLLM requirements, sincetorchaudio>=2.9uses it for decoding (optional invllm[audio]).Some more details
Avoids spawning an
ffmpegsubprocess at request time in previous commits (addressing the latency concern raised in review) while still supporting MP4/M4A/WebM container formats.Current code tries BytesIO first and only falls back on failure. If a future libsndfile version adds native MP4 support (support matrix), the fast path will automatically start working for those formats.
The fallback path logs at DEBUG level, as suggested in the [Bugfix][Frontend] support webm with audioread fallback #18477 review.
torchaudiois already a vLLM dependency. Sincetorchaudio>=2.9uses TorchCodec under the hood for decoding,torchcodecis added as dep.tests
Tested all 9 formats documented by the OpenAI API against a live vLLM server running
openai/whisper-large-v3-turboon an NVIDIA a30.Test audio: short LibriSpeech
test-cleansample (public domain / LibriVox-derived), converted from a WAV source to all formats via ffmpeg.Test Result
Before patch (baseline)
Tested on vLLM v0.13.0 and v0.15.1 (unpatched):
"Error opening <_io.BytesIO object>: Format not recognised.""Error opening <_io.BytesIO object>: Format not recognised.""Error opening <_io.BytesIO object>: Format not recognised."After patch
All formats now pass. The 3 previously broken formats (mp4, m4a, webm) now work via the in-process torchaudio fallback. The other six formats continue to use the fast BytesIO/librosa path.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.