Skip to content

[Bugfix][Frontend] Fix audio transcription for MP4, M4A, and WebM formats#35109

Merged
vllm-bot merged 1 commit intovllm-project:mainfrom
seanmamasde:fix/audio-transcription-mp4-m4a-webm
Mar 14, 2026
Merged

[Bugfix][Frontend] Fix audio transcription for MP4, M4A, and WebM formats#35109
vllm-bot merged 1 commit intovllm-project:mainfrom
seanmamasde:fix/audio-transcription-mp4-m4a-webm

Conversation

@seanmamasde
Copy link
Contributor

@seanmamasde seanmamasde commented Feb 23, 2026

Purpose

Fix /v1/audio/transcriptions (and /v1/audio/translations) to correctly handle MP4, M4A, and WebM audio uploads. These three container formats are listed as supported in both the OpenAI API specification and the vLLM documentation, yet they have been broken since the transcription endpoint was first
introduced.

Fixes #16335
Fixes #26808
Fixes #18385

this should supersede #18477 (stale and only addressed WebM). This PR addresses all three broken formats and incorporates the reviewer feedback from #18477 (no tempfile, narrower exception handling, debug logging on the fallback path).

Cause

_preprocess_speech_to_text() wraps the uploaded bytes in a BytesIO and passes them to librosa.load(). Under the hood, librosa delegates to soundfile (libsndfile), which auto-detects the codec from the stream. This works for self-describing formats like WAV, FLAC, MP3, and OGG because their headers contain enough information for libsndfile to identify them.

MP4 (AAC), M4A (AAC), and WebM (Opus/Vorbis) (container formats) use ISOBMFF or Matroska containers whose detection in libsndfile relies on a filename extension hint that BytesIO objects cannot provide. When libsndfile fails, librosa is supposed to fall back to audioread (which shells out to ffmpeg), but audioread also cannot handle BytesIO objects because ffmpeg needs a seekable file path.

The result is "Error opening <_io.BytesIO object>: Format not recognised.", shown as HTTP 500 (v0.13) or HTTP 200 with an error body (v0.15+).

Critically, librosa.load(filepath_string) works perfectly for all nine documented formats. The bug is exclusively in the BytesIO code path.

Changes

  • load_audio_bytes() in vllm/entrypoints/openai/speech_to_text/utils.py first tries librosa.load(BytesIO(...)) (soundfile backend) and, on known libsndfile format detection failures (soundfile.LibsndfileError codes {1, 3, 4}), falls back to an in-process decode via torchaudio.load(BytesIO(...)) (torchcodec).
  1. _preprocess_speech_to_text() is replaced with a call to load_audio_bytes().

  2. Added torchcodec to vLLM requirements, since torchaudio>=2.9 uses it for decoding (optional in vllm[audio]).

Some more details

  • Avoids spawning an ffmpeg subprocess at request time in previous commits (addressing the latency concern raised in review) while still supporting MP4/M4A/WebM container formats.

  • Current code tries BytesIO first and only falls back on failure. If a future libsndfile version adds native MP4 support (support matrix), the fast path will automatically start working for those formats.

  • The fallback path logs at DEBUG level, as suggested in the [Bugfix][Frontend] support webm with audioread fallback #18477 review.

  • torchaudio is already a vLLM dependency. Since torchaudio>=2.9 uses TorchCodec under the hood for decoding, torchcodec is added as dep.

tests

Tested all 9 formats documented by the OpenAI API against a live vLLM server running openai/whisper-large-v3-turbo on an NVIDIA a30.

Test audio: short LibriSpeech test-clean sample (public domain / LibriVox-derived), converted from a WAV source to all formats via ffmpeg.

# Server startup
python -m vllm.entrypoints.openai.api_server \
  --model openai/whisper-large-v3-turbo \
  --max-model-len 448 --dtype auto

# Test each format
for fmt in wav flac mp3 mpga ogg mp4 mpeg m4a webm; do
  curl -s -w "\n[%{http_code}]" \
    -F "file=@test.${fmt}" \
    -F "model=openai/whisper-large-v3-turbo" \
    http://localhost:8000/v1/audio/transcriptions
done

Test Result

Before patch (baseline)

Tested on vLLM v0.13.0 and v0.15.1 (unpatched):

Format v0.13.0 v0.15.1 Response
wav 200 OK 200 OK Transcription text
flac 200 OK 200 OK Transcription text
mp3 200 OK 200 OK Transcription text
mpga 200 OK 200 OK Transcription text
ogg 200 OK 200 OK Transcription text
mpeg 200 OK 200 OK Transcription text
mp4 500 200 (error in body) "Error opening <_io.BytesIO object>: Format not recognised."
m4a 500 200 (error in body) "Error opening <_io.BytesIO object>: Format not recognised."
webm 500 200 (error in body) "Error opening <_io.BytesIO object>: Format not recognised."

After patch

All formats now pass. The 3 previously broken formats (mp4, m4a, webm) now work via the in-process torchaudio fallback. The other six formats continue to use the fast BytesIO/librosa path.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copilot AI review requested due to automatic review settings February 23, 2026 15:43
@mergify mergify bot added frontend bug Something isn't working labels Feb 23, 2026
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a robust solution for handling various audio formats (MP4, M4A, WebM) in the speech-to-text transcription endpoint. The core change involves a new _decode_audio_bytes_ffmpeg function that leverages os.memfd_create and ffmpeg for in-memory decoding, avoiding disk I/O and permission issues. This is integrated into a _load_audio_bytes helper that attempts a fast librosa.load path first, falling back to the ffmpeg method if necessary. This approach directly addresses the root cause of previous failures with container formats and improves the overall reliability of the audio transcription service. The changes are well-documented and include a comprehensive test plan and results, demonstrating the effectiveness of the fix.

Comment on lines +158 to +159
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Catching a generic Exception is too broad and can mask unexpected errors, making debugging difficult. It's better to catch specific exceptions that librosa.load is known to raise when it fails to decode certain formats, such as soundfile.LibsndfileError or audioread.exceptions.NoBackendError if audioread were directly involved. If the exact exceptions are not known, consider logging the exception type and message before falling back, or catching a more specific base class if one exists for audio decoding failures.

Suggested change
except Exception:
pass
except (soundfile.LibsndfileError, audioread.exceptions.NoBackendError) as e:
logger.debug("Librosa BytesIO decode failed: %s", e)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes the OpenAI-compatible speech-to-text preprocessing path so MP4/M4A/WebM container uploads can be decoded (via an ffmpeg fallback) instead of failing when librosa.load(BytesIO(...)) can’t infer the format.

Changes:

  • Added _decode_audio_bytes_ffmpeg() to decode audio bytes to mono float32 PCM using ffmpeg with an in-memory FD.
  • Added _load_audio_bytes() to try librosa.load(BytesIO(...)) first and fall back to ffmpeg on failure.
  • Updated _preprocess_speech_to_text() to use _load_audio_bytes() instead of a direct BytesIO + librosa.load call.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sr = int(sr)
fd = os.memfd_create("vllm_audio")
try:
os.write(fd, audio_data)
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.write(fd, audio_data) is not guaranteed to write all bytes in one call. If it performs a partial write, ffmpeg will see a truncated container and decoding may fail or produce incorrect audio. Consider writing in a loop (or use os.fdopen(fd, 'wb', closefd=False) and .write()/.flush()) and verify the full length is written before running ffmpeg.

Suggested change
os.write(fd, audio_data)
# Ensure the full audio_data buffer is written to the memfd.
total_written = 0
data_len = len(audio_data)
while total_written < data_len:
written = os.write(fd, audio_data[total_written:])
if written == 0:
raise RuntimeError("Failed to write complete audio data to memfd")
total_written += written

Copilot uses AI. Check for mistakes.
Comment on lines +118 to +135
result = subprocess.run(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
pass_fds=(fd,), # inherit only this fd
)
finally:
os.close(fd)

if result.returncode != 0:
raise RuntimeError(
"ffmpeg failed to decode audio: "
+ result.stderr.decode("utf-8", errors="replace").strip()
)

y = np.frombuffer(result.stdout, dtype=np.float32)
if y.size == 0:
raise RuntimeError(
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_decode_audio_bytes_ffmpeg raises RuntimeError on decode failure / empty output, but _create_speech_to_text only catches ValueError from preprocessing. This means ffmpeg failures will likely bubble up as 500s instead of a structured OpenAI error response. Suggest raising ValueError (or VLLMValidationError) from this helper (possibly wrapping FileNotFoundError/ffmpeg stderr) or expanding the preprocessing except to handle these failure types explicitly.

Suggested change
result = subprocess.run(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
pass_fds=(fd,), # inherit only this fd
)
finally:
os.close(fd)
if result.returncode != 0:
raise RuntimeError(
"ffmpeg failed to decode audio: "
+ result.stderr.decode("utf-8", errors="replace").strip()
)
y = np.frombuffer(result.stdout, dtype=np.float32)
if y.size == 0:
raise RuntimeError(
try:
result = subprocess.run(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
pass_fds=(fd,), # inherit only this fd
)
except OSError as exc:
# ffmpeg binary not found or failed to execute
raise VLLMValidationError(
f"ffmpeg invocation failed while decoding audio: {exc}"
) from exc
finally:
os.close(fd)
if result.returncode != 0:
raise VLLMValidationError(
"ffmpeg failed to decode audio: "
+ result.stderr.decode("utf-8", errors="replace").strip()
)
y = np.frombuffer(result.stdout, dtype=np.float32)
if y.size == 0:
raise VLLMValidationError(

Copilot uses AI. Check for mistakes.
Comment on lines +96 to +109
sr = int(sr)
fd = os.memfd_create("vllm_audio")
try:
os.write(fd, audio_data)
os.lseek(fd, 0, os.SEEK_SET)

cmd = [
"ffmpeg",
"-hide_banner",
"-loglevel",
"error",
"-i",
f"/proc/self/fd/{fd}",
"-vn", # discard video
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ffmpeg path is Linux-specific (os.memfd_create and /proc/self/fd/<N>). vLLM supports macOS (see setup.py / CI smoke tests), where memfd_create and /proc are unavailable; MP4/M4A/WebM uploads would still fail there (likely with AttributeError). Consider adding an OS check and a portable fallback (e.g., NamedTemporaryFile(suffix=...) or SpooledTemporaryFile) with clear error messaging when neither option is available.

Copilot uses AI. Check for mistakes.
Comment on lines +155 to +160
try:
with io.BytesIO(audio_data) as buf:
return librosa.load(buf, sr=sr) # type: ignore[return-value]
except Exception:
pass

Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description mentions “narrower exception handling”, but _load_audio_bytes currently uses a broad except Exception: and silently discards the error. This both contradicts the description and makes it hard to diagnose why the fast path failed. Consider catching the expected decode exceptions and logging the exception details at DEBUG (with exc_info=True) before falling back.

Copilot uses AI. Check for mistakes.
Comment on lines +404 to +409
# Decode audio bytes. For container formats (MP4, M4A, WebM) that
# soundfile cannot detect from a BytesIO stream, _load_audio_bytes
# transparently falls back to ffmpeg via an in-memory fd.
# NOTE resample to model SR here for efficiency. This is also a
# pre-requisite for chunking, as it assumes Whisper SR.
y, sr = _load_audio_bytes(audio_data, sr=self.asr_config.sample_rate)
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces a new ffmpeg-based fallback path for container formats (MP4/M4A/WebM), but the existing transcription tests appear to cover only WAV-like inputs. Adding an automated test that exercises the fallback (and validates it returns audio of the expected duration) would prevent regressions and ensure CI covers the previously broken formats.

Copilot uses AI. Check for mistakes.
@mergify
Copy link

mergify bot commented Feb 23, 2026

Hi @seanmamasde, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@seanmamasde seanmamasde force-pushed the fix/audio-transcription-mp4-m4a-webm branch 2 times, most recently from 20c55cb to b878842 Compare February 23, 2026 16:04
Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the detailed breakdown and for contributing to vLLM @seanmamasde !

My only concern is the one I commented about, reporting it here to broaden discussion.

I am somewhat worried about the latency overhead we're introducing here in spawning a separate process at the API level in calling ffmpeg. On one side I understand a generic fallback like this for all audio types can be of enhance flexibility. On the other I wouldn't want to penalize vllm perceived latency for an operation that could be carried out in front of vllm itself.

This may call at least for an optional flag which the user has to explicitly set to opt-in and ack the suboptimal conversion (ie make this feature optional).

Alternatively, we should consider whether an in-process conversion solution could be adopted here.

Finally, can you provide more info about the mp4 file used for testing (feel free to reach out on slack), so I can add them to our set?

cc @alex-jw-brooks may also be interested

Comment on lines +140 to +143
def _load_audio_bytes(
audio_data: bytes,
sr: int | float,
) -> tuple[np.ndarray, int]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move these two new functions into a new utils.py file here in the same submodule?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's moved now! checkout 62f5ce5

with io.BytesIO(audio_data) as buf:
return librosa.load(buf, sr=sr) # type: ignore[return-value]
except Exception:
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can move the rest of the code here instead of passing.
Also, could you check whether snf exception introduced in #34715 can be used in place of the generic Exception catch-all trap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. now utils.py catches sf.LibsndfileError with exc.code in _BAD_SF_CODE

Comment on lines +118 to +122
result = subprocess.run(
cmd,
capture_output=True,
pass_fds=(fd,), # inherit only this fd
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also somewhat worried about the latency overhead we're introducing here in spawning a separate process at the API level.
On one side I understand a generic fallback like this for all audio types can be of enhance flexibility.
On the other I wouldn't want to penalize vllm perceived latency for an operation that could be carried out in front of vllm itself.

This may call at least for an optional flag which the user has to explicitly set to opt-in and ack the suboptimal conversion (ie make this feature optional).

Alternatively, we should consider whether an in-process conversion solution could be adopted here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's now done via torchaudio.load, which is in-process, so I guess no flag needed?

@seanmamasde seanmamasde force-pushed the fix/audio-transcription-mp4-m4a-webm branch 2 times, most recently from 4794f78 to 1258128 Compare February 24, 2026 13:01
@mergify
Copy link

mergify bot commented Feb 24, 2026

Hi @seanmamasde, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

1 similar comment
@mergify
Copy link

mergify bot commented Feb 24, 2026

Hi @seanmamasde, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@seanmamasde seanmamasde force-pushed the fix/audio-transcription-mp4-m4a-webm branch from 259fca6 to bb8dbce Compare February 24, 2026 13:36
@seanmamasde seanmamasde force-pushed the fix/audio-transcription-mp4-m4a-webm branch from bb8dbce to 914066d Compare February 24, 2026 13:39
@seanmamasde seanmamasde requested a review from tjtanaa as a code owner February 24, 2026 13:39
@mergify mergify bot added rocm Related to AMD ROCm cpu Related to CPU backends labels Feb 24, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Feb 24, 2026
@mergify
Copy link

mergify bot commented Feb 24, 2026

Hi @seanmamasde, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@seanmamasde seanmamasde force-pushed the fix/audio-transcription-mp4-m4a-webm branch from 914066d to 38e8209 Compare February 24, 2026 13:58
@seanmamasde
Copy link
Contributor Author

seanmamasde commented Feb 24, 2026

Thanks a lot for the detailed breakdown and for contributing to vLLM @seanmamasde !

My only concern is the one I commented about, reporting it here to broaden discussion.

I am somewhat worried about the latency overhead we're introducing here in spawning a separate process at the API level in calling ffmpeg. On one side I understand a generic fallback like this for all audio types can be of enhance flexibility. On the other I wouldn't want to penalize vllm perceived latency for an operation that could be carried out in front of vllm itself.

This may call at least for an optional flag which the user has to explicitly set to opt-in and ack the suboptimal conversion (ie make this feature optional).

Alternatively, we should consider whether an in-process conversion solution could be adopted here.

Finally, can you provide more info about the mp4 file used for testing (feel free to reach out on slack), so I can add them to our set?

cc @alex-jw-brooks may also be interested

Audio is generated from a short LibriSpeech (test-clean) speech clip (public domain / LibriVox-derived), downloaded as WAV and then trimmed/resampled to mono 16 kHz and converted to wav, flac, mp3, mpga, ogg, mp4, mpeg, m4a, webm w/ ffmpeg.

LibriSpeech sample mirror: https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav

Copy link
Contributor

@alex-jw-brooks alex-jw-brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, looks good! Some small suggestions

try:
import soundfile as sf
except ImportError:
sf = None # type: ignore[assignment]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you refactor this to also use a PlaceholderModule for soundfile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

try:
with io.BytesIO(audio_data) as buf:
return librosa.load(buf, sr=sr) # type: ignore[return-value]
except Exception as exc:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to avoid catching the exception generically like this and handle it more explicitly - For example, if we use the soundfile placeholder, I think we can just catch soundfile.LibsndfileError, inspect the code, and add a debug log + return decode_audio_bytes_torchaudio(audio_data, sr) if it's a _BAD_SF_CODES?

Using the placeholder would also be more clear for failure cases here, because soundfile is an explicitly listed optional dep of vLLM for audio too, so it'll raise Please install vllm[audio] for audio support if soundfile.LibsndfileError is invalid because it's not installed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@github-project-automation github-project-automation bot moved this to In review in NVIDIA Mar 3, 2026
@seanmamasde seanmamasde force-pushed the fix/audio-transcription-mp4-m4a-webm branch from 38e8209 to 0cf7845 Compare March 4, 2026 14:59
@mergify mergify bot removed the needs-rebase label Mar 4, 2026
@seanmamasde
Copy link
Contributor Author

Hi @alex-jw-brooks I have made the changes you suggested. Can you take a look when you have time? Huge thanks.

Copy link
Contributor

@alex-jw-brooks alex-jw-brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks! LGTM - @NickLucche will need to take one more look to merge I think :)

Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the work @seanmamasde @alex-jw-brooks !
Just a comment on where should torchcodec dependency should live (and also torchaudio imo, although that may be work for a separate PR)

ray[cgraph]>=2.48.0
torch==2.10.0
torchaudio==2.10.0
torchcodec==0.10.0 # Required by torchaudio>=2.9 for audio decoding (MP4/M4A/WebM)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure sure why torchaudio appears in every requirements but not in common.txt? @DarkLight1337

Regardless, I think we should add torchcodec to the vllm[audio] extras

vllm/setup.py

Line 1054 in 755356b

"audio": [

Copy link
Contributor Author

@seanmamasde seanmamasde Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I should've just added it under to the audio section. I removed all the occurrences in the requirements.txt and put it in the setup.py -> audio[] instead. it's fixed now!

@seanmamasde seanmamasde force-pushed the fix/audio-transcription-mp4-m4a-webm branch from 0cf7845 to 9657424 Compare March 7, 2026 09:32
@NickLucche NickLucche added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 9, 2026
Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-project-automation github-project-automation bot moved this from In review to Ready in NVIDIA Mar 9, 2026
@NickLucche NickLucche enabled auto-merge (squash) March 9, 2026 08:21
auto-merge was automatically disabled March 9, 2026 08:23

Head branch was pushed to by a user without write access

@seanmamasde seanmamasde force-pushed the fix/audio-transcription-mp4-m4a-webm branch 5 times, most recently from cb8b2be to c6363fb Compare March 12, 2026 06:45
…mats

Add torchaudio-based fallback decoding for container formats that
librosa/soundfile (libsndfile) cannot handle. When librosa.load()
fails with a LibsndfileError on unsupported formats, fall back to
torchaudio.load() which uses torchcodec/FFmpeg for decoding.

- Add utils.py with load_audio_bytes() and decode_audio_bytes_torchaudio()
- Narrow exception handling to catch sf.LibsndfileError specifically
- Use PlaceholderModule for soundfile import
- Add torchcodec to vllm[audio] extras in setup.py

Signed-off-by: seanmamasde <seanmamasde@gmail.com>
@seanmamasde seanmamasde force-pushed the fix/audio-transcription-mp4-m4a-webm branch from c6363fb to 277488c Compare March 14, 2026 10:19
@vllm-bot vllm-bot merged commit 84868e4 into vllm-project:main Mar 14, 2026
126 of 128 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Mar 14, 2026
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Mar 14, 2026
"soundfile",
"mistral_common[audio]",
"av",
"torchcodec",
Copy link
Member

@Isotr0py Isotr0py Mar 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried that torchcodec will break audio support on GB200 + aarch64 CPU, because it only distributes x86_64 manylinux wheels (https://pypi.org/project/torchcodec/#files).

I opened #37061 to revert this PR and use pyav for video fallback instead.

Copy link
Contributor Author

@seanmamasde seanmamasde Mar 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually investigated this a bit back:

| lib           | in-process? | mp4/m4a/webm using bytesio                            | new dep |
| ------------- | ----------- | ----------------------------------------------------- | ------- |
| ffmpeg-python | no          | using pipe, but still subprocess                      | yes     |
| pydub         | no          | using pipe, but still subprocess                      | yes     |
| soundfile     | yes         | libsndfile doesn't support mp4/m4a/webm               | no      |
| PyAV (av)     | yes         | av.open(BytesIO(...)) should work                     | yes     |
| torchaudio    | yes         | torchaudio.load(BytesIO(...), format=...) should work | no      |

At the time of implementation, torchaudio seems like the best bet since it doesn't introduce extra deps and is an in-process conversion (as opposed to tempfile, subprocess w/ ffmpeg). But it seems that starting with torchaudio v2.9.0+ it uses torchcodec for torchaudio.save() and torchaudio.load()

athrael-soju pushed a commit to athrael-soju/vllm that referenced this pull request Mar 15, 2026
…mats (vllm-project#35109)

Signed-off-by: seanmamasde <seanmamasde@gmail.com>
Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
athrael-soju pushed a commit to athrael-soju/vllm that referenced this pull request Mar 16, 2026
…mats (vllm-project#35109)

Signed-off-by: seanmamasde <seanmamasde@gmail.com>
Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026
…mats (vllm-project#35109)

Signed-off-by: seanmamasde <seanmamasde@gmail.com>
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
…mats (vllm-project#35109)

Signed-off-by: seanmamasde <seanmamasde@gmail.com>
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
…mats (vllm-project#35109)

Signed-off-by: seanmamasde <seanmamasde@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci/build cpu Related to CPU backends frontend nvidia ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done
Status: Done

6 participants