[Feature] WebSocket streaming audio input for ASR#22821
Closed
SammLSH wants to merge 1 commit into
Closed
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Implements M1 of the RFC in #22474.
PR #22089 shipped chunked streaming output for Qwen3-ASR via
POST /v1/audio/transcriptions?stream=true(SSE over an HTTP upload), which assumes the entire audio file is known up-front. Real-time use cases (live captioning, voice assistants, meeting transcription) need the opposite direction: the server accepts audio as it arrives and pushes partial transcripts back as the speaker talks. This PR adds that path as a new WebSocket endpoint, reusing the existing chunked inference state machine andTranscriptionAdapterfrom PR #22181.Modifications
New WebSocket endpoint
WS /v1/audio/transcriptions/stream(registered inhttp_server.py)session.start(JSON) → binary PCM16 frames →session.end(JSON)session.started/transcript.delta(per word) /transcript.final/errorsession.startaccepts anaudio_formatfield for forward compatibility. Currently onlypcm16_16k_monois supported; other values returninvalid_audio_format. Additional formats can be added without a protocol break.--asr-max-buffer-secondsCLI flag (default 60s). If accumulated server-side audio exceeds the cap, the server sends abuffer_overflowerror and closes the socket. Below the cap, the single-task coroutine alternates receive and inference, so while a chunk is inferring the client experiences standard TCP-level backpressure (ws.sendblocks on a full socket buffer). No silent drop.session.endis therefore serialized after any in-flight chunk.Note on HTTP SSE path: switching
get_prefix_text()fromconfirmed_texttoemitted_text(seestreaming_asr.pyrow below) also incidentally fixes a latent prompt-prefix continuation issue in the HTTP SSE path from #22089, whereconfirmed_textcould roll back mid-sentence and cause the model to re-emit from scratch on long English audio. Regression covered by the existing 8 HTTP tests.Architecture
The HTTP SSE and WebSocket paths both route through a shared inference driver, keeping streaming state in
StreamingASRStateat the adapter layer rather than lifting it into the transport layer.WebSocket session lifecycle
Inside
serving_transcription_websocket.py, one coroutine alternates receive and inference. Each PCM batch ofchunk_size_bytestriggers exactly one inference pass through the shared driver.Files touched
WS assertions use
_assert_close_to_ref(WER ≤ 0.15)againstEXPECTED_TRANSCRIPTS(a dict of canonical transcripts captured from one-shot non-streaming inference)._wernormalizes case/punctuation and falls back to character-level comparison for CJK.Manual (single audio, step-by-step)
Launch the server:
HTTP non-streaming (baseline / ground truth):
HTTP SSE streaming:
WebSocket streaming — save as
wsasr.pyand runpython wsasr.py audio.wav [language]. Sends PCM in 0.5s frames at wall-clock realtime pacing while concurrently readingtranscript.deltaevents; each line is prefixed witht=<sec>from session start so you can see deltas arriving while audio is still uploading.Error-path verification
Manually tested all 8 error codes against a running server:
invalid_json,invalid_payload,invalid_state× 3 variants,invalid_audio_format,unknown_message,buffer_overflow,unsupported_model,internal_error. All return the documented error event and close the socket.Speed Tests and Profiling
No impact on inference speed — this PR is a thin WebSocket transport layer on top of unchanged chunked inference. Per-chunk latency is bound by the existing
chunk_size_sec(2s) + model inference time (~0.5–1.5s on H100 for Qwen3-ASR-0.6B). No new CUDA kernels, no new memory patterns, no scheduler changes.Checklist
test/manual/models/test_qwen3_asr.py)Review and Merge Process
Related
TranscriptionAdapterrefactor, merged)cc @JustinTong0323 @AgainstEntropy
## MotivationImplements M1 of the RFC in #22474.
PR #22089 shipped chunked streaming output for Qwen3-ASR via
POST /v1/audio/transcriptions?stream=true(SSE over an HTTP upload), which assumes the entire audio file is known up-front. Real-time use cases (live captioning, voice assistants, meeting transcription) need the opposite direction: the server accepts audio as it arrives and pushes partial transcripts back as the speaker talks. This PR adds that path as a new WebSocket endpoint, reusing the existing chunked inference state machine andTranscriptionAdapterfrom PR #22181.Modifications
New WebSocket endpoint
WS /v1/audio/transcriptions/stream(registered inhttp_server.py)session.start(JSON) → binary PCM16 frames →session.end(JSON)session.started/transcript.delta(per word) /transcript.final/errorsession.startaccepts anaudio_formatfield for forward compatibility. Currently onlypcm16_16k_monois supported; other values returninvalid_audio_format. Additional formats can be added without a protocol break.--asr-max-buffer-secondsCLI flag (default 60s). If accumulated server-side audio exceeds the cap, the server sends abuffer_overflowerror and closes the socket. Below the cap, the single-task coroutine alternates receive and inference, so while a chunk is inferring the client experiences standard TCP-level backpressure (ws.sendblocks on a full socket buffer). No silent drop.session.endis therefore serialized after any in-flight chunk.Note on HTTP SSE path: switching
get_prefix_text()fromconfirmed_texttoemitted_text(seestreaming_asr.pyrow below) also incidentally fixes a latent prompt-prefix continuation issue in the HTTP SSE path from #22089, whereconfirmed_textcould roll back mid-sentence and cause the model to re-emit from scratch on long English audio. Regression covered by the existing 8 HTTP tests.Architecture
The HTTP SSE and WebSocket paths both route through a shared inference driver, keeping streaming state in
StreamingASRStateat the adapter layer rather than lifting it into the transport layer.flowchart TB HTTP["HTTP<br/>(file upload)"] WS["WebSocket<br/>(live PCM frames)"] subgraph serving["OpenAIServingTranscription"] SSE["_generate_chunked_asr_stream"] WSH["handle_websocket<br/>(delegator)"] end PAC["process_asr_chunk<br/>(shared inference driver)"] subgraph state["StreamingASRState"] CT["confirmed_text<br/>chunk-local rollback, delta diff basis"] ET["emitted_text<br/>monotonic accumulator, prompt prefix source"] UF["update() / finalize()"] end HTTP --> SSE WS --> WSH SSE --> PAC WSH --> PAC PAC --> stateWebSocket session lifecycle
Inside
serving_transcription_websocket.py, one coroutine alternates receive and inference. Each PCM batch ofchunk_size_bytestriggers exactly one inference pass through the shared driver.sequenceDiagram autonumber participant C as Client participant H as WS handler participant I as process_asr_chunk participant S as StreamingASRState C->>H: session.start (JSON) H->>H: _init_session<br/>accept + adapter capability check H-->>C: session.started loop per chunk_size_bytes of new audio C->>H: binary PCM16 frames H->>H: _handle_audio_frame<br/>accumulate into pcm_buffer H->>I: _run_inference(_pcm_to_wav(buffer)) I->>S: update() → delta I-->>H: delta str H-->>C: transcript.delta (per word) end C->>H: session.end (JSON) H->>I: _run_inference(is_last=True) I->>S: finalize() → tail I-->>H: tail delta H-->>C: transcript.final H->>C: close socketFiles touched
python/sglang/srt/entrypoints/http_server.pyWS /v1/audio/transcriptions/streamroutepython/sglang/srt/entrypoints/openai/serving_transcription.pyprocess_asr_chunkintostreaming_asr.pyso HTTP and WS share the inference driver; addhandle_websocketdelegatorpython/sglang/srt/entrypoints/openai/serving_transcription_websocket.py_pcm_to_wavadapter for protocol-fixed PCM16/16kHz/mono,transcript.final.textconstructionpython/sglang/srt/entrypoints/openai/streaming_asr.pyStreamingASRStatewith anemitted_textaccumulator used as the prompt prefix inget_prefix_text()(previouslyconfirmed_text); also extractprocess_asr_chunkas the shared HTTP/WS inference driver and add a_normalize_whitespacehelper for batched-inference punctuation jitterpython/sglang/srt/entrypoints/websocket_base.pyWebSocketSessionBaseminimal mixin (accept / send_json / safe_close) so future WS endpoints can reuse itpython/sglang/srt/server_args.pyasr_max_buffer_seconds: int = 60+ CLI flagtest/manual/models/test_qwen3_asr.pyEXPECTED_TRANSCRIPTSreference dict,_werLevenshtein helper, WS assertions via_assert_close_to_ref(WER ≤ 0.15)Accuracy Tests
Manual end-to-end verification against HTTP non-streaming (one-shot) ground truth across 7 audio fixtures × 3 paths (HTTP JSON / HTTP SSE / WebSocket):
Oh yeah, yeah. He wasn't even that big when I started listening to him. But and his solo music...for other people.Uh huh.prefix hallucination (vocal-fry intro, chunked-inference artifact)甚至出现交易几乎停滞的情况。I have a dream that one day this nation will rise up and live out the true meaning of its creed.He hoped there would be stew for dinner—turnips and carrots and bruised potatoes and fat mutton pieces—to be ladled out in thick peppered flour-fatted sauce.—→:y en las ramas medio sumergidas revoloteaban algunos pájaros de químico y legendario plumajeY ... plumaje.मिर्ची में कितने विभिन्न प्रजातियाँ हैंI know kung fu.WER threshold rationale: WS assertions use
WER ≤ 0.15. This tolerates the chunked-inference boundary artifacts inherited from #22089 (e.g. theUh huh.prefix on the EN clip) while still catching real regressions — the non-streaming ground-truth WER on these fixtures is ≤0.02, so the 0.15 threshold leaves plenty of headroom.5 consecutive stability runs of the 18-test suite passed (all 18 tests green each run). Median wall-clock 52.4s on a single H100.
Streaming verification: on the 15s EN audio in realtime-pacing mode (client sends PCM at 0.5s/frame = wall-clock rate), ~30 out of 37
transcript.deltaevents arrive at the client before the client sendssession.end, confirming true incremental server push (not batch-at-end).M1 completion checklist (from RFC)
/v1/audio/transcriptions/streamsession.start/ binary PCM /session.end⇄session.started/transcript.delta/transcript.final/errorinvalid_json,invalid_payload,invalid_state,invalid_audio_format,unknown_message,buffer_overflow,unsupported_model,internal_error--asr-max-buffer-secondsbackpressureStreamingASRState,process_asr_chunk,TranscriptionAdapterGenerateReqInput.stream=True, forced-alignment timestamps): out of scopeKnown limitations
Listed here so reviewers don't have to re-discover them:
"Uh huh."prefix hallucination because the first 2s chunk sees only vocal fry / silence, and the append-only delta protocol can't retract it later. WER ~0.057.transcript.deltaevent becausestr.split()can't word-tokenize them. Final transcript is still correct. TODO already inStreamingASRStatedocstring.mm_utils.py:_adjust_embedding_length. Upstream bug, unrelated to this PR.嗯哼。). Short-clip test uses 3s MP3 to avoid this.Test plan
Tests live under
test/manual/because they require downloading theQwen/Qwen3-ASR-0.6Bcheckpoint (~1.2GB) and a GPU. They are runnable locally with a single command and complete in ~52s on one H100.Automated (single command)
cd /path/to/sglang python test/manual/models/test_qwen3_asr.pyThe file uses
popen_launch_serverto spin up its ownsglang.launch_server, runs 18 tests, and tears the server down.Test breakdown:
WS assertions use
_assert_close_to_ref(WER ≤ 0.15)againstEXPECTED_TRANSCRIPTS(a dict of canonical transcripts captured from one-shot non-streaming inference)._wernormalizes case/punctuation and falls back to character-level comparison for CJK.Manual (single audio, step-by-step)
Launch the server:
HTTP non-streaming (baseline / ground truth):
curl -s http://127.0.0.1:30000/v1/audio/transcriptions \ -F file=@audio.wav -F model=qwen3-asr | jq -r .textHTTP SSE streaming:
WebSocket streaming — save as
wsasr.pyand runpython wsasr.py audio.wav [language]. Sends PCM in 0.5s frames at wall-clock realtime pacing while concurrently readingtranscript.deltaevents; each line is prefixed witht=<sec>from session start so you can see deltas arriving while audio is still uploading.Error-path verification
Manually tested all 8 error codes against a running server:
invalid_json,invalid_payload,invalid_state× 3 variants,invalid_audio_format,unknown_message,buffer_overflow,unsupported_model,internal_error. All return the documented error event and close the socket.Speed Tests and Profiling
No impact on inference speed — this PR is a thin WebSocket transport layer on top of unchanged chunked inference. Per-chunk latency is bound by the existing
chunk_size_sec(2s) + model inference time (~0.5–1.5s on H100 for Qwen3-ASR-0.6B). No new CUDA kernels, no new memory patterns, no scheduler changes.Checklist
test/manual/models/test_qwen3_asr.py)