Skip to content

Feat/realtime asr input slicing#26767

Closed
SammLSH wants to merge 2 commits into
sgl-project:mainfrom
SammLSH:feat/realtime-asr-input-slicing
Closed

Feat/realtime asr input slicing#26767
SammLSH wants to merge 2 commits into
sgl-project:mainfrom
SammLSH:feat/realtime-asr-input-slicing

Conversation

@SammLSH
Copy link
Copy Markdown
Contributor

@SammLSH SammLSH commented May 30, 2026

Motivation

This PR implements the M2 milestone of RFC #22474 (bounding long-form cost) via input slicing, rather than the RadixCache prefix-caching the RFC sketched (see "Relationship to RFC M2").

The WS /v1/realtime path from #22848 re-sends the entire accumulated PCM buffer to the model on every chunk, so per-chunk work grows linearly with audio length and total prefill is quadratic in session length. On long-form audio this causes:

  1. Unbounded per-chunk latency. Worst single inference on a 300 s session is 399 ms (cumulative) vs 121 ms (sliced); end-to-end wall improves 14–78 % across 30–900 s, and realtime-paced per-chunk inference drops 40–42 % on both 0.6B and 1.7B.
  2. Over-emission on repetitive audio. The cumulative delta stream balloons ~35× beyond gold (EN180: 75,035 vs ~2,160 chars) — model-agnostic (same on 0.6B and 1.7B), since the pathology is in the call pattern, not the model. Slicing stays near gold.
  3. WS keepalive timeouts. At 600/900 s the cumulative path trips the OpenAI SDK's 20 s ping budget (1011 keepalive ping timeout); bounded inference never threatens it.
  4. Encoder activation growth. audio_tower activations grow with length and risk OOM; slicing caps per-call audio regardless of session length.

Slicing keeps per-call audio at one chunk plus a short left overlap, independent of session length, so per-chunk work stays flat.

A small cleanup ships alongside: the realtime path no longer round-trips PCM16 through a WAV byte stream just for load_audio to decode it back to a float ndarray.

Defaults referenced below: chunk_size_sec = 2 s; slicing engages after 8 chunks (ceil(min_audio_sec / chunk_size_sec), ~16 s); steady-state per-call audio is one chunk + a 2 s left overlap (~4 s).

Realtime ASR Roadmap

flowchart LR
    M1["M1: Functional Realtime ASR<br/>- WS /v1/realtime<br/>- session.update / append / commit / clear<br/>- partial and completed transcript events<br/>- shared ASR streaming driver<br/><br/><b>Usable realtime ASR</b>"]
    G1["Remaining after M1<br/>Full accumulated audio is still reprocessed<br/>Per-chunk cost grows with session length"]
    M2["M2: Long-Form Cost Bounding<br/>this PR<br/>- input slicing after stability gate<br/>- chunk + left-overlap audio window<br/>- output-side dedupe<br/>- direct PCM16 to float samples<br/>- bounded encoder + prefill work<br/><br/><b>Affordable long-form realtime ASR</b>"]
    G2["Remaining after M2<br/>Not exact prefix caching<br/>Not incremental encoder<br/>Dedupe remains heuristic<br/>PCM buffer compaction is future work"]
    M3["M3: Stateful Streaming ASR<br/>future<br/>- incremental audio encoder state<br/>- rolling audio buffer<br/>- token/timestamp-level alignment<br/>- stable-prefix commit<br/>- draft/final separation<br/>- session-aware scheduling<br/><br/><b>Stateful and robust realtime ASR</b>"]
    Target["Target: True Realtime Streaming ASR<br/>- low TTFT and stable per-chunk latency<br/>- bounded compute and memory<br/>- no full cumulative reprocessing<br/>- fewer heuristic correctness tradeoffs<br/>- extensible across ASR / omni models"]

    M1 --> G1 --> M2 --> G2 --> M3 --> Target
Loading

M1 makes realtime ASR usable, M2 makes long-form realtime ASR affordable, and M3 moves the system toward truly stateful realtime streaming.

Relationship to RFC M2

RFC #22474 sketches M2 as cross-chunk prefix caching via RadixCache: keep the full audio context, let RadixCache match the shared token prefix, and run LLM prefill only on the new tail tokens. This PR does not take that route. It addresses M2's goal — eliminating the quadratic long-form cost — with a different mechanism: input slicing.

Dimension RFC M2 (RadixCache prefix caching) This PR (input slicing)
Mechanism Retain full audio context; RadixCache matches shared token prefix; prefill only the new tail tokens Discard old audio; feed only ~4 s (chunk + overlap); restore continuity via acoustic overlap + output-side dedupe
What it bounds LLM prefill only Encoder and LLM prefill
Encoder cost Still re-runs on the full buffer every chunk (RFC's own stated limitation) → long audio still risks OOM Bounded at ~4 s; does not grow
Cache key Needs a content-aware multimodal key (flagged non-trivial in RFC review) Untouched
Correctness Exact (same tokens, only cached) Approximate (dedupe heuristic)

Why slicing was chosen over the RadixCache route:

  • It bounds the encoder, which RadixCache does not. The RFC explicitly notes the RadixCache approach reduces only LLM prefill re-work while the encoder still re-runs on the full accumulated buffer every chunk — leaving the OOM risk in place. Slicing bounds the encoder too, closing that gap. The flat per-chunk latency at 300 s (80 ms mean for slicing, encoder + prefill both bounded) is the observable consequence.
  • It does not touch the multimodal cache key. Review on the RFC confirmed the current key hashes the whole audio feature tensor, so a content-aware key would be a non-trivial change to the multimodal processor and radix cache. Slicing avoids that surface entirely.
  • The cost is exactness → approximation, and it is gated. Slicing trades exact per-token caching for a dedupe heuristic. The 8-chunk threshold (with Qwen3-ASR's chunk_size_sec=2 s and min_audio_sec=16 s, ceil(16/2)=8) confines that approximation to long-form audio (≥ ~16 s); short audio rides the original cumulative path end-to-end and stays byte-equal to HTTP SSE.

This PR implements the M2 long-form cost-bounding milestone via input slicing. It differs from the initial RadixCache prefix-caching sketch in the RFC, but targets the same M2 goal: removing the quadratic long-form reprocessing cost. Exact RadixCache prefix reuse and token-level streaming / alignment (M3) remain valid future work; slicing and prefix caching could even compose later (slice the audio and cache the bounded tail's prefix).

Modifications

Runtime changes primarily touch realtime/session.py, streaming_asr.py, utils/common.py, and the transcription adapter config — transcription_adapters/base.py ships a base realtime_slicing_config that keeps slicing off by default (new adapters must opt in with enabled=True), and transcription_adapters/qwen3_asr.py overrides it with the Qwen3-ASR-tuned values (left_overlap_ms=2000, min_audio_sec=16.0). This PR also adds a small CPU unit-test suite for the helper functions.

The primary change slices pending audio once committed text rolls forward; one small cleanup ships alongside it (the PCM → WAV → ndarray round-trip skip). Both touch _run_inference, so they sit in the same diff. The unit test verifies bit-equality of the new direct conversion against the legacy sf.writesf.read fallback path; with the ndarray passthrough in load_audio, the realtime PCM path bypasses file / byte decoding entirely. The cleanup is mechanically independent of slicing — it is bundled here only to avoid landing a second PR that touches the same function immediately after this one.

HTTP and HTTP SSE transcription endpoints are unchanged on the wire, except that the StreamingASRState.update() reconciliation step (shared with the HTTP SSE chunked-streaming path) drops a buggy str.startswith fast path — see the Reconciliation fix note below for the scenario and behavior delta.

PCM/WAV cleanup. PCM16 bytes are converted directly to float samples via np.frombuffer(int16) / 32768.0, which matches soundfile.read's default normalization, so the float values reaching the encoder are bit-equal to the legacy sf.writesf.read fallback path covered by the unit test. load_audio gains an early-return passthrough when its argument is already an ndarray, which is what the new realtime path passes in.

Reconciliation fix. StreamingASRState.update() previously short-circuited the per-chunk delta computation with confirmed_text.startswith(old_confirmed) and a [len(old):] slice. Because startswith is character-level, when the model extended a confirmed trailing word ("world""worldly") the string prefix still matched and it emitted the mid-word fragment "ly" rather than the corrected word "worldly". The two-line char-level fast path is deleted so the word-level common-prefix scan below it (which already handled revisions, and which finalize() already uses) runs unconditionally — no new branches. A differential check over 20k randomized transcript evolutions confirms the two paths diverge only on this mid-word case, where the token-level scan is correct. Two new pure-CPU unit tests in test_streaming_asr.py lock it: a mid-word-extension case (asserts "worldly", not "ly") and a clean-append case (guards the common path is unchanged). The bug originated in #22089 (the original Qwen3-ASR chunked-streaming PR); this PR fixes it in place because the slicing path also funnels output through update(), so leaving it would let mid-word fragments leak through the slicing path too.

Slicing. Once the chunked-streaming StreamingASRState has stable emitted text, is past the K-token holdback gate, and has accumulated at least eight chunks (≈ 16 s) of cumulative context, the model runs on pcm_buffer[committed_audio_until_bytes - left_overlap_bytes:] plus a 2 s left overlap, instead of re-sending the full cumulative buffer. The prompt stays at adapter.prompt_templateemitted_text is not injected as a continuation prefix; in our measurements the retained acoustic overlap plus output-side dedupe is a stronger signal than text-prefix injection and avoids re-priming the model on its own output. A word-level dedupe (with a CJK char-level fallback) drops the resulting duplicate transcription before it reaches StreamingASRState. Note: the FIRST inference past the gate (chunk index = 8 with default config) still feeds the full accumulated buffer because committed_audio_until_bytes = 0 at that point makes slice_start = 0. After that first gated transition call, steady-state per-call audio is one chunk plus the left overlap, ~4 s with the Qwen3-ASR defaults.

The slicing trigger gates on state.get_prefix_text() returning non-empty AND state.chunk_index >= slicing_min_chunk_index (derived as math.ceil(adapter.realtime_slicing_config['min_audio_sec'] / chunk_size_sec), =8 for Qwen3-ASR's min_audio_sec=16 s / chunk_size_sec=2 s), so pre-K-chunk, CJK single-word-fallback, and short audio overall stay on the cumulative behavior.

Left overlap (2 s). Sized to Qwen3-ASR's unfixed_token_num = 5 rollback window (K = 5 tokens ≈ 2 s of English audio at the model's effective token rate). In our long-form English tests 3 s produced more duplicate-word leaks through the dedupe heuristic, so this PR uses 2 s as a conservative default. Smaller values risk dropping audio the previous inference left unconfirmed.

Min-chunk gate (8 chunks ≈ 16 s). Empirical. Above this threshold the model's per-chunk transcription on a sliced 4 s tail matches what it produces on the equivalent cumulative buffer closely enough that the word-level dedupe matches cleanly. Below it (Hindi 4 s, Spanish 7 s, MLK 13 s were the failure cases we observed), the bare-prompt slice produces a different word sequence than cumulative, the dedupe over-matches, and genuine new content is dropped. Below the gate slicing simply doesn't engage — the path falls back to cumulative end-to-end, so short audio is unaffected at the transcript level.

Model-specificity (important for the model-agnostic adapter layer). Both
the 8-chunk threshold (with Qwen3-ASR's chunk_size_sec=2 s and
min_audio_sec=16 s, ceil(16/2)=8) and the 2 s left overlap (from
adapter.realtime_slicing_config['left_overlap_ms']) were tuned against
Qwen3-ASR's unfixed_token_num = 5 on seven fixtures, yet the WS path is built
on the model-agnostic TranscriptionAdapter. The overlap is in principle
derivable from the adapter's rollback window rather than hard-coded (a
follow-up could expose it on the adapter); the 8-chunk gate is documented in the
_AudioState docstring as Qwen3-ASR-tuned and likely needs re-tuning for other
ASR models.

Accuracy Tests

Short audio: no regression vs #22848 + gate fix

No regression on the 7-fixture multilingual three-path consistency from #22848's review round 2 (HTTP non-stream / HTTP SSE / WS realtime byte-equal where they were before, same WER profile, same delta counts). The new 8-chunk threshold (with Qwen3-ASR's chunk_size_sec=2 s and min_audio_sec=16 s, ceil(16/2)=8) additionally fixed pre-gate WS-path word drops on three short fixtures: MLK 13 s (lost ~8 words), Spanish 6.6 s (lost medio sumergidas), Hindi 4.1 s (lost में कितने) — all three are now byte-equal to HTTP SSE / HTTP non-stream. Direct cumulative-vs-slicing byte-equality across short fixtures is in the "Multilingual sanity" table below.

Long-form audio (slicing engaged past chunk 8)

On varied long-form English (TED talks: Robert Gupta and Daniel Kahneman from the distil-whisper/tedlium-long-form set), final transcripts on the two paths are broadly consistent — no truncation, no hallucination divergence. The cumulative path emits slightly more deltas than slicing (e.g. 767 vs 698 at 300 s) because cumulative inference produces more intermediate revisions. Per-chunk casing and punctuation can differ because the LLM re-tokenizes around chunk boundaries; the underlying word sequence agrees. TED 30 s sample:

  • cumulative: "One day, Los Angeles Times columnist Steve Lopez was walking along the streets of downtown Los Angeles..."
  • slicing: "One day, Los Angeles Times columnist Steve Lopez was walking along the streets of downtown Los Angeles..."

StreamingASRState.update() is shared with the HTTP SSE path and was designed for cumulative transcripts; the slicing path feeds it deduped tail-only output. This works because the wire transcript is built from item.emitted_deltas, not from state.full_transcript (which under slicing only contains the last deduped tail). The behavior is documented in _AudioState's docstring.

Worst case: repetitive long-form content

The TED-talk result above is the typical case; it does not generalize to repetitive long-form audio. On a tiled-repetitive English fixture (the same short English clip repeated to fill 180 s / 240 s), the cumulative path's WS delta stream balloons far beyond gold, while the slicing path stays bounded:

fixture gold chars cumulative chars slicing chars cumulative over-emit slicing vs gold
EN180 ~2,160 75,035 1,920 ~35× −11 %
EN240 ~2,880 106,531 2,524 ~37× −12 %

The table shows two points:

  • Cumulative is badly broken here (~35–37× over-emission), and the pathology is model-agnostic: EN180 cumulative output is 75,035 chars on both Qwen3-ASR-0.6B and Qwen3-ASR-1.7B, because it lives in the call pattern (re-feeding cumulative audio), not the model.
  • Slicing is bounded but slightly under gold (~11–12 %), not exactly at gold. This is the dedupe heuristic over-matching genuine repeats — the same boundary-aligned real-word-repetition trade-off documented in [Feature] WebSocket streaming audio input for ASR #22848, surfacing on long repetitive audio where dedupe is doing the heavy lifting. Slicing is dramatically better than cumulative's 35× blow-up, but it is a trade-off, not a perfect transcript. See "Where this PR does not help" below.

Unit coverage added in this PR

test/registered/unit/entrypoints/openai/test_streaming_asr.py — 14 cases, registered for CI (base-a-test-cpu, est_time=3), pure unittest, no GPU:

  • process_asr_chunk integration (3 cases, mock generate_request + real StreamingASRState + dedupe — the same style as test_serving_transcription.py): cumulative path injects prompt_template + get_prefix_text() and runs no dedupe; slicing path uses the bare prompt and dedupe trims the overlapping leading word before state ingests it; the is_last path dedupes before finalize(). These cover the M2 prompt-override + dedupe×update() two-stage — the PR's main correctness surface.
  • Slicing-enable guard (3 cases, RealtimeConnection.__init__, no GPU): overlap within the unfixed-chunk window engages slicing; overlap exceeding it falls back to cumulative; enabled=False never slices.
  • StreamingASRState.update() reconciliation (2 cases): mid-word extension emits the whole corrected word (regression guard for the removed startswith fast path); clean append emits only the new word.
  • Dedupe rules (4 cases): full-candidate match returns empty; em-dash + case normalization; CJK char-level fallback; long-history suffix-only invariant (the tail-only perf optimization depends on this).
  • PCM / slice helpers (2 cases): _pcm_to_float_samples bit-equal to the legacy sf.writesf.read round trip; _slice_pcm_from out-of-bounds start raises (validation contract).

Scope of CI coverage (so reviewers don't over-read it): the dedupe × update() interaction and the slicing-enable config guard are CPU-covered above. What remains GPU/manual-only is the runtime slicing engagement inside RealtimeConnection (the 8-chunk gate firing mid-stream, the first-gated-call full-buffer edge, chunk-boundary flush); those are exercised by the manual suite. (Trivial happy-path / empty-input helper asserts that just restated Python primitives were dropped in favor of the integration cases.)

Existing coverage from #22848 expected to continue passing:

Speed Tests and Profiling

All measurements on a single H100 with Qwen3-ASR-0.6B unless noted (the 1.7B subsection uses Qwen3-ASR-1.7B). WS realtime is driven through the official OpenAI Python SDK (openai==2.6.1, AsyncOpenAI().realtime.connect(...)) so the numbers reflect what a stock SDK client sees — the realtime-API compatibility shipped in #22848 is why WS exists and is exercised that way. HTTP and HTTP SSE are plain REST endpoints driven by requests/curl; SDK use is unrelated to those paths. Audio is pushed faster than realtime unless explicitly noted as realtime-paced. The cumulative (pre-slicing) path is reconstructed by swapping the two changed files to upstream a95b4e2e0's _run_inference; both modes ran on the same server process within minutes of each other.

End-to-end wall, real long-form English (via SDK WS)

Robert Gupta TED talk prefixes (15 s – 340 s) plus Daniel Kahneman for 600 s / 900 s, from the distil-whisper/tedlium-long-form HuggingFace dataset (a long-form subset of the TED-LIUM 3 corpus). Σnew is total prefill new-tokens across the session; last_new is the final inference's new-token count (a proxy for worst-case per-call cost).

audio cumul. wall sliced wall wall save cumul. Σnew sliced Σnew cumul. last_new sliced last_new
15 s 1.50 s 1.12 s 25% 1,095 1,095 230 230
30 s 1.51 s 1.29 s 14% 2,846 838 469 58
60 s 3.00 s 2.52 s 16% 11,001 885 965 58
120 s 6.17 s 5.11 s 17% 44,627 1,770 1,958 58
240 s 14.78 s 10.07 s 32% 175,982 3,540 3,888 58
300 s 19.49 s 12.01 s 38% 144,997 1,860 4,831 58
340 s 23.57 s 14.31 s 39% 126,561 1,310 5,463 58
600 s 1 77.24 s 26.68 s 65% 1,444,362 17,378 1,388 58
900 s 1 171.37 s 38.23 s 78% 3,241,918 9,000 6,150 58

At lengths where both modes complete, slicing is 14 – 78 % faster end-to-end. Per-chunk new-tokens stays at 58 from the ninth chunk onward (the first eight stay on the cumulative path under the gate); cumulative's per-chunk new-tokens grows to 6,150 at 900 s. Read Σnew for the quadratic trend, not last_newlast_new reflects only the final chunk plus RadixCache state, and cumulative Σnew is non-monotonic across the nested-prefix fixtures because RadixCache reuses prefixes across sequential runs.

Per-chunk model-call distribution

Same fixtures, identical timing instrumentation patched into both code paths (time.perf_counter() brackets around tokenizer_manager.generate_request, parsed from server logs per chunk). generate_us is the encoder + LLM prefill + LLM decode + IPC black box, ~96 – 99 % of total wall in both modes. (Measured with a raw-websockets driver; per-call cost is server-side and independent of client transport. The probe instrumentation was removed from the tree post-bench; the aggregate sum_t_infer numbers in the cross-model subsection are directionally consistent.)

audio mode n_chunks mean median stdev min max
30 s cumul. 15 97 ms 99 ms 19 ms 74 ms 127 ms
30 s sliced 15 79 ms 78 ms 8 ms 66 ms 94 ms
60 s cumul. 30 91 ms 93 ms 26 ms 55 ms 131 ms
60 s sliced 30 80 ms 80 ms 13 ms 58 ms 109 ms
120 s cumul. 60 99 ms 105 ms 28 ms 56 ms 143 ms
120 s sliced 60 82 ms 80 ms 11 ms 59 ms 109 ms
300 s cumul. 150 137 ms 143 ms 54 ms 56 ms 399 ms
300 s sliced 150 80 ms 80 ms 12 ms 57 ms 121 ms

Slicing's mean and tail are essentially flat across audio length — 79 – 82 ms mean, ≤ 121 ms worst chunk — because slicing's audio input is bounded at ~4 s and its prompt is the bare template. Cumulative's mean grows from 91 to 137 ms and its worst single inference at 300 s is 399 ms vs slicing's 121 ms (3.3× wider on the tail). At 600 s+ the cumulative path stays continuously busy long enough that the SDK keepalive budget is exhausted; the slicing path keeps every inference bounded so this never happens.

The physical mechanism is per-call audio input size. On long realtime-paced sessions on 0.6B, mean audio fed per inference is 6.56 s (slicing) vs 35.47 s (cumulative); the max diverges further (slicing 18 s = the 16 s engagement threshold + 2 s overlap, vs cumulative 240 s on a 240 s session — at the tail cumulative feeds ~30× more audio). The per-chunk wall numbers are the symptom; bounded vs unbounded input size is the cause.

PCM-to-model-input prep (_pcm_to_wav cumulative vs _pcm_to_float_samples slicing) takes < 1 ms / chunk in both modes: the PCM/WAV skip is a real but small win (~1 % wall on long fixtures), worth shipping for cleanliness more than performance.

Cross-model wall-time: 0.6B and 1.7B

Probe-measured per-chunk inference totals (sum_t_infer, summed across all inference calls in a session set) under realtime pacing on long-form English. The 1.7B matrix is 53 sessions / 536 inference calls; the 0.6B matrix is the corresponding sweep on the smaller model.

model cumulative Σ infer slicing Σ infer reduction
Qwen3-ASR-0.6B 461.0 s 276.2 s -40.1%
Qwen3-ASR-1.7B 181.8 s 105.4 s -42.0%

The 1.7B reduction (-42 %) is within 2 points of the 0.6B reduction (-40 %), so the wall-time win generalizes across a 3×-larger model — broader evidence than the 0.6B-only TED sweep above. Note the relationship to the push-fast numbers earlier: under realtime pacing the end-to-end wall floor is just the audio duration (RTF ≈ 1.01 – 1.02 on both branches at 180 / 240 s), so the two paths look identical at the SDK wall level; the savings re-emerge as per-chunk inference time, which is exactly the sum_t_infer reduction above. Mean audio fed per inference on 1.7B is 7.94 s slicing vs 23.97 s cumulative, mirroring the 0.6B input-size gap. Push the same audio faster than realtime and the per-chunk inference saving turns directly into SDK wall saving — the 14 – 78 % push-fast numbers and the 40 – 42 % realtime-paced inference numbers are two views of the same compute reduction.

Multilingual sanity — byte-equal transcripts

Short fixtures across ZH / HI / ES / EN at 4 – 13 s, both modes via SDK WS. All five stay under the eight-chunk slicing gate, so the slicing path runs cumulative end-to-end and emits the same delta count and final transcript as the cumulative reference:

fixture lang audio cumul. wall sliced wall cumul. n_deltas sliced n_deltas transcripts
zh_4s ZH 4.2 s 0.59 s 0.58 s 1 1 byte-equal
hi_4s HI 4.1 s 0.37 s 0.37 s 6 6 byte-equal
es_7s ES 6.6 s 0.43 s 0.35 s 14 14 byte-equal
libri_10s EN 10.4 s 0.53 s 0.48 s 27 27 byte-equal
mlk_13s EN 13.0 s 0.54 s 0.48 s 21 21 byte-equal

Verified by direct string comparison of the final transcript field on the WS conversation.item.input_audio_transcription.completed event.

HTTP SSE vs WebSocket realtime (same streaming UX)

The HTTP SSE chunked-transcription endpoint and the WS realtime endpoint share process_asr_chunk but differ in transport and per-chunk audio shape. HTTP SSE accumulates the full audio and re-encodes WAV bytes per chunk (the cumulative path), so it serves as a same-streaming-UX reference for what the realtime path would look like without this fix. Both driven via the OpenAI SDK on the PR branch.

audio HTTP SSE wall WS realtime wall WS vs SSE
30 s 1.43 s 1.33 s −7 %
60 s 2.83 s 2.54 s −10 %
120 s 6.30 s 5.17 s −18 %
300 s 22.82 s 12.52 s −45 %

Time-to-first-content is comparable at short audio (HTTP SSE 0.10 – 0.45 s vs WS 0.15 – 0.60 s for 30 – 120 s audio); WS overtakes at 300 s (1.54 s vs 2.34 s) because the cumulative path's first chunk pays more prefill once the buffer is large.

Known limits

Regimes where slicing provides no benefit (or a known trade-off), so reviewers and operators can set expectations.

  1. Short audio (< 8 chunks, ≈ 16 s). Slicing never engages under the gate, so per-chunk compute, wall, and final transcripts are byte-equal to cumulative. The multilingual sanity table is the direct evidence.
  2. Genuinely repetitive long-form content (dedupe over-match). On heavily repeated audio the dedupe heuristic over-matches genuine repeats and slicing emits ~11 – 12 % under gold (EN180 / EN240 above). This is far better than cumulative's ~35× over-emission, but it is a correctness trade-off, not a perfect transcript — the same boundary-aligned repetition limitation noted in [Feature] WebSocket streaming audio input for ASR #22848. Forced-alignment timestamps (M3) would be the principled fix.
  3. Short-audio high concurrency (c = 64, ~15 s audio, 0.6B). Both branches drop exactly 32 of 64 sessions and reach the same throughput (1.91 sess/s); TTFD p50/p95 within ~15 ms, e2e and RTF identical. The bottleneck here is not per-chunk inference compute (likely connection limits or thread pool), so bounded per-chunk audio does not push the saturation point higher. Up to c = 32 on both 0.6B and 1.7B the two paths are latency-equivalent (TTFD / e2e within ±50 ms) at 100 % success — no regression, no measurable win.
  4. Pure conversational use. Sessions that stay short by design (turn-based, < 16 s per turn) ride the cumulative path end-to-end and see no compute change. This PR is specifically a long-form fix.
  5. Session PCM buffer compaction. This PR bounds per-call model input and encoder / prefill work, but does not compact pcm_buffer into a rolling buffer. Session memory remains controlled by --asr-max-buffer-seconds. Rolling-buffer compaction is a follow-up if multi-session memory pressure becomes relevant.

How to send requests

Server launch:

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-ASR-0.6B \
  --served-model-name qwen3-asr \
  --trust-remote-code \
  --host 127.0.0.1 --port 30000

HTTP non-streaming — standard /v1/audio/transcriptions REST endpoint, any HTTP client works:

curl -s -X POST http://127.0.0.1:30000/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F model=qwen3-asr \
  -F language=en

HTTP SSE chunked — same endpoint with stream=true. sglang emits the legacy chat-style transcription.chunk shape with content under choices[0].delta.content:

curl -N -X POST http://127.0.0.1:30000/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F model=qwen3-asr \
  -F stream=true

WebSocket realtime — driven through the OpenAI Python SDK to exercise the realtime-API compatibility shipped in #22848. The SDK's typed namespaced helpers on conn.session / conn.input_audio_buffer are not defined for the transcription session shape, so events flow through conn.send({...}) raw dicts:

import asyncio, base64
import numpy as np, soundfile as sf
from openai import AsyncOpenAI


async def transcribe(path: str, language: str = "en") -> str:
    data, sr = sf.read(path, dtype="float32")
    if data.ndim > 1:
        data = data.mean(axis=1)
    if sr not in (16000, 24000, 48000):
        n = int(len(data) / sr * 16000)
        data = np.interp(np.linspace(0, len(data) - 1, n), np.arange(len(data)), data)
        sr = 16000
    pcm = (data * 32767).astype(np.int16).tobytes()

    client = AsyncOpenAI(
        base_url="http://127.0.0.1:30000/v1",
        websocket_base_url="ws://127.0.0.1:30000/v1",
        api_key="x",
    )
    async with client.realtime.connect(model="qwen3-asr") as conn:
        await conn.send({
            "type": "session.update",
            "session": {
                "type": "transcription",
                "audio": {"input": {
                    "format": {"type": "audio/pcm", "rate": sr},
                    "transcription": {"model": "qwen3-asr", "language": language},
                    "noise_reduction": None, "turn_detection": None,
                }},
            },
        })
        async for evt in conn:
            if getattr(evt, "type", None) == "session.updated":
                break
        chunk_bytes = sr  # 0.5 s of int16 PCM @ sr Hz
        for off in range(0, len(pcm), chunk_bytes):
            await conn.send({
                "type": "input_audio_buffer.append",
                "audio": base64.b64encode(pcm[off:off + chunk_bytes]).decode(),
            })
        await conn.send({"type": "input_audio_buffer.commit"})
        async for evt in conn:
            if getattr(evt, "type", None) == "conversation.item.input_audio_transcription.completed":
                return getattr(evt, "transcript", "")


if __name__ == "__main__":
    print(asyncio.run(transcribe("audio.wav")))

Reproducing the numbers

Wall time comes from the client (time.perf_counter() around the SDK call). #new-token and #cached-token per inference are parsed from the sglang scheduler log, which emits one line per Prefill batch:

[YYYY-MM-DD HH:MM:SS] Prefill batch, #new-seq: 1, #new-token: 58, #cached-token: 1024, ...

WS sessions are bounded by lines containing WebSocket /v1/realtime…[accepted] (start; the URL carries a ?model= query string when accessed via the SDK) and connection closed (end); per-fixture aggregation is a regex over those boundaries.

Long fixtures (≥ 600 s) on the cumulative path close with 1011 under the OpenAI SDK because the SDK's underlying websockets client uses ping_interval=20 by default and we found no public knob to override it without subclassing the SDK transport. The slicing path is unaffected because each per-chunk inference completes well under one second.

Checklist

  • Format your code according to the Format code with pre-commit. (pre-commit hooks pass: black-jupyter, isort, ruff, codespell, ast, EOL, whitespace.)
  • Add unit tests according to the Run and add unit tests. (14-case unit suite at test/registered/unit/entrypoints/openai/test_streaming_asr.py registered for CI (base-a-test-cpu, est_time=3), pure unittest.CustomTestCase, no GPU, no server: 3 mock-driven process_asr_chunk integration cases (cumulative prefix-injection / sliced bare-prompt + dedupe / is_last dedupe→finalize), 3 slicing-enable guard cases (RealtimeConnection.__init__: within-window enables, over-window falls back, opt-out disables), 2 StreamingASRState.update() reconciliation cases (mid-word extension, clean append), 4 dedupe-rule cases (full-overlap, em-dash+case, CJK fallback, long-history suffix-only), 2 PCM/slice helper cases (_pcm_to_float_samples bit-equal-to-soundfile, _slice_pcm_from out-of-bounds raises). The runtime byte-threshold slicing engagement inside RealtimeConnection is exercised by the manual GPU suite. The existing 14-case manual integration suite and 71-case v2 unit suite from [Feature] WebSocket streaming audio input for ASR #22848 continue to cover the WS / HTTP / SSE wire paths.)
  • Update documentation according to Write documentations. (In-tree docstrings on _AudioState, the adapter's realtime_slicing_config['left_overlap_ms'] field, _run_inference, process_asr_chunk, dedupe_overlap, and _dedupe_norm cover the slicing heuristic, the bare-prompt choice, the overlap sizing rationale, the Qwen3-ASR-tuned gate, and the dedupe contract.)
  • Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed. (See Accuracy Tests and Speed Tests and Profiling above.)
  • Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process. Get approvals from CODEOWNERS and other reviewers. Trigger CI tests with comments or contact authorized users to do so — common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci. After green CI and required approvals, ask Merge Oncalls or someone with Write permission to merge.

Related

  • Builds on PR [Feature] WebSocket streaming audio input for ASR #22848 (initial WebSocket realtime ASR, merged).
  • Implements the M2 long-form cost-bounding milestone of RFC [RFC]: Real-Time Streaming Audio Input for ASR Models #22474 via input slicing. See the "Relationship to RFC M2" section for the mechanism difference vs the initial RadixCache prefix-caching sketch. Exact RadixCache prefix reuse and token-level streaming / alignment (M3) remain valid future work.
  • The PCM/WAV round-trip skip is a small independent cleanup bundled here because it touches the same _run_inference function.

Footnotes

  1. The 600/900 s cumulative rows were measured with a raw websockets client (ping_interval=None); the OpenAI SDK's default 20 s keepalive trips on the cumulative path at those lengths (1011 keepalive ping timeout). Sliced numbers match SDK-driven within ~1 %. 2

PCM16 from the WebSocket path was being encoded into WAV bytes only for
load_audio to decode it back into a float ndarray. Convert directly to
float samples (1/32768 normalization matches soundfile.read default for
signed 16-bit, so the float values are bit-equal to the old path), and
teach load_audio to accept a pre-decoded ndarray as a no-op passthrough.

ASR/cache semantics unchanged — this only removes the WAV adapter layer.
A future optimization could maintain decoded samples incrementally to
avoid re-converting the cumulative PCM buffer on every chunk.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@SammLSH SammLSH force-pushed the feat/realtime-asr-input-slicing branch from cbde4df to 4ca2e43 Compare May 30, 2026 06:59
…forward

Replace the WS /v1/realtime cumulative inference path (re-send the whole
PCM buffer on every chunk) with input slicing once a committed-text
prefix exists. Once StreamingASRState has stable emitted text, is past
the K-token holdback gate, and has accumulated at least eight chunks
(~16 s) of cumulative context, the model runs on
``pcm_buffer[committed_audio_until_bytes - left_overlap_bytes:]`` plus
a 2 s left overlap instead of the full buffer. The prompt stays at
``adapter.prompt_template`` — emitted_text is not injected as a
continuation prefix; the retained acoustic overlap plus a word-level
dedupe (with CJK char-level fallback) takes its place.

The first gated call still starts at offset 0 because
committed_audio_until_bytes is initialized to 0; only chunk 9 onward
is bounded to overlap + new chunk.

Performance (TED-LIUM long-form sweep on Qwen3-ASR-0.6B, H100):

  audio  cumul wall  sliced wall  save
   30 s     1.51 s      1.29 s   14 %
   60 s     3.00 s      2.52 s   16 %
  120 s     6.17 s      5.11 s   17 %
  240 s    14.78 s     10.07 s   32 %
  300 s    19.49 s     12.01 s   38 %
  600 s    77.24 s     26.68 s   65 %
  900 s   171.37 s     38.23 s   78 %

Per-chunk model-call wall stays flat at ~80 ms mean / ~121 ms max
across the sweep instead of growing to 137 ms mean / 399 ms max in
the cumulative path at 300 s. Realtime-paced sum of per-chunk
inference wall drops 40-42 % on both 0.6B and 1.7B Qwen3-ASR.

Implementation:
- ``adapter.realtime_slicing_config`` returns left_overlap_ms (default
  2000) and min_audio_sec (default 16.0); slicing_min_chunk_index is
  derived as ceil(min_audio_sec / chunk_size_sec).
- ``_slice_pcm_from`` snapshots the bytearray via memoryview so the
  per-chunk copy is slice-sized instead of full-buffer + slice
  (~7.7 MB -> ~128 KB at 240 s when slicing engaged).
- ``dedupe_overlap`` normalizes only the tail of committed_text bounded
  by len(candidate_words), so dedupe cost does not grow with session
  length.
- ``process_asr_chunk`` gains ``prompt: Optional[str]`` and
  ``dedupe_against: Optional[str]`` kwargs; the realtime path uses them,
  the HTTP / HTTP SSE path keeps existing behavior via defaults.
- ``load_audio`` annotation widened from ``str`` to
  ``Union[str, bytes, np.ndarray]`` to match the existing isinstance
  branches; not exposed through any Pydantic schema path.

Tests: 21-case CI unit suite at
test/registered/unit/entrypoints/openai/test_streaming_asr.py covering
dedupe_overlap (word + CJK + suffix-only-history invariant),
_pcm_to_float_samples (normalization + soundfile-round-trip
bit-equality + odd-length raises), and _slice_pcm_from validation.
@SammLSH SammLSH force-pushed the feat/realtime-asr-input-slicing branch from 4ca2e43 to af872ce Compare May 30, 2026 07:11
@SammLSH SammLSH closed this May 30, 2026
@SammLSH SammLSH deleted the feat/realtime-asr-input-slicing branch May 30, 2026 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant