Feat/realtime asr input slicing#26767
Closed
SammLSH wants to merge 2 commits into
Closed
Conversation
PCM16 from the WebSocket path was being encoded into WAV bytes only for load_audio to decode it back into a float ndarray. Convert directly to float samples (1/32768 normalization matches soundfile.read default for signed 16-bit, so the float values are bit-equal to the old path), and teach load_audio to accept a pre-decoded ndarray as a no-op passthrough. ASR/cache semantics unchanged — this only removes the WAV adapter layer. A future optimization could maintain decoded samples incrementally to avoid re-converting the cumulative PCM buffer on every chunk.
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
cbde4df to
4ca2e43
Compare
…forward Replace the WS /v1/realtime cumulative inference path (re-send the whole PCM buffer on every chunk) with input slicing once a committed-text prefix exists. Once StreamingASRState has stable emitted text, is past the K-token holdback gate, and has accumulated at least eight chunks (~16 s) of cumulative context, the model runs on ``pcm_buffer[committed_audio_until_bytes - left_overlap_bytes:]`` plus a 2 s left overlap instead of the full buffer. The prompt stays at ``adapter.prompt_template`` — emitted_text is not injected as a continuation prefix; the retained acoustic overlap plus a word-level dedupe (with CJK char-level fallback) takes its place. The first gated call still starts at offset 0 because committed_audio_until_bytes is initialized to 0; only chunk 9 onward is bounded to overlap + new chunk. Performance (TED-LIUM long-form sweep on Qwen3-ASR-0.6B, H100): audio cumul wall sliced wall save 30 s 1.51 s 1.29 s 14 % 60 s 3.00 s 2.52 s 16 % 120 s 6.17 s 5.11 s 17 % 240 s 14.78 s 10.07 s 32 % 300 s 19.49 s 12.01 s 38 % 600 s 77.24 s 26.68 s 65 % 900 s 171.37 s 38.23 s 78 % Per-chunk model-call wall stays flat at ~80 ms mean / ~121 ms max across the sweep instead of growing to 137 ms mean / 399 ms max in the cumulative path at 300 s. Realtime-paced sum of per-chunk inference wall drops 40-42 % on both 0.6B and 1.7B Qwen3-ASR. Implementation: - ``adapter.realtime_slicing_config`` returns left_overlap_ms (default 2000) and min_audio_sec (default 16.0); slicing_min_chunk_index is derived as ceil(min_audio_sec / chunk_size_sec). - ``_slice_pcm_from`` snapshots the bytearray via memoryview so the per-chunk copy is slice-sized instead of full-buffer + slice (~7.7 MB -> ~128 KB at 240 s when slicing engaged). - ``dedupe_overlap`` normalizes only the tail of committed_text bounded by len(candidate_words), so dedupe cost does not grow with session length. - ``process_asr_chunk`` gains ``prompt: Optional[str]`` and ``dedupe_against: Optional[str]`` kwargs; the realtime path uses them, the HTTP / HTTP SSE path keeps existing behavior via defaults. - ``load_audio`` annotation widened from ``str`` to ``Union[str, bytes, np.ndarray]`` to match the existing isinstance branches; not exposed through any Pydantic schema path. Tests: 21-case CI unit suite at test/registered/unit/entrypoints/openai/test_streaming_asr.py covering dedupe_overlap (word + CJK + suffix-only-history invariant), _pcm_to_float_samples (normalization + soundfile-round-trip bit-equality + odd-length raises), and _slice_pcm_from validation.
4ca2e43 to
af872ce
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This PR implements the M2 milestone of RFC #22474 (bounding long-form cost) via input slicing, rather than the RadixCache prefix-caching the RFC sketched (see "Relationship to RFC M2").
The WS
/v1/realtimepath from #22848 re-sends the entire accumulated PCM buffer to the model on every chunk, so per-chunk work grows linearly with audio length and total prefill is quadratic in session length. On long-form audio this causes:1011 keepalive ping timeout); bounded inference never threatens it.audio_toweractivations grow with length and risk OOM; slicing caps per-call audio regardless of session length.Slicing keeps per-call audio at one chunk plus a short left overlap, independent of session length, so per-chunk work stays flat.
A small cleanup ships alongside: the realtime path no longer round-trips PCM16 through a WAV byte stream just for
load_audioto decode it back to a float ndarray.Defaults referenced below:
chunk_size_sec = 2 s; slicing engages after 8 chunks (ceil(min_audio_sec / chunk_size_sec), ~16 s); steady-state per-call audio is one chunk + a 2 s left overlap (~4 s).Realtime ASR Roadmap
flowchart LR M1["M1: Functional Realtime ASR<br/>- WS /v1/realtime<br/>- session.update / append / commit / clear<br/>- partial and completed transcript events<br/>- shared ASR streaming driver<br/><br/><b>Usable realtime ASR</b>"] G1["Remaining after M1<br/>Full accumulated audio is still reprocessed<br/>Per-chunk cost grows with session length"] M2["M2: Long-Form Cost Bounding<br/>this PR<br/>- input slicing after stability gate<br/>- chunk + left-overlap audio window<br/>- output-side dedupe<br/>- direct PCM16 to float samples<br/>- bounded encoder + prefill work<br/><br/><b>Affordable long-form realtime ASR</b>"] G2["Remaining after M2<br/>Not exact prefix caching<br/>Not incremental encoder<br/>Dedupe remains heuristic<br/>PCM buffer compaction is future work"] M3["M3: Stateful Streaming ASR<br/>future<br/>- incremental audio encoder state<br/>- rolling audio buffer<br/>- token/timestamp-level alignment<br/>- stable-prefix commit<br/>- draft/final separation<br/>- session-aware scheduling<br/><br/><b>Stateful and robust realtime ASR</b>"] Target["Target: True Realtime Streaming ASR<br/>- low TTFT and stable per-chunk latency<br/>- bounded compute and memory<br/>- no full cumulative reprocessing<br/>- fewer heuristic correctness tradeoffs<br/>- extensible across ASR / omni models"] M1 --> G1 --> M2 --> G2 --> M3 --> TargetM1 makes realtime ASR usable, M2 makes long-form realtime ASR affordable, and M3 moves the system toward truly stateful realtime streaming.
Relationship to RFC M2
RFC #22474 sketches M2 as cross-chunk prefix caching via RadixCache: keep the full audio context, let RadixCache match the shared token prefix, and run LLM prefill only on the new tail tokens. This PR does not take that route. It addresses M2's goal — eliminating the quadratic long-form cost — with a different mechanism: input slicing.
Why slicing was chosen over the RadixCache route:
chunk_size_sec=2 sandmin_audio_sec=16 s,ceil(16/2)=8) confines that approximation to long-form audio (≥ ~16 s); short audio rides the original cumulative path end-to-end and stays byte-equal to HTTP SSE.This PR implements the M2 long-form cost-bounding milestone via input slicing. It differs from the initial RadixCache prefix-caching sketch in the RFC, but targets the same M2 goal: removing the quadratic long-form reprocessing cost. Exact RadixCache prefix reuse and token-level streaming / alignment (M3) remain valid future work; slicing and prefix caching could even compose later (slice the audio and cache the bounded tail's prefix).
Modifications
Runtime changes primarily touch
realtime/session.py,streaming_asr.py,utils/common.py, and the transcription adapter config —transcription_adapters/base.pyships a baserealtime_slicing_configthat keeps slicing off by default (new adapters must opt in withenabled=True), andtranscription_adapters/qwen3_asr.pyoverrides it with the Qwen3-ASR-tuned values (left_overlap_ms=2000,min_audio_sec=16.0). This PR also adds a small CPU unit-test suite for the helper functions.The primary change slices pending audio once committed text rolls forward; one small cleanup ships alongside it (the PCM → WAV → ndarray round-trip skip). Both touch
_run_inference, so they sit in the same diff. The unit test verifies bit-equality of the new direct conversion against the legacysf.write→sf.readfallback path; with the ndarray passthrough inload_audio, the realtime PCM path bypasses file / byte decoding entirely. The cleanup is mechanically independent of slicing — it is bundled here only to avoid landing a second PR that touches the same function immediately after this one.HTTP and HTTP SSE transcription endpoints are unchanged on the wire, except that the
StreamingASRState.update()reconciliation step (shared with the HTTP SSE chunked-streaming path) drops a buggystr.startswithfast path — see the Reconciliation fix note below for the scenario and behavior delta.PCM/WAV cleanup. PCM16 bytes are converted directly to float samples via
np.frombuffer(int16) / 32768.0, which matchessoundfile.read's default normalization, so the float values reaching the encoder are bit-equal to the legacysf.write→sf.readfallback path covered by the unit test.load_audiogains an early-return passthrough when its argument is already an ndarray, which is what the new realtime path passes in.Reconciliation fix.
StreamingASRState.update()previously short-circuited the per-chunk delta computation withconfirmed_text.startswith(old_confirmed)and a[len(old):]slice. Becausestartswithis character-level, when the model extended a confirmed trailing word ("world"→"worldly") the string prefix still matched and it emitted the mid-word fragment"ly"rather than the corrected word"worldly". The two-line char-level fast path is deleted so the word-level common-prefix scan below it (which already handled revisions, and whichfinalize()already uses) runs unconditionally — no new branches. A differential check over 20k randomized transcript evolutions confirms the two paths diverge only on this mid-word case, where the token-level scan is correct. Two new pure-CPU unit tests intest_streaming_asr.pylock it: a mid-word-extension case (asserts"worldly", not"ly") and a clean-append case (guards the common path is unchanged). The bug originated in #22089 (the original Qwen3-ASR chunked-streaming PR); this PR fixes it in place because the slicing path also funnels output throughupdate(), so leaving it would let mid-word fragments leak through the slicing path too.Slicing. Once the chunked-streaming
StreamingASRStatehas stable emitted text, is past the K-token holdback gate, and has accumulated at least eight chunks (≈ 16 s) of cumulative context, the model runs onpcm_buffer[committed_audio_until_bytes - left_overlap_bytes:]plus a 2 s left overlap, instead of re-sending the full cumulative buffer. The prompt stays atadapter.prompt_template—emitted_textis not injected as a continuation prefix; in our measurements the retained acoustic overlap plus output-side dedupe is a stronger signal than text-prefix injection and avoids re-priming the model on its own output. A word-level dedupe (with a CJK char-level fallback) drops the resulting duplicate transcription before it reachesStreamingASRState. Note: the FIRST inference past the gate (chunk index = 8 with default config) still feeds the full accumulated buffer becausecommitted_audio_until_bytes = 0at that point makesslice_start = 0. After that first gated transition call, steady-state per-call audio is one chunk plus the left overlap, ~4 s with the Qwen3-ASR defaults.The slicing trigger gates on
state.get_prefix_text()returning non-empty ANDstate.chunk_index >= slicing_min_chunk_index(derived asmath.ceil(adapter.realtime_slicing_config['min_audio_sec'] / chunk_size_sec), =8 for Qwen3-ASR'smin_audio_sec=16 s/chunk_size_sec=2 s), so pre-K-chunk, CJK single-word-fallback, and short audio overall stay on the cumulative behavior.Left overlap (2 s). Sized to Qwen3-ASR's
unfixed_token_num = 5rollback window (K = 5 tokens ≈ 2 s of English audio at the model's effective token rate). In our long-form English tests 3 s produced more duplicate-word leaks through the dedupe heuristic, so this PR uses 2 s as a conservative default. Smaller values risk dropping audio the previous inference left unconfirmed.Min-chunk gate (8 chunks ≈ 16 s). Empirical. Above this threshold the model's per-chunk transcription on a sliced 4 s tail matches what it produces on the equivalent cumulative buffer closely enough that the word-level dedupe matches cleanly. Below it (Hindi 4 s, Spanish 7 s, MLK 13 s were the failure cases we observed), the bare-prompt slice produces a different word sequence than cumulative, the dedupe over-matches, and genuine new content is dropped. Below the gate slicing simply doesn't engage — the path falls back to cumulative end-to-end, so short audio is unaffected at the transcript level.
Accuracy Tests
Short audio: no regression vs #22848 + gate fix
No regression on the 7-fixture multilingual three-path consistency from #22848's review round 2 (HTTP non-stream / HTTP SSE / WS realtime byte-equal where they were before, same WER profile, same delta counts). The new 8-chunk threshold (with Qwen3-ASR's
chunk_size_sec=2 sandmin_audio_sec=16 s,ceil(16/2)=8) additionally fixed pre-gate WS-path word drops on three short fixtures: MLK 13 s (lost ~8 words), Spanish 6.6 s (lostmedio sumergidas), Hindi 4.1 s (lostमें कितने) — all three are now byte-equal to HTTP SSE / HTTP non-stream. Direct cumulative-vs-slicing byte-equality across short fixtures is in the "Multilingual sanity" table below.Long-form audio (slicing engaged past chunk 8)
On varied long-form English (TED talks: Robert Gupta and Daniel Kahneman from the
distil-whisper/tedlium-long-formset), final transcripts on the two paths are broadly consistent — no truncation, no hallucination divergence. The cumulative path emits slightly more deltas than slicing (e.g. 767 vs 698 at 300 s) because cumulative inference produces more intermediate revisions. Per-chunk casing and punctuation can differ because the LLM re-tokenizes around chunk boundaries; the underlying word sequence agrees. TED 30 s sample:"One day, Los Angeles Times columnist Steve Lopez was walking along the streets of downtown Los Angeles...""One day, Los Angeles Times columnist Steve Lopez was walking along the streets of downtown Los Angeles..."StreamingASRState.update()is shared with the HTTP SSE path and was designed for cumulative transcripts; the slicing path feeds it deduped tail-only output. This works because the wire transcript is built fromitem.emitted_deltas, not fromstate.full_transcript(which under slicing only contains the last deduped tail). The behavior is documented in_AudioState's docstring.Worst case: repetitive long-form content
The TED-talk result above is the typical case; it does not generalize to repetitive long-form audio. On a tiled-repetitive English fixture (the same short English clip repeated to fill 180 s / 240 s), the cumulative path's WS delta stream balloons far beyond gold, while the slicing path stays bounded:
The table shows two points:
Unit coverage added in this PR
test/registered/unit/entrypoints/openai/test_streaming_asr.py— 14 cases, registered for CI (base-a-test-cpu,est_time=3), pureunittest, no GPU:process_asr_chunkintegration (3 cases, mockgenerate_request+ realStreamingASRState+ dedupe — the same style astest_serving_transcription.py): cumulative path injectsprompt_template + get_prefix_text()and runs no dedupe; slicing path uses the bare prompt and dedupe trims the overlapping leading word beforestateingests it; theis_lastpath dedupes beforefinalize(). These cover the M2 prompt-override + dedupe×update()two-stage — the PR's main correctness surface.RealtimeConnection.__init__, no GPU): overlap within the unfixed-chunk window engages slicing; overlap exceeding it falls back to cumulative;enabled=Falsenever slices.StreamingASRState.update()reconciliation (2 cases): mid-word extension emits the whole corrected word (regression guard for the removedstartswithfast path); clean append emits only the new word._pcm_to_float_samplesbit-equal to the legacysf.write→sf.readround trip;_slice_pcm_fromout-of-bounds start raises (validation contract).Scope of CI coverage (so reviewers don't over-read it): the dedupe ×
update()interaction and the slicing-enable config guard are CPU-covered above. What remains GPU/manual-only is the runtime slicing engagement insideRealtimeConnection(the 8-chunk gate firing mid-stream, the first-gated-call full-buffer edge, chunk-boundary flush); those are exercised by the manual suite. (Trivial happy-path / empty-input helper asserts that just restated Python primitives were dropped in favor of the integration cases.)Existing coverage from #22848 expected to continue passing:
test/manual/models/test_qwen3_asr.py(14 tests: HTTP / SSE / WS happy paths, three-concurrent-session, protocol rejects, item lifecycle).Speed Tests and Profiling
All measurements on a single H100 with Qwen3-ASR-0.6B unless noted (the 1.7B subsection uses Qwen3-ASR-1.7B). WS realtime is driven through the official OpenAI Python SDK (
openai==2.6.1,AsyncOpenAI().realtime.connect(...)) so the numbers reflect what a stock SDK client sees — the realtime-API compatibility shipped in #22848 is why WS exists and is exercised that way. HTTP and HTTP SSE are plain REST endpoints driven byrequests/curl; SDK use is unrelated to those paths. Audio is pushed faster than realtime unless explicitly noted as realtime-paced. The cumulative (pre-slicing) path is reconstructed by swapping the two changed files to upstreama95b4e2e0's_run_inference; both modes ran on the same server process within minutes of each other.End-to-end wall, real long-form English (via SDK WS)
Robert Gupta TED talk prefixes (15 s – 340 s) plus Daniel Kahneman for 600 s / 900 s, from the
distil-whisper/tedlium-long-formHuggingFace dataset (a long-form subset of the TED-LIUM 3 corpus).Σnewis total prefill new-tokens across the session;last_newis the final inference's new-token count (a proxy for worst-case per-call cost).At lengths where both modes complete, slicing is 14 – 78 % faster end-to-end. Per-chunk new-tokens stays at 58 from the ninth chunk onward (the first eight stay on the cumulative path under the gate); cumulative's per-chunk new-tokens grows to 6,150 at 900 s. Read
Σnewfor the quadratic trend, notlast_new—last_newreflects only the final chunk plus RadixCache state, and cumulativeΣnewis non-monotonic across the nested-prefix fixtures because RadixCache reuses prefixes across sequential runs.Per-chunk model-call distribution
Same fixtures, identical timing instrumentation patched into both code paths (
time.perf_counter()brackets aroundtokenizer_manager.generate_request, parsed from server logs per chunk).generate_usis the encoder + LLM prefill + LLM decode + IPC black box, ~96 – 99 % of total wall in both modes. (Measured with a raw-websockets driver; per-call cost is server-side and independent of client transport. The probe instrumentation was removed from the tree post-bench; the aggregatesum_t_infernumbers in the cross-model subsection are directionally consistent.)Slicing's mean and tail are essentially flat across audio length — 79 – 82 ms mean, ≤ 121 ms worst chunk — because slicing's audio input is bounded at ~4 s and its prompt is the bare template. Cumulative's mean grows from 91 to 137 ms and its worst single inference at 300 s is 399 ms vs slicing's 121 ms (3.3× wider on the tail). At 600 s+ the cumulative path stays continuously busy long enough that the SDK keepalive budget is exhausted; the slicing path keeps every inference bounded so this never happens.
The physical mechanism is per-call audio input size. On long realtime-paced sessions on 0.6B, mean audio fed per inference is 6.56 s (slicing) vs 35.47 s (cumulative); the max diverges further (slicing 18 s = the 16 s engagement threshold + 2 s overlap, vs cumulative 240 s on a 240 s session — at the tail cumulative feeds ~30× more audio). The per-chunk wall numbers are the symptom; bounded vs unbounded input size is the cause.
PCM-to-model-input prep (
_pcm_to_wavcumulative vs_pcm_to_float_samplesslicing) takes < 1 ms / chunk in both modes: the PCM/WAV skip is a real but small win (~1 % wall on long fixtures), worth shipping for cleanliness more than performance.Cross-model wall-time: 0.6B and 1.7B
Probe-measured per-chunk inference totals (
sum_t_infer, summed across all inference calls in a session set) under realtime pacing on long-form English. The 1.7B matrix is 53 sessions / 536 inference calls; the 0.6B matrix is the corresponding sweep on the smaller model.The 1.7B reduction (-42 %) is within 2 points of the 0.6B reduction (-40 %), so the wall-time win generalizes across a 3×-larger model — broader evidence than the 0.6B-only TED sweep above. Note the relationship to the push-fast numbers earlier: under realtime pacing the end-to-end wall floor is just the audio duration (RTF ≈ 1.01 – 1.02 on both branches at 180 / 240 s), so the two paths look identical at the SDK wall level; the savings re-emerge as per-chunk inference time, which is exactly the
sum_t_inferreduction above. Mean audio fed per inference on 1.7B is 7.94 s slicing vs 23.97 s cumulative, mirroring the 0.6B input-size gap. Push the same audio faster than realtime and the per-chunk inference saving turns directly into SDK wall saving — the 14 – 78 % push-fast numbers and the 40 – 42 % realtime-paced inference numbers are two views of the same compute reduction.Multilingual sanity — byte-equal transcripts
Short fixtures across ZH / HI / ES / EN at 4 – 13 s, both modes via SDK WS. All five stay under the eight-chunk slicing gate, so the slicing path runs cumulative end-to-end and emits the same delta count and final transcript as the cumulative reference:
Verified by direct string comparison of the final
transcriptfield on the WSconversation.item.input_audio_transcription.completedevent.HTTP SSE vs WebSocket realtime (same streaming UX)
The HTTP SSE chunked-transcription endpoint and the WS realtime endpoint share
process_asr_chunkbut differ in transport and per-chunk audio shape. HTTP SSE accumulates the full audio and re-encodes WAV bytes per chunk (the cumulative path), so it serves as a same-streaming-UX reference for what the realtime path would look like without this fix. Both driven via the OpenAI SDK on the PR branch.Time-to-first-content is comparable at short audio (HTTP SSE 0.10 – 0.45 s vs WS 0.15 – 0.60 s for 30 – 120 s audio); WS overtakes at 300 s (1.54 s vs 2.34 s) because the cumulative path's first chunk pays more prefill once the buffer is large.
Known limits
Regimes where slicing provides no benefit (or a known trade-off), so reviewers and operators can set expectations.
pcm_bufferinto a rolling buffer. Session memory remains controlled by--asr-max-buffer-seconds. Rolling-buffer compaction is a follow-up if multi-session memory pressure becomes relevant.How to send requests
Server launch:
HTTP non-streaming — standard
/v1/audio/transcriptionsREST endpoint, any HTTP client works:HTTP SSE chunked — same endpoint with
stream=true. sglang emits the legacy chat-styletranscription.chunkshape with content underchoices[0].delta.content:WebSocket realtime — driven through the OpenAI Python SDK to exercise the realtime-API compatibility shipped in #22848. The SDK's typed namespaced helpers on
conn.session/conn.input_audio_bufferare not defined for the transcription session shape, so events flow throughconn.send({...})raw dicts:Reproducing the numbers
Wall time comes from the client (
time.perf_counter()around the SDK call).#new-tokenand#cached-tokenper inference are parsed from the sglang scheduler log, which emits one line perPrefill batch:WS sessions are bounded by lines containing
WebSocket /v1/realtime…[accepted](start; the URL carries a?model=query string when accessed via the SDK) andconnection closed(end); per-fixture aggregation is a regex over those boundaries.Long fixtures (≥ 600 s) on the cumulative path close with
1011under the OpenAI SDK because the SDK's underlying websockets client usesping_interval=20by default and we found no public knob to override it without subclassing the SDK transport. The slicing path is unaffected because each per-chunk inference completes well under one second.Checklist
test/registered/unit/entrypoints/openai/test_streaming_asr.pyregistered for CI (base-a-test-cpu,est_time=3), pureunittest.CustomTestCase, no GPU, no server: 3 mock-drivenprocess_asr_chunkintegration cases (cumulative prefix-injection / sliced bare-prompt + dedupe /is_lastdedupe→finalize), 3 slicing-enable guard cases (RealtimeConnection.__init__: within-window enables, over-window falls back, opt-out disables), 2StreamingASRState.update()reconciliation cases (mid-word extension, clean append), 4 dedupe-rule cases (full-overlap, em-dash+case, CJK fallback, long-history suffix-only), 2 PCM/slice helper cases (_pcm_to_float_samplesbit-equal-to-soundfile,_slice_pcm_fromout-of-bounds raises). The runtime byte-threshold slicing engagement insideRealtimeConnectionis exercised by the manual GPU suite. The existing 14-case manual integration suite and 71-case v2 unit suite from [Feature] WebSocket streaming audio input for ASR #22848 continue to cover the WS / HTTP / SSE wire paths.)_AudioState, the adapter'srealtime_slicing_config['left_overlap_ms']field,_run_inference,process_asr_chunk,dedupe_overlap, and_dedupe_normcover the slicing heuristic, the bare-prompt choice, the overlap sizing rationale, the Qwen3-ASR-tuned gate, and the dedupe contract.)Review and Merge Process
Ping Merge Oncalls to start the process. See the PR Merge Process. Get approvals from CODEOWNERS and other reviewers. Trigger CI tests with comments or contact authorized users to do so — common commands include
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci. After green CI and required approvals, ask Merge Oncalls or someone with Write permission to merge.Related
_run_inferencefunction.Footnotes
The 600/900 s cumulative rows were measured with a raw
websocketsclient (ping_interval=None); the OpenAI SDK's default 20 s keepalive trips on the cumulative path at those lengths (1011 keepalive ping timeout). Sliced numbers match SDK-driven within ~1 %. ↩ ↩2