[Feature] Realtime ASR: Input Slicing for Long-Running Realtime ASR Sessions#26853
[Feature] Realtime ASR: Input Slicing for Long-Running Realtime ASR Sessions#26853SammLSH wants to merge 10 commits into
Conversation
PCM16 from the WebSocket path was being encoded into WAV bytes only for load_audio to decode it back into a float ndarray. Convert directly to float samples (1/32768 normalization matches soundfile.read default for signed 16-bit, so the float values are bit-equal to the old path), and teach load_audio to accept a pre-decoded ndarray as a no-op passthrough. ASR/cache semantics unchanged — this only removes the WAV adapter layer. A future optimization could maintain decoded samples incrementally to avoid re-converting the cumulative PCM buffer on every chunk.
…forward Replace the WS /v1/realtime cumulative inference path (re-send the whole PCM buffer on every chunk) with input slicing once a committed-text prefix exists. Once StreamingASRState has stable emitted text, is past the K-token holdback gate, and has accumulated at least eight chunks (~16 s) of cumulative context, the model runs on ``pcm_buffer[committed_audio_until_bytes - left_overlap_bytes:]`` plus a 2 s left overlap instead of the full buffer. The prompt stays at ``adapter.prompt_template`` — emitted_text is not injected as a continuation prefix; the retained acoustic overlap plus a word-level dedupe (with CJK char-level fallback) takes its place. The first gated call still starts at offset 0 because committed_audio_until_bytes is initialized to 0; only chunk 9 onward is bounded to overlap + new chunk. Performance (TED-LIUM long-form sweep on Qwen3-ASR-0.6B, H100): audio cumul wall sliced wall save 30 s 1.51 s 1.29 s 14 % 60 s 3.00 s 2.52 s 16 % 120 s 6.17 s 5.11 s 17 % 240 s 14.78 s 10.07 s 32 % 300 s 19.49 s 12.01 s 38 % 600 s 77.24 s 26.68 s 65 % 900 s 171.37 s 38.23 s 78 % Per-chunk model-call wall stays flat at ~80 ms mean / ~121 ms max across the sweep instead of growing to 137 ms mean / 399 ms max in the cumulative path at 300 s. Realtime-paced sum of per-chunk inference wall drops 40-42 % on both 0.6B and 1.7B Qwen3-ASR. Implementation: - ``adapter.realtime_slicing_config`` returns left_overlap_ms (default 2000) and min_audio_sec (default 16.0); slicing_min_chunk_index is derived as ceil(min_audio_sec / chunk_size_sec). - ``_slice_pcm_from`` snapshots the bytearray via memoryview so the per-chunk copy is slice-sized instead of full-buffer + slice (~7.7 MB -> ~128 KB at 240 s when slicing engaged). - ``dedupe_overlap`` normalizes only the tail of committed_text bounded by len(candidate_words), so dedupe cost does not grow with session length. - ``process_asr_chunk`` gains ``prompt: Optional[str]`` and ``dedupe_against: Optional[str]`` kwargs; the realtime path uses them, the HTTP / HTTP SSE path keeps existing behavior via defaults. - ``load_audio`` annotation widened from ``str`` to ``Union[str, bytes, np.ndarray]`` to match the existing isinstance branches; not exposed through any Pydantic schema path. Tests: 21-case CI unit suite at test/registered/unit/entrypoints/openai/test_streaming_asr.py covering dedupe_overlap (word + CJK + suffix-only-history invariant), _pcm_to_float_samples (normalization + soundfile-round-trip bit-equality + odd-length raises), and _slice_pcm_from validation.
After a stability gate (8 chunks / ~16s for Qwen3-ASR), the realtime WebSocket
path runs inference on a bounded audio tail (the new chunk + a 2s left overlap)
instead of the full cumulative PCM buffer, with output-side dedupe. Slicing is
opt-in per adapter: the base config keeps it off; Qwen3-ASR enables it. Short
audio and non-opting adapters keep the cumulative path unchanged.
Also in this changeset:
- StreamingASRState.update(): drop the char-level startswith fast path that
emitted mid-word fragments ("world" -> "worldly" emitted "ly"); the word-level
common-prefix scan now runs unconditionally (matching finalize()).
- Convert PCM16 to float directly, skipping the PCM -> WAV -> ndarray round-trip;
load_audio accepts a pre-decoded ndarray.
- Add a 14-case CPU unit suite (process_asr_chunk integration, slicing-enable
guard, update() reconciliation, dedupe rules, PCM/slice helpers).
Refines the M2 input-slicing output dedupe and its test suite: - Dedupe normalization uses NFKC + Unicode category-P edge stripping (Whisper-style) instead of a hand-listed punctuation set. - Split CJK detection into _is_cjk_no_space (spacing) and _is_cjk_dedupe (dedupe, narrower); use Script_Extensions so the kana marks U+30FC/U+30FB are covered; keep Hangul out (Korean is space-delimited). - CJK dedupe is boundary-only: compare the leading/trailing CJK runs and never skip interior non-CJK content; require a >=2-glyph overlap for letters, allow 1 for punctuation. - Fix spurious over-deletion when lone-punctuation tokens normalize to "" and match each other; require a real word in the matched overlap. - _dedupe_by_word rsplits the committed tail instead of tokenizing the whole growing transcript. - Rewrite the unit tests as entry-point scenarios through process_asr_chunk plus the slicing-enable guard.
Tighten slicing-path comments to one-liners (base adapter config docstring, session.py slicing_enabled / emitted_deltas / _run_inference, streaming_asr predicate header). No logic change.
CJK never enters the sliced path -- the slicing gate needs confirmed text, which the whitespace word-split in StreamingASRState.update never produces for space-less scripts -- so the CJK char-level dedupe was unreachable for CJK and only added review surface. dedupe_overlap is now word-level only; the spacing predicate (needs_space) and word-level dedupe (incl. the punctuation-overlap fix) stay. CJK-aware dedupe is deferred to M3, where slicing also engages for CJK.
Drop the regex/scx rewrite of the spacing predicate back to the baseline codepoint _is_cjk (removes the new `regex` dependency); add a halfwidth Hangul jamo guard so the function matches its docstring. Fix a stale "token-level" comment in update() (the scan is word-level; token-level rollback is M3) and shorten _dedupe_norm's docstring.
"committed_audio_until_bytes" collided with the OpenAI realtime
input_audio_buffer.commit concept; the field actually marks the PCM
offset the previous sliced inference consumed up to. Rename it across
the field declaration, slice-start arithmetic, anchor update, and the
per-item reset. Also fix a stale test docstring left over from the
CJK-dedupe rollback ("Latin and CJK" -> word-level dedupe).
There was a problem hiding this comment.
Code Review
This pull request introduces a realtime ASR slicing path to optimize inference by switching from cumulative buffers to tail slices with left overlap and output deduplication. It updates RealtimeConnection and StreamingASRState to support slicing configuration, adds word-level deduplication, and enables slicing for Qwen3-ASR. Feedback focuses on improving robustness and efficiency, including moving slicing and float conversion inside the try-except block in _run_inference, using float32 instead of float64 for audio samples to reduce memory usage, and ensuring that left_overlap_bytes and slicing offsets are aligned to the 16-bit PCM sample width boundary to prevent audio corruption.
| slicing_opt_in = bool(slicing_cfg.get("enabled", False)) | ||
| left_overlap_ms = int(slicing_cfg.get("left_overlap_ms", 0)) | ||
| min_audio_sec = float(slicing_cfg.get("min_audio_sec", 0.0)) | ||
| left_overlap_bytes = int(left_overlap_ms / 1000 * self.bytes_per_second) |
There was a problem hiding this comment.
To prevent potential audio misalignment and corruption, left_overlap_bytes should be explicitly aligned to a multiple of _SAMPLE_WIDTH (2 bytes). If left_overlap_bytes is not aligned, slicing the PCM buffer could cut a 16-bit sample in half, leading to static noise or runtime errors during conversion.
| left_overlap_bytes = int(left_overlap_ms / 1000 * self.bytes_per_second) | |
| left_overlap_bytes = int(left_overlap_ms / 1000 * self.bytes_per_second) // _SAMPLE_WIDTH * _SAMPLE_WIDTH |
| def _slice_pcm_from(buffer: Union[bytes, bytearray], start: int) -> bytes: | ||
| """Return an immutable ``buffer[start:]`` snapshot with bounds checking.""" | ||
| if not (0 <= start <= len(buffer)): | ||
| raise ValueError(f"_slice_pcm_from: start={start} not in [0, {len(buffer)}]") | ||
| return bytes(memoryview(buffer)[start:]) |
There was a problem hiding this comment.
As a defensive programming practice, consider adding a check in _slice_pcm_from to ensure that the start offset is a multiple of _SAMPLE_WIDTH. This guarantees that the sliced buffer is properly aligned to 16-bit PCM boundaries, preventing silent audio corruption or misalignment.
| def _slice_pcm_from(buffer: Union[bytes, bytearray], start: int) -> bytes: | |
| """Return an immutable ``buffer[start:]`` snapshot with bounds checking.""" | |
| if not (0 <= start <= len(buffer)): | |
| raise ValueError(f"_slice_pcm_from: start={start} not in [0, {len(buffer)}]") | |
| return bytes(memoryview(buffer)[start:]) | |
| def _slice_pcm_from(buffer: Union[bytes, bytearray], start: int) -> bytes: | |
| """Return an immutable ``buffer[start:]`` snapshot with bounds checking.""" | |
| if not (0 <= start <= len(buffer)): | |
| raise ValueError(f"_slice_pcm_from: start={start} not in [0, {len(buffer)}]") | |
| if start % _SAMPLE_WIDTH != 0: | |
| raise ValueError(f"_slice_pcm_from: start={start} must be a multiple of {_SAMPLE_WIDTH}") | |
| return bytes(memoryview(buffer)[start:]) |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Motivation
This PR implements the M2 milestone of RFC #22474: keeping long-form realtime speech-to-text (ASR) cheap to run.
Background on how the realtime path works today. The WebSocket endpoint
/v1/realtime(added in #22848) transcribes audio as it streams in: the client sends audio, and the server periodically runs the model on what it has so far and pushes partial text back. Audio is processed in fixed-length pieces called chunks (2 s each for Qwen3-ASR). Today, on every chunk the server re-sends the entire audio accumulated so far — the "cumulative" approach. So per-chunk work grows as the session gets longer, and total work over a session is quadratic in its length.This PR adds an input slicing path for long sessions. Once enough audio has accumulated (a "stability gate"), each inference runs on only a bounded tail of the audio — the newest chunk plus a short slice of the audio just before it (the "left overlap") — instead of the whole buffer. Short audio, and any model that doesn't opt in, keep the old cumulative behavior unchanged.
Terms used throughout:
In our measurements, slicing addresses the main long-form problems:
Per-chunk latency stops growing. On a 300 s session, the slowest single inference drops from 399 ms (cumulative) to 121 ms (sliced). End-to-end time (total elapsed wall-clock time) improves 14–78% across 30–900 s. Under real-time pacing, per-chunk inference time drops 40–42% on both Qwen3-ASR-0.6B and Qwen3-ASR-1.7B.
Runaway repeated output. On audio with repeated content, the cumulative path keeps re-transcribing and re-sending old text. For a 180 s repetitive clip (which we call "EN180"), the cumulative output balloons to 75,035 characters versus about 2,160 in the gold transcript. The same 75,035-character output appears on both 0.6B and 1.7B, so it is the cumulative call pattern, not a model-size issue. Slicing stays near the gold length.
Dropped connections on very long sessions. A WebSocket connection exchanges periodic "keepalive" pings; if one side is too busy to answer in time, the connection is closed. In our 600 s / 900 s runs through the official OpenAI Python client library (the "SDK"), the cumulative path was busy long enough to trip this and closed with
1011 keepalive ping timeout. The sliced path stayed responsive because each inference is bounded.Audio-encoder memory growth. The part of the model that turns audio into features (the "audio encoder", also called the audio tower) sees more audio every chunk on the cumulative path, so its memory grows with session length. We did not hit out-of-memory (OOM) in these runs, but slicing caps the per-call audio and removes the risk.
This PR also removes a small inefficiency: the realtime path used to convert raw 16-bit audio samples (PCM16) into a WAV byte stream just so a shared helper could decode them back into a floating-point array. That round-trip is gone.
Defaults for Qwen3-ASR:
ceil(16 / 2) = 8chunksRelationship to RFC M2
RFC #22474 sketches M2 as cross-chunk prefix caching via RadixCache (sglang's cache that reuses computation for inputs sharing a leading prefix): keep the full audio context, let the cache match the shared leading tokens, and run prefill only on the new tail tokens.
This PR does not take that exact route. It targets the same M2 goal — removing the quadratic long-form cost — with input slicing instead.
Why this route:
Exact RadixCache prefix reuse and token-level streaming / alignment remain valid future work, and can compose with slicing later.
Modifications
The main change is in the realtime WebSocket inference path (
_run_inference). Once the streaming state has stable emitted text and the session has passed the 8-chunk gate,_run_inferencesends only a tail audio window instead of the full accumulated PCM buffer.The sliced path:
adapter.prompt_template), and does not inject the already-emitted text back into the prompt — the retained acoustic overlap is the continuity signal instead;last_sliced_buffer_end_bytes - left_overlap_bytes(i.e. the audio since the previous slice's end, plus the left overlap);dedupe_overlap) that removes text the overlap caused the model to re-transcribe, before it reaches the streaming state.The first inference past the gate still starts from offset 0 (the last-sliced-end marker is initialized to 0), so that one call feeds the full buffer; every call after it is the ~4 s steady-state window.
Slicing is opt-in per adapter. The base adapter keeps it off by default; Qwen3-ASR turns it on:
{"enabled": True, "left_overlap_ms": 2000, "min_audio_sec": 16.0}Files touched
realtime/session.py_run_inference; new slicing state fields + opt-in guard;_pcm_to_float_samplesand_slice_pcm_from(replace the PCM→WAV round-trip)streaming_asr.pydedupe_overlap;process_asr_chunkgainsprompt/dedupe_againstargs; fixes theStreamingASRState.update()reconciliation bug belowtranscription_adapters/base.pyrealtime_slicing_config, defaulting toenabled=False(slicing off)transcription_adapters/qwen3_asr.pyutils/common.pyload_audioaccepts a pre-decoded array and returns it directlytest/registered/unit/entrypoints/openai/test_streaming_asr.pyPCM/WAV cleanup
The realtime path used to convert PCM16 → WAV bytes and then call
load_audio, which decoded the WAV back into float samples. This PR converts PCM16 directly:/ 32768.0matchessoundfile.read's default 16-bit normalization, so the result is identical (bit-for-bit) to the old path by construction.load_audionow also accepts an already-decoded array and returns it directly, so the realtime path never enters the file/byte decoder at all.Reconciliation fix
StreamingASRState.update()computes which new words to emit each chunk. It previously had a character-level shortcut:Because that compares characters, not whole words, it cut mid-word when the model extended a previously-emitted word. For example, when
"world"became"worldly", it emitted the fragment"ly"instead of the corrected word"worldly".This PR removes the shortcut and always uses the word-level common-prefix scan that already lived below it (and that
finalize()already uses). A CPU test drives this throughprocess_asr_chunk: a mid-word extension ("world"→"worldly") must emit"worldly", not"ly". The behavior dates back to the original chunked-streaming path in #22089; it is fixed here because the sliced path also routes its output throughupdate().When slicing turns on
Slicing runs only when all of these hold:
realtime_slicing_config["enabled"] == True);state.get_prefix_text()is non-empty);state.chunk_index >= slicing_min_chunk_index);unfixed_chunk_num × chunk_size), so some fresh audio always remains for the dedupe to anchor against (otherwise slicing auto-disables — see below). This is a structural check on chunk-sized audio, distinct from the token-level rollback tuning in Model-specific tuning.For Qwen3-ASR:
chunk_size_sec = 2.0,min_audio_sec = 16.0,left_overlap_ms = 2000. So the gate isceil(16 / 2) = 8chunks (≈ 16 s). Below the gate, the path stays cumulative — short audio is unchanged, which avoids the short-input divergence we saw in manual tests.Model-specific tuning
The 2 s overlap and 16 s gate are tuned for Qwen3-ASR.
Qwen3-ASR may revise its last few output tokens as more audio arrives — its config marks the last 5 tokens as still-revisable (
unfixed_token_num = 5, a token-level rollback window). The 2 s overlap is an empirical choice: in our fixtures 2 s of audio carried enough context to re-emit those ~5 tokens. This is a tuning assumption (≈5 tokens ≲ 2 s), not something the guard verifies — the guard only enforces the chunk-level window above. The 16 s gate keeps slicing off on short inputs, where sliced output diverged from cumulative output in manual tests. Other chunked ASR models should re-tune these before enabling slicing; this is noted in the adapter docstring.Accuracy tests
Short audio: no regression
The 7-fixture HTTP / HTTP-SSE / WebSocket consistency checks from #22848 still hold ("fixture" = a test audio sample; SSE = Server-Sent Events, the HTTP streaming format). These fixtures stay below the 8-chunk gate, so the WebSocket path stays cumulative end-to-end.
The gate also fixes word drops that earlier slicing attempts hit on short audio:
medio sumergidasमें कितनेAll three are now identical character-for-character ("byte-equal") to the HTTP paths.
Long-form audio
On long-form English TED talks (from the
distil-whisper/tedlium-long-formdataset), the cumulative and sliced paths produce broadly matching final transcripts — no truncation, no hallucination divergence in the tested fixtures.The cumulative path emits more intermediate deltas (incremental updates to the client) because it keeps producing more revisions — e.g. 767 vs 698 deltas at 300 s. The final transcript agrees. (Under slicing, the running state only holds the latest deduped tail, so the wire transcript is rebuilt from the list of deltas already sent, not from that state.)
Repetitive long-form content
The TED results do not cover the worst case. Repetitive audio exposes a different failure. On a tiled English clip ("EN180"/"EN240" = a short English clip repeated to fill 180 s / 240 s), the cumulative path over-emits badly while slicing stays bounded:
("gold chars" = character count of the reference transcript; "over-emit" = how many times larger the output is than gold.) The cumulative output is unusable here. The sliced output is bounded but under-emits by ~11–12%, because the text-level dedupe can over-match genuinely repeated words. That is a known trade-off of this M2 implementation; a principled fix needs token- or timestamp-level alignment (M3).
Unit coverage added in this PR
test/registered/unit/entrypoints/openai/test_streaming_asr.py— 8 CPU tests (no GPU), registered for CI under thebase-a-test-cpusuite (est_time=3is the CI time-budget hint, in seconds). Following the existingtest_serving_transcription/test_serving_embeddingsuites, they drive the realprocess_asr_chunkentry point with a mockedTokenizerManagerrather than unit-testing helpers in isolation:process_asr_chunkscenarios (6): cumulative path injects the prompt prefix and runs no dedupe (M1); sliced path uses the bare prompt and dedupes a Latin overlap (M2); a non-overlapping candidate is kept unchanged; the final chunk dedupes then finalizes; a mid-word extension reconciles to the whole word ("worldly", not"ly"); an empty model response emits nothing without mutating state.enabled=False) never slices.The output dedupe (word-level) is exercised through these entry-point scenarios. Whether slicing actually turns on at runtime (the 8-chunk gate mid-stream, the first-gated-call edge, chunk-boundary behavior) stays in the manual GPU suite, not CI.
Existing coverage from #22848 should keep passing: the manual Qwen3-ASR HTTP / SSE / WebSocket tests, protocol-reject and item-lifecycle tests, the v2 unit suite, and the multilingual three-path byte-equality checks.
Speed tests and profiling
Definitions used in this section:
All numbers in this section were measured on a single H100 GPU with Qwen3-ASR-0.6B unless noted. The WebSocket path is driven through the OpenAI Python SDK (
openai==2.6.1). Audio is pushed faster than real time unless a row is marked real-time-paced. The "cumulative" baseline is the pre-slicing_run_inferenceon the same server.End-to-end wall time
TED-talk prefixes from
distil-whisper/tedlium-long-form. "Total prefill tokens" sums every chunk's prefill in the session; "last-chunk prefill tokens" is just the final chunk (a proxy for worst-case per-call cost). Lower is better for both.Slicing is 14–78% faster end-to-end here, and its last-chunk prefill flattens at 58 tokens once past the gate while the cumulative path's keeps growing. Two caveats on reading this table:
The quadratic long-form cost is cleanest in wall time and in per-chunk max latency (next section: cumulative per-chunk max grows from 127 ms at 30 s to 399 ms at 300 s) — treat those as the primary evidence; the cumulative total-prefill column is supporting/illustrative.
Per-chunk model-call distribution
Each row is the distribution of single-chunk inference time (milliseconds) within one session, measured with
time.perf_counter()around the model call.n_chunks= number of model calls;stdev= standard deviation;min/max= fastest/slowest single chunk.Sliced per-chunk time stays flat (the per-call audio is bounded); cumulative time grows as the accumulated buffer gets longer. The PCM/WAV cleanup is < 1 ms per chunk here — it is a cleanup, not the source of the speedup.
Cross-model total inference time
"Total inference time" = the sum of every per-chunk model-call duration across a whole run (so it captures total GPU work, not just wall time). Lower is better. Runs are real-time-paced on long-form English; the 1.7B run is 53 sessions / 536 model calls.
The two models reduce by nearly the same amount (−40% vs −42%), which means the win comes from bounding the call pattern, not from anything specific to the 0.6B model.
Multilingual short-audio sanity
These fixtures stay below the gate, so both modes take the cumulative path and should match exactly. "deltas" = incremental updates sent to the client.
HTTP SSE vs WebSocket realtime
HTTP SSE (the streaming HTTP endpoint) stays cumulative; the WebSocket path uses this PR's slicing after the gate. Both driven via the OpenAI SDK.
Short audio is close; the gap grows with length because SSE still re-encodes the full cumulative buffer.
Known limits
unfixed_token_numwords (Qwen3-ASR: 5) for the next pass to confirm; the next slice re-covers them through the 2 s left overlap. But the hold-back is counted in tokens while the overlap is measured in time, andlast_sliced_buffer_end_bytesjumps to the full buffer after each sliced call — so a held word is recovered only if it lies within the last ~2 s of audio. If the last 5 words span more than the overlap (slow/spontaneous speech, or a pause among them), the earliest held word falls before the next slice's start and is dropped, with no later chance to recover. Fast continuous speech (~3 words/s → ~1.7 s for 5 words, e.g. the TED fixtures) stays inside the overlap, so the long-form results above do not surface it. The slicing-enable guard checks a chunk-time window, not this token-span-vs-overlap relationship. The robust fix makes the hold-back time-based (hold the last X s, require overlap ≥ X) — M3, with token/timestamp alignment. (This affects space-delimited languages only; CJK never reaches the sliced path — see limit 5 — so it cannot lose held words this way.)update()has confirmed text to anchor against (get_prefix_text()non-empty).update()finds confirmed words by splitting on whitespace, which space-less scripts (Chinese, Japanese) never satisfy — soemitted_textstays empty, the slicing gate never opens, and CJK long-form runs the cumulative path for the whole session: no per-call bounding, and the transcript arrives as a single burst at commit instead of incrementally. Measured on this branch (Qwen3-ASR-0.6B): a 48 s Mandarin clip → 1 delta, prefill grew to ~629 tokens, ~21 s wall (cumulative); a 37 s English clip → 107 deltas, a flat 58-token prefill, ~5 s wall (sliced). The CJK transcript is still correct (byte-equal to the full-context HTTP result) — slicing simply never engages. Making CJK benefit needs CJK-aware confirmation (character/token-level rollback) = M3.pcm_buffer); session memory is still capped by--asr-max-buffer-seconds. A rolling buffer is future work.How to send requests
Server launch:
HTTP non-streaming:
HTTP SSE (streaming):
WebSocket realtime via the OpenAI Python SDK:
Reproducing the numbers
Wall time is measured on the client with
time.perf_counter()around the SDK call.Prefill-token counts come from SGLang's scheduler log, one line per batch —
#new-tokenis tokens prefilled fresh,#cached-tokenis tokens reused from the prefix cache:WebSocket sessions are delimited in the log by
WebSocket /v1/realtime ... [accepted](start) andconnection closed(end).The 600 s / 900 s cumulative rows were measured with a raw
websocketsclient and keepalive disabled (ping_interval=None), because the OpenAI SDK's default 20 s keepalive interval dropped the cumulative path at those lengths (1011 keepalive ping timeout). The sliced path completes through the stock SDK; sliced numbers match within ~1% either way.Checklist
black-jupyter,isort,ruff,codespell,ast, EOL, and whitespace checks pass.test/registered/unit/entrypoints/openai/test_streaming_asr.py(base-a-test-cpu,est_time=3): entry-point scenarios throughprocess_asr_chunk(cumulative/sliced paths, word-level dedupe, finalize, reconciliation, empty response) plus the slicing-enable guard. Runtime slicing behavior stays in the manual GPU suite._run_inference,process_asr_chunk, and the dedupe helpers.Related
_run_inferencepath.CI States
Latest PR Test (Base): ❌ Run #26718039502
Latest PR Test (Extra): ❌ Run #26718039444
Footnotes
The 600 s / 900 s cumulative rows use a raw
websocketsclient with keepalive disabled (ping_interval=None). The OpenAI SDK path hit1011 keepalive ping timeouton the cumulative path at those lengths. Sliced numbers match the SDK-driven runs within ~1%. ↩ ↩2