[Feature] Realtime ASR: Input Slicing for Long-Running Realtime ASR Sessions by SammLSH · Pull Request #26853 · sgl-project/sglang

SammLSH · 2026-05-31T15:52:10Z

Motivation

This PR implements the M2 milestone of RFC #22474: keeping long-form realtime speech-to-text (ASR) cheap to run.

Background on how the realtime path works today. The WebSocket endpoint /v1/realtime (added in #22848) transcribes audio as it streams in: the client sends audio, and the server periodically runs the model on what it has so far and pushes partial text back. Audio is processed in fixed-length pieces called chunks (2 s each for Qwen3-ASR). Today, on every chunk the server re-sends the entire audio accumulated so far — the "cumulative" approach. So per-chunk work grows as the session gets longer, and total work over a session is quadratic in its length.

This PR adds an input slicing path for long sessions. Once enough audio has accumulated (a "stability gate"), each inference runs on only a bounded tail of the audio — the newest chunk plus a short slice of the audio just before it (the "left overlap") — instead of the whole buffer. Short audio, and any model that doesn't opt in, keep the old cumulative behavior unchanged.

Terms used throughout:

chunk — a fixed-length piece of audio the server runs the model on (2 s for Qwen3-ASR).
cumulative path — re-send all audio so far on every chunk (the old behavior).
sliced path — send only the recent tail (this PR, for long audio).
prefill — the model encoding its input before it generates output; its cost scales with the number of input tokens.
delta — one incremental piece of transcript text the server pushes to the client as it goes.
gold transcript — the reference / ground-truth transcript we compare against.

In our measurements, slicing addresses the main long-form problems:

Per-chunk latency stops growing. On a 300 s session, the slowest single inference drops from 399 ms (cumulative) to 121 ms (sliced). End-to-end time (total elapsed wall-clock time) improves 14–78% across 30–900 s. Under real-time pacing, per-chunk inference time drops 40–42% on both Qwen3-ASR-0.6B and Qwen3-ASR-1.7B.
Runaway repeated output. On audio with repeated content, the cumulative path keeps re-transcribing and re-sending old text. For a 180 s repetitive clip (which we call "EN180"), the cumulative output balloons to 75,035 characters versus about 2,160 in the gold transcript. The same 75,035-character output appears on both 0.6B and 1.7B, so it is the cumulative call pattern, not a model-size issue. Slicing stays near the gold length.
Dropped connections on very long sessions. A WebSocket connection exchanges periodic "keepalive" pings; if one side is too busy to answer in time, the connection is closed. In our 600 s / 900 s runs through the official OpenAI Python client library (the "SDK"), the cumulative path was busy long enough to trip this and closed with 1011 keepalive ping timeout. The sliced path stayed responsive because each inference is bounded.
Audio-encoder memory growth. The part of the model that turns audio into features (the "audio encoder", also called the audio tower) sees more audio every chunk on the cumulative path, so its memory grows with session length. We did not hit out-of-memory (OOM) in these runs, but slicing caps the per-call audio and removes the risk.

This PR also removes a small inefficiency: the realtime path used to convert raw 16-bit audio samples (PCM16) into a WAV byte stream just so a shared helper could decode them back into a floating-point array. That round-trip is gone.

Defaults for Qwen3-ASR:

chunk length: 2 s
slicing turns on after 16 s of audio, i.e. ceil(16 / 2) = 8 chunks
left overlap: 2 s
steady-state sliced input: one new chunk + 2 s overlap ≈ 4 s

Relationship to RFC M2

RFC #22474 sketches M2 as cross-chunk prefix caching via RadixCache (sglang's cache that reuses computation for inputs sharing a leading prefix): keep the full audio context, let the cache match the shared leading tokens, and run prefill only on the new tail tokens.

This PR does not take that exact route. It targets the same M2 goal — removing the quadratic long-form cost — with input slicing instead.

Dimension	RFC sketch: RadixCache prefix caching	This PR: input slicing
Mechanism	Keep the full audio context; cache reuses the shared leading tokens	After the gate, feed only a bounded tail window
What it bounds	LLM prefill only	Audio-encoder input and LLM prefill
Encoder cost	Still re-runs on the full audio buffer	Bounded by chunk + overlap
Cache key	Needs a content-aware audio prefix key (the current key hashes the whole audio tensor, not a prefix)	No cache-key change
Correctness	Exact (same tokens, only cached)	Approximate (acoustic overlap + text dedupe)

Why this route:

RadixCache prefix reuse does not bound audio-encoder work on its own — the encoder still sees the full accumulated buffer unless the audio input is also bounded. Slicing bounds both.
The current multimodal cache key is not a prefix-aware audio key; adding one is a larger change to the cache and multimodal processor.
Slicing is lower-risk for this PR: it keeps the change local to the realtime ASR path and adapter config.
The cost is exactness — slicing relies on acoustic overlap plus output-side dedupe, so it is approximate, not exact prefix reuse.

Exact RadixCache prefix reuse and token-level streaming / alignment remain valid future work, and can compose with slicing later.

Modifications

The main change is in the realtime WebSocket inference path (_run_inference). Once the streaming state has stable emitted text and the session has passed the 8-chunk gate, _run_inference sends only a tail audio window instead of the full accumulated PCM buffer.

The sliced path:

uses the bare prompt template (adapter.prompt_template), and does not inject the already-emitted text back into the prompt — the retained acoustic overlap is the continuity signal instead;
slices the audio buffer at last_sliced_buffer_end_bytes - left_overlap_bytes (i.e. the audio since the previous slice's end, plus the left overlap);
converts PCM16 directly to float samples (no WAV round-trip);
runs an output-side dedupe (dedupe_overlap) that removes text the overlap caused the model to re-transcribe, before it reaches the streaming state.

The first inference past the gate still starts from offset 0 (the last-sliced-end marker is initialized to 0), so that one call feeds the full buffer; every call after it is the ~4 s steady-state window.

Slicing is opt-in per adapter. The base adapter keeps it off by default; Qwen3-ASR turns it on:

{"enabled": True, "left_overlap_ms": 2000, "min_audio_sec": 16.0}

Files touched

File	Change
`realtime/session.py`	Tail-window slicing in `_run_inference`; new slicing state fields + opt-in guard; `_pcm_to_float_samples` and `_slice_pcm_from` (replace the PCM→WAV round-trip)
`streaming_asr.py`	New `dedupe_overlap`; `process_asr_chunk` gains `prompt` / `dedupe_against` args; fixes the `StreamingASRState.update()` reconciliation bug below
`transcription_adapters/base.py`	New `realtime_slicing_config`, defaulting to `enabled=False` (slicing off)
`transcription_adapters/qwen3_asr.py`	Opts Qwen3-ASR in (2 s overlap, 16 s min audio)
`utils/common.py`	`load_audio` accepts a pre-decoded array and returns it directly
`test/registered/unit/entrypoints/openai/test_streaming_asr.py`	New 8-case CPU unit suite (entry-point integration + slicing-enable guard)

PCM/WAV cleanup

The realtime path used to convert PCM16 → WAV bytes and then call load_audio, which decoded the WAV back into float samples. This PR converts PCM16 directly:

np.frombuffer(pcm, dtype=np.int16).astype(np.float32) / 32768.0

/ 32768.0 matches soundfile.read's default 16-bit normalization, so the result is identical (bit-for-bit) to the old path by construction. load_audio now also accepts an already-decoded array and returns it directly, so the realtime path never enters the file/byte decoder at all.

Reconciliation fix

StreamingASRState.update() computes which new words to emit each chunk. It previously had a character-level shortcut:

confirmed_text.startswith(old_confirmed)

Because that compares characters, not whole words, it cut mid-word when the model extended a previously-emitted word. For example, when "world" became "worldly", it emitted the fragment "ly" instead of the corrected word "worldly".

This PR removes the shortcut and always uses the word-level common-prefix scan that already lived below it (and that finalize() already uses). A CPU test drives this through process_asr_chunk: a mid-word extension ("world" → "worldly") must emit "worldly", not "ly". The behavior dates back to the original chunked-streaming path in #22089; it is fixed here because the sliced path also routes its output through update().

When slicing turns on

Slicing runs only when all of these hold:

the adapter opted in (realtime_slicing_config["enabled"] == True);
there is already-emitted text to anchor the dedupe (state.get_prefix_text() is non-empty);
enough chunks have accumulated (state.chunk_index >= slicing_min_chunk_index);
the left overlap fits inside the unfixed-chunk window (unfixed_chunk_num × chunk_size), so some fresh audio always remains for the dedupe to anchor against (otherwise slicing auto-disables — see below). This is a structural check on chunk-sized audio, distinct from the token-level rollback tuning in Model-specific tuning.

For Qwen3-ASR: chunk_size_sec = 2.0, min_audio_sec = 16.0, left_overlap_ms = 2000. So the gate is ceil(16 / 2) = 8 chunks (≈ 16 s). Below the gate, the path stays cumulative — short audio is unchanged, which avoids the short-input divergence we saw in manual tests.

Model-specific tuning

The 2 s overlap and 16 s gate are tuned for Qwen3-ASR.

Qwen3-ASR may revise its last few output tokens as more audio arrives — its config marks the last 5 tokens as still-revisable (unfixed_token_num = 5, a token-level rollback window). The 2 s overlap is an empirical choice: in our fixtures 2 s of audio carried enough context to re-emit those ~5 tokens. This is a tuning assumption (≈5 tokens ≲ 2 s), not something the guard verifies — the guard only enforces the chunk-level window above. The 16 s gate keeps slicing off on short inputs, where sliced output diverged from cumulative output in manual tests. Other chunked ASR models should re-tune these before enabling slicing; this is noted in the adapter docstring.

Accuracy tests

Short audio: no regression

The 7-fixture HTTP / HTTP-SSE / WebSocket consistency checks from #22848 still hold ("fixture" = a test audio sample; SSE = Server-Sent Events, the HTTP streaming format). These fixtures stay below the 8-chunk gate, so the WebSocket path stays cumulative end-to-end.

The gate also fixes word drops that earlier slicing attempts hit on short audio:

MLK 13 s: previously lost ~8 words
Spanish 6.6 s: previously lost medio sumergidas
Hindi 4.1 s: previously lost में कितने

All three are now identical character-for-character ("byte-equal") to the HTTP paths.

Long-form audio

On long-form English TED talks (from the distil-whisper/tedlium-long-form dataset), the cumulative and sliced paths produce broadly matching final transcripts — no truncation, no hallucination divergence in the tested fixtures.

The cumulative path emits more intermediate deltas (incremental updates to the client) because it keeps producing more revisions — e.g. 767 vs 698 deltas at 300 s. The final transcript agrees. (Under slicing, the running state only holds the latest deduped tail, so the wire transcript is rebuilt from the list of deltas already sent, not from that state.)

Repetitive long-form content

The TED results do not cover the worst case. Repetitive audio exposes a different failure. On a tiled English clip ("EN180"/"EN240" = a short English clip repeated to fill 180 s / 240 s), the cumulative path over-emits badly while slicing stays bounded:

fixture	gold chars	cumulative chars	sliced chars	cumulative over-emit	sliced vs gold
EN180	~2,160	75,035	1,920	~35×	−11%
EN240	~2,880	106,531	2,524	~37×	−12%

("gold chars" = character count of the reference transcript; "over-emit" = how many times larger the output is than gold.) The cumulative output is unusable here. The sliced output is bounded but under-emits by ~11–12%, because the text-level dedupe can over-match genuinely repeated words. That is a known trade-off of this M2 implementation; a principled fix needs token- or timestamp-level alignment (M3).

Unit coverage added in this PR

test/registered/unit/entrypoints/openai/test_streaming_asr.py — 8 CPU tests (no GPU), registered for CI under the base-a-test-cpu suite (est_time=3 is the CI time-budget hint, in seconds). Following the existing test_serving_transcription / test_serving_embedding suites, they drive the real process_asr_chunk entry point with a mocked TokenizerManager rather than unit-testing helpers in isolation:

process_asr_chunk scenarios (6): cumulative path injects the prompt prefix and runs no dedupe (M1); sliced path uses the bare prompt and dedupes a Latin overlap (M2); a non-overlapping candidate is kept unchanged; the final chunk dedupes then finalizes; a mid-word extension reconciles to the whole word ("worldly", not "ly"); an empty model response emits nothing without mutating state.
Slicing-enable guard (2): slicing turns on only when the left overlap fits inside the unfixed-chunk window (on at 2 s, off at 8 s), and an opted-out adapter (enabled=False) never slices.

The output dedupe (word-level) is exercised through these entry-point scenarios. Whether slicing actually turns on at runtime (the 8-chunk gate mid-stream, the first-gated-call edge, chunk-boundary behavior) stays in the manual GPU suite, not CI.

Existing coverage from #22848 should keep passing: the manual Qwen3-ASR HTTP / SSE / WebSocket tests, protocol-reject and item-lifecycle tests, the v2 unit suite, and the multilingual three-path byte-equality checks.

Speed tests and profiling

Definitions used in this section:

wall time — total elapsed real-world time, as the client sees it.
prefill tokens — how many input tokens the model had to encode on a call; more tokens = more compute. Reported as a session total and for the last chunk.
inference time — the time for one model call (audio encode + prefill + decode + inter-process messaging).

All numbers in this section were measured on a single H100 GPU with Qwen3-ASR-0.6B unless noted. The WebSocket path is driven through the OpenAI Python SDK (openai==2.6.1). Audio is pushed faster than real time unless a row is marked real-time-paced. The "cumulative" baseline is the pre-slicing _run_inference on the same server.

End-to-end wall time

TED-talk prefixes from distil-whisper/tedlium-long-form. "Total prefill tokens" sums every chunk's prefill in the session; "last-chunk prefill tokens" is just the final chunk (a proxy for worst-case per-call cost). Lower is better for both.

audio	cumulative wall	sliced wall	wall saved	cumulative total prefill tokens	sliced total prefill tokens	cumulative last-chunk prefill tokens	sliced last-chunk prefill tokens
15 s	1.50 s	1.12 s	25%	1,095	1,095	230	230
30 s	1.51 s	1.29 s	14%	2,846	838	469	58
60 s	3.00 s	2.52 s	16%	11,001	885	965	58
120 s	6.17 s	5.11 s	17%	44,627	1,770	1,958	58
240 s	14.78 s	10.07 s	32%	175,982	3,540	3,888	58
300 s	19.49 s	12.01 s	38%	144,997	1,860	4,831	58
340 s	23.57 s	14.31 s	39%	126,561	1,310	5,463	58
600 s ¹	77.24 s	26.68 s	65%	1,444,362	17,378	1,388	58
900 s ¹	171.37 s	38.23 s	78%	3,241,918	9,000	6,150	58

Slicing is 14–78% faster end-to-end here, and its last-chunk prefill flattens at 58 tokens once past the gate while the cumulative path's keeps growing. Two caveats on reading this table:

The 15 s row is below the 16 s gate, so slicing never turns on — prefill is identical on both sides (1,095 / 230). Its 25% wall difference is single-run warmup/noise, not a slicing effect.
The cumulative total-prefill column is not a clean monotonic curve (e.g. 300 s < 240 s): the TED prefixes differ in speech density, and RadixCache state carries across the sequential runs, so token counts are only a rough trend, not a controlled measurement.

The quadratic long-form cost is cleanest in wall time and in per-chunk max latency (next section: cumulative per-chunk max grows from 127 ms at 30 s to 399 ms at 300 s) — treat those as the primary evidence; the cumulative total-prefill column is supporting/illustrative.

Per-chunk model-call distribution

Each row is the distribution of single-chunk inference time (milliseconds) within one session, measured with time.perf_counter() around the model call. n_chunks = number of model calls; stdev = standard deviation; min/max = fastest/slowest single chunk.

audio	mode	n_chunks	mean	median	stdev	min	max
30 s	cumulative	15	97 ms	99 ms	19 ms	74 ms	127 ms
30 s	sliced	15	79 ms	78 ms	8 ms	66 ms	94 ms
60 s	cumulative	30	91 ms	93 ms	26 ms	55 ms	131 ms
60 s	sliced	30	80 ms	80 ms	13 ms	58 ms	109 ms
120 s	cumulative	60	99 ms	105 ms	28 ms	56 ms	143 ms
120 s	sliced	60	82 ms	80 ms	11 ms	59 ms	109 ms
300 s	cumulative	150	137 ms	143 ms	54 ms	56 ms	399 ms
300 s	sliced	150	80 ms	80 ms	12 ms	57 ms	121 ms

Sliced per-chunk time stays flat (the per-call audio is bounded); cumulative time grows as the accumulated buffer gets longer. The PCM/WAV cleanup is < 1 ms per chunk here — it is a cleanup, not the source of the speedup.

Cross-model total inference time

"Total inference time" = the sum of every per-chunk model-call duration across a whole run (so it captures total GPU work, not just wall time). Lower is better. Runs are real-time-paced on long-form English; the 1.7B run is 53 sessions / 536 model calls.

model	cumulative total inference time	sliced total inference time	reduction
Qwen3-ASR-0.6B	461.0 s	276.2 s	−40.1%
Qwen3-ASR-1.7B	181.8 s	105.4 s	−42.0%

The two models reduce by nearly the same amount (−40% vs −42%), which means the win comes from bounding the call pattern, not from anything specific to the 0.6B model.

Multilingual short-audio sanity

These fixtures stay below the gate, so both modes take the cumulative path and should match exactly. "deltas" = incremental updates sent to the client.

fixture	language	audio	cumulative wall	sliced wall	cumulative deltas	sliced deltas	transcripts
zh_4s	Chinese	4.2 s	0.59 s	0.58 s	1	1	byte-equal
hi_4s	Hindi	4.1 s	0.37 s	0.37 s	6	6	byte-equal
es_7s	Spanish	6.6 s	0.43 s	0.35 s	14	14	byte-equal
libri_10s	English	10.4 s	0.53 s	0.48 s	27	27	byte-equal
mlk_13s	English	13.0 s	0.54 s	0.48 s	21	21	byte-equal

HTTP SSE vs WebSocket realtime

HTTP SSE (the streaming HTTP endpoint) stays cumulative; the WebSocket path uses this PR's slicing after the gate. Both driven via the OpenAI SDK.

audio	HTTP SSE wall	WebSocket wall	WebSocket vs SSE
30 s	1.43 s	1.33 s	−7%
60 s	2.83 s	2.54 s	−10%
120 s	6.30 s	5.17 s	−18%
300 s	22.82 s	12.52 s	−45%

Short audio is close; the gap grows with length because SSE still re-encodes the full cumulative buffer.

Known limits

Short audio under the gate. Slicing does not turn on before ~16 s; short audio stays cumulative (and byte-equal — see the multilingual table).
Repetitive long-form content. Text-level dedupe can over-match genuinely repeated words, so slicing under-emits by ~11–12% on EN180/EN240. Much better than the cumulative ~35–37× over-emission, but still a correctness trade-off.
Held-back words on slow or paused speech. Each sliced inference holds back the last unfixed_token_num words (Qwen3-ASR: 5) for the next pass to confirm; the next slice re-covers them through the 2 s left overlap. But the hold-back is counted in tokens while the overlap is measured in time, and last_sliced_buffer_end_bytes jumps to the full buffer after each sliced call — so a held word is recovered only if it lies within the last ~2 s of audio. If the last 5 words span more than the overlap (slow/spontaneous speech, or a pause among them), the earliest held word falls before the next slice's start and is dropped, with no later chance to recover. Fast continuous speech (~3 words/s → ~1.7 s for 5 words, e.g. the TED fixtures) stays inside the overlap, so the long-form results above do not surface it. The slicing-enable guard checks a chunk-time window, not this token-span-vs-overlap relationship. The robust fix makes the hold-back time-based (hold the last X s, require overlap ≥ X) — M3, with token/timestamp alignment. (This affects space-delimited languages only; CJK never reaches the sliced path — see limit 5 — so it cannot lose held words this way.)
Short-audio high concurrency. In a saturation test (64 concurrent ~15 s sessions on 0.6B) both paths drop 32/64 sessions at similar throughput — the bottleneck there is not per-chunk compute, so this PR does not move it. These sessions are below the gate, so both paths run cumulative; long-session concurrency — where bounding per-call work should let more streams run at once — was not measured and is future work.
CJK gets no slicing benefit yet. Slicing only activates once update() has confirmed text to anchor against (get_prefix_text() non-empty). update() finds confirmed words by splitting on whitespace, which space-less scripts (Chinese, Japanese) never satisfy — so emitted_text stays empty, the slicing gate never opens, and CJK long-form runs the cumulative path for the whole session: no per-call bounding, and the transcript arrives as a single burst at commit instead of incrementally. Measured on this branch (Qwen3-ASR-0.6B): a 48 s Mandarin clip → 1 delta, prefill grew to ~629 tokens, ~21 s wall (cumulative); a 37 s English clip → 107 deltas, a flat 58-token prefill, ~5 s wall (sliced). The CJK transcript is still correct (byte-equal to the full-context HTTP result) — slicing simply never engages. Making CJK benefit needs CJK-aware confirmation (character/token-level rollback) = M3.
Conversational short turns. Turns under the gate keep the old cumulative behavior; this is specifically a long-form fix.
Session memory. This PR bounds per-call model input and encoder/prefill work, but does not compact the stored audio buffer (pcm_buffer); session memory is still capped by --asr-max-buffer-seconds. A rolling buffer is future work.

How to send requests

Server launch:

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-ASR-0.6B \
  --served-model-name qwen3-asr \
  --trust-remote-code \
  --host 127.0.0.1 --port 30000

HTTP non-streaming:

curl -s -X POST http://127.0.0.1:30000/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F model=qwen3-asr \
  -F language=en

HTTP SSE (streaming):

curl -N -X POST http://127.0.0.1:30000/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F model=qwen3-asr \
  -F stream=true

WebSocket realtime via the OpenAI Python SDK:

import asyncio
import base64

import numpy as np
import soundfile as sf
from openai import AsyncOpenAI


async def transcribe(path: str, language: str = "en") -> str:
    data, sr = sf.read(path, dtype="float32")
    if data.ndim > 1:
        data = data.mean(axis=1)

    if sr not in (16000, 24000, 48000):
        n = int(len(data) / sr * 16000)
        data = np.interp(
            np.linspace(0, len(data) - 1, n),
            np.arange(len(data)),
            data,
        )
        sr = 16000

    pcm = (data * 32767).astype(np.int16).tobytes()

    client = AsyncOpenAI(
        base_url="http://127.0.0.1:30000/v1",
        websocket_base_url="ws://127.0.0.1:30000/v1",
        api_key="x",
    )

    async with client.realtime.connect(model="qwen3-asr") as conn:
        await conn.send(
            {
                "type": "session.update",
                "session": {
                    "type": "transcription",
                    "audio": {
                        "input": {
                            "format": {"type": "audio/pcm", "rate": sr},
                            "transcription": {
                                "model": "qwen3-asr",
                                "language": language,
                            },
                            "noise_reduction": None,
                            "turn_detection": None,
                        }
                    },
                },
            }
        )

        async for evt in conn:
            if getattr(evt, "type", None) == "session.updated":
                break

        chunk_bytes = sr  # 0.5 s of int16 PCM at sr Hz
        for off in range(0, len(pcm), chunk_bytes):
            await conn.send(
                {
                    "type": "input_audio_buffer.append",
                    "audio": base64.b64encode(pcm[off : off + chunk_bytes]).decode(),
                }
            )

        await conn.send({"type": "input_audio_buffer.commit"})

        async for evt in conn:
            if (
                getattr(evt, "type", None)
                == "conversation.item.input_audio_transcription.completed"
            ):
                return getattr(evt, "transcript", "")


if __name__ == "__main__":
    print(asyncio.run(transcribe("audio.wav")))

Reproducing the numbers

Wall time is measured on the client with time.perf_counter() around the SDK call.

Prefill-token counts come from SGLang's scheduler log, one line per batch — #new-token is tokens prefilled fresh, #cached-token is tokens reused from the prefix cache:

[YYYY-MM-DD HH:MM:SS] Prefill batch, #new-seq: 1, #new-token: 58, #cached-token: 1024, ...

WebSocket sessions are delimited in the log by WebSocket /v1/realtime ... [accepted] (start) and connection closed (end).

The 600 s / 900 s cumulative rows were measured with a raw websockets client and keepalive disabled (ping_interval=None), because the OpenAI SDK's default 20 s keepalive interval dropped the cumulative path at those lengths (1011 keepalive ping timeout). The sliced path completes through the stock SDK; sliced numbers match within ~1% either way.

Checklist

Format code with pre-commit. black-jupyter, isort, ruff, codespell, ast, EOL, and whitespace checks pass.
Add unit tests. 8-case CPU suite at test/registered/unit/entrypoints/openai/test_streaming_asr.py (base-a-test-cpu, est_time=3): entry-point scenarios through process_asr_chunk (cumulative/sliced paths, word-level dedupe, finalize, reconciliation, empty response) plus the slicing-enable guard. Runtime slicing behavior stays in the manual GPU suite.
Update documentation. Docstrings added/updated for the slicing state, adapter config, _run_inference, process_asr_chunk, and the dedupe helpers.
Provide accuracy and speed results. See the sections above.
Follow SGLang code style.

Builds on [Feature] WebSocket streaming audio input for ASR #22848 (initial WebSocket realtime ASR path).
Implements the M2 long-form cost-bounding milestone of RFC [RFC]: Real-Time Streaming Audio Input for ASR Models #22474 via input slicing.
Exact RadixCache prefix reuse and token-level streaming / alignment remain future work.
The PCM/WAV round-trip removal is a small cleanup bundled here because it touches the same _run_inference path.

CI States

Latest PR Test (Base): ❌ Run #26718039502
Latest PR Test (Extra): ❌ Run #26718039444

The 600 s / 900 s cumulative rows use a raw websockets client with keepalive disabled (ping_interval=None). The OpenAI SDK path hit 1011 keepalive ping timeout on the cumulative path at those lengths. Sliced numbers match the SDK-driven runs within ~1%. ↩ ↩²

PCM16 from the WebSocket path was being encoded into WAV bytes only for load_audio to decode it back into a float ndarray. Convert directly to float samples (1/32768 normalization matches soundfile.read default for signed 16-bit, so the float values are bit-equal to the old path), and teach load_audio to accept a pre-decoded ndarray as a no-op passthrough. ASR/cache semantics unchanged — this only removes the WAV adapter layer. A future optimization could maintain decoded samples incrementally to avoid re-converting the cumulative PCM buffer on every chunk.

…forward Replace the WS /v1/realtime cumulative inference path (re-send the whole PCM buffer on every chunk) with input slicing once a committed-text prefix exists. Once StreamingASRState has stable emitted text, is past the K-token holdback gate, and has accumulated at least eight chunks (~16 s) of cumulative context, the model runs on ``pcm_buffer[committed_audio_until_bytes - left_overlap_bytes:]`` plus a 2 s left overlap instead of the full buffer. The prompt stays at ``adapter.prompt_template`` — emitted_text is not injected as a continuation prefix; the retained acoustic overlap plus a word-level dedupe (with CJK char-level fallback) takes its place. The first gated call still starts at offset 0 because committed_audio_until_bytes is initialized to 0; only chunk 9 onward is bounded to overlap + new chunk. Performance (TED-LIUM long-form sweep on Qwen3-ASR-0.6B, H100): audio cumul wall sliced wall save 30 s 1.51 s 1.29 s 14 % 60 s 3.00 s 2.52 s 16 % 120 s 6.17 s 5.11 s 17 % 240 s 14.78 s 10.07 s 32 % 300 s 19.49 s 12.01 s 38 % 600 s 77.24 s 26.68 s 65 % 900 s 171.37 s 38.23 s 78 % Per-chunk model-call wall stays flat at ~80 ms mean / ~121 ms max across the sweep instead of growing to 137 ms mean / 399 ms max in the cumulative path at 300 s. Realtime-paced sum of per-chunk inference wall drops 40-42 % on both 0.6B and 1.7B Qwen3-ASR. Implementation: - ``adapter.realtime_slicing_config`` returns left_overlap_ms (default 2000) and min_audio_sec (default 16.0); slicing_min_chunk_index is derived as ceil(min_audio_sec / chunk_size_sec). - ``_slice_pcm_from`` snapshots the bytearray via memoryview so the per-chunk copy is slice-sized instead of full-buffer + slice (~7.7 MB -> ~128 KB at 240 s when slicing engaged). - ``dedupe_overlap`` normalizes only the tail of committed_text bounded by len(candidate_words), so dedupe cost does not grow with session length. - ``process_asr_chunk`` gains ``prompt: Optional[str]`` and ``dedupe_against: Optional[str]`` kwargs; the realtime path uses them, the HTTP / HTTP SSE path keeps existing behavior via defaults. - ``load_audio`` annotation widened from ``str`` to ``Union[str, bytes, np.ndarray]`` to match the existing isinstance branches; not exposed through any Pydantic schema path. Tests: 21-case CI unit suite at test/registered/unit/entrypoints/openai/test_streaming_asr.py covering dedupe_overlap (word + CJK + suffix-only-history invariant), _pcm_to_float_samples (normalization + soundfile-round-trip bit-equality + odd-length raises), and _slice_pcm_from validation.

After a stability gate (8 chunks / ~16s for Qwen3-ASR), the realtime WebSocket path runs inference on a bounded audio tail (the new chunk + a 2s left overlap) instead of the full cumulative PCM buffer, with output-side dedupe. Slicing is opt-in per adapter: the base config keeps it off; Qwen3-ASR enables it. Short audio and non-opting adapters keep the cumulative path unchanged. Also in this changeset: - StreamingASRState.update(): drop the char-level startswith fast path that emitted mid-word fragments ("world" -> "worldly" emitted "ly"); the word-level common-prefix scan now runs unconditionally (matching finalize()). - Convert PCM16 to float directly, skipping the PCM -> WAV -> ndarray round-trip; load_audio accepts a pre-decoded ndarray. - Add a 14-case CPU unit suite (process_asr_chunk integration, slicing-enable guard, update() reconciliation, dedupe rules, PCM/slice helpers).

Refines the M2 input-slicing output dedupe and its test suite: - Dedupe normalization uses NFKC + Unicode category-P edge stripping (Whisper-style) instead of a hand-listed punctuation set. - Split CJK detection into _is_cjk_no_space (spacing) and _is_cjk_dedupe (dedupe, narrower); use Script_Extensions so the kana marks U+30FC/U+30FB are covered; keep Hangul out (Korean is space-delimited). - CJK dedupe is boundary-only: compare the leading/trailing CJK runs and never skip interior non-CJK content; require a >=2-glyph overlap for letters, allow 1 for punctuation. - Fix spurious over-deletion when lone-punctuation tokens normalize to "" and match each other; require a real word in the matched overlap. - _dedupe_by_word rsplits the committed tail instead of tokenizing the whole growing transcript. - Rewrite the unit tests as entry-point scenarios through process_asr_chunk plus the slicing-enable guard.

Tighten slicing-path comments to one-liners (base adapter config docstring, session.py slicing_enabled / emitted_deltas / _run_inference, streaming_asr predicate header). No logic change.

CJK never enters the sliced path -- the slicing gate needs confirmed text, which the whitespace word-split in StreamingASRState.update never produces for space-less scripts -- so the CJK char-level dedupe was unreachable for CJK and only added review surface. dedupe_overlap is now word-level only; the spacing predicate (needs_space) and word-level dedupe (incl. the punctuation-overlap fix) stay. CJK-aware dedupe is deferred to M3, where slicing also engages for CJK.

Drop the regex/scx rewrite of the spacing predicate back to the baseline codepoint _is_cjk (removes the new `regex` dependency); add a halfwidth Hangul jamo guard so the function matches its docstring. Fix a stale "token-level" comment in update() (the scan is word-level; token-level rollback is M3) and shorten _dedupe_norm's docstring.

"committed_audio_until_bytes" collided with the OpenAI realtime input_audio_buffer.commit concept; the field actually marks the PCM offset the previous sliced inference consumed up to. Rename it across the field declaration, slice-start arithmetic, anchor update, and the per-item reset. Also fix a stale test docstring left over from the CJK-dedupe rollback ("Latin and CJK" -> word-level dedupe).

gemini-code-assist

Code Review

This pull request introduces a realtime ASR slicing path to optimize inference by switching from cumulative buffers to tail slices with left overlap and output deduplication. It updates RealtimeConnection and StreamingASRState to support slicing configuration, adds word-level deduplication, and enables slicing for Qwen3-ASR. Feedback focuses on improving robustness and efficiency, including moving slicing and float conversion inside the try-except block in _run_inference, using float32 instead of float64 for audio samples to reduce memory usage, and ensuring that left_overlap_bytes and slicing offsets are aligned to the 16-bit PCM sample width boundary to prevent audio corruption.

gemini-code-assist · 2026-05-31T15:53:54Z

+        slicing_opt_in = bool(slicing_cfg.get("enabled", False))
+        left_overlap_ms = int(slicing_cfg.get("left_overlap_ms", 0))
+        min_audio_sec = float(slicing_cfg.get("min_audio_sec", 0.0))
+        left_overlap_bytes = int(left_overlap_ms / 1000 * self.bytes_per_second)


To prevent potential audio misalignment and corruption, left_overlap_bytes should be explicitly aligned to a multiple of _SAMPLE_WIDTH (2 bytes). If left_overlap_bytes is not aligned, slicing the PCM buffer could cut a 16-bit sample in half, leading to static noise or runtime errors during conversion.

Suggested change

left_overlap_bytes = int(left_overlap_ms / 1000 * self.bytes_per_second)

left_overlap_bytes = int(left_overlap_ms / 1000 * self.bytes_per_second) // _SAMPLE_WIDTH * _SAMPLE_WIDTH

gemini-code-assist · 2026-05-31T15:53:54Z

+def _slice_pcm_from(buffer: Union[bytes, bytearray], start: int) -> bytes:
+    """Return an immutable ``buffer[start:]`` snapshot with bounds checking."""
+    if not (0 <= start <= len(buffer)):
+        raise ValueError(f"_slice_pcm_from: start={start} not in [0, {len(buffer)}]")
+    return bytes(memoryview(buffer)[start:])


As a defensive programming practice, consider adding a check in _slice_pcm_from to ensure that the start offset is a multiple of _SAMPLE_WIDTH. This guarantees that the sliced buffer is properly aligned to 16-bit PCM boundaries, preventing silent audio corruption or misalignment.

Suggested change

def _slice_pcm_from(buffer: Union[bytes, bytearray], start: int) -> bytes:

"""Return an immutable ``buffer[start:]`` snapshot with bounds checking."""

if not (0 <= start <= len(buffer)):

raise ValueError(f"_slice_pcm_from: start={start} not in [0, {len(buffer)}]")

return bytes(memoryview(buffer)[start:])

def _slice_pcm_from(buffer: Union[bytes, bytearray], start: int) -> bytes:

"""Return an immutable ``buffer[start:]`` snapshot with bounds checking."""

if not (0 <= start <= len(buffer)):

raise ValueError(f"_slice_pcm_from: start={start} not in [0, {len(buffer)}]")

if start % _SAMPLE_WIDTH != 0:

raise ValueError(f"_slice_pcm_from: start={start} must be a multiple of {_SAMPLE_WIDTH}")

return bytes(memoryview(buffer)[start:])

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

SammLSH · 2026-05-31T16:11:25Z

cc @AgainstEntropy

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

SammLSH added 8 commits May 30, 2026 06:49

Trim verbose comments/docstrings in realtime ASR slicing

57fcd0d

Tighten slicing-path comments to one-liners (base adapter config docstring, session.py slicing_enabled / emitted_deltas / _run_inference, streaming_asr predicate header). No logic change.

SammLSH requested review from CatherineSue, JustinTong0323, ispobock, merrymercy and slin1237 as code owners May 31, 2026 15:52

gemini-code-assist Bot reviewed May 31, 2026

View reviewed changes

Update python/sglang/srt/entrypoints/openai/realtime/session.py

36b3322

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update python/sglang/srt/entrypoints/openai/realtime/session.py

31dbc97

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

SammLSH changed the title ~~Feat/realtime asr input slicing~~ [Feature] Realtime ASR: bound long-form audio with input slicing May 31, 2026

SammLSH changed the title ~~[Feature] Realtime ASR: bound long-form audio with input slicing~~ [Feature] Realtime ASR: input slicing for long-running sessions Jun 1, 2026

SammLSH changed the title ~~[Feature] Realtime ASR: input slicing for long-running sessions~~ [Feature] Realtime ASR: Input Slicing for Long-Running Realtime ASR Sessions Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Realtime ASR: Input Slicing for Long-Running Realtime ASR Sessions#26853

[Feature] Realtime ASR: Input Slicing for Long-Running Realtime ASR Sessions#26853
SammLSH wants to merge 10 commits into
sgl-project:mainfrom
SammLSH:feat/realtime-asr-input-slicing

SammLSH commented May 31, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 31, 2026

Uh oh!

gemini-code-assist Bot May 31, 2026

Uh oh!

SammLSH commented May 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	left_overlap_bytes = int(left_overlap_ms / 1000 * self.bytes_per_second)
	left_overlap_bytes = int(left_overlap_ms / 1000 * self.bytes_per_second) // _SAMPLE_WIDTH * _SAMPLE_WIDTH

Conversation

SammLSH commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Relationship to RFC M2

Modifications

Files touched

PCM/WAV cleanup

Reconciliation fix

When slicing turns on

Model-specific tuning

Accuracy tests

Short audio: no regression

Long-form audio

Repetitive long-form content

Unit coverage added in this PR

Speed tests and profiling

End-to-end wall time

Per-chunk model-call distribution

Cross-model total inference time

Multilingual short-audio sanity

HTTP SSE vs WebSocket realtime

Known limits

How to send requests

Reproducing the numbers

Checklist

Related

CI States

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

SammLSH commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SammLSH commented May 31, 2026 •

edited

Loading

SammLSH commented May 31, 2026 •

edited

Loading