Skip to content

[Feature] Realtime ASR: Input Slicing for Long-Running Realtime ASR Sessions#26853

Open
SammLSH wants to merge 10 commits into
sgl-project:mainfrom
SammLSH:feat/realtime-asr-input-slicing
Open

[Feature] Realtime ASR: Input Slicing for Long-Running Realtime ASR Sessions#26853
SammLSH wants to merge 10 commits into
sgl-project:mainfrom
SammLSH:feat/realtime-asr-input-slicing

Conversation

@SammLSH
Copy link
Copy Markdown
Contributor

@SammLSH SammLSH commented May 31, 2026

Motivation

This PR implements the M2 milestone of RFC #22474: keeping long-form realtime speech-to-text (ASR) cheap to run.

Background on how the realtime path works today. The WebSocket endpoint /v1/realtime (added in #22848) transcribes audio as it streams in: the client sends audio, and the server periodically runs the model on what it has so far and pushes partial text back. Audio is processed in fixed-length pieces called chunks (2 s each for Qwen3-ASR). Today, on every chunk the server re-sends the entire audio accumulated so far — the "cumulative" approach. So per-chunk work grows as the session gets longer, and total work over a session is quadratic in its length.

This PR adds an input slicing path for long sessions. Once enough audio has accumulated (a "stability gate"), each inference runs on only a bounded tail of the audio — the newest chunk plus a short slice of the audio just before it (the "left overlap") — instead of the whole buffer. Short audio, and any model that doesn't opt in, keep the old cumulative behavior unchanged.

Terms used throughout:

  • chunk — a fixed-length piece of audio the server runs the model on (2 s for Qwen3-ASR).
  • cumulative path — re-send all audio so far on every chunk (the old behavior).
  • sliced path — send only the recent tail (this PR, for long audio).
  • prefill — the model encoding its input before it generates output; its cost scales with the number of input tokens.
  • delta — one incremental piece of transcript text the server pushes to the client as it goes.
  • gold transcript — the reference / ground-truth transcript we compare against.

In our measurements, slicing addresses the main long-form problems:

  1. Per-chunk latency stops growing. On a 300 s session, the slowest single inference drops from 399 ms (cumulative) to 121 ms (sliced). End-to-end time (total elapsed wall-clock time) improves 14–78% across 30–900 s. Under real-time pacing, per-chunk inference time drops 40–42% on both Qwen3-ASR-0.6B and Qwen3-ASR-1.7B.

  2. Runaway repeated output. On audio with repeated content, the cumulative path keeps re-transcribing and re-sending old text. For a 180 s repetitive clip (which we call "EN180"), the cumulative output balloons to 75,035 characters versus about 2,160 in the gold transcript. The same 75,035-character output appears on both 0.6B and 1.7B, so it is the cumulative call pattern, not a model-size issue. Slicing stays near the gold length.

  3. Dropped connections on very long sessions. A WebSocket connection exchanges periodic "keepalive" pings; if one side is too busy to answer in time, the connection is closed. In our 600 s / 900 s runs through the official OpenAI Python client library (the "SDK"), the cumulative path was busy long enough to trip this and closed with 1011 keepalive ping timeout. The sliced path stayed responsive because each inference is bounded.

  4. Audio-encoder memory growth. The part of the model that turns audio into features (the "audio encoder", also called the audio tower) sees more audio every chunk on the cumulative path, so its memory grows with session length. We did not hit out-of-memory (OOM) in these runs, but slicing caps the per-call audio and removes the risk.

This PR also removes a small inefficiency: the realtime path used to convert raw 16-bit audio samples (PCM16) into a WAV byte stream just so a shared helper could decode them back into a floating-point array. That round-trip is gone.

Defaults for Qwen3-ASR:

  • chunk length: 2 s
  • slicing turns on after 16 s of audio, i.e. ceil(16 / 2) = 8 chunks
  • left overlap: 2 s
  • steady-state sliced input: one new chunk + 2 s overlap ≈ 4 s

Relationship to RFC M2

RFC #22474 sketches M2 as cross-chunk prefix caching via RadixCache (sglang's cache that reuses computation for inputs sharing a leading prefix): keep the full audio context, let the cache match the shared leading tokens, and run prefill only on the new tail tokens.

This PR does not take that exact route. It targets the same M2 goal — removing the quadratic long-form cost — with input slicing instead.

Dimension RFC sketch: RadixCache prefix caching This PR: input slicing
Mechanism Keep the full audio context; cache reuses the shared leading tokens After the gate, feed only a bounded tail window
What it bounds LLM prefill only Audio-encoder input and LLM prefill
Encoder cost Still re-runs on the full audio buffer Bounded by chunk + overlap
Cache key Needs a content-aware audio prefix key (the current key hashes the whole audio tensor, not a prefix) No cache-key change
Correctness Exact (same tokens, only cached) Approximate (acoustic overlap + text dedupe)

Why this route:

  • RadixCache prefix reuse does not bound audio-encoder work on its own — the encoder still sees the full accumulated buffer unless the audio input is also bounded. Slicing bounds both.
  • The current multimodal cache key is not a prefix-aware audio key; adding one is a larger change to the cache and multimodal processor.
  • Slicing is lower-risk for this PR: it keeps the change local to the realtime ASR path and adapter config.
  • The cost is exactness — slicing relies on acoustic overlap plus output-side dedupe, so it is approximate, not exact prefix reuse.

Exact RadixCache prefix reuse and token-level streaming / alignment remain valid future work, and can compose with slicing later.

Modifications

The main change is in the realtime WebSocket inference path (_run_inference). Once the streaming state has stable emitted text and the session has passed the 8-chunk gate, _run_inference sends only a tail audio window instead of the full accumulated PCM buffer.

The sliced path:

  • uses the bare prompt template (adapter.prompt_template), and does not inject the already-emitted text back into the prompt — the retained acoustic overlap is the continuity signal instead;
  • slices the audio buffer at last_sliced_buffer_end_bytes - left_overlap_bytes (i.e. the audio since the previous slice's end, plus the left overlap);
  • converts PCM16 directly to float samples (no WAV round-trip);
  • runs an output-side dedupe (dedupe_overlap) that removes text the overlap caused the model to re-transcribe, before it reaches the streaming state.

The first inference past the gate still starts from offset 0 (the last-sliced-end marker is initialized to 0), so that one call feeds the full buffer; every call after it is the ~4 s steady-state window.

Slicing is opt-in per adapter. The base adapter keeps it off by default; Qwen3-ASR turns it on:

{"enabled": True, "left_overlap_ms": 2000, "min_audio_sec": 16.0}

Files touched

File Change
realtime/session.py Tail-window slicing in _run_inference; new slicing state fields + opt-in guard; _pcm_to_float_samples and _slice_pcm_from (replace the PCM→WAV round-trip)
streaming_asr.py New dedupe_overlap; process_asr_chunk gains prompt / dedupe_against args; fixes the StreamingASRState.update() reconciliation bug below
transcription_adapters/base.py New realtime_slicing_config, defaulting to enabled=False (slicing off)
transcription_adapters/qwen3_asr.py Opts Qwen3-ASR in (2 s overlap, 16 s min audio)
utils/common.py load_audio accepts a pre-decoded array and returns it directly
test/registered/unit/entrypoints/openai/test_streaming_asr.py New 8-case CPU unit suite (entry-point integration + slicing-enable guard)

PCM/WAV cleanup

The realtime path used to convert PCM16 → WAV bytes and then call load_audio, which decoded the WAV back into float samples. This PR converts PCM16 directly:

np.frombuffer(pcm, dtype=np.int16).astype(np.float32) / 32768.0

/ 32768.0 matches soundfile.read's default 16-bit normalization, so the result is identical (bit-for-bit) to the old path by construction. load_audio now also accepts an already-decoded array and returns it directly, so the realtime path never enters the file/byte decoder at all.

Reconciliation fix

StreamingASRState.update() computes which new words to emit each chunk. It previously had a character-level shortcut:

confirmed_text.startswith(old_confirmed)

Because that compares characters, not whole words, it cut mid-word when the model extended a previously-emitted word. For example, when "world" became "worldly", it emitted the fragment "ly" instead of the corrected word "worldly".

This PR removes the shortcut and always uses the word-level common-prefix scan that already lived below it (and that finalize() already uses). A CPU test drives this through process_asr_chunk: a mid-word extension ("world""worldly") must emit "worldly", not "ly". The behavior dates back to the original chunked-streaming path in #22089; it is fixed here because the sliced path also routes its output through update().

When slicing turns on

Slicing runs only when all of these hold:

  • the adapter opted in (realtime_slicing_config["enabled"] == True);
  • there is already-emitted text to anchor the dedupe (state.get_prefix_text() is non-empty);
  • enough chunks have accumulated (state.chunk_index >= slicing_min_chunk_index);
  • the left overlap fits inside the unfixed-chunk window (unfixed_chunk_num × chunk_size), so some fresh audio always remains for the dedupe to anchor against (otherwise slicing auto-disables — see below). This is a structural check on chunk-sized audio, distinct from the token-level rollback tuning in Model-specific tuning.

For Qwen3-ASR: chunk_size_sec = 2.0, min_audio_sec = 16.0, left_overlap_ms = 2000. So the gate is ceil(16 / 2) = 8 chunks (≈ 16 s). Below the gate, the path stays cumulative — short audio is unchanged, which avoids the short-input divergence we saw in manual tests.

Model-specific tuning

The 2 s overlap and 16 s gate are tuned for Qwen3-ASR.

Qwen3-ASR may revise its last few output tokens as more audio arrives — its config marks the last 5 tokens as still-revisable (unfixed_token_num = 5, a token-level rollback window). The 2 s overlap is an empirical choice: in our fixtures 2 s of audio carried enough context to re-emit those ~5 tokens. This is a tuning assumption (≈5 tokens ≲ 2 s), not something the guard verifies — the guard only enforces the chunk-level window above. The 16 s gate keeps slicing off on short inputs, where sliced output diverged from cumulative output in manual tests. Other chunked ASR models should re-tune these before enabling slicing; this is noted in the adapter docstring.

Accuracy tests

Short audio: no regression

The 7-fixture HTTP / HTTP-SSE / WebSocket consistency checks from #22848 still hold ("fixture" = a test audio sample; SSE = Server-Sent Events, the HTTP streaming format). These fixtures stay below the 8-chunk gate, so the WebSocket path stays cumulative end-to-end.

The gate also fixes word drops that earlier slicing attempts hit on short audio:

  • MLK 13 s: previously lost ~8 words
  • Spanish 6.6 s: previously lost medio sumergidas
  • Hindi 4.1 s: previously lost में कितने

All three are now identical character-for-character ("byte-equal") to the HTTP paths.

Long-form audio

On long-form English TED talks (from the distil-whisper/tedlium-long-form dataset), the cumulative and sliced paths produce broadly matching final transcripts — no truncation, no hallucination divergence in the tested fixtures.

The cumulative path emits more intermediate deltas (incremental updates to the client) because it keeps producing more revisions — e.g. 767 vs 698 deltas at 300 s. The final transcript agrees. (Under slicing, the running state only holds the latest deduped tail, so the wire transcript is rebuilt from the list of deltas already sent, not from that state.)

Repetitive long-form content

The TED results do not cover the worst case. Repetitive audio exposes a different failure. On a tiled English clip ("EN180"/"EN240" = a short English clip repeated to fill 180 s / 240 s), the cumulative path over-emits badly while slicing stays bounded:

fixture gold chars cumulative chars sliced chars cumulative over-emit sliced vs gold
EN180 ~2,160 75,035 1,920 ~35× −11%
EN240 ~2,880 106,531 2,524 ~37× −12%

("gold chars" = character count of the reference transcript; "over-emit" = how many times larger the output is than gold.) The cumulative output is unusable here. The sliced output is bounded but under-emits by ~11–12%, because the text-level dedupe can over-match genuinely repeated words. That is a known trade-off of this M2 implementation; a principled fix needs token- or timestamp-level alignment (M3).

Unit coverage added in this PR

test/registered/unit/entrypoints/openai/test_streaming_asr.py — 8 CPU tests (no GPU), registered for CI under the base-a-test-cpu suite (est_time=3 is the CI time-budget hint, in seconds). Following the existing test_serving_transcription / test_serving_embedding suites, they drive the real process_asr_chunk entry point with a mocked TokenizerManager rather than unit-testing helpers in isolation:

  • process_asr_chunk scenarios (6): cumulative path injects the prompt prefix and runs no dedupe (M1); sliced path uses the bare prompt and dedupes a Latin overlap (M2); a non-overlapping candidate is kept unchanged; the final chunk dedupes then finalizes; a mid-word extension reconciles to the whole word ("worldly", not "ly"); an empty model response emits nothing without mutating state.
  • Slicing-enable guard (2): slicing turns on only when the left overlap fits inside the unfixed-chunk window (on at 2 s, off at 8 s), and an opted-out adapter (enabled=False) never slices.

The output dedupe (word-level) is exercised through these entry-point scenarios. Whether slicing actually turns on at runtime (the 8-chunk gate mid-stream, the first-gated-call edge, chunk-boundary behavior) stays in the manual GPU suite, not CI.

Existing coverage from #22848 should keep passing: the manual Qwen3-ASR HTTP / SSE / WebSocket tests, protocol-reject and item-lifecycle tests, the v2 unit suite, and the multilingual three-path byte-equality checks.

Speed tests and profiling

Definitions used in this section:

  • wall time — total elapsed real-world time, as the client sees it.
  • prefill tokens — how many input tokens the model had to encode on a call; more tokens = more compute. Reported as a session total and for the last chunk.
  • inference time — the time for one model call (audio encode + prefill + decode + inter-process messaging).

All numbers in this section were measured on a single H100 GPU with Qwen3-ASR-0.6B unless noted. The WebSocket path is driven through the OpenAI Python SDK (openai==2.6.1). Audio is pushed faster than real time unless a row is marked real-time-paced. The "cumulative" baseline is the pre-slicing _run_inference on the same server.

End-to-end wall time

TED-talk prefixes from distil-whisper/tedlium-long-form. "Total prefill tokens" sums every chunk's prefill in the session; "last-chunk prefill tokens" is just the final chunk (a proxy for worst-case per-call cost). Lower is better for both.

audio cumulative wall sliced wall wall saved cumulative total prefill tokens sliced total prefill tokens cumulative last-chunk prefill tokens sliced last-chunk prefill tokens
15 s 1.50 s 1.12 s 25% 1,095 1,095 230 230
30 s 1.51 s 1.29 s 14% 2,846 838 469 58
60 s 3.00 s 2.52 s 16% 11,001 885 965 58
120 s 6.17 s 5.11 s 17% 44,627 1,770 1,958 58
240 s 14.78 s 10.07 s 32% 175,982 3,540 3,888 58
300 s 19.49 s 12.01 s 38% 144,997 1,860 4,831 58
340 s 23.57 s 14.31 s 39% 126,561 1,310 5,463 58
600 s 1 77.24 s 26.68 s 65% 1,444,362 17,378 1,388 58
900 s 1 171.37 s 38.23 s 78% 3,241,918 9,000 6,150 58

Slicing is 14–78% faster end-to-end here, and its last-chunk prefill flattens at 58 tokens once past the gate while the cumulative path's keeps growing. Two caveats on reading this table:

  • The 15 s row is below the 16 s gate, so slicing never turns on — prefill is identical on both sides (1,095 / 230). Its 25% wall difference is single-run warmup/noise, not a slicing effect.
  • The cumulative total-prefill column is not a clean monotonic curve (e.g. 300 s < 240 s): the TED prefixes differ in speech density, and RadixCache state carries across the sequential runs, so token counts are only a rough trend, not a controlled measurement.

The quadratic long-form cost is cleanest in wall time and in per-chunk max latency (next section: cumulative per-chunk max grows from 127 ms at 30 s to 399 ms at 300 s) — treat those as the primary evidence; the cumulative total-prefill column is supporting/illustrative.

Per-chunk model-call distribution

Each row is the distribution of single-chunk inference time (milliseconds) within one session, measured with time.perf_counter() around the model call. n_chunks = number of model calls; stdev = standard deviation; min/max = fastest/slowest single chunk.

audio mode n_chunks mean median stdev min max
30 s cumulative 15 97 ms 99 ms 19 ms 74 ms 127 ms
30 s sliced 15 79 ms 78 ms 8 ms 66 ms 94 ms
60 s cumulative 30 91 ms 93 ms 26 ms 55 ms 131 ms
60 s sliced 30 80 ms 80 ms 13 ms 58 ms 109 ms
120 s cumulative 60 99 ms 105 ms 28 ms 56 ms 143 ms
120 s sliced 60 82 ms 80 ms 11 ms 59 ms 109 ms
300 s cumulative 150 137 ms 143 ms 54 ms 56 ms 399 ms
300 s sliced 150 80 ms 80 ms 12 ms 57 ms 121 ms

Sliced per-chunk time stays flat (the per-call audio is bounded); cumulative time grows as the accumulated buffer gets longer. The PCM/WAV cleanup is < 1 ms per chunk here — it is a cleanup, not the source of the speedup.

Cross-model total inference time

"Total inference time" = the sum of every per-chunk model-call duration across a whole run (so it captures total GPU work, not just wall time). Lower is better. Runs are real-time-paced on long-form English; the 1.7B run is 53 sessions / 536 model calls.

model cumulative total inference time sliced total inference time reduction
Qwen3-ASR-0.6B 461.0 s 276.2 s −40.1%
Qwen3-ASR-1.7B 181.8 s 105.4 s −42.0%

The two models reduce by nearly the same amount (−40% vs −42%), which means the win comes from bounding the call pattern, not from anything specific to the 0.6B model.

Multilingual short-audio sanity

These fixtures stay below the gate, so both modes take the cumulative path and should match exactly. "deltas" = incremental updates sent to the client.

fixture language audio cumulative wall sliced wall cumulative deltas sliced deltas transcripts
zh_4s Chinese 4.2 s 0.59 s 0.58 s 1 1 byte-equal
hi_4s Hindi 4.1 s 0.37 s 0.37 s 6 6 byte-equal
es_7s Spanish 6.6 s 0.43 s 0.35 s 14 14 byte-equal
libri_10s English 10.4 s 0.53 s 0.48 s 27 27 byte-equal
mlk_13s English 13.0 s 0.54 s 0.48 s 21 21 byte-equal

HTTP SSE vs WebSocket realtime

HTTP SSE (the streaming HTTP endpoint) stays cumulative; the WebSocket path uses this PR's slicing after the gate. Both driven via the OpenAI SDK.

audio HTTP SSE wall WebSocket wall WebSocket vs SSE
30 s 1.43 s 1.33 s −7%
60 s 2.83 s 2.54 s −10%
120 s 6.30 s 5.17 s −18%
300 s 22.82 s 12.52 s −45%

Short audio is close; the gap grows with length because SSE still re-encodes the full cumulative buffer.

Known limits

  1. Short audio under the gate. Slicing does not turn on before ~16 s; short audio stays cumulative (and byte-equal — see the multilingual table).
  2. Repetitive long-form content. Text-level dedupe can over-match genuinely repeated words, so slicing under-emits by ~11–12% on EN180/EN240. Much better than the cumulative ~35–37× over-emission, but still a correctness trade-off.
  3. Held-back words on slow or paused speech. Each sliced inference holds back the last unfixed_token_num words (Qwen3-ASR: 5) for the next pass to confirm; the next slice re-covers them through the 2 s left overlap. But the hold-back is counted in tokens while the overlap is measured in time, and last_sliced_buffer_end_bytes jumps to the full buffer after each sliced call — so a held word is recovered only if it lies within the last ~2 s of audio. If the last 5 words span more than the overlap (slow/spontaneous speech, or a pause among them), the earliest held word falls before the next slice's start and is dropped, with no later chance to recover. Fast continuous speech (~3 words/s → ~1.7 s for 5 words, e.g. the TED fixtures) stays inside the overlap, so the long-form results above do not surface it. The slicing-enable guard checks a chunk-time window, not this token-span-vs-overlap relationship. The robust fix makes the hold-back time-based (hold the last X s, require overlap ≥ X) — M3, with token/timestamp alignment. (This affects space-delimited languages only; CJK never reaches the sliced path — see limit 5 — so it cannot lose held words this way.)
  4. Short-audio high concurrency. In a saturation test (64 concurrent ~15 s sessions on 0.6B) both paths drop 32/64 sessions at similar throughput — the bottleneck there is not per-chunk compute, so this PR does not move it. These sessions are below the gate, so both paths run cumulative; long-session concurrency — where bounding per-call work should let more streams run at once — was not measured and is future work.
  5. CJK gets no slicing benefit yet. Slicing only activates once update() has confirmed text to anchor against (get_prefix_text() non-empty). update() finds confirmed words by splitting on whitespace, which space-less scripts (Chinese, Japanese) never satisfy — so emitted_text stays empty, the slicing gate never opens, and CJK long-form runs the cumulative path for the whole session: no per-call bounding, and the transcript arrives as a single burst at commit instead of incrementally. Measured on this branch (Qwen3-ASR-0.6B): a 48 s Mandarin clip → 1 delta, prefill grew to ~629 tokens, ~21 s wall (cumulative); a 37 s English clip → 107 deltas, a flat 58-token prefill, ~5 s wall (sliced). The CJK transcript is still correct (byte-equal to the full-context HTTP result) — slicing simply never engages. Making CJK benefit needs CJK-aware confirmation (character/token-level rollback) = M3.
  6. Conversational short turns. Turns under the gate keep the old cumulative behavior; this is specifically a long-form fix.
  7. Session memory. This PR bounds per-call model input and encoder/prefill work, but does not compact the stored audio buffer (pcm_buffer); session memory is still capped by --asr-max-buffer-seconds. A rolling buffer is future work.

How to send requests

Server launch:

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-ASR-0.6B \
  --served-model-name qwen3-asr \
  --trust-remote-code \
  --host 127.0.0.1 --port 30000

HTTP non-streaming:

curl -s -X POST http://127.0.0.1:30000/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F model=qwen3-asr \
  -F language=en

HTTP SSE (streaming):

curl -N -X POST http://127.0.0.1:30000/v1/audio/transcriptions \
  -F file=@audio.wav \
  -F model=qwen3-asr \
  -F stream=true

WebSocket realtime via the OpenAI Python SDK:

import asyncio
import base64

import numpy as np
import soundfile as sf
from openai import AsyncOpenAI


async def transcribe(path: str, language: str = "en") -> str:
    data, sr = sf.read(path, dtype="float32")
    if data.ndim > 1:
        data = data.mean(axis=1)

    if sr not in (16000, 24000, 48000):
        n = int(len(data) / sr * 16000)
        data = np.interp(
            np.linspace(0, len(data) - 1, n),
            np.arange(len(data)),
            data,
        )
        sr = 16000

    pcm = (data * 32767).astype(np.int16).tobytes()

    client = AsyncOpenAI(
        base_url="http://127.0.0.1:30000/v1",
        websocket_base_url="ws://127.0.0.1:30000/v1",
        api_key="x",
    )

    async with client.realtime.connect(model="qwen3-asr") as conn:
        await conn.send(
            {
                "type": "session.update",
                "session": {
                    "type": "transcription",
                    "audio": {
                        "input": {
                            "format": {"type": "audio/pcm", "rate": sr},
                            "transcription": {
                                "model": "qwen3-asr",
                                "language": language,
                            },
                            "noise_reduction": None,
                            "turn_detection": None,
                        }
                    },
                },
            }
        )

        async for evt in conn:
            if getattr(evt, "type", None) == "session.updated":
                break

        chunk_bytes = sr  # 0.5 s of int16 PCM at sr Hz
        for off in range(0, len(pcm), chunk_bytes):
            await conn.send(
                {
                    "type": "input_audio_buffer.append",
                    "audio": base64.b64encode(pcm[off : off + chunk_bytes]).decode(),
                }
            )

        await conn.send({"type": "input_audio_buffer.commit"})

        async for evt in conn:
            if (
                getattr(evt, "type", None)
                == "conversation.item.input_audio_transcription.completed"
            ):
                return getattr(evt, "transcript", "")


if __name__ == "__main__":
    print(asyncio.run(transcribe("audio.wav")))

Reproducing the numbers

Wall time is measured on the client with time.perf_counter() around the SDK call.

Prefill-token counts come from SGLang's scheduler log, one line per batch — #new-token is tokens prefilled fresh, #cached-token is tokens reused from the prefix cache:

[YYYY-MM-DD HH:MM:SS] Prefill batch, #new-seq: 1, #new-token: 58, #cached-token: 1024, ...

WebSocket sessions are delimited in the log by WebSocket /v1/realtime ... [accepted] (start) and connection closed (end).

The 600 s / 900 s cumulative rows were measured with a raw websockets client and keepalive disabled (ping_interval=None), because the OpenAI SDK's default 20 s keepalive interval dropped the cumulative path at those lengths (1011 keepalive ping timeout). The sliced path completes through the stock SDK; sliced numbers match within ~1% either way.

Checklist

  • Format code with pre-commit. black-jupyter, isort, ruff, codespell, ast, EOL, and whitespace checks pass.
  • Add unit tests. 8-case CPU suite at test/registered/unit/entrypoints/openai/test_streaming_asr.py (base-a-test-cpu, est_time=3): entry-point scenarios through process_asr_chunk (cumulative/sliced paths, word-level dedupe, finalize, reconciliation, empty response) plus the slicing-enable guard. Runtime slicing behavior stays in the manual GPU suite.
  • Update documentation. Docstrings added/updated for the slicing state, adapter config, _run_inference, process_asr_chunk, and the dedupe helpers.
  • Provide accuracy and speed results. See the sections above.
  • Follow SGLang code style.

Related


CI States

Latest PR Test (Base): ❌ Run #26718039502
Latest PR Test (Extra): ❌ Run #26718039444

Footnotes

  1. The 600 s / 900 s cumulative rows use a raw websockets client with keepalive disabled (ping_interval=None). The OpenAI SDK path hit 1011 keepalive ping timeout on the cumulative path at those lengths. Sliced numbers match the SDK-driven runs within ~1%. 2

SammLSH added 8 commits May 30, 2026 06:49
PCM16 from the WebSocket path was being encoded into WAV bytes only for
load_audio to decode it back into a float ndarray. Convert directly to
float samples (1/32768 normalization matches soundfile.read default for
signed 16-bit, so the float values are bit-equal to the old path), and
teach load_audio to accept a pre-decoded ndarray as a no-op passthrough.

ASR/cache semantics unchanged — this only removes the WAV adapter layer.
A future optimization could maintain decoded samples incrementally to
avoid re-converting the cumulative PCM buffer on every chunk.
…forward

Replace the WS /v1/realtime cumulative inference path (re-send the whole
PCM buffer on every chunk) with input slicing once a committed-text
prefix exists. Once StreamingASRState has stable emitted text, is past
the K-token holdback gate, and has accumulated at least eight chunks
(~16 s) of cumulative context, the model runs on
``pcm_buffer[committed_audio_until_bytes - left_overlap_bytes:]`` plus
a 2 s left overlap instead of the full buffer. The prompt stays at
``adapter.prompt_template`` — emitted_text is not injected as a
continuation prefix; the retained acoustic overlap plus a word-level
dedupe (with CJK char-level fallback) takes its place.

The first gated call still starts at offset 0 because
committed_audio_until_bytes is initialized to 0; only chunk 9 onward
is bounded to overlap + new chunk.

Performance (TED-LIUM long-form sweep on Qwen3-ASR-0.6B, H100):

  audio  cumul wall  sliced wall  save
   30 s     1.51 s      1.29 s   14 %
   60 s     3.00 s      2.52 s   16 %
  120 s     6.17 s      5.11 s   17 %
  240 s    14.78 s     10.07 s   32 %
  300 s    19.49 s     12.01 s   38 %
  600 s    77.24 s     26.68 s   65 %
  900 s   171.37 s     38.23 s   78 %

Per-chunk model-call wall stays flat at ~80 ms mean / ~121 ms max
across the sweep instead of growing to 137 ms mean / 399 ms max in
the cumulative path at 300 s. Realtime-paced sum of per-chunk
inference wall drops 40-42 % on both 0.6B and 1.7B Qwen3-ASR.

Implementation:
- ``adapter.realtime_slicing_config`` returns left_overlap_ms (default
  2000) and min_audio_sec (default 16.0); slicing_min_chunk_index is
  derived as ceil(min_audio_sec / chunk_size_sec).
- ``_slice_pcm_from`` snapshots the bytearray via memoryview so the
  per-chunk copy is slice-sized instead of full-buffer + slice
  (~7.7 MB -> ~128 KB at 240 s when slicing engaged).
- ``dedupe_overlap`` normalizes only the tail of committed_text bounded
  by len(candidate_words), so dedupe cost does not grow with session
  length.
- ``process_asr_chunk`` gains ``prompt: Optional[str]`` and
  ``dedupe_against: Optional[str]`` kwargs; the realtime path uses them,
  the HTTP / HTTP SSE path keeps existing behavior via defaults.
- ``load_audio`` annotation widened from ``str`` to
  ``Union[str, bytes, np.ndarray]`` to match the existing isinstance
  branches; not exposed through any Pydantic schema path.

Tests: 21-case CI unit suite at
test/registered/unit/entrypoints/openai/test_streaming_asr.py covering
dedupe_overlap (word + CJK + suffix-only-history invariant),
_pcm_to_float_samples (normalization + soundfile-round-trip
bit-equality + odd-length raises), and _slice_pcm_from validation.
After a stability gate (8 chunks / ~16s for Qwen3-ASR), the realtime WebSocket
path runs inference on a bounded audio tail (the new chunk + a 2s left overlap)
instead of the full cumulative PCM buffer, with output-side dedupe. Slicing is
opt-in per adapter: the base config keeps it off; Qwen3-ASR enables it. Short
audio and non-opting adapters keep the cumulative path unchanged.

Also in this changeset:
- StreamingASRState.update(): drop the char-level startswith fast path that
  emitted mid-word fragments ("world" -> "worldly" emitted "ly"); the word-level
  common-prefix scan now runs unconditionally (matching finalize()).
- Convert PCM16 to float directly, skipping the PCM -> WAV -> ndarray round-trip;
  load_audio accepts a pre-decoded ndarray.
- Add a 14-case CPU unit suite (process_asr_chunk integration, slicing-enable
  guard, update() reconciliation, dedupe rules, PCM/slice helpers).
Refines the M2 input-slicing output dedupe and its test suite:

- Dedupe normalization uses NFKC + Unicode category-P edge stripping
  (Whisper-style) instead of a hand-listed punctuation set.
- Split CJK detection into _is_cjk_no_space (spacing) and _is_cjk_dedupe
  (dedupe, narrower); use Script_Extensions so the kana marks U+30FC/U+30FB
  are covered; keep Hangul out (Korean is space-delimited).
- CJK dedupe is boundary-only: compare the leading/trailing CJK runs and
  never skip interior non-CJK content; require a >=2-glyph overlap for
  letters, allow 1 for punctuation.
- Fix spurious over-deletion when lone-punctuation tokens normalize to ""
  and match each other; require a real word in the matched overlap.
- _dedupe_by_word rsplits the committed tail instead of tokenizing the
  whole growing transcript.
- Rewrite the unit tests as entry-point scenarios through process_asr_chunk
  plus the slicing-enable guard.
Tighten slicing-path comments to one-liners (base adapter config docstring,
session.py slicing_enabled / emitted_deltas / _run_inference, streaming_asr
predicate header). No logic change.
CJK never enters the sliced path -- the slicing gate needs confirmed text,
which the whitespace word-split in StreamingASRState.update never produces for
space-less scripts -- so the CJK char-level dedupe was unreachable for CJK and
only added review surface. dedupe_overlap is now word-level only; the spacing
predicate (needs_space) and word-level dedupe (incl. the punctuation-overlap
fix) stay. CJK-aware dedupe is deferred to M3, where slicing also engages for
CJK.
Drop the regex/scx rewrite of the spacing predicate back to the baseline
codepoint _is_cjk (removes the new `regex` dependency); add a halfwidth Hangul
jamo guard so the function matches its docstring. Fix a stale "token-level"
comment in update() (the scan is word-level; token-level rollback is M3) and
shorten _dedupe_norm's docstring.
"committed_audio_until_bytes" collided with the OpenAI realtime
input_audio_buffer.commit concept; the field actually marks the PCM
offset the previous sliced inference consumed up to. Rename it across
the field declaration, slice-start arithmetic, anchor update, and the
per-item reset. Also fix a stale test docstring left over from the
CJK-dedupe rollback ("Latin and CJK" -> word-level dedupe).
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a realtime ASR slicing path to optimize inference by switching from cumulative buffers to tail slices with left overlap and output deduplication. It updates RealtimeConnection and StreamingASRState to support slicing configuration, adds word-level deduplication, and enables slicing for Qwen3-ASR. Feedback focuses on improving robustness and efficiency, including moving slicing and float conversion inside the try-except block in _run_inference, using float32 instead of float64 for audio samples to reduce memory usage, and ensuring that left_overlap_bytes and slicing offsets are aligned to the 16-bit PCM sample width boundary to prevent audio corruption.

Comment thread python/sglang/srt/entrypoints/openai/realtime/session.py Outdated
Comment thread python/sglang/srt/entrypoints/openai/realtime/session.py Outdated
slicing_opt_in = bool(slicing_cfg.get("enabled", False))
left_overlap_ms = int(slicing_cfg.get("left_overlap_ms", 0))
min_audio_sec = float(slicing_cfg.get("min_audio_sec", 0.0))
left_overlap_bytes = int(left_overlap_ms / 1000 * self.bytes_per_second)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent potential audio misalignment and corruption, left_overlap_bytes should be explicitly aligned to a multiple of _SAMPLE_WIDTH (2 bytes). If left_overlap_bytes is not aligned, slicing the PCM buffer could cut a 16-bit sample in half, leading to static noise or runtime errors during conversion.

Suggested change
left_overlap_bytes = int(left_overlap_ms / 1000 * self.bytes_per_second)
left_overlap_bytes = int(left_overlap_ms / 1000 * self.bytes_per_second) // _SAMPLE_WIDTH * _SAMPLE_WIDTH

Comment on lines +84 to +88
def _slice_pcm_from(buffer: Union[bytes, bytearray], start: int) -> bytes:
"""Return an immutable ``buffer[start:]`` snapshot with bounds checking."""
if not (0 <= start <= len(buffer)):
raise ValueError(f"_slice_pcm_from: start={start} not in [0, {len(buffer)}]")
return bytes(memoryview(buffer)[start:])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

As a defensive programming practice, consider adding a check in _slice_pcm_from to ensure that the start offset is a multiple of _SAMPLE_WIDTH. This guarantees that the sliced buffer is properly aligned to 16-bit PCM boundaries, preventing silent audio corruption or misalignment.

Suggested change
def _slice_pcm_from(buffer: Union[bytes, bytearray], start: int) -> bytes:
"""Return an immutable ``buffer[start:]`` snapshot with bounds checking."""
if not (0 <= start <= len(buffer)):
raise ValueError(f"_slice_pcm_from: start={start} not in [0, {len(buffer)}]")
return bytes(memoryview(buffer)[start:])
def _slice_pcm_from(buffer: Union[bytes, bytearray], start: int) -> bytes:
"""Return an immutable ``buffer[start:]`` snapshot with bounds checking."""
if not (0 <= start <= len(buffer)):
raise ValueError(f"_slice_pcm_from: start={start} not in [0, {len(buffer)}]")
if start % _SAMPLE_WIDTH != 0:
raise ValueError(f"_slice_pcm_from: start={start} must be a multiple of {_SAMPLE_WIDTH}")
return bytes(memoryview(buffer)[start:])

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@SammLSH
Copy link
Copy Markdown
Contributor Author

SammLSH commented May 31, 2026

cc @AgainstEntropy

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@SammLSH SammLSH changed the title Feat/realtime asr input slicing [Feature] Realtime ASR: bound long-form audio with input slicing May 31, 2026
@SammLSH SammLSH changed the title [Feature] Realtime ASR: bound long-form audio with input slicing [Feature] Realtime ASR: input slicing for long-running sessions Jun 1, 2026
@SammLSH SammLSH changed the title [Feature] Realtime ASR: input slicing for long-running sessions [Feature] Realtime ASR: Input Slicing for Long-Running Realtime ASR Sessions Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant