Skip to content

[Feature] WebSocket streaming audio input for ASR#22821

Closed
SammLSH wants to merge 1 commit into
sgl-project:mainfrom
SammLSH:feat/ws-streaming-asr-input
Closed

[Feature] WebSocket streaming audio input for ASR#22821
SammLSH wants to merge 1 commit into
sgl-project:mainfrom
SammLSH:feat/ws-streaming-asr-input

Conversation

@SammLSH
Copy link
Copy Markdown
Contributor

@SammLSH SammLSH commented Apr 14, 2026

Motivation

Implements M1 of the RFC in #22474.

PR #22089 shipped chunked streaming output for Qwen3-ASR via POST /v1/audio/transcriptions?stream=true (SSE over an HTTP upload), which assumes the entire audio file is known up-front. Real-time use cases (live captioning, voice assistants, meeting transcription) need the opposite direction: the server accepts audio as it arrives and pushes partial transcripts back as the speaker talks. This PR adds that path as a new WebSocket endpoint, reusing the existing chunked inference state machine and TranscriptionAdapter from PR #22181.

Modifications

New WebSocket endpoint

  • Endpoint: WS /v1/audio/transcriptions/stream (registered in http_server.py)
  • Wire protocol (inspired by OpenAI Realtime API conventions, simplified session/delta/final event model):
    • Client → session.start (JSON) → binary PCM16 frames → session.end (JSON)
    • Server → session.started / transcript.delta (per word) / transcript.final / error
  • Audio format (M1): session.start accepts an audio_format field for forward compatibility. Currently only pcm16_16k_mono is supported; other values return invalid_audio_format. Additional formats can be added without a protocol break.
  • Backpressure: --asr-max-buffer-seconds CLI flag (default 60s). If accumulated server-side audio exceeds the cap, the server sends a buffer_overflow error and closes the socket. Below the cap, the single-task coroutine alternates receive and inference, so while a chunk is inferring the client experiences standard TCP-level backpressure (ws.send blocks on a full socket buffer). No silent drop.
  • Concurrency model: single-task — one coroutine alternates receive + inference; session.end is therefore serialized after any in-flight chunk.

Note on HTTP SSE path: switching get_prefix_text() from confirmed_text to emitted_text (see streaming_asr.py row below) also incidentally fixes a latent prompt-prefix continuation issue in the HTTP SSE path from #22089, where confirmed_text could roll back mid-sentence and cause the model to re-emit from scratch on long English audio. Regression covered by the existing 8 HTTP tests.

Architecture

The HTTP SSE and WebSocket paths both route through a shared inference driver, keeping streaming state in StreamingASRState at the adapter layer rather than lifting it into the transport layer.

flowchart TB
    HTTP["HTTP<br/>(file upload)"]
    WS["WebSocket<br/>(live PCM frames)"]
subgraph serving["OpenAIServingTranscription"]
    SSE["_generate_chunked_asr_stream"]
    WSH["handle_websocket&lt;br/&gt;(delegator)"]
end

PAC["process_asr_chunk&lt;br/&gt;(shared inference driver)"]

subgraph state["StreamingASRState"]
    CT["confirmed_text&lt;br/&gt;chunk-local rollback, delta diff basis"]
    ET["emitted_text&lt;br/&gt;monotonic accumulator, prompt prefix source"]
    UF["update() / finalize()"]
end

HTTP --&gt; SSE
WS --&gt; WSH
SSE --&gt; PAC
WSH --&gt; PAC
PAC --&gt; state

WebSocket session lifecycle

Inside serving_transcription_websocket.py, one coroutine alternates receive and inference. Each PCM batch of chunk_size_bytes triggers exactly one inference pass through the shared driver.

sequenceDiagram
    autonumber
    participant C as Client
    participant H as WS handler
    participant I as process_asr_chunk
    participant S as StreamingASRState
C-&gt;&gt;H: session.start (JSON)
H-&gt;&gt;H: _init_session&lt;br/&gt;accept + adapter capability check
H--&gt;&gt;C: session.started

loop per chunk_size_bytes of new audio
    C-&gt;&gt;H: binary PCM16 frames
    H-&gt;&gt;H: _handle_audio_frame&lt;br/&gt;accumulate into pcm_buffer
    H-&gt;&gt;I: _run_inference(_pcm_to_wav(buffer))
    I-&gt;&gt;S: update() → delta
    I--&gt;&gt;H: delta str
    H--&gt;&gt;C: transcript.delta (per word)
end

C-&gt;&gt;H: session.end (JSON)
H-&gt;&gt;I: _run_inference(is_last=True)
I-&gt;&gt;S: finalize() → tail
I--&gt;&gt;H: tail delta
H--&gt;&gt;C: transcript.final
H-&gt;&gt;C: close socket

Files touched

File Change
python/sglang/srt/entrypoints/http_server.py Register WS /v1/audio/transcriptions/stream route
python/sglang/srt/entrypoints/openai/serving_transcription.py Extract process_asr_chunk into streaming_asr.py so HTTP and WS share the inference driver; add handle_websocket delegator
python/sglang/srt/entrypoints/openai/serving_transcription_websocket.py NEW — WS transport: session lifecycle, PCM buffering, chunk-trigger logic, private _pcm_to_wav adapter for protocol-fixed PCM16/16kHz/mono, transcript.final.text construction
python/sglang/srt/entrypoints/openai/streaming_asr.py Extend StreamingASRState with an emitted_text accumulator used as the prompt prefix in get_prefix_text() (previously confirmed_text); also extract process_asr_chunk as the shared HTTP/WS inference driver and add a _normalize_whitespace helper for batched-inference punctuation jitter
python/sglang/srt/entrypoints/websocket_base.py NEW — WebSocketSessionBase minimal mixin (accept / send_json / safe_close) so future WS endpoints can reuse it
python/sglang/srt/server_args.py asr_max_buffer_seconds: int = 60 + CLI flag
test/manual/models/test_qwen3_asr.py 18 tests total (8 HTTP non-stream + 10 WebSocket). EXPECTED_TRANSCRIPTS reference dict, _wer Levenshtein helper, WS assertions via _assert_close_to_ref(WER ≤ 0.15)

WS assertions use _assert_close_to_ref(WER ≤ 0.15) against EXPECTED_TRANSCRIPTS (a dict of canonical transcripts captured from one-shot non-streaming inference). _wer normalizes case/punctuation and falls back to character-level comparison for CJK.

Manual (single audio, step-by-step)

Launch the server:

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-ASR-0.6B \
  --served-model-name qwen3-asr \
  --trust-remote-code \
  --host 127.0.0.1 --port 30000

HTTP non-streaming (baseline / ground truth):

curl -s http://127.0.0.1:30000/v1/audio/transcriptions \
  -F file=@audio.wav -F model=qwen3-asr | jq -r .text

HTTP SSE streaming:

curl -Ns http://127.0.0.1:30000/v1/audio/transcriptions \
  -F file=@audio.wav -F model=qwen3-asr -F stream=true

WebSocket streaming — save as wsasr.py and run python wsasr.py audio.wav [language]. Sends PCM in 0.5s frames at wall-clock realtime pacing while concurrently reading transcript.delta events; each line is prefixed with t=<sec> from session start so you can see deltas arriving while audio is still uploading.

import asyncio
import json
import sys
import time

import numpy as np
import soundfile as sf
import websockets

async def main(path, lang=None):
data, sr = sf.read(path, dtype="float32")
if data.ndim > 1:
data = data.mean(axis=1)
if sr != 16000:
n = int(len(data) / sr * 16000)
data = np.interp(np.linspace(0, len(data) - 1, n), np.arange(len(data)), data)
sr = 16000
pcm = (data * 32767).astype(np.int16).tobytes()

url = "ws://127.0.0.1:30000/v1/audio/transcriptions/stream"
async with websockets.connect(url) as ws:
    start = {"type": "session.start"}
    if lang:
        start["language"] = lang

    t0 = time.perf_counter()
    await ws.send(json.dumps(start))

    def stamp():
        return time.perf_counter() - t0

    ack_raw = await ws.recv()
    ack = json.loads(ack_raw)
    print(f"[+{stamp():5.2f}] &lt;&lt; {ack_raw}")
    if ack.get("type") != "session.started":
        print(f"[+{stamp():5.2f}] !! expected session.started, got {ack.get('type')}; aborting")
        return

    async def receive_loop():
        try:
            async for raw in ws:
                print(f"[+{stamp():5.2f}] &lt;&lt; {raw}")
                msg = json.loads(raw)
                if msg["type"] == "transcript.final":
                    return
                if msg["type"] == "error":
                    print(f"[+{stamp():5.2f}] !! server error, closing")
                    return
        except websockets.ConnectionClosed as e:
            print(f"[+{stamp():5.2f}] !! connection closed: {e}")

    receiver = asyncio.create_task(receive_loop())

    chunk = int(0.5 * sr) * 2  # 0.5s of int16 PCM
    try:
        for i in range(0, len(pcm), chunk):
            if receiver.done():
                print(f"[+{stamp():5.2f}] !! receiver ended early, stopping send")
                break
            await ws.send(pcm[i : i + chunk])
            await asyncio.sleep(0.5)  # realtime pacing; drop for fast mode

        if not receiver.done():
            print(f"[+{stamp():5.2f}] &gt;&gt; session.end")
            await ws.send(json.dumps({"type": "session.end"}))
    except websockets.ConnectionClosed as e:
        print(f"[+{stamp():5.2f}] !! send failed, connection closed: {e}")

    await receiver

if name == "main":
asyncio.run(main(sys.argv[1], sys.argv[2] if len(sys.argv) > 2 else None))

Error-path verification

Manually tested all 8 error codes against a running server: invalid_json, invalid_payload, invalid_state × 3 variants, invalid_audio_format, unknown_message, buffer_overflow, unsupported_model, internal_error. All return the documented error event and close the socket.

Speed Tests and Profiling

No impact on inference speed — this PR is a thin WebSocket transport layer on top of unchanged chunked inference. Per-chunk latency is bound by the existing chunk_size_sec (2s) + model inference time (~0.5–1.5s on H100 for Qwen3-ASR-0.6B). No new CUDA kernels, no new memory patterns, no scheduler changes.

Checklist

  • Format your code according to the Format code with pre-commit (black, isort, ruff, codespell all passing)
  • Add unit tests according to the Run and add unit tests (18 tests in test/manual/models/test_qwen3_asr.py)
  • Documentation: wire protocol documented in the Modifications section above
  • Provide accuracy benchmark results — see "Accuracy Tests" section above
  • Follow the SGLang code style guidance

Review and Merge Process

  1. Ping Merge Oncalls to start the process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments.
  4. After green CI and required approvals, merge.

Related

cc @JustinTong0323 @AgainstEntropy

## Motivation

Implements M1 of the RFC in #22474.

PR #22089 shipped chunked streaming output for Qwen3-ASR via POST /v1/audio/transcriptions?stream=true (SSE over an HTTP upload), which assumes the entire audio file is known up-front. Real-time use cases (live captioning, voice assistants, meeting transcription) need the opposite direction: the server accepts audio as it arrives and pushes partial transcripts back as the speaker talks. This PR adds that path as a new WebSocket endpoint, reusing the existing chunked inference state machine and TranscriptionAdapter from PR #22181.

Modifications

New WebSocket endpoint

  • Endpoint: WS /v1/audio/transcriptions/stream (registered in http_server.py)
  • Wire protocol (inspired by OpenAI Realtime API conventions, simplified session/delta/final event model):
    • Client → session.start (JSON) → binary PCM16 frames → session.end (JSON)
    • Server → session.started / transcript.delta (per word) / transcript.final / error
  • Audio format (M1): session.start accepts an audio_format field for forward compatibility. Currently only pcm16_16k_mono is supported; other values return invalid_audio_format. Additional formats can be added without a protocol break.
  • Backpressure: --asr-max-buffer-seconds CLI flag (default 60s). If accumulated server-side audio exceeds the cap, the server sends a buffer_overflow error and closes the socket. Below the cap, the single-task coroutine alternates receive and inference, so while a chunk is inferring the client experiences standard TCP-level backpressure (ws.send blocks on a full socket buffer). No silent drop.
  • Concurrency model: single-task — one coroutine alternates receive + inference; session.end is therefore serialized after any in-flight chunk.

Note on HTTP SSE path: switching get_prefix_text() from confirmed_text to emitted_text (see streaming_asr.py row below) also incidentally fixes a latent prompt-prefix continuation issue in the HTTP SSE path from #22089, where confirmed_text could roll back mid-sentence and cause the model to re-emit from scratch on long English audio. Regression covered by the existing 8 HTTP tests.

Architecture

The HTTP SSE and WebSocket paths both route through a shared inference driver, keeping streaming state in StreamingASRState at the adapter layer rather than lifting it into the transport layer.

flowchart TB
    HTTP["HTTP<br/>(file upload)"]
    WS["WebSocket<br/>(live PCM frames)"]

    subgraph serving["OpenAIServingTranscription"]
        SSE["_generate_chunked_asr_stream"]
        WSH["handle_websocket<br/>(delegator)"]
    end

    PAC["process_asr_chunk<br/>(shared inference driver)"]

    subgraph state["StreamingASRState"]
        CT["confirmed_text<br/>chunk-local rollback, delta diff basis"]
        ET["emitted_text<br/>monotonic accumulator, prompt prefix source"]
        UF["update() / finalize()"]
    end

    HTTP --> SSE
    WS --> WSH
    SSE --> PAC
    WSH --> PAC
    PAC --> state
Loading

WebSocket session lifecycle

Inside serving_transcription_websocket.py, one coroutine alternates receive and inference. Each PCM batch of chunk_size_bytes triggers exactly one inference pass through the shared driver.

sequenceDiagram
    autonumber
    participant C as Client
    participant H as WS handler
    participant I as process_asr_chunk
    participant S as StreamingASRState

    C->>H: session.start (JSON)
    H->>H: _init_session<br/>accept + adapter capability check
    H-->>C: session.started

    loop per chunk_size_bytes of new audio
        C->>H: binary PCM16 frames
        H->>H: _handle_audio_frame<br/>accumulate into pcm_buffer
        H->>I: _run_inference(_pcm_to_wav(buffer))
        I->>S: update() → delta
        I-->>H: delta str
        H-->>C: transcript.delta (per word)
    end

    C->>H: session.end (JSON)
    H->>I: _run_inference(is_last=True)
    I->>S: finalize() → tail
    I-->>H: tail delta
    H-->>C: transcript.final
    H->>C: close socket
Loading

Files touched

File Change
python/sglang/srt/entrypoints/http_server.py Register WS /v1/audio/transcriptions/stream route
python/sglang/srt/entrypoints/openai/serving_transcription.py Extract process_asr_chunk into streaming_asr.py so HTTP and WS share the inference driver; add handle_websocket delegator
python/sglang/srt/entrypoints/openai/serving_transcription_websocket.py NEW — WS transport: session lifecycle, PCM buffering, chunk-trigger logic, private _pcm_to_wav adapter for protocol-fixed PCM16/16kHz/mono, transcript.final.text construction
python/sglang/srt/entrypoints/openai/streaming_asr.py Extend StreamingASRState with an emitted_text accumulator used as the prompt prefix in get_prefix_text() (previously confirmed_text); also extract process_asr_chunk as the shared HTTP/WS inference driver and add a _normalize_whitespace helper for batched-inference punctuation jitter
python/sglang/srt/entrypoints/websocket_base.py NEWWebSocketSessionBase minimal mixin (accept / send_json / safe_close) so future WS endpoints can reuse it
python/sglang/srt/server_args.py asr_max_buffer_seconds: int = 60 + CLI flag
test/manual/models/test_qwen3_asr.py 18 tests total (8 HTTP non-stream + 10 WebSocket). EXPECTED_TRANSCRIPTS reference dict, _wer Levenshtein helper, WS assertions via _assert_close_to_ref(WER ≤ 0.15)

Accuracy Tests

Manual end-to-end verification against HTTP non-streaming (one-shot) ground truth across 7 audio fixtures × 3 paths (HTTP JSON / HTTP SSE / WebSocket):

Audio Length HTTP non-stream (truth) HTTP SSE WebSocket Notes
EN (Qwen podcast) 15.05s Oh yeah, yeah. He wasn't even that big when I started listening to him. But and his solo music...for other people. WER ≈ 0.057, Uh huh. prefix hallucination (vocal-fry intro, chunked-inference artifact)
ZH (Qwen news) 4.20s 甚至出现交易几乎停滞的情况。 identical identical byte-match
MLK speech 13.00s I have a dream that one day this nation will rise up and live out the true meaning of its creed. identical identical byte-match
LibriSpeech dummy 10.44s He hoped there would be stew for dinner—turnips and carrots and bruised potatoes and fat mutton pieces—to be ladled out in thick peppered flour-fatted sauce. : same as SSE word-level identical, punctuation drift only
Spanish (LibriSpeech-style) 6.58s y en las ramas medio sumergidas revoloteaban algunos pájaros de químico y legendario plumaje Y ... plumaje. same as SSE case + trailing period only
Hindi 4.13s मिर्ची में कितने विभिन्न प्रजातियाँ हैं identical identical byte-match
MP3 stereo (I know kung fu) 3.98s I know kung fu. identical identical byte-match

WER threshold rationale: WS assertions use WER ≤ 0.15. This tolerates the chunked-inference boundary artifacts inherited from #22089 (e.g. the Uh huh. prefix on the EN clip) while still catching real regressions — the non-streaming ground-truth WER on these fixtures is ≤0.02, so the 0.15 threshold leaves plenty of headroom.

5 consecutive stability runs of the 18-test suite passed (all 18 tests green each run). Median wall-clock 52.4s on a single H100.

Streaming verification: on the 15s EN audio in realtime-pacing mode (client sends PCM at 0.5s/frame = wall-clock rate), ~30 out of 37 transcript.delta events arrive at the client before the client sends session.end, confirming true incremental server push (not batch-at-end).

M1 completion checklist (from RFC)

  • ✅ WebSocket endpoint /v1/audio/transcriptions/stream
  • ✅ Protocol: session.start / binary PCM / session.endsession.started / transcript.delta / transcript.final / error
  • ✅ 8 error codes: invalid_json, invalid_payload, invalid_state, invalid_audio_format, unknown_message, buffer_overflow, unsupported_model, internal_error
  • ✅ Single-task concurrency, --asr-max-buffer-seconds backpressure
  • ✅ No scheduler or engine changes
  • ✅ Reuse StreamingASRState, process_asr_chunk, TranscriptionAdapter
  • M2 (cross-chunk RadixCache prefill reuse): out of scope for this PR
  • M3 (token-level streaming within chunks via GenerateReqInput.stream=True, forced-alignment timestamps): out of scope

Known limitations

Listed here so reviewers don't have to re-discover them:

Limitation Origin Can this be fixed?
EN 15s audio has a "Uh huh." prefix hallucination because the first 2s chunk sees only vocal fry / silence, and the append-only delta protocol can't retract it later. WER ~0.057. Inherited from #22089 chunked inference. Not at the serving layer. Would need either (a) skip-emit-for-noise heuristic on chunk 0, or (b) token-level timestamps (M3).
CJK languages (Chinese, Japanese, …) produce a single transcript.delta event because str.split() can't word-tokenize them. Final transcript is still correct. TODO already in StreamingASRState docstring. Inherited from #22089. M3 — token-level overlap instead of word-level.
Boundary-aligned real word repetition (theoretical): if the audio contains a legitimately repeated phrase that aligns exactly with a 2s chunk boundary, the current merge algorithm could skip one repetition. None of the 7 test fixtures trigger this; would show up on songs / chants / a full 8-verse MLK speech. Trade-off of the current merge algorithm. M3 — forced-alignment timestamp-based merge. This is the fundamental reason the RFC files M3 as a separate milestone.
Silence (all-zero PCM) crashes Qwen3-ASR in mm_utils.py:_adjust_embedding_length. Upstream bug, unrelated to this PR. Upstream sglang bug. Tracked in #XXXXX.
<2s of real speech triggers Qwen3-ASR short-context hallucinations (1s EN clip returns 嗯哼。). Short-clip test uses 3s MP3 to avoid this. Model limitation. Not at the serving layer.

Test plan

Tests live under test/manual/ because they require downloading the Qwen/Qwen3-ASR-0.6B checkpoint (~1.2GB) and a GPU. They are runnable locally with a single command and complete in ~52s on one H100.

Automated (single command)

cd /path/to/sglang
python test/manual/models/test_qwen3_asr.py

The file uses popen_launch_server to spin up its own sglang.launch_server, runs 18 tests, and tears the server down.

Test breakdown:

Category Count What it covers
HTTP non-streaming (ground truth) 8 EN, ZH, MLK, LibriSpeech, Spanish, Hindi, MP3 stereo, EN × 3 consistency
WebSocket streaming — happy paths 7 EN / ZH / Hindi / Spanish / MLK / LibriSpeech / MP3 (fast-mode PCM)
WebSocket streaming — edges 3 Realtime-pacing EN, 3 concurrent sessions on same audio (state isolation + determinism), 3s short clip

WS assertions use _assert_close_to_ref(WER ≤ 0.15) against EXPECTED_TRANSCRIPTS (a dict of canonical transcripts captured from one-shot non-streaming inference). _wer normalizes case/punctuation and falls back to character-level comparison for CJK.

Manual (single audio, step-by-step)

Launch the server:

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-ASR-0.6B \
  --served-model-name qwen3-asr \
  --trust-remote-code \
  --host 127.0.0.1 --port 30000

HTTP non-streaming (baseline / ground truth):

curl -s http://127.0.0.1:30000/v1/audio/transcriptions \
  -F file=@audio.wav -F model=qwen3-asr | jq -r .text

HTTP SSE streaming:

curl -Ns http://127.0.0.1:30000/v1/audio/transcriptions \
  -F file=@audio.wav -F model=qwen3-asr -F stream=true

WebSocket streaming — save as wsasr.py and run python wsasr.py audio.wav [language]. Sends PCM in 0.5s frames at wall-clock realtime pacing while concurrently reading transcript.delta events; each line is prefixed with t=<sec> from session start so you can see deltas arriving while audio is still uploading.

import asyncio
import json
import sys
import time

import numpy as np
import soundfile as sf
import websockets


async def main(path, lang=None):
    data, sr = sf.read(path, dtype="float32")
    if data.ndim > 1:
        data = data.mean(axis=1)
    if sr != 16000:
        n = int(len(data) / sr * 16000)
        data = np.interp(np.linspace(0, len(data) - 1, n), np.arange(len(data)), data)
        sr = 16000
    pcm = (data * 32767).astype(np.int16).tobytes()

    url = "ws://127.0.0.1:30000/v1/audio/transcriptions/stream"
    async with websockets.connect(url) as ws:
        start = {"type": "session.start"}
        if lang:
            start["language"] = lang

        t0 = time.perf_counter()
        await ws.send(json.dumps(start))

        def stamp():
            return time.perf_counter() - t0

        ack_raw = await ws.recv()
        ack = json.loads(ack_raw)
        print(f"[+{stamp():5.2f}] << {ack_raw}")
        if ack.get("type") != "session.started":
            print(f"[+{stamp():5.2f}] !! expected session.started, got {ack.get('type')}; aborting")
            return

        async def receive_loop():
            try:
                async for raw in ws:
                    print(f"[+{stamp():5.2f}] << {raw}")
                    msg = json.loads(raw)
                    if msg["type"] == "transcript.final":
                        return
                    if msg["type"] == "error":
                        print(f"[+{stamp():5.2f}] !! server error, closing")
                        return
            except websockets.ConnectionClosed as e:
                print(f"[+{stamp():5.2f}] !! connection closed: {e}")

        receiver = asyncio.create_task(receive_loop())

        chunk = int(0.5 * sr) * 2  # 0.5s of int16 PCM
        try:
            for i in range(0, len(pcm), chunk):
                if receiver.done():
                    print(f"[+{stamp():5.2f}] !! receiver ended early, stopping send")
                    break
                await ws.send(pcm[i : i + chunk])
                await asyncio.sleep(0.5)  # realtime pacing; drop for fast mode

            if not receiver.done():
                print(f"[+{stamp():5.2f}] >> session.end")
                await ws.send(json.dumps({"type": "session.end"}))
        except websockets.ConnectionClosed as e:
            print(f"[+{stamp():5.2f}] !! send failed, connection closed: {e}")

        await receiver


if __name__ == "__main__":
    asyncio.run(main(sys.argv[1], sys.argv[2] if len(sys.argv) > 2 else None))

Error-path verification

Manually tested all 8 error codes against a running server: invalid_json, invalid_payload, invalid_state × 3 variants, invalid_audio_format, unknown_message, buffer_overflow, unsupported_model, internal_error. All return the documented error event and close the socket.

Speed Tests and Profiling

No impact on inference speed — this PR is a thin WebSocket transport layer on top of unchanged chunked inference. Per-chunk latency is bound by the existing chunk_size_sec (2s) + model inference time (~0.5–1.5s on H100 for Qwen3-ASR-0.6B). No new CUDA kernels, no new memory patterns, no scheduler changes.

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@SammLSH SammLSH closed this Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant