[WIP] [Full Duplex] Feat: Support Full-Duplex realtime runtime & add MiniCPM-o 4.5 demo#3907
[WIP] [Full Duplex] Feat: Support Full-Duplex realtime runtime & add MiniCPM-o 4.5 demo#3907Sy0307 wants to merge 54 commits into
Conversation
There was a problem hiding this comment.
can we convert this model specific duplex runtime to be generic reusable!?
There was a problem hiding this comment.
The changes look good overall, however I'm a bit concerned about adding duplex execution logic directly inside the model executor — it feels too model-specific. Can we explore a more generic approach instead? This could benefit many other models down the line, not just MiniCPM-o 4.5.
can we build a wrapper which takes the model executor and manages session state and other responsiblities
ef97974 to
dfff075
Compare
Signed-off-by: Sy03 <1370724210@qq.com>
… and restore per-segment streaming-input ingestion Combines two fixes needed for MiniCPM-o 4.5 duplex on vllm-omni 0.22: * Register the full thinker/talker architecture keys (MiniCPMO45OmniLLMForConditionalGeneration / MiniCPMO45OmniTTSForConditionalGeneration) and add a plain-chat (use_tts_template) tts_bos fallback so non-duplex chat-completions audio works: resolve <|tts_bos|> (151703) directly and bound the spoken region at <|im_end|> (151645). * Restore per-segment streaming-input ingestion in _update_streaming_input_additional_info: read and accumulate the incoming per-segment model_intermediate_buffer (via streaming_accumulated_keys + torch.cat) instead of only resetting num_processed_tokens. A prior rebase had dropped this, starving duplex audio and producing garbled/doubled output. Signed-off-by: linyueqian <linyueqian@outlook.com>
Add the continuous-duplex realtime web UI under examples/online_serving/minicpmo/realtime_web/: a browser client that streams mic audio to the duplex endpoint and plays back TTS, with selectable turn-detection (model-driven default, server_vad for other models), a voice picker, an interaction-mode toggle, a light/white theme, and 16 kHz anti-aliased mic capture for clean audio. Signed-off-by: linyueqian <linyueqian@outlook.com>
dfff075 to
47a93f5
Compare
…uest/StreamingUpdate imports, typos) Signed-off-by: linyueqian <linyueqian@outlook.com>
47a93f5 to
42f5e9a
Compare
… epochs The data-plane stage0 request id was epoch-scoped and barge_in() aborted all stage requests, so every turn/epoch advance started a fresh, context-less KV while the model helper still skipped re-prepending the system context, degenerating multi-turn output into token garbage. Make stage0 a single long-lived resumable request: its id is epoch-independent and barge_in() preserves the stage0 binding (only downstream stages are torn down), so conversation KV/context persists across turns/epochs as the topology already declares (stage0_long_lived_request). Signed-off-by: linyueqian <linyueqian@outlook.com>
…x sampler Mirror the official MiniCPM-o StreamDecoder.decode listen handling in the data-plane sampler: scale the listen-token logit and optionally force-keep listen only when it ranks within top-k. Defaults (1.0, None) preserve current behavior; tunable via MINICPMO45_LISTEN_PROB_SCALE / MINICPMO45_LISTEN_TOP_K for listen/speak balance. Signed-off-by: linyueqian <linyueqian@outlook.com>
…ation) Add an auto-response mode (session extra_body.auto_response/full_duplex) that runs per-chunk speak/listen generation continuously, matching the official duplex_generate loop, instead of waiting for an explicit response.create. Each ~chunk_period of streamed audio is emitted to the stage0 stream, and continuous chunks feed the ongoing stream rather than being routed through the discrete-response overlap/barge-in policy (explicit force_barge_in still interrupts). Signed-off-by: linyueqian <linyueqian@outlook.com>
…time web demo Full mode now requests server-side continuous auto-response (extra_body.auto_response) so the model speaks on its own. Turn mode no longer hangs on a model listen decision: response.listen / response.done reset the status. Signed-off-by: linyueqian <linyueqian@outlook.com>
|
May I know when the target merge date for this PR is? Is it scheduled? |
…eaming Two divergences from the official MiniCPMODuplexInference corrupted audio full-duplex output (degenerate / garbage transcripts vs the coherent official worker on identical input): - _stage_prefill_embeddings_only re-emitted the assistant turn-open prefix (im_end + im_start assistant + tts_bos) on every audio chunk, re-opening the turn each chunk and producing repeated turn-initial greetings. The official feeds only <unit>+audio per chunk; the turn is opened once at session init and tts_bos/listen/turn_eos are model-generated. Drop the per-chunk prefix. - _configure_streaming_processor used cnn_redundancy_ms=0, yielding 9 audio embed tokens/chunk vs the official's 10 (official duplex default is 20). This off-by-one misaligned the audio representation the model was trained on. Default to 20, and call processor.reset_streaming() at session init, mirroring the official init_streaming_processor (modeling_minicpmo_unified.py:207). Verified against the official model on the same input: with these changes vllm's chunk-0 audio embeds match the official to 3 decimals (std 0.4339 vs 0.4336). A downstream LLM-forward/positions issue in the duplex data plane remains under investigation and is not addressed here.
… context budget
Follow-ups to the per-chunk assistant-prefix fix, aligning the MiniCPM-o 4.5
scheduler data-plane path with the official MiniCPMODuplexInference format:
- _stage_prefill_embeddings_only: prepend </unit> for chunks >= 1 so every
unit is closed before the next <unit> opens (official finalize_unit feeds
terminator + </unit>; the scheduler session update discards the sampled
terminator, so only the closure is appended).
- preprocess: always place padding in front of the chunk embeddings. The
appended duplex tokens occupy the tail of the request prompt and the runner
schedules [num_computed_tokens, prompt_len); with the old suffix-split
layout the audio embeds of chunks >= 1 landed outside the scheduled span
and were never forwarded, so generation ran on pad tokens only. Keeping
embeds last also puts the decode position right after the final audio
embedding (official listen/speak decision point). Warn when the worker
produces more embeddings than reserved slots instead of silently truncating.
- orchestrator/duplex: reserve extra scheduler token slots on the first
append for the session context (system prompt + optional ref-audio
embeddings); previously a long reference audio could overflow the chunk-0
budget and truncate the audio tail.
- _prepare_session_context: always emit the official
<|im_start|>system\n{text}\n<|audio_start|>[ref]<|audio_end|><|im_end|>
template, with or without reference audio.
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
… stale duplex tests - _stage_ref_audio_embeddings: the split stage0 wrapper ports official get_audio_embedding(chunk_length=...) as get_audio_hidden_states, so the ref-audio path always fell into the streaming fallback. That truncated the reference audio to a single streaming chunk (~1 s of a 6 s prompt) and advanced the streaming mel/encoder state at session open, corrupting the first real audio chunk. Use the whole-clip encoder when available. - duplex_scheduler_token_budget / first-append reserve: MiniCPM-o pools audio to one token per 100 ms, not 20 ms; the old 50 tok/s math reserved ~5x too many slots and filled the KV with hundreds of </unit> pad embeddings (451 of 482 chunk-0 positions measured on the dumped data plane). Use 1600 samples/token and tight margins. - _prepare_session_context: keep the audio markers conditional on ref audio, matching MiniCPMODuplex.prepare() in the released checkpoint. - tests: update stale duplex expectations (stage0 epoch-independent request id, barge-in preserving stage0, optional <|audio|> token, per-state session config, ref-audio stub signature, new budget numbers). Verified by feeding the dumped vLLM chunk-0 embeddings through the official MiniCPMODuplex decoder: it reproduces the same degenerate logits the server samples, while the official pipeline on the same input yields listen logits of +12 vs garbage text at -10, isolating the bug to embed construction. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…upt the model Replaying the dumped vLLM chunk-0 data-plane embeddings through the official MiniCPMODuplex decoder isolates the final divergence: with the leading pad slots stripped the official decoder produces listen=16.25 (official pipeline: 16.38) on identical embeddings, while including the pad run yields the same degenerate logits the server samples. Any run of </unit> pad embeddings in the KV breaks the model, so scheduler slot reservations must match the worker-built embeddings exactly: - serving (MiniCPMO45PcmAppendBuffer): emit only whole model chunks. The first emission is capped at one chunk (the worker's first unit consumes the official 1035 ms window); commit flushes zero-pad the tail to the chunk boundary (silence, in-distribution) instead of emitting partial chunks. - serving adapter: trim reference audio to a whole number of pooled frames and precompute the exact session-context token count (shared template via MiniCPMO45DuplexPolicy.session_context_texts + samples/1600 pooling math) into duplex_first_append_context_tokens. - engine: duplex_scheduler_token_budget returns the exact per-unit slot count (closure + <unit> + 10 audio embeddings) for whole-chunk payloads; duplex_first_append_context_reserve prefers the adapter-precomputed count; the orchestrator subtracts the absent closure slot on the first append. - worker: _stage_prefill_embeddings_only consumes every complete chunk per append (multi-unit spans) so serving-side multi-chunk payloads stay exact. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…mit whole-chunk multi-payloads The first unit consumes the official ~1035 ms window, so a k-chunk first payload yields k-1 worker units (k>=2); cap-at-one-chunk emission stranded the rest of the committed turn at serving because the commit path flushes once. Emit all whole chunks in one payload instead and model the first-window consumption in the orchestrator's first-append slot budget. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
With model-driven listen/speak restored, an explicit response.create can legitimately resolve to listen and produce no audio, which breaks the Realtime contract that response.create yields a response. Mark response-bound appends with force_speak and suppress only the listen token at the segment's decision step (official listen_prob_scale -> 0 semantics). Per-chunk full-duplex auto-response appends stay fully model-driven, and force_listen keeps precedence. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
A committed turn's tail was stranded: the first unit consumes the official ~1035 ms window, leaving up to a second of speech in the worker buffer with no following chunk to flush it, so the model decided on a partial question (and force_speak then produced an empty reply). On final appends the worker now builds exactly one extra unit - the zero-padded leftover if any, otherwise one full silence unit - matching the official post-turn silence beat at the decision step, and the scheduler budget reserves that unit. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
The official model answers across several units (the microphone keeps streaming silence while the assistant speaks); verified on the released checkpoint, a question is answered as 'I'm sorry, but / I can't answer / that question.' over three consecutive units. Turn-mode gave the model exactly one unit and stopped. While a response is still open after a segment finishes, serving now appends one silence unit at a time (capped) so the reply can complete, and force_speak suppresses the listen token at every step of response-bound segments, mirroring the official mid-turn listen -> tts_bos replacement. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…t the continuation cap Forcing speak on the response-bound segment fires at the question's final unit, where the official model still listens (it answers one silence unit later with real content); the forced decision produced near-empty utterances. Keep response-bound segments model-driven and rely on the silence-continuation units for the official decision cadence, forcing speak only on the last unit before the continuation cap. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…ce unit A model-driven listen on a response-bound segment closed the response immediately, so the continuation units never ran. While continuation budget remains, keep the response open and append the next silence unit as the model's decision point (official cadence: it often listens for a beat before answering). Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
The official duplex format feeds the sampled terminator (listen/chunk_eos/ turn_eos) + </unit> into the KV at every unit boundary, and the model's listen/speak policy conditions on its own past decisions. The scheduler session update discards the segment's final sampled token, so the KV never contained them and the model kept producing empty speak segments. The model sampler now records the terminator per duplex session (via a runner-published row -> session map) and the next append re-injects it ahead of the first unit closure; the scheduler budget reserves one extra slot for appends after the first. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
preprocess ran the duplex embedding path for every scheduling step of the data-plane request, including 1-token decode steps where token_offset is past the prompt: the prompt-embedding slice came up empty and was pad-filled, so every sampled token was forwarded as a </unit> embedding instead of itself. The model saw </unit> right after its own <|speak|> and terminated with empty utterances. Verified by replaying the dumped KV spans through the official decoder, which generates real text from the same state. Decode steps now use the normal embedding lookup of the sampled token ids. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
30 silence units kept turn-mode responses open too long and stalled subsequent turn-taking in the scenario flow; 8 s covers the official reply cadence (listen 1-2 units, speak 2-4 units). Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
The generic stage-0 final-output message accompanies every duplex segment with cumulative thinker text and no audio; gating the empty flush on unit_end_of_turn pushed it into the text-without-audio error branch, flooding auto-respond clients with one error per chunk and displacing real decisions. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Match the official per-chunk contract (exactly one result per audio chunk): duplex stage-0 segment boundaries no longer forward the raw thinker output as a final-output message. Listen decisions already flow via _emit_duplex_model_listen_output and spoken content via the talker stage; the extra message carried cumulative text with no audio and every downstream consumer had to filter it (mis-filtering caused either an error per chunk or silenced decisions). Reverts the converter-side text-without-audio suppression, no longer needed. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
The orchestrator stamps duplex_native_decision=listen/model_listen on model-listen segment outputs, but the converter only inspected completion token tails, which the wrapped listen output does not carry; listen decisions fell through to the text-without-audio error branch. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Every duplex segment ends finished=True by design; exiting the drain after each delivered batch made every subsequent decision wait for the next append to start a fresh drain task, re-adding one chunk of latency per decision. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…spond A stage-1 emission whose cumulative audio sliced to an empty delta can still carry delta text; in auto-respond mode that is normal streaming overlap, not a text-without-audio protocol error. Restores the guard removed in d73682f now that the listen-marker (f94ae50) and persistent drain (e30885d) fixes cover the cases it was blamed for. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
The official worker processes each chunk in one synchronous loop, so nothing can be lost between chunks. Our per-append cancel+restart of the data-plane drain task orphaned any decision arriving in the swap window (chunk 0 delivered, everything after raced). Keep one drain for the session's stable resumable request and skip the restart entirely. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…ROFILE_LOGS Log every boundary of the auto-respond event path to localize where post-chunk-0 decisions are lost: append control result contents ([append-result]), drain task lifecycle ([drain-start], [drain]), pump routing with per-queue depth ([pump]), and collect-side request state census ([collect]). Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…trace Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…chet The model-listen wrapper carries the thinker's hidden states under the 'latent' mm key, which the audio encoder's key fallback treated as a waveform. Encoding it on the chunk-0 listen ratcheted the per-request cumulative audio offset to tokens*hidden_dim fake samples, so every later talker unit sliced to empty and was silently dropped by the auto-respond empty-audio guard: chunk-0 listen arrived, then the session went mute (audio only resurfaced once the real cumulative waveform outgrew the poisoned offset, ~16 units). Compute the native decision first and yield listen results before any audio work so listen batches never touch the offset. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
… batches A talker segment streams several cumulative-audio batches that all carry the same segment text; attaching it per batch re-delivered the text with every audio delta (official results carry per-unit deltas exactly once). Track delivered chars per request, attach only the unseen suffix, and reset at segment end so a genuinely repeated next segment still goes out. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
A segment whose finished batch slices to an empty audio delta never hit the in-branch reset, so the next segment's text was suffix-sliced against the previous segment and lost. Clear the per-request sent-segment text in the output iterator for every finished batch, and compare by content so a genuinely repeated next segment is still delivered. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Browser session trace showed continuation units re-running the talker
with the SAME handed text after finished=True (every engine segment ends
finished, so finished does not partition text). The iterator-level reset
re-attached the text once per continuation segment, duplicating the
transcript ('of your day so far? of your day so far?') and re-opening a
ghost bubble after end of turn. Compare by content only and replace the
stored text when it actually changes; a verbatim-identical consecutive
reply keeps its audio but not a second transcript copy (rare, accepted).
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
… handoff The resumable duplex prompt folds every earlier unit, so a <|tts_bos|> from an already-spoken reply can sit mid-prompt. The unbounded last-bos search re-sliced that stale region on text-less continuation units and re-handed already-spoken text to the talker, which re-synthesized it (official feeds a lone audio_bos for empty units and never re-feeds text). Restrict the search to the final prompt token (this unit's folded decision) and the current segment. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Match the official MiniCPMODuplex talker conditioning, which our per-segment one-shot synthesis violated in five ways that together garbled the reply audio (correct text, mush between islands): - One TTSStreamingGenerator per spoken turn with carried KV and text_start_pos, fed once per unit; reset only when <|turn_eos|> arrives. Previously every segment synthesized as an independent utterance from scratch. - text_eos only at turn end (text_finished was True for every segment, stamping utterance-final prosody on each ~1s snippet). - ~chunk_size codec tokens per unit instead of a 256+ token free-run, so audio length tracks the text again. - Per-turn token2wav stream: ref-audio caches cloned once per turn with a [4218]*3 silence-token seed (stops ref-voice bleed at onsets) and overlapping pre_lookahead windows advancing chunk_size, flushed with last_chunk only at turn end. Previously each segment re-primed the caches from the ref wav and fed disjoint windows. - Repetition penalty only for duplex TTS sampling (official constructs but never applies the top-p/top-k warpers). Handoffs now accumulate in the runner streaming buffer (streaming_accumulated_keys, with list support) and the turn state consumes them by cursor, so a unit arriving mid-synthesis is queued instead of overwritten. llm2tts conditions on mid-unit <|speak|> tokens and includes <|turn_eos|> token+hidden (the trained stop signal), stamping its id in handoff meta for turn-end detection. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
The first stream() call of each mode (mid-turn 25+pre_lookahead window and the last_chunk tail flush) costs ~20s of one-time compilation, which landed inside the first spoken turn of the first session (two ~20s stalls). Warm both modes against the default ref-audio caches when the profile/dummy run reaches the talker, mirroring the official demo's precompile step. Opt out with MINICPMO45_SKIP_T2W_WARMUP=1. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…ndow A unit that sampled EOS early buffered fewer than window-size codec tokens, yielded no audio, and its chunk produced no client-visible result; the bridge's per-chunk lockstep then waited out its 20s timeout (the two ~20s first-reply stalls). Official pins min=max tokens per mid-turn unit with EOS at -inf; mirror it with a toggleable suppressor re-enabled per unit and lifted at turn end. Also keep the per-unit generate/vocode timing trace behind MINICPMO45_PROFILE_LOGS. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…ity) The official demo forces the first force_listen_count units (default 3) to listen so the model never answers off one second of partial audio; we had no startup guard, so the model spoke at chunk 0/1 with confabulated content. Inject force_listen into the data-plane payload for the first N appends (configurable via extra_body.force_listen_count, 0 disables); the runner already applies it once per segment. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Variable-size last_chunk windows hit a fresh ~20s vocoder compile per new token count, and an empty tail emitted no event for its chunk at all, starving the bridge's per-chunk lockstep into its 20s timeout at reply ends. Pad every tail to chunk_size+pre_lookahead silence tokens (one shape, warmed at startup; trailing silence at reply end is inaudible) and always emit the final waveform batch. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Clients that treat the audio.delta + transcript.delta pair as the per-unit completion signal (the official-demo bridge does) waited 20s at every reply end: the turn-end flush and deduplicated continuation units carry audio but no text, so no transcript followed, the pending unit never flushed, and the per-chunk lockstep timed out. Emit the transcript delta with an empty delta whenever audio was emitted. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…andoff Replaying our exact per-unit conditions through the OFFICIAL talker and vocoder reproduced the garble, exonerating our generator: the conditions themselves were missing ALTERNATING reply segments (decoded condition for 'This is a Chinese TV show called "The Legend of Qin Shi Huang".' was only ' TV show called' + ' of Qin' + '".') — the talker vocalized text it never received. Root cause: the runner's streaming-buffer update is not merge-safe for a resumable stage-1 request (in-place updates merge sub-keys, resume prefills REPLACE the buffer), so runner-side accumulation silently dropped segments and the consumer cursor then ate the head of each replacement. Accumulate in llm2tts instead (per-request bridge state, cleared on epoch reset) and hand the complete ids+hidden history every handoff, making downstream replace semantics lossless; drop the runner-side streaming_accumulated_keys to avoid double-merge. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Two follow-ups to the accumulated-condition handoff: (1) a segment delta
can start with several unit decisions (forced/model listens from chunks
that produced no stage-1 handoff accumulate ahead of the speak), so the
old output[0]!=listen check skipped the ENTIRE first speak segment of
every reply (' This is a Chinese' / ' I think it's' never reached the
talker); skip the leading listen run, then the speak decision. (2) the
talker's consumed-cursor lived in the per-turn state and died at
turn_eos, so the next reply re-read the whole accumulated history and
re-synthesized the previous reply; keep the cursor per request, popped
only when the request finishes.
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
A segment delta's leading decision tokens can also be folded into the
resumable prompt, so prompt_len + delta over-counts them and the
front-aligned hidden indexing truncated each reply's first segment to
its first token (' This [is a Chinese]' lost). The hidden tensor's last
len(delta) rows are the delta's rows; index from the end.
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Not scheduled. @linyueqian and I are working on it. |
|
For anyone who wants to drive this PR from the official browser demo: the adapter we use is Quick start: # 1. vLLM duplex server (2 GPUs)
vllm-omni serve <MiniCPM-o-4_5 path> --omni \
--stage-configs-path vllm_omni/model_executor/stage_configs/minicpmo45_2gpu_streaming.yaml \
--trust-remote-code --port 8099
# 2. Bridge worker
python examples/online_serving/minicpmo/official_demo_bridge_worker.py --port 22500 \
--vllm-ws "ws://localhost:8099/v1/realtime?duplex=1" --model <MiniCPM-o-4_5 path>
# 3. Official demo gateway (from OpenBMB/MiniCPM-o-Demo)
python gateway.py --http --workers localhost:22500Then open the demo web UI and talk. The session runs in the official semantics: the model decides listen/speak per ~1s unit, results carry per-unit delta text and audio, and end of turn fires once per reply. |
The prebuilt MiniCPM-o-Demo frontend schedules playback assuming every per-chunk result carries exactly one second of 24 kHz audio (listens are silence, non-final speak units are left-padded, only the turn-final result is an unpadded tail). Forwarding variable-size generation deltas with no audio on listens made the frontend drift and clip even though the delivered bytes were verbatim-clear. Buffer reply samples in the bridge and emit one paced unit per result. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
This is expected to be ready before the next major release. I will have more time starting now and will review it thoroughly, as it introduces several structural changes designed to serve as an abstraction for easily integrating full-duplex type models. From a higher-level perspective, I think the abstraction should avoid being too model-specific, and we should also summarize the architectural trends of recent full-duplex models. |
NOTICE: This PR still WIP
Refs #3745. Stacked on #3642 (
tc-mb:Support-MiniCPM-o-4.5).This PR targets
mainso it can be opened against the upstream repository. Until #3642 is merged, the diff also includes the base MiniCPM-o 4.5 model-support changes from #3642. The full-duplex/realtime work is in the follow-up signed-off commit32ab95b2ffdb982fefaa0bbffff5efafdf8a175e.Purpose
This PR adds a MiniCPM-o 4.5 native full-duplex realtime runtime path for the architecture discussed in #3745.
The base model support comes from #3642. This PR extends that baseline from normal staged MiniCPM-o 4.5 serving into a session-oriented audio streaming path:
The goal is not just a smoke test around the existing chat endpoint. The implementation wires a real audio-in -> Stage0 -> Stage1 -> audio-out loop and covers the core realtime control cases needed by the current demo: streaming input append, model listen/speak decisions, audio response streaming, cancel/barge-in, overlap handling, playback ack, and conversation item emission.
RFC Alignment
Implemented from the #3745 full-duplex direction:
/v1/realtime?duplex=1maps the main OpenAI Realtime-style events used by the demo, includingsession.update,input_audio_buffer.append,input_audio_buffer.commit,response.create,response.cancel,response.audio.delta,response.audio.done,response.done, and conversation item events.Intentional differences or improvements versus the initial RFC sketch:
Technical Changes
Serving and protocol:
/v1/duplexwebsocket serving for the native duplex protocol./v1/realtime?duplex=1websocket serving for the Realtime-compatible adapter path.Engine / orchestrator / scheduler:
MiniCPM-o 4.5 runtime:
Examples and configs:
Tests:
Current Verified Behavior
The latest controlled remote E2E covers the important demo path:
/v1/realtime?duplex=1session creation.response.audio.delta,response.audio.done, andresponse.doneemission.Known Boundaries / Not Claimed Complete
This PR is a substantial step toward the #3745 RFC, but it should not be reviewed as a final production-complete full-duplex core. Known remaining work:
resumable/session state should not be interpreted as a full scheduler-owned KV lease with allocation, rollback, migration, and release semantics.Test Plan
git diff --check.py_compilefor the duplex/realtime serving and MiniCPM-o native runtime files./v1/realtime?duplex=1.Test Result
Local checks:
E2E:
54cf2229f33584aereturncode=0,ok=trueKey E2E signals:
Server log audit for the E2E run:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)