[Feature] Support Qwen3 Omni talker mtp batch inference by ZeldaHuang · Pull Request #722 · vllm-project/vllm-omni

ZeldaHuang · 2026-01-09T10:06:21Z

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6c5a143c41

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-01-09T10:10:14Z

+                    decode_req_ids.append(req_id)
                    last_talker_hidden, text_step = update_dict.pop("mtp_inputs")
-                    if _cudagraph_mode != CUDAGraphMode.NONE:
-                        self.talker_mtp_input_ids.gpu[:span_len].copy_(req_input_ids)
-                        self.talker_mtp_inputs_embeds.gpu[:span_len].copy_(req_embeds)
-                        self.last_talker_hidden.gpu[:span_len].copy_(last_talker_hidden)
-                        self.text_step.gpu[:span_len].copy_(text_step)
-                        req_input_ids = self.talker_mtp_input_ids.gpu[:span_len]
-                        req_embeds = self.talker_mtp_inputs_embeds.gpu[:span_len]
-                        last_talker_hidden = self.last_talker_hidden.gpu[:span_len]
-                        text_step = self.text_step.gpu[:span_len]
-                    with set_forward_context(
-                        None, self.vllm_config, cudagraph_runtime_mode=_cudagraph_mode, batch_descriptor=batch_desc
-                    ):
-                        req_embeds, code_predictor_codes = self.talker_mtp(
-                            req_input_ids, req_embeds, last_talker_hidden, text_step
-                        )
-                        update_dict["code_predictor_codes"] = code_predictor_codes
+                    decode_slice = slice(len(decode_req_ids), len(decode_req_ids) + 1)


Fix decode buffer indexing off-by-one

The decode buffers are filled using decode_slice = slice(len(decode_req_ids), len(decode_req_ids) + 1) immediately after appending the current request, which makes the first decode request write into index 1 instead of 0. _talker_mtp_forward then reads [:decode_batch_size], so the first row is stale and the last request’s data is never read. In multi-request decode batches this silently corrupts embeddings/codes for at least one request.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-01-09T10:10:15Z

+            # run talker mtp decode
+            if hasattr(self.model, "talker_mtp"):
+                self._talker_mtp_forward(decode_req_ids, inputs_embeds)       


Guard talker MTP when no decode requests

_talker_mtp_forward is called unconditionally whenever the model has talker_mtp, even if decode_req_ids is empty (e.g., a prefill-only batch where all span_len > 1). This makes decode_batch_size=0, and talker_mtp ultimately calls code_predictor_forward with seq_len=0, which hits torch.cat(all_codes_per_position, dim=2) on an empty list and raises a runtime error. This will crash prefill-only batches; add a guard to skip _talker_mtp_forward when there are no decode requests.

Useful? React with 👍 / 👎.

yinpeiqi · 2026-01-10T05:19:39Z

+        for idx, req_id in enumerate(decode_req_ids):
+            req_index = self.input_batch.req_ids.index(req_id)
+            start_offset = int(self.query_start_loc.cpu[req_index])
+            inputs_embeds[start_offset : start_offset + 1] = req_embeds[idx : idx + 1]


By this code we can directly use inputs_embeds = req_embeds? then avoid iterate each request. I can remove the update of "code_predictor_codes" and "merge update" here. If so, we maybe able to directly place batch_mtp_forward inside the model forward.

req_embeds does not include prefill requests. Assigning inputs_embeds = req_embeds directly maybe incorrect for mixed prefill and decode batch.

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2147982303

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-01-13T03:43:19Z

+                if getattr(self.model, "talker", None) is not None and hasattr(self.model, "talker_mtp"):
                    outputs = self.talker_mtp(
                        self.talker_mtp_input_ids.gpu[:num_tokens_padded],
                        self.talker_mtp_inputs_embeds.gpu[:num_tokens_padded],


Guard talker_mtp call to decode-populated buffers

This call now runs for every batch, but the talker_mtp_* buffers are only populated for decode requests in _prepare_inputs when span_len == 1. For prefill-only batches (or any batch where num_tokens_padded exceeds the number of decode requests), the slice [:num_tokens_padded] pulls stale/uninitialized entries (and can exceed the max_num_reqs-sized buffers), which can drive talker_mtp with invalid token IDs/hidden states and trigger out-of-range embedding errors or random sampling on garbage data. This should be guarded by the actual decode batch size (or removed) and sliced by that size rather than num_tokens_padded.

Useful? React with 👍 / 👎.

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>

Gaohan123

Could you please add a UT test to protect key methods? Thanks!

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>

ZeldaHuang · 2026-01-14T10:41:59Z

Could you please add a UT test to protect key methods? Thanks!

Done.

Gaohan123

LGTM. Thanks!

…#722) Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com> Signed-off-by: Chen Yang <2082464740@qq.com>

…#722) Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>

launch server && capture graph work

6c5a143

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>

ZeldaHuang requested a review from hsliuustc0106 as a code owner January 9, 2026 10:06

ZeldaHuang changed the title ~~Support Qwen3 Omni talker mtp batch inference~~ [Feature] Support Qwen3 Omni talker mtp batch inference Jan 9, 2026

ZeldaHuang marked this pull request as draft January 9, 2026 10:06

chatgpt-codex-connector Bot reviewed Jan 9, 2026

View reviewed changes

yinpeiqi reviewed Jan 10, 2026

View reviewed changes

ZeldaHuang added 5 commits January 12, 2026 10:49

Merge branch 'main' into support_talker_batch

90ec453

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>

fix

1aa251a

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>

fix

025f29f

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>

fix

9d3a984

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>

Merge branch 'main' into support_talker_batch

2147982

ZeldaHuang marked this pull request as ready for review January 13, 2026 03:35

chatgpt-codex-connector Bot reviewed Jan 13, 2026

View reviewed changes

yttasdfghjk approved these changes Jan 14, 2026

View reviewed changes

hsliuustc0106 added the ready label to trigger buildkite CI label Jan 14, 2026

set max_num_seqs to max_batch_size in load_stage_configs_from_yaml

f3afd1d

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>

Gaohan123 reviewed Jan 14, 2026

View reviewed changes

ZeldaHuang added 2 commits January 14, 2026 17:52

add test_omni_gpu_model_runner.py

eb2562e

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>

update test

3df2f24

Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>

Gaohan123 approved these changes Jan 14, 2026

View reviewed changes

Gaohan123 merged commit 1444e1f into vllm-project:main Jan 14, 2026
7 checks passed

erfgss pushed a commit to erfgss/vllm-omni that referenced this pull request Jan 19, 2026

[Feature] Support Qwen3 Omni talker mtp batch inference (vllm-project…

2bab2c0

…#722) Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com> Signed-off-by: Chen Yang <2082464740@qq.com>

This was referenced Jan 20, 2026

[Bug]: Diffusion models cannot launch with a stage config #859

Closed

[Bugfix] Diffusion model fails to load when stage config is present #860

Merged

with1015 pushed a commit to with1015/vllm-omni that referenced this pull request Jan 20, 2026

[Feature] Support Qwen3 Omni talker mtp batch inference (vllm-project…

5bc4cdd

…#722) Signed-off-by: ZeldaHuang <hzm414167@alibaba-inc.com>

Bounty-hunter mentioned this pull request Jan 21, 2026

[Bug]: Qwen3-Omni-30B-A3B-Instruct，One request is sent per second, and by the 100th request, the end-to-end latency deteriorates by more than 30 times compared to the first request.(20s -> 671s) JiusiServe/vllm-omni#20

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support Qwen3 Omni talker mtp batch inference#722

[Feature] Support Qwen3 Omni talker mtp batch inference#722
Gaohan123 merged 9 commits into
vllm-project:mainfrom
ZeldaHuang:support_talker_batch

ZeldaHuang commented Jan 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jan 9, 2026

Uh oh!

chatgpt-codex-connector Bot Jan 9, 2026

Uh oh!

yinpeiqi Jan 10, 2026

Uh oh!

ZeldaHuang Jan 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jan 13, 2026

Uh oh!

Gaohan123 left a comment

Uh oh!

ZeldaHuang commented Jan 14, 2026

Uh oh!

Gaohan123 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ZeldaHuang commented Jan 9, 2026

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

yinpeiqi Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

ZeldaHuang Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Gaohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

ZeldaHuang commented Jan 14, 2026

Uh oh!

Gaohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants