[Feature]: Support for multiple embedding types in a single inference call by staugust · Pull Request #35829 · vllm-project/vllm

staugust · 2026-03-03T04:37:32Z

Purpose

As issue #35190 proposed, support both sparse&dense embedding in a single inference call.

Test Plan

./tests/plugins_tests/test_bge_m3_sparse_io_processor_plugins.py

Test Result

All test cases passed. Both sparse&dense embedding is ok.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

The pull request introduces support for multiple embedding types in a single inference call, which is a significant feature enhancement. The changes involve updating various components to handle a list of pooling tasks instead of a single task. This includes modifications to request mixins, pooling parameters, IO processor logic, and model runner output structures. The test cases have also been updated to cover the new multi-task scenarios.

However, there are critical issues identified in the handling of request queues within the BgeM3SparseEmbeddingsProcessor and a list initialization bug in DispatchPooler that could lead to incorrect behavior or crashes. Additionally, the parameter validation for multiple pooling tasks in PoolingParams needs to be more robust.

tests/plugins/bge_m3_sparse_plugin/bge_m3_sparse_processor/sparse_embeddings_processor.py

vllm/model_executor/layers/pooler/special.py

tests/plugins/bge_m3_sparse_plugin/bge_m3_sparse_processor/sparse_embeddings_processor.py

vllm/pooling_params.py

mergify · 2026-03-03T04:46:03Z

Hi @staugust, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

DarkLight1337

Hmm, at the end of the day we are still having multiple tasks in PoolingParams. So IOProcessor cannot really avoid this complexity. If that's the case then I suggest splling up the PR into two parts: the first part supports multiple tasks in PoolingParams while the second part updates the plugin.

staugust · 2026-03-03T06:27:33Z

@DarkLight1337 ok, shall we update pooling api to support multi pooling tasks?

DarkLight1337 · 2026-03-03T06:28:25Z

Sure

mergify · 2026-03-03T06:53:08Z

Hi @staugust, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

noooop · 2026-03-03T08:14:58Z

I don't think it's worth making such a big change for this feature.

I think it would be better to add one or more combination of tasks, rather than supporting multiple tasks.
For example, adding a task : "embed+token_embed+token_classify"

For this specific requirement, we only need to add one task to output the results of embed + token_embed + token_classify. Do we really need to broadly support multiple tasks?

mergify · 2026-03-03T08:15:08Z

Hi @staugust, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

staugust · 2026-03-03T08:43:12Z

@noooop I've modify PoolingTask and two related PoolingRequest data class. For sparse&dense embedding, we only need to add 'token_classify+embed'. Anyway, core.py, scheduler.py and gpu_model_runner have to be updated to process two pooling_output of one request properly, PTAL, thanks.

staugust · 2026-03-05T02:44:27Z

Hi @noooop, @DarkLight1337 — just wanted to kindly check if you have any updates or feedback on this PR when you have a moment. Thanks!

vllm/pooling_params.py

mergify · 2026-03-05T17:26:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @staugust.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/v1/worker/gpu_model_runner.py

mergify · 2026-03-10T07:41:15Z

Hi @staugust, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>

… call (vllm-project#35829) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>

noooop · 2026-03-19T09:33:19Z

@staugust

We are very sorry, but we plan to deprecate support for multi-task, (for v2 runner), meaning users will need to select only one of them at the beginning.

Please modify this bge-m3 plugin to use only the dense & sparse tasks, meaning that for all requests, the dense & sparse tasks will be computed. When returning results to users, filter the content they need.

This may increase overhead, but we have no other choice.

noooop · 2026-03-19T09:39:50Z

vllm/model_executor/models/roberta.py

-                    self.eos_token_id,
+                "token_classify": token_classify_pooler,
+                "embed&token_classify": BgeM3Pooler(
+                    token_classify_pooler, embed_pooler
                ),


Sorry, I'm not 100% sure how you implemented support for dense, sparse, and dense&sparse simultaneously, but you need to change it to only use "embed&token_classify": BgeM3Pooler.

staugust · 2026-03-19T09:49:28Z

@staugust

We are very sorry, but we plan to deprecate support for multi-task, (for v2 runner), meaning users will need to select only one of them at the beginning.

Please modify this bge-m3 plugin to use only the dense & sparse tasks, meaning that for all requests, the dense & sparse tasks will be computed. When returning results to users, filter the content they need.

This may increase overhead, but we have no other choice.

@noooop Can a pooling model still support multi pooling tasks by /pooling api with different task configured in pooling request body ? Or pooling task is specified by command line?

noooop · 2026-03-19T10:05:16Z

pooling task is specified by command line

noooop · 2026-03-19T10:06:42Z

We had some discussions, summaries:

Make all supported pooling tasks available simultaneously (multi-task support) vs selecting only one at the start.

Typically, model training focuses on a single task — 99.9% of usage scenarios only require picking one of them at the beginning.
Having all supported pooling tasks ready at once is certainly fancier and gives users everything they want in one go.
In terms of UX, if users have to pick just one at the start, they'd need to manually select token_embed for multivector retrieval. e.g. jina_embeddings_v4 https://github.com/vllm-project/vllm/blob/main/examples/pooling/token_embed/jina_embeddings_v4_offline.py
BGE-M3 supports three tasks: embed for dense retrieval; token_embed for multi-vector retrieval; and token_classify for sparse retrieval.
This PR ([Feature]: Support for multiple embedding types in a single inference call #35829) updates the BGE-M3 plugin to support dense, sparse, and dense+sparse modes, making it more convenient for BGE-M3 users.

noooop · 2026-03-19T10:07:28Z

If you have any important information, feel free to add.

staugust · 2026-03-19T10:19:39Z

@noooop I’d like to understand the reasoning behind v2 runner only supporting a single specific task at model serving instance startup.
If this is mainly for user experience, my assumption is that users calling the pooling API already know which pooling task their model supports, so it may not be necessary to configure the task via the command line.

noooop · 2026-03-19T10:28:41Z

This will simplify the design of the v2 runner; otherwise, supporting all pooling tasks running simultaneously (multi-task support) would make the runner complex.

staugust · 2026-03-19T10:58:09Z

@noooop ok, I'll update this plugin when v2 runner is ready.

noooop · 2026-03-19T11:12:50Z

It’s because the block v2 feature is being rolled out, so we need to deprecate support for multi-task first. That means you can proceed with implementing this feature now.

staugust · 2026-03-19T11:15:37Z

ok, I'll issue a pr to update plugin.

… call (vllm-project#35829) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>

… call (vllm-project#35829) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

… call (vllm-project#35829) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>

staugust requested review from ApostaC, DarkLight1337, WoosukKwon, alexm-redhat, heheda12345, njhill, noooop, orozery, robertgshaw2-redhat and ywang96 as code owners March 3, 2026 04:37

mergify bot added frontend v1 labels Mar 3, 2026

gemini-code-assist bot reviewed Mar 3, 2026

View reviewed changes

DarkLight1337 reviewed Mar 3, 2026

View reviewed changes

staugust force-pushed the fuse_dense_sparse_embed branch from 0741f23 to 0dc2998 Compare March 3, 2026 06:48

noooop reviewed Mar 5, 2026

View reviewed changes

vllm/pooling_params.py Outdated Show resolved Hide resolved

mergify bot added needs-rebase and removed needs-rebase labels Mar 5, 2026

noooop reviewed Mar 9, 2026

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

staugust force-pushed the fuse_dense_sparse_embed branch from 628fe67 to 0b0be64 Compare March 10, 2026 07:35

staugust added 7 commits March 17, 2026 07:08

specify task by embed_task for bge_m3 plugin

be3a7ed

Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>

correct pooling params for different tasks

4434bd1

Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>

update object to raw_request.task for bge_m3 plugin

a4283b4

Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>

fix bugs in bge_m3 plugin

3f45f2c

Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>

fix bugs in check sparse embed result for bge_m3 plugin

95f2349

Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>

fix bugs in check sparse embed result for bge_m3 plugin

95e0280

Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>

remove log of bge m3 plugin initialization parameters

c19f266

Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>

auto-merge was automatically disabled March 17, 2026 07:08
Head branch was pushed to by a user without write access

staugust force-pushed the fuse_dense_sparse_embed branch from b4003e3 to c19f266 Compare March 17, 2026 07:08

noooop merged commit 9c7cab5 into vllm-project:main Mar 17, 2026
59 checks passed

zhenwei-intel pushed a commit to zhenwei-intel/vllm that referenced this pull request Mar 17, 2026

[Feature]: Support for multiple embedding types in a single inference…

96039a7

… call (vllm-project#35829) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026

[Feature]: Support for multiple embedding types in a single inference…

21bfc54

… call (vllm-project#35829) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>

andylolu2 pushed a commit to andylolu2/vllm that referenced this pull request Mar 18, 2026

[Feature]: Support for multiple embedding types in a single inference…

865d3dc

… call (vllm-project#35829) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>

wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026

[Feature]: Support for multiple embedding types in a single inference…

101b177

… call (vllm-project#35829) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>

noooop reviewed Mar 19, 2026

View reviewed changes

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[Feature]: Support for multiple embedding types in a single inference…

cce253f

… call (vllm-project#35829) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>

staugust mentioned this pull request Mar 20, 2026

always use embed&token_classify for bge-m3 #37632

Merged

5 tasks

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

[Feature]: Support for multiple embedding types in a single inference…

f462baa

… call (vllm-project#35829) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[Feature]: Support for multiple embedding types in a single inference…

f121a49

… call (vllm-project#35829) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>

Uh oh!

Conversation

staugust commented Mar 3, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Mar 3, 2026

Uh oh!

DarkLight1337 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

staugust commented Mar 3, 2026

Uh oh!

DarkLight1337 commented Mar 3, 2026

Uh oh!

mergify bot commented Mar 3, 2026

Uh oh!

noooop commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Mar 3, 2026

Uh oh!

staugust commented Mar 3, 2026

Uh oh!

staugust commented Mar 5, 2026

Uh oh!

Uh oh!

mergify bot commented Mar 5, 2026

Uh oh!

Uh oh!

mergify bot commented Mar 10, 2026

Uh oh!

Uh oh!

noooop commented Mar 19, 2026

Uh oh!

noooop Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

staugust commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noooop commented Mar 19, 2026

Uh oh!

noooop commented Mar 19, 2026

Uh oh!

noooop commented Mar 19, 2026

Uh oh!

staugust commented Mar 19, 2026

Uh oh!

noooop commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

staugust commented Mar 19, 2026

Uh oh!

noooop commented Mar 19, 2026

Uh oh!

staugust commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

staugust commented Mar 3, 2026 •

edited by github-actions bot

Loading

DarkLight1337 left a comment •

edited

Loading

noooop commented Mar 3, 2026 •

edited

Loading

staugust commented Mar 19, 2026 •

edited

Loading

noooop commented Mar 19, 2026 •

edited

Loading