Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
826ac0b
wip: add io_process_plugin for sparse embedding
staugust Feb 9, 2026
1687d0c
fix bugs for offline mode with array
staugust Feb 10, 2026
2e4dc28
update code with gemeni suggestions
staugust Feb 10, 2026
4589b09
update bge_m3_sparse_plugin with simple code to construct sparse embe…
staugust Feb 11, 2026
b2e15fe
add params to determine whether output token_id to token text mapping
staugust Feb 11, 2026
269a8b7
udpate bge_m3_sparse_plugin
staugust Feb 13, 2026
75027a7
add input param for sparse embedding
staugust Feb 14, 2026
3901619
update interface for io_processor_plugin
staugust Feb 14, 2026
4850b7e
add return
staugust Feb 14, 2026
79946b8
fix bugs in post_process
staugust Feb 14, 2026
9c20e6f
fix bugs in post_process
staugust Feb 14, 2026
d3c2d8d
make plugin compatible with main branch
staugust Feb 14, 2026
07e2633
make plugin compatible with offline mode
staugust Feb 14, 2026
32bda35
update pooling params for online mode
staugust Feb 14, 2026
ac285c3
make code cleaner
staugust Feb 14, 2026
93d754c
use convert_ids_list_to_tokens instead of convert_ids_to_tokens
staugust Feb 14, 2026
e3dce21
use convert_ids_list_to_tokens instead of convert_ids_to_tokens
staugust Feb 14, 2026
5c856cb
pass renderer during io_processor init
staugust Feb 24, 2026
5eb5a33
let get_io_processor compatible with previous io_process_plugin
staugust Feb 24, 2026
670d31c
add warnning msg for io_processor_plugin.__init__ api change
staugust Feb 24, 2026
0782705
remove request parameter in merge_pooling_params
staugust Feb 25, 2026
dc4ec89
fix bugs in call merge_pooling_params
staugust Feb 25, 2026
0739106
update io_processor_plugins.md as abstract class IOProcessor is updated
staugust Feb 25, 2026
a5d518b
remove fallbacks in update io_processor_plugins.md, return correct er…
staugust Feb 25, 2026
49a2cef
Update vllm/plugins/io_processors/__init__.py
staugust Feb 25, 2026
50db326
fix testcase for loading wrong io_processor plugin
staugust Feb 25, 2026
ef9065d
Merge branch 'main' into bge-m3-sparse-plugin
staugust Feb 25, 2026
29378ab
add e2e test case for bge_m3_sparse_plugin
staugust Feb 26, 2026
ea6e9c1
fix bugs in passing hf_overrides
staugust Feb 26, 2026
759314b
fix bugs in construct prompts for offline mode
staugust Feb 26, 2026
15d6cf5
fix bugs in construct prompts for multi inputs in offline mode
staugust Feb 26, 2026
d67346f
update verify logic for bge_m3_sparse_plugin
staugust Feb 26, 2026
cb01d53
fix bugs in get pooler_output
staugust Feb 26, 2026
a16d521
fix bugs in offline testcase
staugust Feb 26, 2026
bee36a2
check embed result
staugust Feb 26, 2026
15f54ba
fix bugs in check offline mode result
staugust Feb 26, 2026
7efcc16
check token is None for return_tokens=False
staugust Feb 26, 2026
a29968f
make _check_sparse_embedding compatible for both online serving and o…
staugust Feb 26, 2026
4a18af7
fix test online
staugust Feb 26, 2026
a07bed7
fix verify logic for online mode
staugust Feb 26, 2026
43b1f54
update online test case
staugust Feb 26, 2026
dd77e52
rename test file for bge_m3_sparse_plugin
staugust Feb 26, 2026
add2882
Merge branch 'main' into bge-m3-sparse-plugin
staugust Feb 27, 2026
58ebcc7
add bge_m3_sparse io processor plugin test into .buildkite
staugust Feb 27, 2026
166cef8
fix pre-commit check
staugust Feb 27, 2026
1f4f969
check sparse-embedding weight using loose equality
staugust Feb 27, 2026
dfb3663
Merge branch 'main' into bge-m3-sparse-plugin
staugust Feb 28, 2026
3a96af3
Merge branch 'main' into bge-m3-sparse-plugin
staugust Feb 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion .buildkite/test-amd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1371,6 +1371,10 @@ steps:
- pip install -e ./plugins/prithvi_io_processor_plugin
- pytest -v -s plugins_tests/test_io_processor_plugins.py
- pip uninstall prithvi_io_processor_plugin -y
# test bge_m3_sparse io_processor plugin
- pip install -e ./plugins/bge_m3_sparse_plugin
- pytest -v -s plugins_tests/test_bge_m3_sparse_io_processor_plugins.py
- pip uninstall bge_m3_sparse_plugin -y
# end io_processor plugins test
# begin stat_logger plugins test
- pip install -e ./plugins/vllm_add_dummy_stat_logger
Expand Down Expand Up @@ -2947,6 +2951,10 @@ steps:
- pip install -e ./plugins/prithvi_io_processor_plugin
- pytest -v -s plugins_tests/test_io_processor_plugins.py
- pip uninstall prithvi_io_processor_plugin -y
# test bge_m3_sparse io_processor plugin
- pip install -e ./plugins/bge_m3_sparse_plugin
- pytest -v -s plugins_tests/test_bge_m3_sparse_io_processor_plugins.py
- pip uninstall bge_m3_sparse_plugin -y
# end io_processor plugins test
# begin stat_logger plugins test
- pip install -e ./plugins/vllm_add_dummy_stat_logger
Expand Down Expand Up @@ -3228,4 +3236,4 @@ steps:
num_gpus: 4
working_dir: "/vllm-workspace"
commands:
- bash .buildkite/scripts/scheduled_integration_test/qwen3_next_mtp_async_eplb.sh 0.8 1319 8040
- bash .buildkite/scripts/scheduled_integration_test/qwen3_next_mtp_async_eplb.sh 0.8 1319 8040
4 changes: 4 additions & 0 deletions .buildkite/test_areas/plugins.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@ steps:
- pip install -e ./plugins/prithvi_io_processor_plugin
- pytest -v -s plugins_tests/test_io_processor_plugins.py
- pip uninstall prithvi_io_processor_plugin -y
# test bge_m3_sparse io_processor plugin
- pip install -e ./plugins/bge_m3_sparse_plugin
- pytest -v -s plugins_tests/test_bge_m3_sparse_io_processor_plugins.py
- pip uninstall bge_m3_sparse_plugin -y
# end io_processor plugins test
# begin stat_logger plugins test
- pip install -e ./plugins/vllm_add_dummy_stat_logger
Expand Down
7 changes: 4 additions & 3 deletions docs/design/io_processor_plugins.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,13 @@ IOProcessorInput = TypeVar("IOProcessorInput")
IOProcessorOutput = TypeVar("IOProcessorOutput")

class IOProcessor(ABC, Generic[IOProcessorInput, IOProcessorOutput]):
def __init__(self, vllm_config: VllmConfig):
"""Abstract interface for pre/post-processing of engine I/O."""

def __init__(self, vllm_config: VllmConfig, renderer: BaseRenderer):
super().__init__()

self.vllm_config = vllm_config

@abstractmethod
def parse_data(self, data: object) -> IOProcessorInput:
raise NotImplementedError

Expand All @@ -32,7 +33,7 @@ class IOProcessor(ABC, Generic[IOProcessorInput, IOProcessorOutput]):
self,
params: PoolingParams | None = None,
) -> PoolingParams:
return params or PoolingParams()
return params or PoolingParams(task="plugin")

@abstractmethod
def pre_process(
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project


def register_bge_m3_sparse_embeddings_processor():
return "bge_m3_sparse_processor.sparse_embeddings_processor.BgeM3SparseEmbeddingsProcessor" # noqa: E501
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

from collections.abc import Sequence

from vllm.config import VllmConfig
from vllm.entrypoints.openai.engine.protocol import UsageInfo
from vllm.inputs.data import PromptType
from vllm.logger import init_logger
from vllm.outputs import PoolingRequestOutput
from vllm.plugins.io_processors.interface import (
IOProcessor,
)
from vllm.pooling_params import PoolingParams
from vllm.renderers import BaseRenderer
from vllm.tokenizers.detokenizer_utils import convert_ids_list_to_tokens

from .types import (
SparseEmbeddingCompletionRequestMixin,
SparseEmbeddingResponse,
SparseEmbeddingResponseData,
SparseEmbeddingTokenWeight,
)

logger = init_logger(__name__)


class BgeM3SparseEmbeddingsProcessor(
IOProcessor[SparseEmbeddingCompletionRequestMixin, SparseEmbeddingResponse]
):
def __init__(self, vllm_config: VllmConfig, renderer: BaseRenderer):
super().__init__(vllm_config, renderer)
self.offline_requests: list[SparseEmbeddingCompletionRequestMixin] = []
self.online_requests: dict[str, SparseEmbeddingCompletionRequestMixin] = {}
self.renderer: BaseRenderer = renderer

def merge_pooling_params(
self,
params: PoolingParams | None = None,
) -> PoolingParams:
if params is None:
params = PoolingParams()
# refer to PoolingCompletionRequest.to_pooling_params
params.task = "token_classify"
return params

def parse_request(
self, request_data: object
) -> SparseEmbeddingCompletionRequestMixin:
# for vllm.entrypoints.llm.LLM, offline mode, calls `encode` directly.
if isinstance(request_data, dict):
return SparseEmbeddingCompletionRequestMixin(**request_data)
raise TypeError("request_data should be a dictionary")

def pre_process(
self,
prompt: SparseEmbeddingCompletionRequestMixin,
request_id: str | None = None,
**kwargs,
) -> PromptType | Sequence[PromptType]:
if request_id is not None:
assert request_id not in self.online_requests, "request_id duplicated"
self.online_requests[request_id] = prompt
else:
self.offline_requests.append(prompt)
return prompt.input

def _get_sparse_embedding_request(self, request_id: str | None = None):
if request_id:
return self.online_requests.pop(request_id, None)
return self.offline_requests.pop()

def _build_sparse_embedding_token_weights(
self,
sparse_embedding: dict[int, float],
return_tokens: bool = False,
) -> list[SparseEmbeddingTokenWeight]:
token_ids = sparse_embedding.keys()
token_weights = sparse_embedding.values()
tokens = [None] * len(token_ids)

if return_tokens and self.renderer is not None:
tokens = convert_ids_list_to_tokens(
self.renderer.get_tokenizer(), token_ids
)
sparse_embedding_output: list[SparseEmbeddingTokenWeight] = []
for token_id, weight, token in zip(token_ids, token_weights, tokens):
sparse_embedding_output.append(
SparseEmbeddingTokenWeight(
token_id=token_id, weight=weight, token=token
)
)
return sparse_embedding_output

def post_process(
self,
model_output: Sequence[PoolingRequestOutput],
request_id: str | None = None,
**kwargs,
) -> SparseEmbeddingResponse:
num_prompt_tokens = 0
response_data = []
return_tokens = self._get_sparse_embedding_request(request_id).return_tokens
for idx in range(len(model_output)):
mo = model_output[idx]
sparse_embedding: dict[int, float] = {}
num_prompt_tokens += len(mo.prompt_token_ids)
if len(mo.prompt_token_ids) != len(mo.outputs.data):
# this is the case that add_special_tokens is True,
# which means first token and last token are special tokens
mo.prompt_token_ids = mo.prompt_token_ids[1:]
for token_id, weight in zip(mo.prompt_token_ids, mo.outputs.data.tolist()):
sparse_embedding[token_id] = max(
weight, sparse_embedding.get(token_id, 0.0)
)
response_data.append(
SparseEmbeddingResponseData(
index=idx,
sparse_embedding=self._build_sparse_embedding_token_weights(
sparse_embedding,
return_tokens,
),
)
)

usage = UsageInfo(
prompt_tokens=num_prompt_tokens,
total_tokens=num_prompt_tokens,
)
resp = SparseEmbeddingResponse(
data=response_data,
usage=usage,
)

return resp
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

from pydantic import BaseModel, Field

from vllm.entrypoints.openai.engine.protocol import UsageInfo
from vllm.entrypoints.pooling.base.protocol import CompletionRequestMixin


class SparseEmbeddingCompletionRequestMixin(CompletionRequestMixin):
return_tokens: bool | None = Field(
default=None,
description="Whether to return dict shows the mapping of token_id to text."
"`None` or False means not return.",
)


class SparseEmbeddingTokenWeight(BaseModel):
token_id: int
weight: float
token: str | None


class SparseEmbeddingResponseData(BaseModel):
index: int
object: str = "sparse-embedding"
sparse_embedding: list[SparseEmbeddingTokenWeight]


class SparseEmbeddingResponse(BaseModel):
data: list[SparseEmbeddingResponseData]
usage: UsageInfo
15 changes: 15 additions & 0 deletions tests/plugins/bge_m3_sparse_plugin/setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

from setuptools import setup

setup(
name="bge-m3-sparse-plugin",
version="0.1",
packages=["bge_m3_sparse_processor"],
entry_points={
"vllm.io_processor_plugins": [
"bge_m3_sparse_plugin = bge_m3_sparse_processor:register_bge_m3_sparse_embeddings_processor", # noqa: E501
]
},
)
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
from vllm.logger import init_logger
from vllm.outputs import PoolingRequestOutput
from vllm.plugins.io_processors.interface import IOProcessor
from vllm.renderers import BaseRenderer

from .types import DataModuleConfig, ImagePrompt, ImageRequestOutput

Expand Down Expand Up @@ -218,8 +219,8 @@ def load_image(
class PrithviMultimodalDataProcessor(IOProcessor[ImagePrompt, ImageRequestOutput]):
indices = [0, 1, 2, 3, 4, 5]

def __init__(self, vllm_config: VllmConfig):
super().__init__(vllm_config)
def __init__(self, vllm_config: VllmConfig, renderer: BaseRenderer):
super().__init__(vllm_config, renderer)

self.datamodule = Sen1Floods11NonGeoDataModule(
data_root=datamodule_config["data_root"],
Expand Down
Loading