[Bugfix] Use async preprocessing in pooling/embedding endpoints#38653
[Bugfix] Use async preprocessing in pooling/embedding endpoints#38653chuqiwang wants to merge 1 commit intovllm-project:mainfrom
Conversation
The pooling/embedding endpoint's `pre_process_online_async` was a fake async — it called the sync `pre_process_online` directly, blocking the asyncio event loop during CPU-bound multimodal preprocessing (image decode, resize, normalize, patch extraction via HuggingFace processors). With concurrent requests, this serializes all preprocessing through a single thread. Only a few requests reach the GPU at a time while the rest wait, causing low GPU utilization for multimodal workloads. The async rendering infrastructure (`render_chat_async`, `render_cmpl_async`, `process_for_engine_async`) already exists and is used by the chat completion endpoint. This commit updates the pooling path to use it: - Base PoolingIOProcessor: `pre_process_online_async` now dispatches to new async helpers that call `renderer.render_chat_async()` and `renderer.render_cmpl_async()`. - EmbedIOProcessor: async overrides for Cohere request handling and batch chat rendering. - Scoring IO processors (BiEncoder, CrossEncoder): async overrides using `_preprocess_completion_offline_async` and `asyncio.gather` for parallel `process_for_engine_async` calls. Fixes: vllm-project#22444 Related: vllm-project#14360, vllm-project#11320, vllm-project#15869, vllm-project#25301 AI-assisted: yes (code authored with Claude, reviewed by human) Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: chuqiwang <chuqi.wang@doordash.com>
There was a problem hiding this comment.
Code Review
This pull request introduces asynchronous preprocessing for pooling, embedding, and scoring IO processors to prevent blocking the asyncio event loop during multimodal processing. The review identifies critical missing overrides for offline asynchronous processing in the BiEncoderIOProcessor and CrossEncoderIOProcessor classes, which would otherwise lead to runtime failures when handling ScoringData. Additionally, feedback highlights that prompt_extras are currently ignored in the CrossEncoderIOProcessor async path, which must be addressed to correctly support multimodal metadata and engine-level features.
| ctx.intermediates = len(scoring_data.data_1) | ||
|
|
There was a problem hiding this comment.
The BiEncoderIOProcessor must override pre_process_offline_async. The base class implementation in PoolingIOProcessor contains an assertion that explicitly disallows ScoringData, which will cause runtime failures when using scoring models in offline mode. Additionally, ctx.offset must be set to ensure that post_process_offline can correctly split the query and document embeddings.
| ctx.intermediates = len(scoring_data.data_1) | |
| ctx.intermediates = len(scoring_data.data_1) | |
| async def pre_process_offline_async( | |
| self, ctx: OfflineInputsContext | |
| ) -> Sequence[EngineInput]: | |
| assert isinstance(ctx.prompts, ScoringData) | |
| tok_params = self.renderer.default_cmpl_tok_params.with_kwargs( | |
| **(ctx.tokenization_kwargs or {}) | |
| ) | |
| ctx.offset = len(ctx.prompts.data_1) | |
| return await self._pre_process_async(ctx.prompts, tok_params) | |
| ctx.pooling_params = pooling_params_list | ||
|
|
There was a problem hiding this comment.
Similar to BiEncoderIOProcessor, CrossEncoderIOProcessor needs to override pre_process_offline_async. The base class assertion against ScoringData will trigger a failure, and this override is necessary to correctly manage ctx.pooling_params and pass the chat_template from the context.
| ctx.pooling_params = pooling_params_list | |
| ctx.pooling_params = pooling_params_list | |
| async def pre_process_offline_async( | |
| self, ctx: OfflineInputsContext | |
| ) -> Sequence[EngineInput]: | |
| assert isinstance(ctx.prompts, ScoringData) | |
| assert not isinstance(ctx.pooling_params, list) | |
| tok_params = self.renderer.default_cmpl_tok_params.with_kwargs( | |
| **(ctx.tokenization_kwargs or {}) | |
| ) | |
| engine_inputs, pooling_params_list = await self._pre_process_async( | |
| ctx.prompts, | |
| tok_params, | |
| ctx.pooling_params, | |
| chat_template=ctx.chat_template, | |
| ) | |
| ctx.pooling_params = pooling_params_list | |
| return engine_inputs | |
| chat_template: str | None = None, | ||
| prompt_extras: dict[str, Any] | None = None, | ||
| ) -> tuple[Sequence[EngineInput], list[PoolingParams]]: | ||
| # todo: support prompt_extras |
There was a problem hiding this comment.
The prompt_extras (which include critical parameters like mm_processor_kwargs and cache_salt) are currently ignored in the async preprocessing path for CrossEncoderIOProcessor. This will lead to incorrect behavior for multimodal requests that provide custom processor arguments.
) -> tuple[Sequence[EngineInput], list[PoolingParams]]:
arrival_time = time.time()| pooling_params_list.append(pooling_params) | ||
|
|
||
| tok_params.apply_post_tokenization(self.tokenizer, engine_prompt) | ||
| engine_prompts.append(engine_prompt) |
There was a problem hiding this comment.
|
|
try #39763 |
Summary
pre_process_online_asyncin the pooling/embedding endpoint to use the existing async renderer path instead of blocking the asyncio event loop with synchronous preprocessingrender_chat_async/render_cmpl_async— the pooling path simply never adopted them/v1/embeddings), scoring (/v1/score), and reranking endpointsRoot Cause
PoolingIOProcessor.pre_process_online_async()was a fake async:This serializes all CPU-bound multimodal preprocessing (image decode, resize, normalize, patch extraction via HuggingFace processors) through the event loop. With concurrent requests, only a few reach the GPU scheduler at a time while the rest wait in preprocessing.
Changes
vllm/entrypoints/pooling/base/io_processor.py:pre_process_online_asyncnow dispatches to new async helpers that callrenderer.render_chat_async()andrenderer.render_cmpl_async(). Also fixedpre_process_offline_asyncwhich had the same issue.vllm/entrypoints/pooling/embed/io_processor.py: Async overrides for Cohere request handling and batch chat rendering.vllm/entrypoints/pooling/scoring/io_processor.py: Async overrides for BiEncoder, CrossEncoder using_preprocess_completion_offline_asyncandasyncio.gatherfor parallelprocess_for_engine_asynccalls. LateInteraction inherits the fix from BiEncoder.Why this is not a duplicate
Issue #22444 was closed by the stale bot without a fix. No existing open or merged PRs address the sync preprocessing bottleneck in the pooling IO processor path. Related open issues: #14360, #11320, #15869, #25301.
Test plan
pytest tests/entrypoints/pooling/ -vpytest tests/entrypoints/pooling/embed/test_online_vision.py -vpytest tests/entrypoints/pooling/scoring/test_cross_encoder_online_vision.py -vPre-commit hooks pass (ruff-check, ruff-format, typos, mypy).
Fixes: #22444
AI-assisted: yes (code authored with Claude, reviewed by human)
🤖 Generated with Claude Code