[Bugfix] Use async preprocessing in pooling/embedding endpoints by chuqiwang · Pull Request #38653 · vllm-project/vllm

chuqiwang · 2026-03-31T18:01:56Z

Summary

Fix pre_process_online_async in the pooling/embedding endpoint to use the existing async renderer path instead of blocking the asyncio event loop with synchronous preprocessing
The chat completion endpoint already uses render_chat_async / render_cmpl_async — the pooling path simply never adopted them
Affects embedding (/v1/embeddings), scoring (/v1/score), and reranking endpoints

Root Cause

PoolingIOProcessor.pre_process_online_async() was a fake async:

async def pre_process_online_async(self, ctx):
    self.pre_process_online(ctx)  # blocks the event loop

This serializes all CPU-bound multimodal preprocessing (image decode, resize, normalize, patch extraction via HuggingFace processors) through the event loop. With concurrent requests, only a few reach the GPU scheduler at a time while the rest wait in preprocessing.

Changes

vllm/entrypoints/pooling/base/io_processor.py: pre_process_online_async now dispatches to new async helpers that call renderer.render_chat_async() and renderer.render_cmpl_async(). Also fixed pre_process_offline_async which had the same issue.
vllm/entrypoints/pooling/embed/io_processor.py: Async overrides for Cohere request handling and batch chat rendering.
vllm/entrypoints/pooling/scoring/io_processor.py: Async overrides for BiEncoder, CrossEncoder using _preprocess_completion_offline_async and asyncio.gather for parallel process_for_engine_async calls. LateInteraction inherits the fix from BiEncoder.

Why this is not a duplicate

Issue #22444 was closed by the stale bot without a fix. No existing open or merged PRs address the sync preprocessing bottleneck in the pooling IO processor path. Related open issues: #14360, #11320, #15869, #25301.

Test plan

Run existing pooling tests: pytest tests/entrypoints/pooling/ -v
Run vision-specific embedding tests: pytest tests/entrypoints/pooling/embed/test_online_vision.py -v
Run scoring vision tests: pytest tests/entrypoints/pooling/scoring/test_cross_encoder_online_vision.py -v
Throughput benchmark with reproduction script from [Bug]: Low GPU Utilization with Image Payloads for Qwen2-VL-2B-Instruct Embeddings #22444 (concurrent multimodal embedding requests)

Pre-commit hooks pass (ruff-check, ruff-format, typos, mypy).

Fixes: #22444

AI-assisted: yes (code authored with Claude, reviewed by human)

🤖 Generated with Claude Code

The pooling/embedding endpoint's `pre_process_online_async` was a fake async — it called the sync `pre_process_online` directly, blocking the asyncio event loop during CPU-bound multimodal preprocessing (image decode, resize, normalize, patch extraction via HuggingFace processors). With concurrent requests, this serializes all preprocessing through a single thread. Only a few requests reach the GPU at a time while the rest wait, causing low GPU utilization for multimodal workloads. The async rendering infrastructure (`render_chat_async`, `render_cmpl_async`, `process_for_engine_async`) already exists and is used by the chat completion endpoint. This commit updates the pooling path to use it: - Base PoolingIOProcessor: `pre_process_online_async` now dispatches to new async helpers that call `renderer.render_chat_async()` and `renderer.render_cmpl_async()`. - EmbedIOProcessor: async overrides for Cohere request handling and batch chat rendering. - Scoring IO processors (BiEncoder, CrossEncoder): async overrides using `_preprocess_completion_offline_async` and `asyncio.gather` for parallel `process_for_engine_async` calls. Fixes: vllm-project#22444 Related: vllm-project#14360, vllm-project#11320, vllm-project#15869, vllm-project#25301 AI-assisted: yes (code authored with Claude, reviewed by human) Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: chuqiwang <chuqi.wang@doordash.com>

gemini-code-assist

Code Review

This pull request introduces asynchronous preprocessing for pooling, embedding, and scoring IO processors to prevent blocking the asyncio event loop during multimodal processing. The review identifies critical missing overrides for offline asynchronous processing in the BiEncoderIOProcessor and CrossEncoderIOProcessor classes, which would otherwise lead to runtime failures when handling ScoringData. Additionally, feedback highlights that prompt_extras are currently ignored in the CrossEncoderIOProcessor async path, which must be addressed to correctly support multimodal metadata and engine-level features.

gemini-code-assist · 2026-03-31T18:04:11Z

+        ctx.intermediates = len(scoring_data.data_1)
+


The BiEncoderIOProcessor must override pre_process_offline_async. The base class implementation in PoolingIOProcessor contains an assertion that explicitly disallows ScoringData, which will cause runtime failures when using scoring models in offline mode. Additionally, ctx.offset must be set to ensure that post_process_offline can correctly split the query and document embeddings.

Suggested change

ctx.intermediates = len(scoring_data.data_1)

ctx.intermediates = len(scoring_data.data_1)

async def pre_process_offline_async(

self, ctx: OfflineInputsContext

) -> Sequence[EngineInput]:

assert isinstance(ctx.prompts, ScoringData)

tok_params = self.renderer.default_cmpl_tok_params.with_kwargs(

**(ctx.tokenization_kwargs or {})

)

ctx.offset = len(ctx.prompts.data_1)

return await self._pre_process_async(ctx.prompts, tok_params)

gemini-code-assist · 2026-03-31T18:04:11Z

+        ctx.pooling_params = pooling_params_list
+


Similar to BiEncoderIOProcessor, CrossEncoderIOProcessor needs to override pre_process_offline_async. The base class assertion against ScoringData will trigger a failure, and this override is necessary to correctly manage ctx.pooling_params and pass the chat_template from the context.

Suggested change

ctx.pooling_params = pooling_params_list

ctx.pooling_params = pooling_params_list

async def pre_process_offline_async(

self, ctx: OfflineInputsContext

) -> Sequence[EngineInput]:

assert isinstance(ctx.prompts, ScoringData)

assert not isinstance(ctx.pooling_params, list)

tok_params = self.renderer.default_cmpl_tok_params.with_kwargs(

**(ctx.tokenization_kwargs or {})

)

engine_inputs, pooling_params_list = await self._pre_process_async(

ctx.prompts,

tok_params,

ctx.pooling_params,

chat_template=ctx.chat_template,

)

ctx.pooling_params = pooling_params_list

return engine_inputs

gemini-code-assist · 2026-03-31T18:04:11Z

+        chat_template: str | None = None,
+        prompt_extras: dict[str, Any] | None = None,
+    ) -> tuple[Sequence[EngineInput], list[PoolingParams]]:
+        # todo: support prompt_extras


The prompt_extras (which include critical parameters like mm_processor_kwargs and cache_salt) are currently ignored in the async preprocessing path for CrossEncoderIOProcessor. This will lead to incorrect behavior for multimodal requests that provide custom processor arguments.

) -> tuple[Sequence[EngineInput], list[PoolingParams]]: arrival_time = time.time()

gemini-code-assist · 2026-03-31T18:04:11Z

+                pooling_params_list.append(pooling_params)
+
+            tok_params.apply_post_tokenization(self.tokenizer, engine_prompt)
+            engine_prompts.append(engine_prompt)


Apply prompt_extras to the engine_prompt before appending it to the list. This ensures that multimodal processing and other engine-level features receive the necessary metadata.

if prompt_extras: engine_prompt.update(prompt_extras) engine_prompts.append(engine_prompt)

noooop · 2026-04-02T02:15:48Z

The Pooling model entrypoint is under rapid refactoring and is currently a work in progress.
The sync api is now used to remove the overhead of the async (~2ms).

#27407 (comment)

We are preparing to apply [Bugfix] Offload blocking tokenizer ops to shared thread pool to unblock event loop #34789 to the Pooling entrypoint: Offload blocking tokenizer ops to shared thread pool to unblock event loop.
Please increase --api-server-count to improve preprocessing performance. We will set the default to 4 instead of the current 1 in a later PR.

noooop · 2026-04-14T07:50:30Z

try #39763

mergify Bot added frontend bug Something isn't working labels Mar 31, 2026

gemini-code-assist Bot reviewed Mar 31, 2026

View reviewed changes

chuqiwang closed this Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Use async preprocessing in pooling/embedding endpoints#38653

[Bugfix] Use async preprocessing in pooling/embedding endpoints#38653
chuqiwang wants to merge 1 commit intovllm-project:mainfrom
chuqiwang:fix/async-pooling-preprocessing

chuqiwang commented Mar 31, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Uh oh!

noooop commented Apr 2, 2026 •

edited

Loading

Uh oh!

noooop commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        ctx.intermediates = len(scoring_data.data_1)
+        ctx.intermediates = len(scoring_data.data_1)
+    async def pre_process_offline_async(
+        self, ctx: OfflineInputsContext
+    ) -> Sequence[EngineInput]:
+        assert isinstance(ctx.prompts, ScoringData)
+        tok_params = self.renderer.default_cmpl_tok_params.with_kwargs(
+            **(ctx.tokenization_kwargs or {})
+        )
+        ctx.offset = len(ctx.prompts.data_1)
+        return await self._pre_process_async(ctx.prompts, tok_params)

Uh oh!

Conversation

chuqiwang commented Mar 31, 2026

Summary

Root Cause

Changes

Why this is not a duplicate

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

noooop commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noooop commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

noooop commented Apr 2, 2026 •

edited

Loading