Skip to content

[Bugfix] Use async preprocessing in pooling/embedding endpoints#38653

Closed
chuqiwang wants to merge 1 commit intovllm-project:mainfrom
chuqiwang:fix/async-pooling-preprocessing
Closed

[Bugfix] Use async preprocessing in pooling/embedding endpoints#38653
chuqiwang wants to merge 1 commit intovllm-project:mainfrom
chuqiwang:fix/async-pooling-preprocessing

Conversation

@chuqiwang
Copy link
Copy Markdown

Summary

  • Fix pre_process_online_async in the pooling/embedding endpoint to use the existing async renderer path instead of blocking the asyncio event loop with synchronous preprocessing
  • The chat completion endpoint already uses render_chat_async / render_cmpl_async — the pooling path simply never adopted them
  • Affects embedding (/v1/embeddings), scoring (/v1/score), and reranking endpoints

Root Cause

PoolingIOProcessor.pre_process_online_async() was a fake async:

async def pre_process_online_async(self, ctx):
    self.pre_process_online(ctx)  # blocks the event loop

This serializes all CPU-bound multimodal preprocessing (image decode, resize, normalize, patch extraction via HuggingFace processors) through the event loop. With concurrent requests, only a few reach the GPU scheduler at a time while the rest wait in preprocessing.

Changes

  • vllm/entrypoints/pooling/base/io_processor.py: pre_process_online_async now dispatches to new async helpers that call renderer.render_chat_async() and renderer.render_cmpl_async(). Also fixed pre_process_offline_async which had the same issue.
  • vllm/entrypoints/pooling/embed/io_processor.py: Async overrides for Cohere request handling and batch chat rendering.
  • vllm/entrypoints/pooling/scoring/io_processor.py: Async overrides for BiEncoder, CrossEncoder using _preprocess_completion_offline_async and asyncio.gather for parallel process_for_engine_async calls. LateInteraction inherits the fix from BiEncoder.

Why this is not a duplicate

Issue #22444 was closed by the stale bot without a fix. No existing open or merged PRs address the sync preprocessing bottleneck in the pooling IO processor path. Related open issues: #14360, #11320, #15869, #25301.

Test plan

  • Run existing pooling tests: pytest tests/entrypoints/pooling/ -v
  • Run vision-specific embedding tests: pytest tests/entrypoints/pooling/embed/test_online_vision.py -v
  • Run scoring vision tests: pytest tests/entrypoints/pooling/scoring/test_cross_encoder_online_vision.py -v
  • Throughput benchmark with reproduction script from [Bug]: Low GPU Utilization with Image Payloads for Qwen2-VL-2B-Instruct Embeddings #22444 (concurrent multimodal embedding requests)

Pre-commit hooks pass (ruff-check, ruff-format, typos, mypy).

Fixes: #22444

AI-assisted: yes (code authored with Claude, reviewed by human)

🤖 Generated with Claude Code

The pooling/embedding endpoint's `pre_process_online_async` was a fake
async — it called the sync `pre_process_online` directly, blocking the
asyncio event loop during CPU-bound multimodal preprocessing (image
decode, resize, normalize, patch extraction via HuggingFace processors).

With concurrent requests, this serializes all preprocessing through a
single thread. Only a few requests reach the GPU at a time while the
rest wait, causing low GPU utilization for multimodal workloads.

The async rendering infrastructure (`render_chat_async`,
`render_cmpl_async`, `process_for_engine_async`) already exists and is
used by the chat completion endpoint. This commit updates the pooling
path to use it:

- Base PoolingIOProcessor: `pre_process_online_async` now dispatches to
  new async helpers that call `renderer.render_chat_async()` and
  `renderer.render_cmpl_async()`.
- EmbedIOProcessor: async overrides for Cohere request handling and
  batch chat rendering.
- Scoring IO processors (BiEncoder, CrossEncoder): async overrides
  using `_preprocess_completion_offline_async` and `asyncio.gather`
  for parallel `process_for_engine_async` calls.

Fixes: vllm-project#22444
Related: vllm-project#14360, vllm-project#11320, vllm-project#15869, vllm-project#25301

AI-assisted: yes (code authored with Claude, reviewed by human)

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: chuqiwang <chuqi.wang@doordash.com>
@mergify mergify Bot added frontend bug Something isn't working labels Mar 31, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces asynchronous preprocessing for pooling, embedding, and scoring IO processors to prevent blocking the asyncio event loop during multimodal processing. The review identifies critical missing overrides for offline asynchronous processing in the BiEncoderIOProcessor and CrossEncoderIOProcessor classes, which would otherwise lead to runtime failures when handling ScoringData. Additionally, feedback highlights that prompt_extras are currently ignored in the CrossEncoderIOProcessor async path, which must be addressed to correctly support multimodal metadata and engine-level features.

Comment on lines +125 to +126
ctx.intermediates = len(scoring_data.data_1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The BiEncoderIOProcessor must override pre_process_offline_async. The base class implementation in PoolingIOProcessor contains an assertion that explicitly disallows ScoringData, which will cause runtime failures when using scoring models in offline mode. Additionally, ctx.offset must be set to ensure that post_process_offline can correctly split the query and document embeddings.

Suggested change
ctx.intermediates = len(scoring_data.data_1)
ctx.intermediates = len(scoring_data.data_1)
async def pre_process_offline_async(
self, ctx: OfflineInputsContext
) -> Sequence[EngineInput]:
assert isinstance(ctx.prompts, ScoringData)
tok_params = self.renderer.default_cmpl_tok_params.with_kwargs(
**(ctx.tokenization_kwargs or {})
)
ctx.offset = len(ctx.prompts.data_1)
return await self._pre_process_async(ctx.prompts, tok_params)

Comment on lines +344 to +345
ctx.pooling_params = pooling_params_list

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to BiEncoderIOProcessor, CrossEncoderIOProcessor needs to override pre_process_offline_async. The base class assertion against ScoringData will trigger a failure, and this override is necessary to correctly manage ctx.pooling_params and pass the chat_template from the context.

Suggested change
ctx.pooling_params = pooling_params_list
ctx.pooling_params = pooling_params_list
async def pre_process_offline_async(
self, ctx: OfflineInputsContext
) -> Sequence[EngineInput]:
assert isinstance(ctx.prompts, ScoringData)
assert not isinstance(ctx.pooling_params, list)
tok_params = self.renderer.default_cmpl_tok_params.with_kwargs(
**(ctx.tokenization_kwargs or {})
)
engine_inputs, pooling_params_list = await self._pre_process_async(
ctx.prompts,
tok_params,
ctx.pooling_params,
chat_template=ctx.chat_template,
)
ctx.pooling_params = pooling_params_list
return engine_inputs

chat_template: str | None = None,
prompt_extras: dict[str, Any] | None = None,
) -> tuple[Sequence[EngineInput], list[PoolingParams]]:
# todo: support prompt_extras
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The prompt_extras (which include critical parameters like mm_processor_kwargs and cache_salt) are currently ignored in the async preprocessing path for CrossEncoderIOProcessor. This will lead to incorrect behavior for multimodal requests that provide custom processor arguments.

    ) -> tuple[Sequence[EngineInput], list[PoolingParams]]:
        arrival_time = time.time()

pooling_params_list.append(pooling_params)

tok_params.apply_post_tokenization(self.tokenizer, engine_prompt)
engine_prompts.append(engine_prompt)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Apply prompt_extras to the engine_prompt before appending it to the list. This ensures that multimodal processing and other engine-level features receive the necessary metadata.

            if prompt_extras:
                engine_prompt.update(prompt_extras)

            engine_prompts.append(engine_prompt)

@chuqiwang chuqiwang closed this Mar 31, 2026
@noooop
Copy link
Copy Markdown
Collaborator

noooop commented Apr 2, 2026

  1. The Pooling model entrypoint is under rapid refactoring and is currently a work in progress.

  2. The sync api is now used to remove the overhead of the async (~2ms).

#27407 (comment)

  1. We are preparing to apply [Bugfix] Offload blocking tokenizer ops to shared thread pool to unblock event loop #34789 to the Pooling entrypoint: Offload blocking tokenizer ops to shared thread pool to unblock event loop.

  2. Please increase --api-server-count to improve preprocessing performance. We will set the default to 4 instead of the current 1 in a later PR.

@noooop
Copy link
Copy Markdown
Collaborator

noooop commented Apr 14, 2026

try #39763

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working frontend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Low GPU Utilization with Image Payloads for Qwen2-VL-2B-Instruct Embeddings

2 participants