Skip to content

fix(harmony): run render_for_completion in thread pool to unblock event loop#38318

Open
ztimothy96 wants to merge 1 commit intovllm-project:mainfrom
ztimothy96:fix/harmony-render-for-completion-async-38266
Open

fix(harmony): run render_for_completion in thread pool to unblock event loop#38318
ztimothy96 wants to merge 1 commit intovllm-project:mainfrom
ztimothy96:fix/harmony-render-for-completion-async-38266

Conversation

@ztimothy96
Copy link
Copy Markdown

@ztimothy96 ztimothy96 commented Mar 27, 2026

Purpose

Resolves #38266. When a request contains a very long repetitive sequence, the synchronous Harmony tokenizer call (render_for_completion) could block the asyncio event loop for minutes, making the server unresponsive to all
other requests.

Fix: wrap render_for_completion with make_async() (run_in_executor) and propagate async/await through the entire call chain (_make_request_with_harmony, HarmonyContext.render_for_completion,
StreamingHarmonyContext.render_for_completion) so the event loop is free to serve concurrent requests while tokenization runs in a thread. The same fix is applied to OpenAIServingRender._make_request_with_harmony in the Chat
Completions path.

Checked for existing PRs addressing #38266 — none found as of March 26, 2026.

Test Plan

Add a unit test, test_render_for_completion_async.py, to ensure all callers of the function are async and that the function itself releases the event loop.

Also run existing tests that call _make_request_with_harmony, since those call sites were updated to add await.

.venv/bin/python -m pytest tests/entrypoints/openai/parser/test_render_for_completion_async.py -v
.venv/bin/python -m pytest tests/entrypoints/openai/responses/test_serving_responses.py -v

Chat Completion test (requires model artifacts)

.venv/bin/python -m pytest tests/entrypoints/openai/chat_completion/test_serving_chat.py -v

Test Result

This fix was developed with AI assistance (Claude). I have reviewed every changed line and personally ran the first two test commands above.

Note: test_serving_chat.py::TestGPTOSSChat requires the Harmony vocab file (not available in a standard dev environment) and was not run locally. The await fixes to that file are mechanical call-site corrections that follow directly from the async signature change.

.venv/bin/python -m pytest tests/entrypoints/openai/parser/test_render_for_completion_async.py -v
====================================== test session starts ======================================
platform darwin -- Python 3.12.13, pytest-9.0.2, pluggy-1.6.0 -- /Users/tzhou/Documents/GitHub/vllm/.venv/bin/python
cachedir: .pytest_cache
rootdir: /Users/tzhou/Documents/GitHub/vllm
configfile: pyproject.toml
plugins: asyncio-1.3.0, anyio-4.13.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 9 items                                                                               

tests/entrypoints/openai/parser/test_render_for_completion_async.py::TestMakeAsyncReleasesEventLoop::test_blocking_call_in_executor_allows_other_coroutines PASSED [ 11%]
tests/entrypoints/openai/parser/test_render_for_completion_async.py::TestMakeAsyncReleasesEventLoop::test_make_async_returns_awaitable PASSED [ 22%]
tests/entrypoints/openai/parser/test_render_for_completion_async.py::TestRenderForCompletionAsyncIsAwaitable::test_render_for_completion_async_exported PASSED [ 33%]
tests/entrypoints/openai/parser/test_render_for_completion_async.py::TestRenderForCompletionAsyncIsAwaitable::test_harmony_context_render_for_completion_is_coroutine PASSED [ 44%]
tests/entrypoints/openai/parser/test_render_for_completion_async.py::TestRenderForCompletionAsyncIsAwaitable::test_streaming_harmony_context_render_for_completion_is_coroutine PASSED [ 55%]
tests/entrypoints/openai/parser/test_render_for_completion_async.py::TestRenderForCompletionAsyncIsAwaitable::test_make_request_with_harmony_is_coroutine_in_render_serving PASSED [ 66%]
tests/entrypoints/openai/parser/test_render_for_completion_async.py::TestRenderForCompletionAsyncIsAwaitable::test_make_request_with_harmony_is_coroutine_in_responses_serving PASSED [ 77%]
tests/entrypoints/openai/parser/test_render_for_completion_async.py::TestGetEncodingThreadSafety::test_get_encoding_initialized_exactly_once_under_concurrency PASSED [ 88%]
tests/entrypoints/openai/parser/test_render_for_completion_async.py::TestRenderForCompletionAsyncResult::test_async_wrapper_preserves_return_value PASSED [100%]
.venv/bin/python -m pytest tests/entrypoints/openai/responses/test_serving_responses.py -v
====================================== test session starts ======================================
platform darwin -- Python 3.12.13, pytest-9.0.2, pluggy-1.6.0 -- /Users/tzhou/Documents/GitHub/vllm/.venv/bin/python
cachedir: .pytest_cache
rootdir: /Users/tzhou/Documents/GitHub/vllm
configfile: pyproject.toml
plugins: asyncio-1.3.0, anyio-4.13.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 16 items                                                                              

tests/entrypoints/openai/responses/test_serving_responses.py::test_extract_tool_types PASSED [  6%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestInitializeToolSessions::test_initialize_tool_sessions PASSED [ 12%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestInitializeToolSessions::test_validate_create_responses_input PASSED [ 18%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestValidateGeneratorInput::test_validate_generator_input PASSED [ 25%]
tests/entrypoints/openai/responses/test_serving_responses.py::test_reasoning_tokens_counted_for_text_reasoning_model PASSED [ 31%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestExtractAllowedToolsFromMcpRequests::test_extract_allowed_tools_basic_formats PASSED [ 37%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestExtractAllowedToolsFromMcpRequests::test_extract_allowed_tools_star_normalization PASSED [ 43%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestExtractAllowedToolsFromMcpRequests::test_extract_allowed_tools_filters_non_mcp PASSED [ 50%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestHarmonyPreambleStreaming::test_preamble_delta_emits_text_events PASSED [ 56%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestHarmonyPreambleStreaming::test_preamble_delta_second_token_no_added PASSED [ 62%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestHarmonyPreambleStreaming::test_commentary_with_function_recipient_not_preamble PASSED [ 68%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestHarmonyPreambleStreaming::test_preamble_done_emits_text_done_events PASSED [ 75%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestHarmonyPreambleStreaming::test_commentary_with_recipient_no_preamble_done PASSED [ 81%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestStreamingReasoningToContentTransition::test_mixed_delta_reasoning_and_content_emits_reasoning_delta PASSED [ 87%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestStreamingReasoningToContentTransition::test_transition_without_mixed_delta_no_extra_reasoning_event PASSED [ 93%]
tests/entrypoints/openai/responses/test_serving_responses.py::TestStreamingReasoningToContentTransition::test_reasoning_only_stream_no_content PASSED [100%]
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a performance bottleneck where the synchronous render_for_completion tokenizer call could block the event loop for extended periods. By wrapping this call with make_async, the rendering now runs in a thread pool, and the associated methods in HarmonyContext, OpenAIServingRender, and OpenAIServingResponses have been updated to be asynchronous. New unit tests were added to ensure the event loop remains responsive during tokenization. A critical issue was identified regarding a race condition in the global encoding singleton's lazy initialization, which now requires thread-safe locking because the tokenization logic is executed concurrently across multiple threads.

Comment thread vllm/entrypoints/openai/parser/harmony_utils.py
@ztimothy96 ztimothy96 force-pushed the fix/harmony-render-for-completion-async-38266 branch from 09198bc to 94c558a Compare March 27, 2026 05:33
@DarkLight1337
Copy link
Copy Markdown
Member

Can you double check whether this is still needed after #34789?

Resolves vllm-project#38266. When a request contains a very long repetitive sequence, the synchronous Harmony tokenizer call (render_for_completion) could block the asyncio event loop for minutes, making the server unresponsive to all other requests.

Fix: wrap render_for_completion with make_async() (run_in_executor) and
propagate async/await through the entire call chain (_make_request_with_harmony, HarmonyContext.render_for_completion, StreamingHarmonyContext.render_for_completion) so the event loop is free to serve concurrent requests while tokenization runs in a thread.

Co-authored-by: Claude

Signed-off-by: Timothy Zhou <z.timothy96@gmail.com>
@ztimothy96 ztimothy96 force-pushed the fix/harmony-render-for-completion-async-38266 branch from 94c558a to 3298a85 Compare March 27, 2026 17:56
Comment thread vllm/entrypoints/openai/parser/harmony_utils.py
return token_ids


render_for_completion_async = make_async(render_for_completion)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will not have any effect if you don't provide make_async with another ThreadPoolExecutor, so the function is executed inside a dedicated thread

from concurrent.futures import ThreadPoolExecutor

render_for_completion_async = make_async(render_for_completion, ThreadPoolExecutor(max_workers=4))

global _harmony_encoding
if _harmony_encoding is None:
_harmony_encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)
with _harmony_encoding_lock:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that's not needed because the underlying code should already be thread safe ? tokenizer: Arc::new(encoding_ext.load()?)

https://doc.rust-lang.org/std/sync/struct.Arc.html#thread-safety

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend gpt-oss Related to GPT-OSS models

Projects

Status: To Triage

Development

Successfully merging this pull request may close these issues.

[Bug]: tokenizing long redundant sequences causes API server deadlock (harmony and others)

3 participants