Skip to content

Make engine core client handshake timeout configurable #27444

Merged
njhill merged 8 commits intovllm-project:mainfrom
eicherseiji:config-client-timeout
Dec 19, 2025
Merged

Make engine core client handshake timeout configurable #27444
njhill merged 8 commits intovllm-project:mainfrom
eicherseiji:config-client-timeout

Conversation

@eicherseiji
Copy link
Contributor

@eicherseiji eicherseiji commented Oct 24, 2025

Purpose

  • Adds configurable engine core client timeout environment variable
  • While starting a DPEP16 deployment with DeepGEMM and DeepEP LL all2all, engine core client times out on the handshake because DeepGEMM warmup takes 10+ minutes. But engines are otherwise healthy
  • Handshake timeout is not easily configurable
(ApiServer_7 pid=161396)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/async_llm.py", line 134, in __init__
(ApiServer_7 pid=161396)     self.engine_core = EngineCoreClient.make_async_mp_client(
(ApiServer_7 pid=161396)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ApiServer_7 pid=161396)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(ApiServer_7 pid=161396)     return AsyncMPClient(*client_args)
(ApiServer_7 pid=161396)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ApiServer_7 pid=161396)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 769, in __init__
(ApiServer_7 pid=161396)     super().__init__(
(ApiServer_7 pid=161396)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 496, in __init__
(ApiServer_7 pid=161396)     raise TimeoutError("Timed out waiting for engines to send"
(ApiServer_7 pid=161396) TimeoutError: Timed out waiting for engines to sendinitial message on input socket.

Configuration with timeout:

#!/bin/bash
# DeepSeek-V3 deployment with Data+Expert Parallel across 8 GPUs using vllm serve

set -e

DS_V3_PATH="/hf-models/hub/models--deepseek-ai--DeepSeek-V3-0324/snapshots/e9b33add76883f293d6bf61f6bd89b497e80e335/"
MODEL_ID="dsv3"

COORDINATOR_IP="10.102.0.1"
COORDINATOR_RPC_PORT=13345

export VLLM_USE_DEEP_GEMM=1
export VLLM_ALL2ALL_BACKEND=deepep_low_latency
export VLLM_MOE_DP_CHUNK_SIZE=256
export VLLM_SKIP_P2P_CHECK=1
export VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1
export NVIDIA_GDRCOPY=enabled
export VLLM_LOGGING_LEVEL=DEBUG
export NCCL_DEBUG=WARN
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export VLLM_MOE_ROUTING_SIMULATION_STRATEGY=uniform_random

vllm serve "$DS_V3_PATH" \
  --served-model-name "$MODEL_ID" \
  --data-parallel-address $COORDINATOR_IP \
  --data-parallel-rpc-port $COORDINATOR_RPC_PORT \
  --tensor-parallel-size 1 \
  --data-parallel-size 16 \
  --data-parallel-size-local 8 \
  --enable-expert-parallel \
  --max-model-len 16384 \
  --api-server-count 16 \
  --enable-dbo \
  --max-num-seqs 4096

Test Plan

  • CI test (we can remove this if overkill)
  • Manual test: add print statement to vllm/v1/engine/core_client.py

Test Result

CI test:

test_engine_core_client.py::test_mp_client_uses_env_timeout PASSED

Manual test:

(base) ray@k-d523f946249d60000:~/default/work/vllm/tests/v1/engine$ VLLM_ENGINE_CORE_TIMEOUT_MS=123456 vllm serve Qwen/Qwen2-0.5B
...
(APIServer pid=37647) VLLM_ENGINE_CORE_TIMEOUT_MS=123456

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
@mergify mergify bot added the v1 label Oct 24, 2025
@eicherseiji eicherseiji changed the title Add VLLM_ENGINE_CORE_TIMEOUT_MS Make engine core client handshake timeout configurable Oct 24, 2025
@eicherseiji eicherseiji marked this pull request as ready for review October 24, 2025 17:03
@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 29, 2025
@tlrmchlsmth tlrmchlsmth requested a review from njhill October 29, 2025 20:14
@pbelevich
Copy link

@eicherseiji @tlrmchlsmth this PR is closed but was not merged? why? how to deal with timeouts?

@eicherseiji
Copy link
Contributor Author

@eicherseiji @tlrmchlsmth this PR is closed but was not merged? why? how to deal with timeouts?

Hi @pbelevich! I was able to resolve the timeout issue using the runai model loader + updating to a release w/ DeepGEMM warmup heuristics.

We were not sure there were other legitimate cases where the client would time out. Do you have a repro?

@asharkhan3101
Copy link

asharkhan3101 commented Dec 16, 2025

In our case running DeepSeek V3 on Mi300x timeout happens exactly when everything is loaded. I think having this option configurable would help in such cases.

@eicherseiji eicherseiji reopened this Dec 16, 2025
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eicherseiji , lgtm

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eicherseiji

@mergify
Copy link

mergify bot commented Dec 16, 2025

Hi @eicherseiji, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

eicherseiji and others added 2 commits December 16, 2025 13:25
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
@njhill njhill enabled auto-merge (squash) December 18, 2025 17:51
@njhill njhill merged commit 1ab5213 into vllm-project:main Dec 19, 2025
48 checks passed
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Dec 22, 2025
Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025
…27444)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
yma11 pushed a commit to yma11/vllm that referenced this pull request Jan 12, 2026
…llm-project#86)

* Make engine core client handshake timeout configurable  (vllm-project#27444)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

* [BugFix] Support online dense model DP without overhead (vllm-project#30739)

Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: njhill <nickhill123@gmail.com>

---------

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: njhill <nickhill123@gmail.com>
Co-authored-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…27444)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants