Make engine core client handshake timeout configurable by eicherseiji · Pull Request #27444 · vllm-project/vllm

eicherseiji · 2025-10-24T00:28:32Z

Purpose

Adds configurable engine core client timeout environment variable
While starting a DPEP16 deployment with DeepGEMM and DeepEP LL all2all, engine core client times out on the handshake because DeepGEMM warmup takes 10+ minutes. But engines are otherwise healthy
Handshake timeout is not easily configurable

(ApiServer_7 pid=161396)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/async_llm.py", line 134, in __init__
(ApiServer_7 pid=161396)     self.engine_core = EngineCoreClient.make_async_mp_client(
(ApiServer_7 pid=161396)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ApiServer_7 pid=161396)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(ApiServer_7 pid=161396)     return AsyncMPClient(*client_args)
(ApiServer_7 pid=161396)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ApiServer_7 pid=161396)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 769, in __init__
(ApiServer_7 pid=161396)     super().__init__(
(ApiServer_7 pid=161396)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 496, in __init__
(ApiServer_7 pid=161396)     raise TimeoutError("Timed out waiting for engines to send"
(ApiServer_7 pid=161396) TimeoutError: Timed out waiting for engines to sendinitial message on input socket.

Configuration with timeout:

#!/bin/bash
# DeepSeek-V3 deployment with Data+Expert Parallel across 8 GPUs using vllm serve

set -e

DS_V3_PATH="/hf-models/hub/models--deepseek-ai--DeepSeek-V3-0324/snapshots/e9b33add76883f293d6bf61f6bd89b497e80e335/"
MODEL_ID="dsv3"

COORDINATOR_IP="10.102.0.1"
COORDINATOR_RPC_PORT=13345

export VLLM_USE_DEEP_GEMM=1
export VLLM_ALL2ALL_BACKEND=deepep_low_latency
export VLLM_MOE_DP_CHUNK_SIZE=256
export VLLM_SKIP_P2P_CHECK=1
export VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1
export NVIDIA_GDRCOPY=enabled
export VLLM_LOGGING_LEVEL=DEBUG
export NCCL_DEBUG=WARN
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export VLLM_MOE_ROUTING_SIMULATION_STRATEGY=uniform_random

vllm serve "$DS_V3_PATH" \
  --served-model-name "$MODEL_ID" \
  --data-parallel-address $COORDINATOR_IP \
  --data-parallel-rpc-port $COORDINATOR_RPC_PORT \
  --tensor-parallel-size 1 \
  --data-parallel-size 16 \
  --data-parallel-size-local 8 \
  --enable-expert-parallel \
  --max-model-len 16384 \
  --api-server-count 16 \
  --enable-dbo \
  --max-num-seqs 4096

Test Plan

CI test (we can remove this if overkill)
Manual test: add print statement to vllm/v1/engine/core_client.py

Test Result

CI test:

test_engine_core_client.py::test_mp_client_uses_env_timeout PASSED

Manual test:

(base) ray@k-d523f946249d60000:~/default/work/vllm/tests/v1/engine$ VLLM_ENGINE_CORE_TIMEOUT_MS=123456 vllm serve Qwen/Qwen2-0.5B
...
(APIServer pid=37647) VLLM_ENGINE_CORE_TIMEOUT_MS=123456

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

pbelevich · 2025-11-30T18:48:24Z

@eicherseiji @tlrmchlsmth this PR is closed but was not merged? why? how to deal with timeouts?

eicherseiji · 2025-11-30T19:07:21Z

@eicherseiji @tlrmchlsmth this PR is closed but was not merged? why? how to deal with timeouts?

Hi @pbelevich! I was able to resolve the timeout issue using the runai model loader + updating to a release w/ DeepGEMM warmup heuristics.

We were not sure there were other legitimate cases where the client would time out. Do you have a repro?

asharkhan3101 · 2025-12-16T19:07:14Z

In our case running DeepSeek V3 on Mi300x timeout happens exactly when everything is loaded. I think having this option configurable would help in such cases.

njhill

Thanks @eicherseiji , lgtm

vllm/envs.py

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…i/vllm into config-client-timeout

njhill

Thanks @eicherseiji

mergify · 2025-12-16T20:38:03Z

Hi @eicherseiji, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…27444) Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…27444) Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

…llm-project#86) * Make engine core client handshake timeout configurable (vllm-project#27444) Signed-off-by: Seiji Eicher <seiji@anyscale.com> * [BugFix] Support online dense model DP without overhead (vllm-project#30739) Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: njhill <nickhill123@gmail.com> --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: njhill <nickhill123@gmail.com> Co-authored-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Co-authored-by: Nick Hill <nhill@redhat.com>

…27444) Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

…27444) Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Add VLLM_ENGINE_CORE_TIMEOUT_MS

96712a2

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

mergify bot added the v1 label Oct 24, 2025

eicherseiji changed the title ~~Add VLLM_ENGINE_CORE_TIMEOUT_MS~~ Make engine core client handshake timeout configurable Oct 24, 2025

eicherseiji marked this pull request as ready for review October 24, 2025 17:03

tlrmchlsmth approved these changes Oct 29, 2025

View reviewed changes

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 29, 2025

tlrmchlsmth requested a review from njhill October 29, 2025 20:14

Merge branch 'main' into config-client-timeout

0206b70

eicherseiji closed this Nov 24, 2025

eicherseiji reopened this Dec 16, 2025

Merge branch 'main' into config-client-timeout

fb71f59

njhill reviewed Dec 16, 2025

View reviewed changes

vllm/envs.py Outdated Show resolved Hide resolved

eicherseiji added 2 commits December 16, 2025 12:30

Change timeout unit to seconds

b151134

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Merge branch 'config-client-timeout' of https://github.com/eicherseij…

0047fb9

…i/vllm into config-client-timeout

njhill approved these changes Dec 16, 2025

View reviewed changes

eicherseiji and others added 2 commits December 16, 2025 13:25

Lint

1a258ce

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Merge branch 'main' into config-client-timeout

13956ec

njhill enabled auto-merge (squash) December 18, 2025 17:51

Merge branch 'main' into config-client-timeout

9b850f4

njhill merged commit 1ab5213 into vllm-project:main Dec 19, 2025
48 checks passed

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Dec 22, 2025

Make engine core client handshake timeout configurable (vllm-project#…

9552ddd

…27444) Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025

Make engine core client handshake timeout configurable (vllm-project#…

ee3d443

…27444) Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>

YangZhou1997 mentioned this pull request Dec 24, 2025

[EP] Support vLLM; high-throughput and low-latency uccl-project/uccl#493

Merged

11 tasks

dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026

Make engine core client handshake timeout configurable (vllm-project#…

6d73f49

…27444) Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

Make engine core client handshake timeout configurable (vllm-project#…

d9ca9e1

…27444) Signed-off-by: Seiji Eicher <seiji@anyscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make engine core client handshake timeout configurable #27444

Make engine core client handshake timeout configurable #27444
njhill merged 8 commits intovllm-project:mainfrom
eicherseiji:config-client-timeout

eicherseiji commented Oct 24, 2025 •

edited by github-actions bot

Loading

Uh oh!

pbelevich commented Nov 30, 2025

Uh oh!

eicherseiji commented Nov 30, 2025

Uh oh!

asharkhan3101 commented Dec 16, 2025 •

edited

Loading

Uh oh!

njhill left a comment

Uh oh!

Uh oh!

njhill left a comment

Uh oh!

mergify bot commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

eicherseiji commented Oct 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

pbelevich commented Nov 30, 2025

Uh oh!

eicherseiji commented Nov 30, 2025

Uh oh!

asharkhan3101 commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

eicherseiji commented Oct 24, 2025 •

edited by github-actions bot

Loading

asharkhan3101 commented Dec 16, 2025 •

edited

Loading