Skip to content

[DO NOT MERGE] Reapply "[BugFix] Fix engine hanging after KV cache initialization failure #35478"#36650

Closed
markmc wants to merge 1 commit intovllm-project:mainfrom
markmc:revert-revert-fix-engine-hanging
Closed

[DO NOT MERGE] Reapply "[BugFix] Fix engine hanging after KV cache initialization failure #35478"#36650
markmc wants to merge 1 commit intovllm-project:mainfrom
markmc:revert-revert-fix-engine-hanging

Conversation

@markmc
Copy link
Member

@markmc markmc commented Mar 10, 2026

Distributed Test 4 GPUs is still failing. Testing whether this reverts fixes it

Fixes #36624

See #36628 (comment)

The series of relevant PRs are:

So it appears that reverting #36262 might be sufficient

…i… (vllm-project#36262)

This reverts commit 26bd43b.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc markmc requested a review from njhill as a code owner March 10, 2026 12:08
@markmc markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 10, 2026
@mergify mergify bot added v1 bug Something isn't working labels Mar 10, 2026
@markmc
Copy link
Member Author

markmc commented Mar 10, 2026

Gah! I only found the main build failure now ...

The PR branch did not include the shutdown timeout work

Now I don't believe this PR will fix the tests

@markmc
Copy link
Member Author

markmc commented Mar 10, 2026

Now I don't believe this PR will fix the tests

Yep, this revert didn't help - https://buildkite.com/vllm/ci/builds/55483

@markmc markmc closed this Mar 10, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request reintroduces error handling for KV cache initialization and handshake processes to prevent engine hanging. It correctly propagates initialization failures to the frontend. The changes are well-structured, but the use of broad except Exception: clauses could be refined for better debugging and maintainability.

num_gpu_blocks, num_cpu_blocks, kv_cache_config = (
self._initialize_kv_caches(vllm_config)
)
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Catching a broad Exception can mask specific underlying issues, making debugging more challenging. While the intent here is likely to catch any failure during KV cache initialization, it's generally recommended to catch more specific exceptions if possible. If the exact exceptions are unknown or if the intent is truly to catch all exceptions for a critical shutdown, consider adding a comment explaining this design choice.

For example, if there are known exceptions related to memory allocation or hardware, catching those specifically would provide clearer error messages.

Suggested change
except Exception:
except Exception as e:

exc_during_init = False
try:
yield addresses
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the previous comment, using a bare except Exception: here can obscure the root cause of failures during the handshake process. While exc_during_init ensures the FAILED status is sent, identifying the specific exception type would aid in diagnosing and resolving issues more effectively.

Consider catching more specific exceptions or adding a comment to justify the broad catch if it's a deliberate choice for robust error signaling during critical initialization.

Suggested change
except Exception:
except Exception as e:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] External LB test_external_lb_dp[4] failing since shutdown timeout PR #34730

1 participant