[DO NOT MERGE] Reapply "[BugFix] Fix engine hanging after KV cache initialization failure #35478"#36650
[DO NOT MERGE] Reapply "[BugFix] Fix engine hanging after KV cache initialization failure #35478"#36650markmc wants to merge 1 commit intovllm-project:mainfrom
Conversation
…i… (vllm-project#36262) This reverts commit 26bd43b. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Gah! I only found the main build failure now ... The PR branch did not include the shutdown timeout work Now I don't believe this PR will fix the tests |
Yep, this revert didn't help - https://buildkite.com/vllm/ci/builds/55483 |
There was a problem hiding this comment.
Code Review
The pull request reintroduces error handling for KV cache initialization and handshake processes to prevent engine hanging. It correctly propagates initialization failures to the frontend. The changes are well-structured, but the use of broad except Exception: clauses could be refined for better debugging and maintainability.
| num_gpu_blocks, num_cpu_blocks, kv_cache_config = ( | ||
| self._initialize_kv_caches(vllm_config) | ||
| ) | ||
| except Exception: |
There was a problem hiding this comment.
Catching a broad Exception can mask specific underlying issues, making debugging more challenging. While the intent here is likely to catch any failure during KV cache initialization, it's generally recommended to catch more specific exceptions if possible. If the exact exceptions are unknown or if the intent is truly to catch all exceptions for a critical shutdown, consider adding a comment explaining this design choice.
For example, if there are known exceptions related to memory allocation or hardware, catching those specifically would provide clearer error messages.
| except Exception: | |
| except Exception as e: |
| exc_during_init = False | ||
| try: | ||
| yield addresses | ||
| except Exception: |
There was a problem hiding this comment.
Similar to the previous comment, using a bare except Exception: here can obscure the root cause of failures during the handshake process. While exc_during_init ensures the FAILED status is sent, identifying the specific exception type would aid in diagnosing and resolving issues more effectively.
Consider catching more specific exceptions or adding a comment to justify the broad catch if it's a deliberate choice for robust error signaling during critical initialization.
| except Exception: | |
| except Exception as e: |
Distributed Test 4 GPUsis still failing. Testing whether this reverts fixes itFixes #36624
See #36628 (comment)
The series of relevant PRs are:
So it appears that reverting #36262 might be sufficient