Fix UnboundLocalError when DetokenizerManager constructor fails#21471
Fix UnboundLocalError when DetokenizerManager constructor fails#21471
Conversation
When the DetokenizerManager constructor fails (e.g., due to HF API rate limiting during tokenizer loading), the except block references `manager` which was never assigned, causing an UnboundLocalError. This prevents SIGQUIT from being sent to the parent process, leaving the server in a half-dead state that returns 503 for ~10 minutes until the test timeout.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the robustness of the server's error handling mechanism. It resolves a critical issue where failures during the DetokenizerManager's initialization could lead to an UnboundLocalError, preventing proper signal propagation and leaving the server in a half-dead state. The changes ensure that the server gracefully exits and cleans up resources even when core components fail to start, improving overall system stability. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request improves the robustness of the run_detokenizer_process function by initializing the manager variable to None and adding a None check before attempting to call manager.maybe_clear_socket_mapping() within the exception handler. This prevents potential AttributeError if the manager fails to initialize. The review suggests a further improvement to error handling by wrapping the maybe_clear_socket_mapping() call in its own try...except block to ensure the critical parent_process.send_signal(signal.SIGQUIT) is always executed, even if the cleanup operation itself raises an exception, thus preventing a half-dead server state.
| if manager is not None: | ||
| manager.maybe_clear_socket_mapping() |
There was a problem hiding this comment.
To improve robustness, it's advisable to wrap the cleanup logic in a try...except block. This ensures that parent_process.send_signal(signal.SIGQUIT) is always executed, even if manager.maybe_clear_socket_mapping() were to raise an unexpected exception. The main goal is to prevent the server from entering a half-dead state, so guaranteeing the signal is sent is critical.
| if manager is not None: | |
| manager.maybe_clear_socket_mapping() | |
| if manager is not None: | |
| try: | |
| manager.maybe_clear_socket_mapping() | |
| except Exception as e_cleanup: | |
| logger.error(f"Error during detokenizer cleanup: {e_cleanup}") |
|
/tag-and-rerun-ci |
Motivation
When the DetokenizerManager constructor fails (e.g., due to HF API 429 rate limiting during
AutoTokenizer.from_pretrained), the except block inrun_detokenizer_processreferencesmanagerbefore it is assigned, raisingUnboundLocalError. This preventsSIGQUITfrom reaching the parent process, leaving the server in a half-dead state — it accepts HTTP connections but returns 503 on/health_generateindefinitely until the test timeout (~10 minutes).Example failure: https://github.com/sgl-project/sglang/actions/runs/23582037327/job/68670070787#step:7:5254
The server stays stuck returning 503 for ~10 minutes because SIGQUIT never fires:
Root cause in the logs:
Modification
Initialize
manager = Nonebefore the try block and guard themaybe_clear_socket_mapping()call with a None check, sosend_signal(SIGQUIT)always executes.Test plan