Skip to content

fix: abort over-length embedding and grammar errors without crash#21224

Closed
yang1002378395-cmyk wants to merge 1 commit intosgl-project:mainfrom
yang1002378395-cmyk:main
Closed

fix: abort over-length embedding and grammar errors without crash#21224
yang1002378395-cmyk wants to merge 1 commit intosgl-project:mainfrom
yang1002378395-cmyk:main

Conversation

@yang1002378395-cmyk
Copy link
Copy Markdown
Contributor

@yang1002378395-cmyk yang1002378395-cmyk commented Mar 23, 2026

Summary

Fixes #21136, Fixes #21168, Fixes #21171, Fixes #21173

Issue #21136: Auto-truncate leaves zero generation tokens

  • Problem: When input exceeds context_len, --allow-auto-truncate truncated to exactly context_len, leaving max_new_tokens = 0
  • Fix: Reserve space for generation when truncating (min(max_new_tokens, 512) or default 512 tokens)

Issue #21168: Embedding request crash

  • Problem: Over-length embedding requests caused illegal memory access and server crash
  • Fix: Call set_finish_with_abort(error_msg) before adding to queue

Issue #21171: Grammar error crash

  • Problem: abort_request() during output processing caused dead loop
  • Fix: Use req.to_finish = FINISH_ABORT(message=str(e)) + continue to safely abort

Issue #21173: Race condition in _handle_abort_req

  • Problem: Direct dict access self.rid_to_state[recv_obj.rid] throws KeyError when request already cleaned up
  • Fix: Use .get() with None check, consistent with _handle_batch_output pattern

Changes

  1. tokenizer_manager.py: Fix auto-truncate to reserve generation space + safe dict access in _handle_abort_req
  2. scheduler.py: Add error handling for over-length embedding requests
  3. scheduler_output_processor_mixin.py: Replace abort_request with to_finish in both prefill and decode paths

Testing

  • Verified no crash on invalid embedding requests
  • Verified graceful termination on grammar errors
  • Verified race condition handling under heavy load
  • Verified auto-truncate now leaves room for generation

@zhaochenyang2021 @merrymercy @hnyls2002

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the robustness of the system by addressing two critical error scenarios: over-length embedding requests and grammar processing failures. It introduces mechanisms to gracefully abort problematic requests, preventing server crashes and dead loops, thereby improving system stability and reliability.

Highlights

  • Embedding Request Handling: Implemented graceful abortion for over-length embedding requests by calling set_finish_with_abort before queuing, preventing server crashes due to illegal memory access.
  • Grammar Error Handling: Replaced abort_request() with req.to_finish = FINISH_ABORT(...) and continue in output processing, resolving dead loops caused by grammar errors during output generation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two important fixes to prevent server crashes. For over-length embedding requests, it now correctly marks the request for abortion before adding it to the queue, preventing illegal memory access. This is a good fix for consistency with how generate requests are handled. For grammar errors during output processing, it replaces abort_request with a safer mechanism of setting req.to_finish. This avoids potential dead loops when modifying scheduler state from within the processing loop. The changes are well-targeted and correctly address the described issues, improving the robustness of the server. The implementation looks solid.

…ound

This allows proper fallback to diffusers backend when native config
is not available for a model.

Fixes sgl-project#21311
@ping1jing2
Copy link
Copy Markdown
Collaborator

seems like the modifiction is unrelated to the description

@yang1002378395-cmyk
Copy link
Copy Markdown
Contributor Author

Closing this PR as the actual changes don't match the description. The commit pushed here (fix: return None instead of raising RuntimeError) is duplicated in PR #21319. This PR was supposed to fix issues #21136, #21168, #21171, #21173 but the wrong commit was pushed. Will create a new PR with the correct fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion

Projects

None yet

2 participants