Fix spec decoding acc length for dpsk-r1-fp4 tp8#12896
Fix spec decoding acc length for dpsk-r1-fp4 tp8#12896zhyncs merged 1 commit intosgl-project:mainfrom
Conversation
Summary of ChangesHello @Qiaolin-Yu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a performance regression in speculative decoding for the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a performance issue with speculative decoding by reducing the default max_token_num for FlashInfer workspace initialization from 16384 to 2048. While this change appears to fix the reported problem, modifying default parameter values can have unintended consequences for other users, potentially causing silent performance regressions for use cases involving larger batch sizes. My feedback focuses on improving the maintainability of this change by adding comments to explain the reasoning behind the new default value, ensuring future developers understand the context.
|
|
||
| def ensure_workspace_initialized( | ||
| max_token_num: int = 16384, hidden_dim: int = 4096, use_fp32_lamport: bool = False | ||
| max_token_num: int = 2048, hidden_dim: int = 4096, use_fp32_lamport: bool = False |
There was a problem hiding this comment.
Changing the default max_token_num from 16384 to 2048 is a significant reduction. While this fixes the speculative decoding issue you've identified, it's a potentially breaking change for users who might have been relying on the larger default for batches with more than 2048 tokens. They will now silently fall back to a slower, non-fused implementation. To improve maintainability and prevent future confusion, it would be beneficial to add a comment explaining why this value was changed and what specific issue it resolves. This context is crucial for understanding this 'magic number'.
| weight: torch.Tensor, | ||
| eps: float = 1e-6, | ||
| max_token_num: int = 16384, | ||
| max_token_num: int = 2048, |
There was a problem hiding this comment.
This reverts commit 9299a62.
Motivation
Before this pr,
This command will get a really low acc length. I found the root cause is #12758 . Therefore, I reverted some parts of the pr, and have confirmed the issue fixed by that pr will not be reproduced.
Not sure what the purpose was for increasing the workspace size. If it’s truly necessary, we can add it back after confirming that my command works correctly. cc @elvischenv
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist