fix pd-mtp metadata buffer hidden size#23918
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the disaggregation scheduler to use spec_hidden_size for EAGLE-based speculative algorithms and refactors the DSv4 Flash PD-disaggregation test. The test configuration now explicitly defines speculative decoding parameters for both prefill and decode stages and increases the GSM8K accuracy threshold to 0.95. Review feedback identifies an inconsistency in the scheduler where is_standalone() was omitted from a buffer sizing condition and suggests consolidating duplicated test arguments to improve maintainability.
| model_config.spec_hidden_size | ||
| if self.spec_algorithm.is_eagle() |
There was a problem hiding this comment.
The condition for using spec_hidden_size on the decode side is inconsistent with the prefill side (lines 952-954). It should also include is_standalone() to ensure that metadata buffers are sized correctly when using standalone speculative algorithms (such as MTP). Without this, a buffer size mismatch will occur during state transfer if is_standalone() is true. Note that hidden_states_dtype (lines 902-906) likely requires a similar update for consistency, although it is not part of this diff hunk.
| model_config.spec_hidden_size | |
| if self.spec_algorithm.is_eagle() | |
| model_config.spec_hidden_size | |
| if self.spec_algorithm.is_eagle() or self.spec_algorithm.is_standalone() |
| prefill_args = [ | ||
| *COMMON_ARGS, | ||
| "--trust-remote-code", | ||
| "--disaggregation-mode", | ||
| "prefill", | ||
| "--base-gpu-id", | ||
| "0", | ||
| "--tp", | ||
| "4", | ||
| "--cuda-graph-max-bs", | ||
| "128", | ||
| "--max-running-requests", | ||
| "256", | ||
| "--mem-fraction-static", | ||
| "0.7", | ||
| "--disaggregation-decode-tp", | ||
| "4", | ||
| "--disaggregation-decode-dp", | ||
| "4", | ||
| "--speculative-algorithm", | ||
| "EAGLE", | ||
| "--speculative-num-steps", | ||
| "3", | ||
| "--speculative-eagle-topk", | ||
| "1", | ||
| "--speculative-num-draft-tokens", | ||
| "4", |
There was a problem hiding this comment.
The speculative and runtime arguments are now duplicated between start_prefill and start_decode. To improve maintainability and prevent configuration drift, consider defining a shared list for common arguments (e.g., --cuda-graph-max-bs, --max-running-requests, and the --speculative-* flags) and unpacking it into both prefill_args and decode_args.
Use spec_hidden_size for the PD MetadataBuffers so the spec module's hidden state ferries correctly across prefill/decode. Adds matching EAGLE config to the PD-disagg NIXL test prefill side.