[Perf][Bugfix] Update dflash aux layer indexing#40727
[Perf][Bugfix] Update dflash aux layer indexing#40727benchislett wants to merge 1 commit intovllm-project:mainfrom
Conversation
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
There was a problem hiding this comment.
Code Review
This pull request implements a +1 shift for DFlash auxiliary layer IDs in the GPU model runner to align with expected semantics. Feedback suggests applying this shift directly in the configuration utility to ensure consistency across execution paths and hardening the list comprehension against potential null values in the configuration dictionary.
| # TODO: does this need to be shifted by 1 like in gpu_model_runner? | ||
| aux_layer_ids = config_dict["aux_hidden_state_layer_ids"] | ||
| pre_trained_config["eagle_aux_hidden_state_layer_ids"] = aux_layer_ids |
There was a problem hiding this comment.
The TODO should be addressed by applying the +1 shift here. Since gpu_model_runner.py prioritizes the eagle_aux_hidden_state_layer_ids field (line 4937), the fix in the model runner's fallback logic is bypassed for DFlash models configured through this function. Applying the shift here ensures that the correct layer indices are used in the primary execution path.
| # TODO: does this need to be shifted by 1 like in gpu_model_runner? | |
| aux_layer_ids = config_dict["aux_hidden_state_layer_ids"] | |
| pre_trained_config["eagle_aux_hidden_state_layer_ids"] = aux_layer_ids | |
| # Add 1 to convert DFlash's aux layer id semantics | |
| aux_layer_ids = [i + 1 for i in config_dict["aux_hidden_state_layer_ids"]] | |
| pre_trained_config["eagle_aux_hidden_state_layer_ids"] = aux_layer_ids |
| # Add 1 to convert DFlash's aux layer id semantics | ||
| layer_ids = [i + 1 for i in dflash_config.get("target_layer_ids", [])] |
There was a problem hiding this comment.
Using dflash_config.get("target_layer_ids", []) can lead to a TypeError if the key exists in the dictionary but its value is explicitly set to None. It is safer to use dflash_config.get("target_layer_ids") or [] to ensure the list comprehension always receives an iterable.
| # Add 1 to convert DFlash's aux layer id semantics | |
| layer_ids = [i + 1 for i in dflash_config.get("target_layer_ids", [])] | |
| # Add 1 to convert DFlash's aux layer id semantics | |
| layer_ids = [i + 1 for i in (dflash_config.get("target_layer_ids") or [])] |
Threads layer_types and sliding_window through the DFlash Qwen3 drafter so target models with SWA layers can be drafted correctly: - Per-layer SWA in qwen3_dflash: builds Attention layers with sliding_window for sliding_attention entries in layer_types, exposes sliding_attention_layer_names for the proposer. - Speculators config: preserve layer_types, use_sliding_window, sliding_window, max_window_layers when extracting the HF config. - DFlash proposer: force causal=True on the per-layer attention metadata for SWA layers so the windowed kernel runs correctly during parallel block drafting. Built on top of vllm-project#40727 (target_layer_ids +1 shift). The shift in gpu_model_runner.py here overlaps with vllm-project#40727 and can be dropped once that lands.
Purpose
A discrepancy in indexing causes a slight gap in the acceptance rates for DFlash v.s. the reference.
See: https://github.com/z-lab/dflash/blob/main/dflash/model.py#L44
This will have implications for Speculators, not sure if they also have this issue.
Test Plan
Existing dflash AR test passes and acceptance rate goes up.
Test Result