[main][feature] Support quarot for eagle3 without embedding#7038
[main][feature] Support quarot for eagle3 without embedding#7038MengqingCao merged 1 commit intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves an issue where Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the patch for quarot quantized models and adds support for eagle3 draft models that do not have their own embed_tokens. The logic correctly loads the embedding from the target model and applies the necessary anti-rotation.
I've found a couple of critical issues in the new patch_draft_quarot.py file that could lead to runtime errors, related to uninitialized variables and lack of None checks. I've also pointed out some typos that affect readability. Please see my detailed comments.
| def make_load_weights(target_model_path, rotation_path): | ||
| def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): | ||
| Q = get_rotataion_matrix(rotation_path) | ||
| Q3 = compute_rotataion_matrix3(Q) | ||
| dtype = None | ||
|
|
||
| model_weights = {} | ||
| includes_draft_id_mapping = False | ||
| includes_embed_tokens = False | ||
| for name, loaded_weight in weights: | ||
| if "t2d" in name: | ||
| continue | ||
| if "d2t" in name: | ||
| name = name.replace("d2t", "draft_id_to_target_id") | ||
| includes_draft_id_mapping = True | ||
| elif "lm_head" not in name: | ||
| name = "model." + name | ||
| elif "fc." in name: | ||
| # anti-rotate fc | ||
| dtype = loaded_weight.dtype | ||
| loaded_weight = loaded_weight @ Q3.to(dtype) | ||
| if "embed_tokens" in name: | ||
| includes_embed_tokens = True | ||
| model_weights[name] = loaded_weight | ||
| process_eagle_weight(self, name) | ||
|
|
||
| # process embedding if drafter does not have embedding | ||
| if not includes_embed_tokens: | ||
| name = "model.embed_tokens.weight" | ||
| loaded_weight = get_embedding_tensor(target_model_path).to(dtype) @ Q.T.to(dtype) | ||
| model_weights[name] = loaded_weight | ||
|
|
||
| includes_embed_tokens = True | ||
| process_eagle_weight(self, name) | ||
|
|
||
| skip_substrs = [] | ||
| if not includes_draft_id_mapping: | ||
| skip_substrs.append("draft_id_to_target_id") | ||
| if not includes_embed_tokens: | ||
| skip_substrs.append("embed_tokens") | ||
| if not self.model.use_aux_hidden_state: | ||
| skip_substrs.append("fc.") | ||
| loader = AutoWeightsLoader( | ||
| self, | ||
| skip_prefixes=None, | ||
| skip_substrs=skip_substrs, | ||
| ) | ||
| loader.load_weights(model_weights.items()) | ||
|
|
||
| return load_weights |
There was a problem hiding this comment.
This function has two critical issues that could lead to a crash:
- The
dtypevariable is initialized toNoneand only set if a weight name contains"fc.". If no such weight exists,dtyperemainsNone, causing aTypeErroron line 121 when.to(dtype)is called. - The function
get_embedding_tensorcan returnNone, but the return value is used without a check on line 121, which would cause anAttributeErrorwhen.to(dtype)is called onNone.
I suggest initializing dtype from self.dtype at the beginning of the function and adding a None check for the result of get_embedding_tensor.
def make_load_weights(target_model_path, rotation_path):
def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]):
Q = get_rotataion_matrix(rotation_path)
Q3 = compute_rotataion_matrix3(Q)
dtype = self.dtype
model_weights = {}
includes_draft_id_mapping = False
includes_embed_tokens = False
for name, loaded_weight in weights:
if "t2d" in name:
continue
if "d2t" in name:
name = name.replace("d2t", "draft_id_to_target_id")
includes_draft_id_mapping = True
elif "lm_head" not in name:
name = "model." + name
elif "fc." in name:
# anti-rotate fc
loaded_weight = loaded_weight @ Q3.to(dtype)
if "embed_tokens" in name:
includes_embed_tokens = True
model_weights[name] = loaded_weight
process_eagle_weight(self, name)
# process embedding if drafter does not have embedding
if not includes_embed_tokens:
embedding_tensor = get_embedding_tensor(target_model_path)
if embedding_tensor is not None:
name = "model.embed_tokens.weight"
loaded_weight = embedding_tensor.to(dtype) @ Q.T.to(dtype)
model_weights[name] = loaded_weight
includes_embed_tokens = True
process_eagle_weight(self, name)
else:
logger.warning(
f"Could not find embedding tensor in {target_model_path} "
"to patch draft model."
)
skip_substrs = []
if not includes_draft_id_mapping:
skip_substrs.append("draft_id_to_target_id")
if not includes_embed_tokens:
skip_substrs.append("embed_tokens")
if not self.model.use_aux_hidden_state:
skip_substrs.append("fc.")
loader = AutoWeightsLoader(
self,
skip_prefixes=None,
skip_substrs=skip_substrs,
)
loader.load_weights(model_weights.items())
return load_weights| """ | ||
| Scans the directory and returns the first tensor found that contains 'embed' in its key. | ||
| Returns a tuple of (key_name, tensor) or (None, None) if not found. | ||
| """ |
There was a problem hiding this comment.
The docstring states that this function returns a tuple (key_name, tensor) or (None, None). However, the implementation returns only the tensor or None. The docstring should be updated to match the implementation.
| """ | |
| Scans the directory and returns the first tensor found that contains 'embed' in its key. | |
| Returns a tuple of (key_name, tensor) or (None, None) if not found. | |
| """ | |
| """ | |
| Scans the directory and returns the first tensor found that contains 'embed' in its key. | |
| Returns the tensor if found, otherwise None. | |
| """ |
| def get_rotataion_matrix(rotation_path): | ||
| """ | ||
| Anti-rotate maxtrix. | ||
| """ | ||
| try: | ||
| safetensor_data = load_file(rotation_path) | ||
| Q = safetensor_data["global_rotation"] | ||
|
|
||
| return Q | ||
| except Exception as e: | ||
| logger.error( | ||
| f"Failed to load rotation weight from '{rotation_path}'. " | ||
| "If you want to use quarot model with eagle3, take a check." | ||
| ) | ||
| raise e | ||
|
|
||
|
|
||
| def compute_rotataion_matrix3(Q): | ||
| """ | ||
| Anti-rotate matrix for 3 layers of hidden_states. | ||
| """ | ||
| return torch.block_diag(Q, Q, Q) |
There was a problem hiding this comment.
There are a few typos in this block that affect readability and maintainability:
- On line 57,
get_rotataion_matrixshould beget_rotation_matrix. - On line 59,
maxtrixshould bematrix. - On line 74,
compute_rotataion_matrix3should becompute_rotation_matrix3.
Please correct these. Remember to also update the function calls on lines 94 and 95.
def get_rotation_matrix(rotation_path):
"""
Anti-rotate matrix.
"""
try:
safetensor_data = load_file(rotation_path)
Q = safetensor_data["global_rotation"]
return Q
except Exception as e:
logger.error(
f"Failed to load rotation weight from '{rotation_path}'. "
"If you want to use quarot model with eagle3, take a check."
)
raise e
def compute_rotation_matrix3(Q):
"""
Anti-rotate matrix for 3 layers of hidden_states.
"""
return torch.block_diag(Q, Q, Q)Signed-off-by: drslark <slarksblood@qq.com>
MengqingCao
left a comment
There was a problem hiding this comment.
LGTM, thanks for your effort!
### What this PR does / why we need it? Add an e2e test for QuaRot model with eagle3 that runs both the QuaRot model and the float model, and then compares their acceptance rates. The QuaRot model adapting eagle3 PR(#6914, #7038) - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@4034c3d Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
…ct#7128) ### What this PR does / why we need it? Add an e2e test for QuaRot model with eagle3 that runs both the QuaRot model and the float model, and then compares their acceptance rates. The QuaRot model adapting eagle3 PR(vllm-project#6914, vllm-project#7038) - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@4034c3d Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
…ject#7038) If some `eagle3` model without embed_tokens works with `quarot` target model, the acceptence rate will drop. We solve it in this PR. The relative vllm pr is vllm-project/vllm#36225. - vLLM main: vllm-project/vllm@4034c3d Signed-off-by: drslark <slarksblood@qq.com>
) Cherry-pick from upstream main 6a7115f. Rename patch_qwen3_quarot to patch_draft_quarot (already present) and clean up redundant documentation.
…ject#7038) If some `eagle3` model without embed_tokens works with `quarot` target model, the acceptence rate will drop. We solve it in this PR. The relative vllm pr is vllm-project/vllm#36225. - vLLM main: vllm-project/vllm@4034c3d Signed-off-by: drslark <slarksblood@qq.com>
…ject#7038) If some `eagle3` model without embed_tokens works with `quarot` target model, the acceptence rate will drop. We solve it in this PR. The relative vllm pr is vllm-project/vllm#36225. - vLLM main: vllm-project/vllm@4034c3d Signed-off-by: drslark <slarksblood@qq.com>
…ct#7128) ### What this PR does / why we need it? Add an e2e test for QuaRot model with eagle3 that runs both the QuaRot model and the float model, and then compares their acceptance rates. The QuaRot model adapting eagle3 PR(vllm-project#6914, vllm-project#7038) - vLLM version: v0.16.0 - vLLM main: vllm-project/vllm@4034c3d Signed-off-by: zhaomingyu <zhaomingyu13@h-partners.com>
What this PR does / why we need it?
If some
eagle3model without embed_tokens works withquarottarget model, the acceptence rate will drop.We solve it in this PR.
The relative vllm pr is vllm-project/vllm#36225.
Does this PR introduce any user-facing change?
N/A
How was this patch tested?
Without this change,
Qwen3-32B-w8a8-quarot+Qwen3-32B-speculator.eagle3.With this change: