Skip to content

Conversation

@wpybtw
Copy link
Contributor

@wpybtw wpybtw commented Jul 28, 2025

I am trying add support on Qwen qwq-32B.
Qwq-32B is already trained with a few epochs.
As qwq-32B has the same architecture as Qwen2. So I believe it also works for Qwen2

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @wpybtw, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces full support for the Qwen2 and Qwen qwq-32B language models. It integrates these models into the system's architecture by providing both a specialized 'Eagle3 draft model' implementation for efficient speculative decoding and a 'distributed target model' for scalable training and inference. This expansion allows users to leverage the Qwen family of models within the framework, complete with necessary configurations and an example training script.

Highlights

  • New Model Support: Added comprehensive support for Qwen2 and Qwen qwq-32B models, enabling their integration into the existing framework.
  • Eagle3 Draft Model Integration: Introduced Qwen2ForCausalLMEagle3 to allow Qwen2 models to function as draft models within the Eagle3 architecture, complete with custom attention, MLP, and normalization layers.
  • Distributed Target Model Integration: Implemented Qwen2ForCausalLM as a distributed target model, leveraging tensor parallelism for efficient training and inference of Qwen2-based models.
  • Configuration and Example Script: Provided a new configuration file (qwq-32B-eagle3.json) and an example training script (run_qwq_eagle3_online.sh) to facilitate the setup and training of the new Qwen models.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Qwen qwq-32B and Qwen2 models. It includes a new configuration file, an example training script, and the core model implementations for both the Eagle draft model and the distributed target model. The changes are well-structured and comprehensive. My feedback focuses on improving code clarity, correctness of type hints and docstrings, and adherence to best practices in both Python and shell scripting.

import torch.nn.functional as F
from transformers import GenerationMixin, Qwen2Config, PreTrainedModel
from transformers.activations import ACT2FN
from transformers.models.qwen2.configuration_qwen2 import Qwen2Config

This comment was marked as resolved.

Comment on lines +397 to +400
# if last:
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
# else:
# self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size * 2, bias=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code is commented out. To improve code clarity and remove dead code, please remove these lines.

Comment on lines +451 to +453
) -> Tuple[
torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]
]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The return type hint Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]] does not match the actual return value, which is a single torch.FloatTensor (hidden_states). Please correct the type hint to match the implementation.

    ) -> torch.FloatTensor:

"""
Arguments:
hidden_states (`torch.FloatTensor`): input to the layer, cat low, mid high hidden_states of shape `(batch, seq_len, hidden_states * 3)`
input_ids (`torch.LongTensor`): input ids of shape `(batch, seq_len)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for the forward method mentions input_ids, but this is not an argument to the function. The function takes inputs_embeds. Please update the docstring to accurately reflect the function's parameters.

            inputs_embeds (`torch.Tensor`): embedded input tokens of shape `(batch, seq_len, hidden_size)`

past_key_value: Optional[Cache] = None,
cache_position: Optional[torch.LongTensor] = None,
**kwargs: Unpack[FlashAttentionKwargs],
) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The return type hint tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]] appears to be incorrect. The function returns attn_output, attn_weights, which corresponds to tuple[torch.Tensor, Optional[torch.Tensor]]. The third element in the hint seems to be for past_key_value, which is not returned here.

Suggested change
) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]:
) -> tuple[torch.Tensor, Optional[torch.Tensor]]:

Comment on lines +211 to +216
) -> tuple[
torch.FloatTensor, Optional[tuple[torch.FloatTensor, torch.FloatTensor]]
]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The return type hint tuple[torch.FloatTensor, Optional[tuple[torch.FloatTensor, torch.FloatTensor]]] does not match the returned value. The function returns a tuple containing hidden_states and optionally self_attn_weights. A more accurate type hint would be tuple[torch.FloatTensor, Optional[torch.FloatTensor]].

    ) -> tuple[torch.FloatTensor, Optional[torch.FloatTensor]]:

all_hidden_states = () if output_hidden_states else None
all_self_attns = () if output_attentions else None

for decoder_layer in self.layers[: self.config.num_hidden_layers]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The slice [: self.config.num_hidden_layers] is redundant because self.layers is already initialized with num_hidden_layers elements. You can simplify this by iterating directly over self.layers.

        for decoder_layer in self.layers:

Comment on lines +497 to +501
>>> model = Qwen2ForCausalLM.from_pretrained("meta-qwen2/Qwen2-2-7b-hf")
>>> tokenizer = AutoTokenizer.from_pretrained("meta-qwen2/Qwen2-2-7b-hf")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example in the docstring uses meta-qwen2/Qwen2-2-7b-hf. Given that this PR adds support for Qwen/QwQ-32B (as seen in the example script), it would be more consistent to use that model in the example.

        >>> model = Qwen2ForCausalLM.from_pretrained("Qwen/QwQ-32B")
        >>> tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B")

"Qwen2ForSequenceClassification",
"Qwen2ForTokenClassification",
"Qwen2ForQuestionAnswering",
] No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This file is missing a final newline character. It's a common convention and good practice to end files with a newline.

@sleepcoo
Copy link
Collaborator

Can you add a new test case like in the tests/test_llama4_tp.py?

@wpybtw
Copy link
Contributor Author

wpybtw commented Jul 28, 2025

Can you add a new test case like in the tests/test_llama4_tp.py?

Sure. Done

@sleepcoo sleepcoo requested a review from ZhengHSI July 28, 2025 07:22
@sleepcoo
Copy link
Collaborator

The code seems fine. I'll try training it later. Have you finished the training? Do you have any performance data, like the acceptance length?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide an example in the README.md?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://huggingface.co/w497273/qwq-32b-eagle3/tree/main
Is this model weight obtained by training with this script?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I run two epoch of training on ultachat. The weight is https://huggingface.co/w497273/qwq-32b-eagle3/tree/main

I testest with

python3 -m sglang.launch_server --model  /home/jovyan/computational_math/w00802858/QwQ-32B \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path  /home/jovyan/pvc-shared/computational_math/w00802858/SpecForge/outputs/QwQ-32B-eagle3/epoch_1 \
    --speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --mem-fraction 0.7 \
    --disable-radix-cache --tp 2

python3 -m sglang.bench_serving --backend sglang  --dataset-name sharegpt  --warmup-requests 0 --num-prompt 1 --max-concurrency 1

The gobal accept len (self.cum_spec_accept_length /self.cum_spec_accept_count ) is 1.67.

@FrankLeeeee
Copy link
Collaborator

Can you run pre-commit?

@sleepcoo
Copy link
Collaborator

The sglang not supports Qwen2ForCausalLMEagle3 at present. Do you use qwen2 as draft because the draft accepting length of qwen2 is higher?
@wpybtw

@wpybtw
Copy link
Contributor Author

wpybtw commented Aug 1, 2025

The sglang not supports Qwen2ForCausalLMEagle3 at present. Do you use qwen2 as draft because the draft accepting length of qwen2 is higher? @wpybtw

I quickly add Qwen2ForCausalLMEagle3 in Sgalng by modify LlamaForCausalLMEagle3 ( wpybtw/sglang@910fca1 )

What do you mean by "Do you use qwen2 as draft because the draft accepting length of qwen2 is higher?"

wpybtw and others added 7 commits August 1, 2025 02:10
@wpybtw
Copy link
Contributor Author

wpybtw commented Aug 1, 2025

Quote reply
Referenc

done

from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel
from transformers.models.qwen2.configuration_qwen2 import Qwen2Config
from transformers.models.qwen2.modeling_qwen2 import (
KwargsForCausalLM,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KwargsForCausalLM is not support after transform=4.53.

@sleepcoo
Copy link
Collaborator

Hello, can you fix the conflict and transform = 4.53bug? I have reviewed the code and can merge it.
@wpybtw

@ZhengHSI
Copy link
Collaborator

Hello, can you fix the conflict and transform = 4.53bug? I have reviewed the code and can merge it. @wpybtw

@wpybtw please try to fix it

@ZhengHSI
Copy link
Collaborator

This PR will be closed, and a new PR is available at #163.

@ZhengHSI ZhengHSI closed this Aug 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants