-
Notifications
You must be signed in to change notification settings - Fork 108
Add support for Qwen qwq-32B and qwen2 #68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @wpybtw, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces full support for the Qwen2 and Qwen qwq-32B language models. It integrates these models into the system's architecture by providing both a specialized 'Eagle3 draft model' implementation for efficient speculative decoding and a 'distributed target model' for scalable training and inference. This expansion allows users to leverage the Qwen family of models within the framework, complete with necessary configurations and an example training script.
Highlights
- New Model Support: Added comprehensive support for Qwen2 and Qwen qwq-32B models, enabling their integration into the existing framework.
- Eagle3 Draft Model Integration: Introduced
Qwen2ForCausalLMEagle3to allow Qwen2 models to function as draft models within the Eagle3 architecture, complete with custom attention, MLP, and normalization layers. - Distributed Target Model Integration: Implemented
Qwen2ForCausalLMas a distributed target model, leveraging tensor parallelism for efficient training and inference of Qwen2-based models. - Configuration and Example Script: Provided a new configuration file (
qwq-32B-eagle3.json) and an example training script (run_qwq_eagle3_online.sh) to facilitate the setup and training of the new Qwen models.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the Qwen qwq-32B and Qwen2 models. It includes a new configuration file, an example training script, and the core model implementations for both the Eagle draft model and the distributed target model. The changes are well-structured and comprehensive. My feedback focuses on improving code clarity, correctness of type hints and docstrings, and adherence to best practices in both Python and shell scripting.
| import torch.nn.functional as F | ||
| from transformers import GenerationMixin, Qwen2Config, PreTrainedModel | ||
| from transformers.activations import ACT2FN | ||
| from transformers.models.qwen2.configuration_qwen2 import Qwen2Config |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| # if last: | ||
| self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) | ||
| # else: | ||
| # self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size * 2, bias=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ) -> Tuple[ | ||
| torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] | ||
| ]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| """ | ||
| Arguments: | ||
| hidden_states (`torch.FloatTensor`): input to the layer, cat low, mid high hidden_states of shape `(batch, seq_len, hidden_states * 3)` | ||
| input_ids (`torch.LongTensor`): input ids of shape `(batch, seq_len)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring for the forward method mentions input_ids, but this is not an argument to the function. The function takes inputs_embeds. Please update the docstring to accurately reflect the function's parameters.
inputs_embeds (`torch.Tensor`): embedded input tokens of shape `(batch, seq_len, hidden_size)`| past_key_value: Optional[Cache] = None, | ||
| cache_position: Optional[torch.LongTensor] = None, | ||
| **kwargs: Unpack[FlashAttentionKwargs], | ||
| ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type hint tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]] appears to be incorrect. The function returns attn_output, attn_weights, which corresponds to tuple[torch.Tensor, Optional[torch.Tensor]]. The third element in the hint seems to be for past_key_value, which is not returned here.
| ) -> tuple[torch.Tensor, Optional[torch.Tensor], Optional[tuple[torch.Tensor]]]: | |
| ) -> tuple[torch.Tensor, Optional[torch.Tensor]]: |
| ) -> tuple[ | ||
| torch.FloatTensor, Optional[tuple[torch.FloatTensor, torch.FloatTensor]] | ||
| ]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type hint tuple[torch.FloatTensor, Optional[tuple[torch.FloatTensor, torch.FloatTensor]]] does not match the returned value. The function returns a tuple containing hidden_states and optionally self_attn_weights. A more accurate type hint would be tuple[torch.FloatTensor, Optional[torch.FloatTensor]].
) -> tuple[torch.FloatTensor, Optional[torch.FloatTensor]]:| all_hidden_states = () if output_hidden_states else None | ||
| all_self_attns = () if output_attentions else None | ||
|
|
||
| for decoder_layer in self.layers[: self.config.num_hidden_layers]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| >>> model = Qwen2ForCausalLM.from_pretrained("meta-qwen2/Qwen2-2-7b-hf") | ||
| >>> tokenizer = AutoTokenizer.from_pretrained("meta-qwen2/Qwen2-2-7b-hf") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example in the docstring uses meta-qwen2/Qwen2-2-7b-hf. Given that this PR adds support for Qwen/QwQ-32B (as seen in the example script), it would be more consistent to use that model in the example.
>>> model = Qwen2ForCausalLM.from_pretrained("Qwen/QwQ-32B")
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B")
specforge/modeling/target/qwen2.py
Outdated
| "Qwen2ForSequenceClassification", | ||
| "Qwen2ForTokenClassification", | ||
| "Qwen2ForQuestionAnswering", | ||
| ] No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Can you add a new test case like in the tests/test_llama4_tp.py? |
Sure. Done |
|
The code seems fine. I'll try training it later. Have you finished the training? Do you have any performance data, like the acceptance length? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide an example in the README.md?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://huggingface.co/w497273/qwq-32b-eagle3/tree/main
Is this model weight obtained by training with this script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I run two epoch of training on ultachat. The weight is https://huggingface.co/w497273/qwq-32b-eagle3/tree/main
I testest with
python3 -m sglang.launch_server --model /home/jovyan/computational_math/w00802858/QwQ-32B \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path /home/jovyan/pvc-shared/computational_math/w00802858/SpecForge/outputs/QwQ-32B-eagle3/epoch_1 \
--speculative-num-steps 4 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --mem-fraction 0.7 \
--disable-radix-cache --tp 2
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --warmup-requests 0 --num-prompt 1 --max-concurrency 1
The gobal accept len (self.cum_spec_accept_length /self.cum_spec_accept_count ) is 1.67.
|
Can you run pre-commit? |
|
The sglang not supports Qwen2ForCausalLMEagle3 at present. Do you use qwen2 as draft because the draft accepting length of qwen2 is higher? |
I quickly add Qwen2ForCausalLMEagle3 in Sgalng by modify LlamaForCausalLMEagle3 ( wpybtw/sglang@910fca1 ) What do you mean by "Do you use qwen2 as draft because the draft accepting length of qwen2 is higher?" |
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
done |
| from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel | ||
| from transformers.models.qwen2.configuration_qwen2 import Qwen2Config | ||
| from transformers.models.qwen2.modeling_qwen2 import ( | ||
| KwargsForCausalLM, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KwargsForCausalLM is not support after transform=4.53.
|
Hello, can you fix the conflict and transform = 4.53bug? I have reviewed the code and can merge it. |
|
This PR will be closed, and a new PR is available at #163. |
I am trying add support on Qwen qwq-32B.
Qwq-32B is already trained with a few epochs.
As qwq-32B has the same architecture as Qwen2. So I believe it also works for Qwen2