[Model] Support IQuest-Coder-40B-Loop#16348
Conversation
Summary of ChangesHello @attack204, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates full support for the IQuest-Coder model into the SGLang framework, focusing on its distinctive multi-loop attention architecture. The changes involve introducing a sophisticated attention mechanism that dynamically blends global and local contextual information across multiple processing loops, managed by learned gating. This required fundamental adjustments to how KV caches are handled and how model layers are counted, ensuring efficient and accurate execution of LoopCoder models. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for the iquest-coder model, which utilizes a LoopCoder architecture. The changes include a new model implementation, modifications to the FlashInfer attention backend to handle cross-layer KV cache sharing, and adjustments in the model runner. The new model implementation is well-structured. I've provided a couple of suggestions to improve code clarity and reduce redundancy in the new model file and the attention backend. Overall, the changes are solid and the provided benchmarks are encouraging.
| gate_logits = torch.diagonal(gate_logits, dim1=0, dim2=2) | ||
| gate_logits = gate_logits.transpose(0, 1) | ||
| gate_logits = gate_logits.unsqueeze(-1) | ||
|
|
||
| # Apply sigmoid | ||
| gate = torch.sigmoid(gate_logits) | ||
|
|
||
| # Expand and reshape to match q shape: [num_tokens, num_heads * head_dim] | ||
| gate = gate.transpose(0, 1) | ||
| gate = gate.expand(-1, -1, head_dim) | ||
| gate = gate.reshape(num_tokens, num_heads * head_dim) |
There was a problem hiding this comment.
The forward method in LoopGateProjection contains redundant transpose operations. The torch.diagonal call already produces a tensor with the desired [num_tokens, num_heads] shape. The subsequent transpositions can be removed to simplify the code and improve readability.
| gate_logits = torch.diagonal(gate_logits, dim1=0, dim2=2) | |
| gate_logits = gate_logits.transpose(0, 1) | |
| gate_logits = gate_logits.unsqueeze(-1) | |
| # Apply sigmoid | |
| gate = torch.sigmoid(gate_logits) | |
| # Expand and reshape to match q shape: [num_tokens, num_heads * head_dim] | |
| gate = gate.transpose(0, 1) | |
| gate = gate.expand(-1, -1, head_dim) | |
| gate = gate.reshape(num_tokens, num_heads * head_dim) | |
| gate_logits = torch.diagonal(gate_logits, dim1=0, dim2=2) | |
| gate_logits = gate_logits.unsqueeze(-1) | |
| # Apply sigmoid | |
| gate = torch.sigmoid(gate_logits) | |
| # Expand and reshape to match q shape: [num_tokens, num_heads * head_dim] | |
| gate = gate.expand(-1, -1, head_dim) | |
| gate = gate.reshape(num_tokens, num_heads * head_dim) |
Hi @attack204 , Thank you so much for your contribution to supporting IQuest-Coder-40B-Loop. We will reviewe the PR and test it more. This will definitely help more users leverage our model with SGLang's high-performance inference. We'll be keeping an eye on this. Let us know if you need any technical details or further support from our side! 🚀 |
|
Yes, Zhelong and me are from the IQUEST coder team. It is so so great that we can see this PR. This definitely help us to patch our model the SGLang community. We hope in the near future, IQUEST model can be supported via SGlang at the first time. |
|
@zelong518 @merlintang However, I still have some questions about the test results. May I ask how I can get in touch with you? My WeChat is L15006856731. |
d3a98d2 to
ce0e6df
Compare
ce0e6df to
541241d
Compare
|
/tag-run-ci-label |
|
/tag-and-rerun-ci again |
|
/rerun-failed-ci |
| # When k is None, we read from KV cache instead of computing attention | ||
| if k is None: | ||
| # Read from KV cache (similar to decode mode) | ||
| o = prefill_wrapper_paged.forward( |
There was a problem hiding this comment.
why not just skip set kv buffer when k is None?
There was a problem hiding this comment.
I moved the read-only key-value pairs to another branch of the if statement. It seems that when key-value pairs are None, we only need to read them.
There was a problem hiding this comment.
and here is more test for my commit:
Basic Test
python3 -m sglang.test.send_one --batch-size 1 --prompt "What is SGLang?" --max-new-tokens 64
SGLang is an open-source programming language and framework designed for high-performance, memory-efficient, and flexible AI model serving. It is built from the ground up to address the unique challenges of deploying large language models (LLMs) in production environments.
Key features of SGL
+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
| 2.032 | 64 | 1.000 | 31.50 |
+-------------+--------+------------+-----------------+GSM8K
python -m sglang.test.few_shot_gsm8k --host http://127.0.0.1 --port 30000 --num-questions 200 --num-shots 5 --data-path test.jsonl
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:34<00:00, 5.86it/s]
Accuracy: 0.845
Invalid: 0.000
Latency: 34.289 s
Output throughput: 744.822 token/sHellswag
python3 bench_sglang.py --num-questions 200 --num-shots 20 --host http://127.0.0.1 --port 30000 --data-path hellswag_val.json
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:07<00:00, 27.16it/s]
Latency: 7.523
Accuracy: 0.770MMLU
python -m sglang.test.run_eval --eval-name mmlu --port 30000 --num-examples 1000
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [03:17<00:00, 5.06it/s]
Total latency: 197.665 s
Score: 0.734GPQA
python -m sglang.test.run_eval --eval-name gpqa --port 30000
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [10:05<00:00, 3.06s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [10:34<00:00, 3.20s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [10:54<00:00, 3.31s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [11:11<00:00, 3.39s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [11:17<00:00, 3.42s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [11:20<00:00, 3.44s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [11:21<00:00, 3.44s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [11:24<00:00, 3.46s/it]
====================███████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 192/198 [11:15<00:11, 1.90s/it]
Repeat: 8, mean: 0.378█████████████████████████████████████████████████████████████████████████████████████████████▋ | 174/198 [11:24<01:18, 3.26s/it]
Scores: ['0.394', '0.379', '0.379', '0.374', '0.374', '0.379', '0.389', '0.359']|
Also, the lint failed. |
|
/tag-run-ci-label |
|
/rerun-failed-ci |
hnyls2002
left a comment
There was a problem hiding this comment.
Add comments to the abnormal condition branch.
|
All CIs passed, excluding one broken by #16273 |
|
/rerun-failed-ci |
|
Hey Zelong, thanks so much for your verification commands. Could you share me how you are verifying these two benchmarks? Humaneval Multiple I only see your score 😂 thanks! cc @zijiexia |
Co-authored-by: yxing <yxing@iquestlab.com> Co-authored-by: yzhu <yzhu@ubiquant.com> Co-authored-by: zelong518 <zelonghuang02@gmail.com>


ENV: 4 * H200
Launch
Basic Test
gsm8k
HellaSwag
cd benchmark/hellaswag python3 bench_sglang.py --num-questions 200 --num-shots 20 --host http://127.0.0.1 --port 30000 Latency: 5.034 Accuracy: 0.775MMLU
GPQA
Throughput
python3 -m sglang.bench_serving --backend sglang --num-prompt 100 ============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: inf Max request concurrency: not set Successful requests: 100 Benchmark duration (s): 89.93 Total input tokens: 39104 Total input text tokens: 39104 Total input vision tokens: 0 Total generated tokens: 25195 Total generated tokens (retokenized): 25254 Request throughput (req/s): 1.11 Input token throughput (tok/s): 434.85 Output token throughput (tok/s): 280.18 Peak output token throughput (tok/s): 763.00 Peak concurrent requests: 100 Total token throughput (tok/s): 715.03 Concurrency: 30.36 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 27302.99 Median E2E Latency (ms): 23139.83 P90 E2E Latency (ms): 60398.58 P99 E2E Latency (ms): 70399.70 ---------------Time to First Token---------------- Mean TTFT (ms): 3111.02 Median TTFT (ms): 3321.05 P99 TTFT (ms): 4315.65 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 146.35 Median TPOT (ms): 116.38 P99 TPOT (ms): 669.88 ---------------Inter-Token Latency---------------- Mean ITL (ms): 96.43 Median ITL (ms): 107.51 P95 ITL (ms): 122.04 P99 ITL (ms): 124.83