[Model] Support IQuest-Coder-40B-Loop by attack204 · Pull Request #16348 · sgl-project/sglang

attack204 · 2026-01-03T17:12:19Z

ENV: 4 * H200

Launch

export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m sglang.launch_server \
  --model-path IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct \
  --tp 4 \
  --cuda-graph-bs $(seq -s ' ' 1 16) \
  --trust-remote-code \
  --port 30000 > out.log 2>&1

Basic Test

python3 -m sglang.test.send_one --batch-size 1 --prompt "What is SGLang?" --max-new-tokens 64

SGLang is an open-source programming language and framework designed for high-performance, memory-efficient, and flexible AI model serving. It is built from the ground up to address the unique challenges of deploying large language models (LLMs) in production environments.

Key features include:
+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    2.273    |   64   |   1.000    |      28.16      |
+-------------+--------+------------+-----------------+

gsm8k

python -m sglang.test.few_shot_gsm8k --host http://127.0.0.1 --port 30000 --num-questions 200 --num-shots 5
100%|█████████████████████████| 200/200 [00:42<00:00,  4.70it/s]
Accuracy: 0.835
Invalid: 0.000
Latency: 42.735 s
Output throughput: 596.748 token/s

HellaSwag

cd benchmark/hellaswag
python3 bench_sglang.py --num-questions 200 --num-shots 20 --host http://127.0.0.1 --port 30000

Latency: 5.034
Accuracy: 0.775

MMLU

 python -m sglang.test.run_eval --eval-name mmlu --port 30000 --num-examples 1000

100%|██████████| 1000/1000 [02:47<00:00,  5.98it/s]
Total latency: 167.257 s
Score: 0.729

GPQA

100%|██████████████████████████████████████████| 198/198 [05:39<00:00,  1.71s/it]
100%|██████████████████████████████████████████| 198/198 [05:47<00:00,  1.76s/it]
100%|████████████████████████████████████████| 198/198 [1:00:28<00:00, 18.33s/it]
100%|████████████████████████████████████████| 198/198 [1:00:28<00:00, 18.33s/it]
100%|████████████████████████████████████████| 198/198 [1:00:32<00:00, 18.34s/it]
100%|████████████████████████████████████████| 198/198 [1:00:32<00:00, 18.35s/it]
100%|████████████████████████████████████████| 198/198 [1:00:36<00:00, 18.36s/it]
100%|████████████████████████████████████████| 198/198 [1:00:38<00:00, 18.37s/it]
Repeat: 8, mean: 0.387
Scores: ['0.404', '0.394', '0.379', '0.399', '0.389', '0.374', '0.374', '0.384']]

Throughput

python3 -m sglang.bench_serving --backend sglang --num-prompt 100


============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     100
Benchmark duration (s):                  89.93
Total input tokens:                      39104
Total input text tokens:                 39104
Total input vision tokens:               0
Total generated tokens:                  25195
Total generated tokens (retokenized):    25254
Request throughput (req/s):              1.11
Input token throughput (tok/s):          434.85
Output token throughput (tok/s):         280.18
Peak output token throughput (tok/s):    763.00
Peak concurrent requests:                100
Total token throughput (tok/s):          715.03
Concurrency:                             30.36
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   27302.99
Median E2E Latency (ms):                 23139.83
P90 E2E Latency (ms):                    60398.58
P99 E2E Latency (ms):                    70399.70
---------------Time to First Token----------------
Mean TTFT (ms):                          3111.02
Median TTFT (ms):                        3321.05
P99 TTFT (ms):                           4315.65
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          146.35
Median TPOT (ms):                        116.38
P99 TPOT (ms):                           669.88
---------------Inter-Token Latency----------------
Mean ITL (ms):                           96.43
Median ITL (ms):                         107.51
P95 ITL (ms):                            122.04
P99 ITL (ms):                            124.83

gemini-code-assist · 2026-01-03T17:12:39Z

Summary of Changes

Hello @attack204, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates full support for the IQuest-Coder model into the SGLang framework, focusing on its distinctive multi-loop attention architecture. The changes involve introducing a sophisticated attention mechanism that dynamically blends global and local contextual information across multiple processing loops, managed by learned gating. This required fundamental adjustments to how KV caches are handled and how model layers are counted, ensuring efficient and accurate execution of LoopCoder models.

Highlights

IQuest-Coder Model Support: This pull request introduces comprehensive support for the IQuest-Coder model, specifically the IQuest-Coder-V1-40B-Loop-Instruct variant, by adding its unique architecture and components to the SGLang framework.
Multi-Loop Attention Mechanism: A novel attention strategy is implemented for LoopCoder models, where the initial loop performs standard global attention, and subsequent loops utilize a mixed attention approach. This mixed attention intelligently combines global context (read from the first loop's KV cache) with local, sliding-window attention, controlled by a learned gating mechanism.
Enhanced KV Cache Management: The FlashInfer backend has been modified to support cross-layer KV cache sharing, a critical feature for LoopCoder models. This allows later attention loops to efficiently read and leverage the KV cache populated by earlier loops without recomputing, optimizing performance.
Dynamic Layer Counting for LoopCoder: The model runner's initialization logic is updated to correctly account for the loop_num attribute of LoopCoder models. This ensures that the num_effective_layers is accurately calculated, which is crucial for proper resource allocation and management within the multi-loop architecture.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for the iquest-coder model, which utilizes a LoopCoder architecture. The changes include a new model implementation, modifications to the FlashInfer attention backend to handle cross-layer KV cache sharing, and adjustments in the model runner. The new model implementation is well-structured. I've provided a couple of suggestions to improve code clarity and reduce redundancy in the new model file and the attention backend. Overall, the changes are solid and the provided benchmarks are encouraging.

gemini-code-assist · 2026-01-03T17:14:58Z

python/sglang/srt/models/iquest_loopcoder.py

+        gate_logits = torch.diagonal(gate_logits, dim1=0, dim2=2)
+        gate_logits = gate_logits.transpose(0, 1)
+        gate_logits = gate_logits.unsqueeze(-1)
+
+        # Apply sigmoid
+        gate = torch.sigmoid(gate_logits)
+
+        # Expand and reshape to match q shape: [num_tokens, num_heads * head_dim]
+        gate = gate.transpose(0, 1)
+        gate = gate.expand(-1, -1, head_dim)
+        gate = gate.reshape(num_tokens, num_heads * head_dim)


The forward method in LoopGateProjection contains redundant transpose operations. The torch.diagonal call already produces a tensor with the desired [num_tokens, num_heads] shape. The subsequent transpositions can be removed to simplify the code and improve readability.

Suggested change

gate_logits = torch.diagonal(gate_logits, dim1=0, dim2=2)

gate_logits = gate_logits.transpose(0, 1)

gate_logits = gate_logits.unsqueeze(-1)

# Apply sigmoid

gate = torch.sigmoid(gate_logits)

# Expand and reshape to match q shape: [num_tokens, num_heads * head_dim]

gate = gate.transpose(0, 1)

gate = gate.expand(-1, -1, head_dim)

gate = gate.reshape(num_tokens, num_heads * head_dim)

gate_logits = torch.diagonal(gate_logits, dim1=0, dim2=2)

gate_logits = gate_logits.unsqueeze(-1)

# Apply sigmoid

gate = torch.sigmoid(gate_logits)

# Expand and reshape to match q shape: [num_tokens, num_heads * head_dim]

gate = gate.expand(-1, -1, head_dim)

gate = gate.reshape(num_tokens, num_heads * head_dim)

zelong518 · 2026-01-04T14:07:52Z

ENV: 4 * H200

Launch

export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m sglang.launch_server \
  --model-path IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct \
  --tp 4 \
  --cuda-graph-bs $(seq -s ' ' 1 16) \
  --trust-remote-code \
  --port 30000 > out.log 2>&1

Basic Test

python3 -m sglang.test.send_one --batch-size 1 --prompt "What is SGLang?" --max-new-tokens 64

SGLang is an open-source programming language and framework designed for high-performance, memory-efficient, and flexible AI model serving. It is built from the ground up to address the unique challenges of deploying large language models (LLMs) in production environments.

Key features include:
+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    2.273    |   64   |   1.000    |      28.16      |
+-------------+--------+------------+-----------------+

gsm8k

python -m sglang.test.few_shot_gsm8k --host http://127.0.0.1 --port 30000 --num-questions 200 --num-shots 5
100%|█████████████████████████| 200/200 [00:42<00:00,  4.70it/s]
Accuracy: 0.835
Invalid: 0.000
Latency: 42.735 s
Output throughput: 596.748 token/s

HellaSwag

cd benchmark/hellaswag
python3 bench_sglang.py --num-questions 200 --num-shots 20 --host http://127.0.0.1 --port 30000

Latency: 5.034
Accuracy: 0.775

MMLU

 python -m sglang.test.run_eval --eval-name mmlu --port 30000 --num-examples 1000

100%|██████████| 1000/1000 [02:47<00:00,  5.98it/s]
Total latency: 167.257 s
Score: 0.729

GPQA

100%|██████████████████████████████████████████| 198/198 [05:39<00:00,  1.71s/it]
100%|██████████████████████████████████████████| 198/198 [05:47<00:00,  1.76s/it]
100%|████████████████████████████████████████| 198/198 [1:00:28<00:00, 18.33s/it]
100%|████████████████████████████████████████| 198/198 [1:00:28<00:00, 18.33s/it]
100%|████████████████████████████████████████| 198/198 [1:00:32<00:00, 18.34s/it]
100%|████████████████████████████████████████| 198/198 [1:00:32<00:00, 18.35s/it]
100%|████████████████████████████████████████| 198/198 [1:00:36<00:00, 18.36s/it]
100%|████████████████████████████████████████| 198/198 [1:00:38<00:00, 18.37s/it]
Repeat: 8, mean: 0.387
Scores: ['0.404', '0.394', '0.379', '0.399', '0.389', '0.374', '0.374', '0.384']]

Throughput

python3 -m sglang.bench_serving --backend sglang --num-prompt 100


============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     100
Benchmark duration (s):                  89.93
Total input tokens:                      39104
Total input text tokens:                 39104
Total input vision tokens:               0
Total generated tokens:                  25195
Total generated tokens (retokenized):    25254
Request throughput (req/s):              1.11
Input token throughput (tok/s):          434.85
Output token throughput (tok/s):         280.18
Peak output token throughput (tok/s):    763.00
Peak concurrent requests:                100
Total token throughput (tok/s):          715.03
Concurrency:                             30.36
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   27302.99
Median E2E Latency (ms):                 23139.83
P90 E2E Latency (ms):                    60398.58
P99 E2E Latency (ms):                    70399.70
---------------Time to First Token----------------
Mean TTFT (ms):                          3111.02
Median TTFT (ms):                        3321.05
P99 TTFT (ms):                           4315.65
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          146.35
Median TPOT (ms):                        116.38
P99 TPOT (ms):                           669.88
---------------Inter-Token Latency----------------
Mean ITL (ms):                           96.43
Median ITL (ms):                         107.51
P95 ITL (ms):                            122.04
P99 ITL (ms):                            124.83

Hi @attack204 ,
I'm from the IQuest-Coder team. We were actually working on the SGLang integration ourselves, but we're thrilled to see the community moving so fast!

Thank you so much for your contribution to supporting IQuest-Coder-40B-Loop. We will reviewe the PR and test it more. This will definitely help more users leverage our model with SGLang's high-performance inference.

We'll be keeping an eye on this. Let us know if you need any technical details or further support from our side! 🚀

merlintang · 2026-01-04T14:14:12Z

@attack204

Yes, Zhelong and me are from the IQUEST coder team. It is so so great that we can see this PR. This definitely help us to patch our model the SGLang community. We hope in the near future, IQUEST model can be supported via SGlang at the first time.

attack204 · 2026-01-04T14:20:31Z

@zelong518 @merlintang
Hey, thank you very much for following the progress of IQuest-Coder in SGLang. Since the code for IQuest is quite comprehensive, the development process went very smoothly, and it only took about three hours to essentially complete the integration with SGLang.

However, I still have some questions about the test results. May I ask how I can get in touch with you? My WeChat is L15006856731.

attack204 · 2026-01-07T16:07:39Z

/tag-run-ci-label

attack204 · 2026-01-07T16:09:45Z

/tag-and-rerun-ci again

attack204 · 2026-01-08T03:14:44Z

/rerun-failed-ci

ispobock · 2026-01-08T05:59:12Z

python/sglang/srt/layers/attention/flashinfer_backend.py

+        # When k is None, we read from KV cache instead of computing attention
+        if k is None:
+            # Read from KV cache (similar to decode mode)
+            o = prefill_wrapper_paged.forward(


why not just skip set kv buffer when k is None?

I moved the read-only key-value pairs to another branch of the if statement. It seems that when key-value pairs are None, we only need to read them.

and here is more test for my commit:

Basic Test

python3 -m sglang.test.send_one --batch-size 1 --prompt "What is SGLang?" --max-new-tokens 64 SGLang is an open-source programming language and framework designed for high-performance, memory-efficient, and flexible AI model serving. It is built from the ground up to address the unique challenges of deploying large language models (LLMs) in production environments. Key features of SGL +-------------+--------+------------+-----------------+ | Latency (s) | Tokens | Acc Length | Speed (token/s) | +-------------+--------+------------+-----------------+ | 2.032 | 64 | 1.000 | 31.50 | +-------------+--------+------------+-----------------+

GSM8K

python -m sglang.test.few_shot_gsm8k --host http://127.0.0.1 --port 30000 --num-questions 200 --num-shots 5 --data-path test.jsonl 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:34<00:00, 5.86it/s] Accuracy: 0.845 Invalid: 0.000 Latency: 34.289 s Output throughput: 744.822 token/s

Hellswag

python3 bench_sglang.py --num-questions 200 --num-shots 20 --host http://127.0.0.1 --port 30000 --data-path hellswag_val.json 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:07<00:00, 27.16it/s] Latency: 7.523 Accuracy: 0.770

MMLU

python -m sglang.test.run_eval --eval-name mmlu --port 30000 --num-examples 1000 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [03:17<00:00, 5.06it/s] Total latency: 197.665 s Score: 0.734

GPQA

python -m sglang.test.run_eval --eval-name gpqa --port 30000 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [10:05<00:00, 3.06s/it] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [10:34<00:00, 3.20s/it] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [10:54<00:00, 3.31s/it] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [11:11<00:00, 3.39s/it] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [11:17<00:00, 3.42s/it] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [11:20<00:00, 3.44s/it] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [11:21<00:00, 3.44s/it] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [11:24<00:00, 3.46s/it] ====================███████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 192/198 [11:15<00:11, 1.90s/it] Repeat: 8, mean: 0.378█████████████████████████████████████████████████████████████████████████████████████████████▋ | 174/198 [11:24<01:18, 3.26s/it] Scores: ['0.394', '0.379', '0.379', '0.374', '0.374', '0.379', '0.389', '0.359']

python/sglang/srt/layers/attention/flashinfer_backend.py

hnyls2002 · 2026-01-09T15:47:50Z

Also, the lint failed.

zelong518 · 2026-01-09T16:06:37Z

/tag-run-ci-label

attack204 · 2026-01-12T01:50:39Z

/rerun-failed-ci

hnyls2002

Add comments to the abnormal condition branch.

hnyls2002 · 2026-01-12T02:58:59Z

All CIs passed, excluding one broken by #16273

https://github.com/sgl-project/sglang/actions/runs/20889454562/job/60058755009?pr=16348

zelong518 · 2026-01-12T09:50:44Z

Here is test again for double check.
ENV: 2 * H200

Launch

python -m sglang.launch_server \
  --model-path IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct \
  --tp 2 \
  --cuda-graph-bs $(seq -s ' ' 1 16) \
  --trust-remote-code \
  --port 30000

Basic Test

python3 -m sglang.test.send_one --batch-size 1 --prompt "What is SGLang?" --max-new-tokens 64

SGLang is an open-source programming language and framework designed for high-performance, memory-efficient, and flexible AI model serving. It is built from the ground up to address the unique challenges of deploying large language models (LLMs) in production environments.

Key features of SGL

acc_length=1.00
speed=32.85 token/s

gsm8k

python -m sglang.test.few_shot_gsm8k --host http://127.0.0.1 --port 30000 --num-questions 200 --num-shots 5

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:34<00:00,  5.81it/s]
Accuracy: 0.825
Invalid: 0.000
Latency: 34.590 s
Output throughput: 744.139 token/s

MMLU

python -m sglang.test.run_eval --eval-name mmlu --port 30000 --num-examples 1000

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [02:50<00:00,  5.87it/s]
Total latency: 170.435 s
Score: 0.736

Hellaswag

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:04<00:00, 49.88it/s]
Latency: 4.174
Accuracy: 0.770

GPQA

python -m sglang.test.run_eval --eval-name gpqa --port 30000 --repeat 8

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [06:52<00:00,  2.08s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [07:44<00:00,  2.35s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [07:45<00:00,  2.35s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [07:49<00:00,  2.37s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [07:53<00:00,  2.39s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [07:53<00:00,  2.39s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [07:55<00:00,  2.40s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [08:03<00:00,  2.44s/it]
====================███████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [07:49<00:00,  1.16it/s]
Repeat: 8, mean: 0.358██████████████████████████████████████████████████████████████████████████████████████████▋      | 187/198 [07:55<00:28,  2.59s/it]
Scores: ['0.364', '0.394', '0.308', '0.404', '0.354', '0.343', '0.333', '0.364']█████████████████████████████████████▊ | 196/198 [07:52<00:01,  1.06it/s]
====================
[METRIC] gpqa_mean_score=0.3579545454545455 labels={"model": "IQuest-Coder-V1-40B-Loop-Instruct", "eval": "gpqa", "repeat": 8}
Writing report to /tmp/gpqa__IQuest-Coder-V1-40B-Loop-Instruct.html
{'chars': np.float64(1915.0353535353536), 'chars:std': np.float64(900.5486171670248), 'score:std': np.float64(0.48104569292083466), 'scores': ['0.364', '0.394', '0.308', '0.404', '0.354', '0.343', '0.333', '0.364'], 'mean_score': np.float64(0.3579545454545455)}

Throughput

python3 -m sglang.bench_serving --backend sglang --num-prompt 100 --dataset-name sharegpt

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     100       
Benchmark duration (s):                  74.96     
Total input tokens:                      39104     
Total input text tokens:                 39104     
Total generated tokens:                  25195     
Total generated tokens (retokenized):    25261     
Request throughput (req/s):              1.33      
Input token throughput (tok/s):          521.68    
Output token throughput (tok/s):         336.12    
Peak output token throughput (tok/s):    1473.00   
Peak concurrent requests:                100       
Total token throughput (tok/s):          857.80    
Concurrency:                             25.56     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   19160.51  
Median E2E Latency (ms):                 15912.64  
P90 E2E Latency (ms):                    38875.57  
P99 E2E Latency (ms):                    50444.83  
---------------Time to First Token----------------
Mean TTFT (ms):                          4189.49   
Median TTFT (ms):                        4603.00   
P99 TTFT (ms):                           5886.91   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          120.00    
Median TPOT (ms):                        67.48     
P99 TPOT (ms):                           863.13    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           59.67     
Median ITL (ms):                         58.89     
P95 ITL (ms):                            60.14     
P99 ITL (ms):                            60.69     
Max ITL (ms):                            4893.52   
==================================================

Humaneval

Multiple

attack204 · 2026-01-12T14:19:21Z

/rerun-failed-ci

hnyls2002 · 2026-01-12T15:46:24Z

https://github.com/sgl-project/sglang/actions/runs/20889454562/job/60058755006

zhaochenyang20 · 2026-01-12T21:31:40Z

Hey Zelong, thanks so much for your verification commands. Could you share me how you are verifying these two benchmarks?

Humaneval

Multiple

I only see your score 😂 thanks!

cc @zijiexia

Co-authored-by: yxing <yxing@iquestlab.com> Co-authored-by: yzhu <yzhu@ubiquant.com> Co-authored-by: zelong518 <zelonghuang02@gmail.com>

attack204 requested review from Fridge003, Qiaolin-Yu, Ying1123, hebiao064, hnyls2002, ispobock and merrymercy as code owners January 3, 2026 17:12

gemini-code-assist bot reviewed Jan 3, 2026

View reviewed changes

attack204 changed the title ~~[Model]Support iquest-coder~~ [Model] Support IQuest-Coder-40B-Loop Jan 3, 2026

attack204 force-pushed the feature/gaoji_support_iquest_coder_40b branch from d3a98d2 to ce0e6df Compare January 4, 2026 14:23

yxing-bj mentioned this pull request Jan 6, 2026

[Model] Support IQuestCoder Model #16574

Closed

5 tasks

yxing-bj and others added 2 commits January 6, 2026 22:48

support iquest coder

ac9d4b3

add iquest-coder

541241d

attack204 force-pushed the feature/gaoji_support_iquest_coder_40b branch from ce0e6df to 541241d Compare January 7, 2026 14:08

github-actions bot added the run-ci label Jan 7, 2026

ispobock reviewed Jan 8, 2026

View reviewed changes

yxing-bj added 2 commits January 8, 2026 19:47

refactor code

37895ca

fix bug on precommit

dafae49

hnyls2002 removed the run-ci label Jan 9, 2026

zelong518 added 2 commits January 9, 2026 14:05

modefy flashinfer_backend.py

3788ed2

fix

6d686a1

hnyls2002 requested changes Jan 9, 2026

View reviewed changes

python/sglang/srt/layers/attention/flashinfer_backend.py Show resolved Hide resolved

github-actions bot added the run-ci label Jan 9, 2026

fix lint

d203f72

attack204 added 3 commits January 10, 2026 20:36

Merge branch 'main' into feature/gaoji_support_iquest_coder_40b

86e1909

Merge branch 'main' into feature/gaoji_support_iquest_coder_40b

5c6e7e3

Merge branch 'main' into feature/gaoji_support_iquest_coder_40b

e3e6ee6

hnyls2002 requested changes Jan 12, 2026

View reviewed changes

hnyls2002 approved these changes Jan 12, 2026

View reviewed changes

add comment

36ad89e

Merge branch 'main' into feature/gaoji_support_iquest_coder_40b

87e366c

hnyls2002 merged commit 7b682de into sgl-project:main Jan 12, 2026
106 of 134 checks passed

Conversation

attack204 commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Launch

Basic Test

gsm8k

HellaSwag

MMLU

GPQA

Throughput

Uh oh!

gemini-code-assist bot commented Jan 3, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

zelong518 commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Launch

Basic Test

gsm8k

HellaSwag

MMLU

GPQA

Throughput

Uh oh!

merlintang commented Jan 4, 2026

Uh oh!

attack204 commented Jan 4, 2026

Uh oh!

attack204 commented Jan 7, 2026

Uh oh!

attack204 commented Jan 7, 2026 • edited by hnyls2002 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

attack204 commented Jan 8, 2026

Uh oh!

ispobock Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

zelong518 Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

zelong518 Jan 9, 2026

Choose a reason for hiding this comment

Basic Test

GSM8K

Hellswag

MMLU

GPQA

Uh oh!

Uh oh!

hnyls2002 commented Jan 9, 2026

Uh oh!

zelong518 commented Jan 9, 2026

Uh oh!

attack204 commented Jan 12, 2026

Uh oh!

hnyls2002 left a comment

Choose a reason for hiding this comment

Uh oh!

hnyls2002 commented Jan 12, 2026

Uh oh!

zelong518 commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Launch

Basic Test

gsm8k

MMLU

Hellaswag

GPQA

Throughput

Humaneval

attack204 commented Jan 3, 2026 •

edited

Loading

zelong518 commented Jan 4, 2026 •

edited

Loading

attack204 commented Jan 7, 2026 •

edited by hnyls2002

Loading

zelong518 commented Jan 12, 2026 •

edited

Loading