Skip to content

[Model] Support IQuest-Coder-40B-Loop#16348

Merged
hnyls2002 merged 12 commits intosgl-project:mainfrom
attack204:feature/gaoji_support_iquest_coder_40b
Jan 12, 2026
Merged

[Model] Support IQuest-Coder-40B-Loop#16348
hnyls2002 merged 12 commits intosgl-project:mainfrom
attack204:feature/gaoji_support_iquest_coder_40b

Conversation

@attack204
Copy link
Copy Markdown
Contributor

@attack204 attack204 commented Jan 3, 2026

ENV: 4 * H200

Launch

export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m sglang.launch_server \
  --model-path IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct \
  --tp 4 \
  --cuda-graph-bs $(seq -s ' ' 1 16) \
  --trust-remote-code \
  --port 30000 > out.log 2>&1

Basic Test

python3 -m sglang.test.send_one --batch-size 1 --prompt "What is SGLang?" --max-new-tokens 64
SGLang is an open-source programming language and framework designed for high-performance, memory-efficient, and flexible AI model serving. It is built from the ground up to address the unique challenges of deploying large language models (LLMs) in production environments.

Key features include:
+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    2.273    |   64   |   1.000    |      28.16      |
+-------------+--------+------------+-----------------+

gsm8k

python -m sglang.test.few_shot_gsm8k --host http://127.0.0.1 --port 30000 --num-questions 200 --num-shots 5
100%|█████████████████████████| 200/200 [00:42<00:00,  4.70it/s]
Accuracy: 0.835
Invalid: 0.000
Latency: 42.735 s
Output throughput: 596.748 token/s

HellaSwag

cd benchmark/hellaswag
python3 bench_sglang.py --num-questions 200 --num-shots 20 --host http://127.0.0.1 --port 30000

Latency: 5.034
Accuracy: 0.775

MMLU

 python -m sglang.test.run_eval --eval-name mmlu --port 30000 --num-examples 1000

100%|██████████| 1000/1000 [02:47<00:00,  5.98it/s]
Total latency: 167.257 s
Score: 0.729

GPQA

100%|██████████████████████████████████████████| 198/198 [05:39<00:00,  1.71s/it]
100%|██████████████████████████████████████████| 198/198 [05:47<00:00,  1.76s/it]
100%|████████████████████████████████████████| 198/198 [1:00:28<00:00, 18.33s/it]
100%|████████████████████████████████████████| 198/198 [1:00:28<00:00, 18.33s/it]
100%|████████████████████████████████████████| 198/198 [1:00:32<00:00, 18.34s/it]
100%|████████████████████████████████████████| 198/198 [1:00:32<00:00, 18.35s/it]
100%|████████████████████████████████████████| 198/198 [1:00:36<00:00, 18.36s/it]
100%|████████████████████████████████████████| 198/198 [1:00:38<00:00, 18.37s/it]
Repeat: 8, mean: 0.387
Scores: ['0.404', '0.394', '0.379', '0.399', '0.389', '0.374', '0.374', '0.384']]

Throughput

python3 -m sglang.bench_serving --backend sglang --num-prompt 100


============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     100
Benchmark duration (s):                  89.93
Total input tokens:                      39104
Total input text tokens:                 39104
Total input vision tokens:               0
Total generated tokens:                  25195
Total generated tokens (retokenized):    25254
Request throughput (req/s):              1.11
Input token throughput (tok/s):          434.85
Output token throughput (tok/s):         280.18
Peak output token throughput (tok/s):    763.00
Peak concurrent requests:                100
Total token throughput (tok/s):          715.03
Concurrency:                             30.36
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   27302.99
Median E2E Latency (ms):                 23139.83
P90 E2E Latency (ms):                    60398.58
P99 E2E Latency (ms):                    70399.70
---------------Time to First Token----------------
Mean TTFT (ms):                          3111.02
Median TTFT (ms):                        3321.05
P99 TTFT (ms):                           4315.65
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          146.35
Median TPOT (ms):                        116.38
P99 TPOT (ms):                           669.88
---------------Inter-Token Latency----------------
Mean ITL (ms):                           96.43
Median ITL (ms):                         107.51
P95 ITL (ms):                            122.04
P99 ITL (ms):                            124.83

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @attack204, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates full support for the IQuest-Coder model into the SGLang framework, focusing on its distinctive multi-loop attention architecture. The changes involve introducing a sophisticated attention mechanism that dynamically blends global and local contextual information across multiple processing loops, managed by learned gating. This required fundamental adjustments to how KV caches are handled and how model layers are counted, ensuring efficient and accurate execution of LoopCoder models.

Highlights

  • IQuest-Coder Model Support: This pull request introduces comprehensive support for the IQuest-Coder model, specifically the IQuest-Coder-V1-40B-Loop-Instruct variant, by adding its unique architecture and components to the SGLang framework.
  • Multi-Loop Attention Mechanism: A novel attention strategy is implemented for LoopCoder models, where the initial loop performs standard global attention, and subsequent loops utilize a mixed attention approach. This mixed attention intelligently combines global context (read from the first loop's KV cache) with local, sliding-window attention, controlled by a learned gating mechanism.
  • Enhanced KV Cache Management: The FlashInfer backend has been modified to support cross-layer KV cache sharing, a critical feature for LoopCoder models. This allows later attention loops to efficiently read and leverage the KV cache populated by earlier loops without recomputing, optimizing performance.
  • Dynamic Layer Counting for LoopCoder: The model runner's initialization logic is updated to correctly account for the loop_num attribute of LoopCoder models. This ensures that the num_effective_layers is accurately calculated, which is crucial for proper resource allocation and management within the multi-loop architecture.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the iquest-coder model, which utilizes a LoopCoder architecture. The changes include a new model implementation, modifications to the FlashInfer attention backend to handle cross-layer KV cache sharing, and adjustments in the model runner. The new model implementation is well-structured. I've provided a couple of suggestions to improve code clarity and reduce redundancy in the new model file and the attention backend. Overall, the changes are solid and the provided benchmarks are encouraging.

Comment on lines +160 to +170
gate_logits = torch.diagonal(gate_logits, dim1=0, dim2=2)
gate_logits = gate_logits.transpose(0, 1)
gate_logits = gate_logits.unsqueeze(-1)

# Apply sigmoid
gate = torch.sigmoid(gate_logits)

# Expand and reshape to match q shape: [num_tokens, num_heads * head_dim]
gate = gate.transpose(0, 1)
gate = gate.expand(-1, -1, head_dim)
gate = gate.reshape(num_tokens, num_heads * head_dim)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The forward method in LoopGateProjection contains redundant transpose operations. The torch.diagonal call already produces a tensor with the desired [num_tokens, num_heads] shape. The subsequent transpositions can be removed to simplify the code and improve readability.

Suggested change
gate_logits = torch.diagonal(gate_logits, dim1=0, dim2=2)
gate_logits = gate_logits.transpose(0, 1)
gate_logits = gate_logits.unsqueeze(-1)
# Apply sigmoid
gate = torch.sigmoid(gate_logits)
# Expand and reshape to match q shape: [num_tokens, num_heads * head_dim]
gate = gate.transpose(0, 1)
gate = gate.expand(-1, -1, head_dim)
gate = gate.reshape(num_tokens, num_heads * head_dim)
gate_logits = torch.diagonal(gate_logits, dim1=0, dim2=2)
gate_logits = gate_logits.unsqueeze(-1)
# Apply sigmoid
gate = torch.sigmoid(gate_logits)
# Expand and reshape to match q shape: [num_tokens, num_heads * head_dim]
gate = gate.expand(-1, -1, head_dim)
gate = gate.reshape(num_tokens, num_heads * head_dim)

@attack204 attack204 changed the title [Model]Support iquest-coder [Model] Support IQuest-Coder-40B-Loop Jan 3, 2026
@zelong518
Copy link
Copy Markdown
Contributor

zelong518 commented Jan 4, 2026

ENV: 4 * H200

Launch

export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m sglang.launch_server \
  --model-path IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct \
  --tp 4 \
  --cuda-graph-bs $(seq -s ' ' 1 16) \
  --trust-remote-code \
  --port 30000 > out.log 2>&1

Basic Test

python3 -m sglang.test.send_one --batch-size 1 --prompt "What is SGLang?" --max-new-tokens 64
SGLang is an open-source programming language and framework designed for high-performance, memory-efficient, and flexible AI model serving. It is built from the ground up to address the unique challenges of deploying large language models (LLMs) in production environments.

Key features include:
+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    2.273    |   64   |   1.000    |      28.16      |
+-------------+--------+------------+-----------------+

gsm8k

python -m sglang.test.few_shot_gsm8k --host http://127.0.0.1 --port 30000 --num-questions 200 --num-shots 5
100%|█████████████████████████| 200/200 [00:42<00:00,  4.70it/s]
Accuracy: 0.835
Invalid: 0.000
Latency: 42.735 s
Output throughput: 596.748 token/s

HellaSwag

cd benchmark/hellaswag
python3 bench_sglang.py --num-questions 200 --num-shots 20 --host http://127.0.0.1 --port 30000

Latency: 5.034
Accuracy: 0.775

MMLU

 python -m sglang.test.run_eval --eval-name mmlu --port 30000 --num-examples 1000

100%|██████████| 1000/1000 [02:47<00:00,  5.98it/s]
Total latency: 167.257 s
Score: 0.729

GPQA

100%|██████████████████████████████████████████| 198/198 [05:39<00:00,  1.71s/it]
100%|██████████████████████████████████████████| 198/198 [05:47<00:00,  1.76s/it]
100%|████████████████████████████████████████| 198/198 [1:00:28<00:00, 18.33s/it]
100%|████████████████████████████████████████| 198/198 [1:00:28<00:00, 18.33s/it]
100%|████████████████████████████████████████| 198/198 [1:00:32<00:00, 18.34s/it]
100%|████████████████████████████████████████| 198/198 [1:00:32<00:00, 18.35s/it]
100%|████████████████████████████████████████| 198/198 [1:00:36<00:00, 18.36s/it]
100%|████████████████████████████████████████| 198/198 [1:00:38<00:00, 18.37s/it]
Repeat: 8, mean: 0.387
Scores: ['0.404', '0.394', '0.379', '0.399', '0.389', '0.374', '0.374', '0.384']]

Throughput

python3 -m sglang.bench_serving --backend sglang --num-prompt 100


============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     100
Benchmark duration (s):                  89.93
Total input tokens:                      39104
Total input text tokens:                 39104
Total input vision tokens:               0
Total generated tokens:                  25195
Total generated tokens (retokenized):    25254
Request throughput (req/s):              1.11
Input token throughput (tok/s):          434.85
Output token throughput (tok/s):         280.18
Peak output token throughput (tok/s):    763.00
Peak concurrent requests:                100
Total token throughput (tok/s):          715.03
Concurrency:                             30.36
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   27302.99
Median E2E Latency (ms):                 23139.83
P90 E2E Latency (ms):                    60398.58
P99 E2E Latency (ms):                    70399.70
---------------Time to First Token----------------
Mean TTFT (ms):                          3111.02
Median TTFT (ms):                        3321.05
P99 TTFT (ms):                           4315.65
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          146.35
Median TPOT (ms):                        116.38
P99 TPOT (ms):                           669.88
---------------Inter-Token Latency----------------
Mean ITL (ms):                           96.43
Median ITL (ms):                         107.51
P95 ITL (ms):                            122.04
P99 ITL (ms):                            124.83

Hi @attack204 ,
I'm from the IQuest-Coder team. We were actually working on the SGLang integration ourselves, but we're thrilled to see the community moving so fast!

Thank you so much for your contribution to supporting IQuest-Coder-40B-Loop. We will reviewe the PR and test it more. This will definitely help more users leverage our model with SGLang's high-performance inference.

We'll be keeping an eye on this. Let us know if you need any technical details or further support from our side! 🚀

@merlintang
Copy link
Copy Markdown

@attack204

Yes, Zhelong and me are from the IQUEST coder team. It is so so great that we can see this PR. This definitely help us to patch our model the SGLang community. We hope in the near future, IQUEST model can be supported via SGlang at the first time.

@attack204
Copy link
Copy Markdown
Contributor Author

@zelong518 @merlintang
Hey, thank you very much for following the progress of IQuest-Coder in SGLang. Since the code for IQuest is quite comprehensive, the development process went very smoothly, and it only took about three hours to essentially complete the integration with SGLang.

However, I still have some questions about the test results. May I ask how I can get in touch with you? My WeChat is L15006856731.

@attack204 attack204 force-pushed the feature/gaoji_support_iquest_coder_40b branch from d3a98d2 to ce0e6df Compare January 4, 2026 14:23
@yxing-bj yxing-bj mentioned this pull request Jan 6, 2026
5 tasks
@attack204 attack204 force-pushed the feature/gaoji_support_iquest_coder_40b branch from ce0e6df to 541241d Compare January 7, 2026 14:08
@attack204
Copy link
Copy Markdown
Contributor Author

/tag-run-ci-label

@github-actions github-actions bot added the run-ci label Jan 7, 2026
@attack204
Copy link
Copy Markdown
Contributor Author

attack204 commented Jan 7, 2026

/tag-and-rerun-ci again

@attack204
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

# When k is None, we read from KV cache instead of computing attention
if k is None:
# Read from KV cache (similar to decode mode)
o = prefill_wrapper_paged.forward(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just skip set kv buffer when k is None?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the read-only key-value pairs to another branch of the if statement. It seems that when key-value pairs are None, we only need to read them.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here is more test for my commit:

Basic Test

python3 -m sglang.test.send_one --batch-size 1 --prompt "What is SGLang?" --max-new-tokens 64


SGLang is an open-source programming language and framework designed for high-performance, memory-efficient, and flexible AI model serving. It is built from the ground up to address the unique challenges of deploying large language models (LLMs) in production environments.

Key features of SGL

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    2.032    |   64   |   1.000    |      31.50      |
+-------------+--------+------------+-----------------+

GSM8K

python -m sglang.test.few_shot_gsm8k --host http://127.0.0.1 --port 30000 --num-questions 200 --num-shots 5 --data-path test.jsonl
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:34<00:00,  5.86it/s]
Accuracy: 0.845
Invalid: 0.000
Latency: 34.289 s
Output throughput: 744.822 token/s

Hellswag

python3 bench_sglang.py --num-questions 200 --num-shots 20 --host http://127.0.0.1 --port 30000 --data-path hellswag_val.json
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:07<00:00, 27.16it/s]
Latency: 7.523
Accuracy: 0.770

MMLU

python -m sglang.test.run_eval --eval-name mmlu --port 30000 --num-examples 1000
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [03:17<00:00,  5.06it/s]
Total latency: 197.665 s
Score: 0.734

GPQA

python -m sglang.test.run_eval --eval-name gpqa --port 30000
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [10:05<00:00,  3.06s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [10:34<00:00,  3.20s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [10:54<00:00,  3.31s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [11:11<00:00,  3.39s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [11:17<00:00,  3.42s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [11:20<00:00,  3.44s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [11:21<00:00,  3.44s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [11:24<00:00,  3.46s/it]
====================███████████████████████████████████████████████████████████████████████████████████████████████████████████▏   | 192/198 [11:15<00:11,  1.90s/it]
Repeat: 8, mean: 0.378█████████████████████████████████████████████████████████████████████████████████████████████▋               | 174/198 [11:24<01:18,  3.26s/it]
Scores: ['0.394', '0.379', '0.379', '0.374', '0.374', '0.379', '0.389', '0.359']

@hnyls2002 hnyls2002 removed the run-ci label Jan 9, 2026
@github-actions github-actions bot added the run-ci label Jan 9, 2026
@hnyls2002
Copy link
Copy Markdown
Collaborator

Also, the lint failed.

@zelong518
Copy link
Copy Markdown
Contributor

/tag-run-ci-label

@attack204
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments to the abnormal condition branch.

@hnyls2002
Copy link
Copy Markdown
Collaborator

@zelong518
Copy link
Copy Markdown
Contributor

zelong518 commented Jan 12, 2026

Here is test again for double check.
ENV: 2 * H200

Launch

python -m sglang.launch_server \
  --model-path IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct \
  --tp 2 \
  --cuda-graph-bs $(seq -s ' ' 1 16) \
  --trust-remote-code \
  --port 30000

Basic Test

python3 -m sglang.test.send_one --batch-size 1 --prompt "What is SGLang?" --max-new-tokens 64
SGLang is an open-source programming language and framework designed for high-performance, memory-efficient, and flexible AI model serving. It is built from the ground up to address the unique challenges of deploying large language models (LLMs) in production environments.

Key features of SGL

acc_length=1.00
speed=32.85 token/s

gsm8k

python -m sglang.test.few_shot_gsm8k --host http://127.0.0.1 --port 30000 --num-questions 200 --num-shots 5
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:34<00:00,  5.81it/s]
Accuracy: 0.825
Invalid: 0.000
Latency: 34.590 s
Output throughput: 744.139 token/s

MMLU

python -m sglang.test.run_eval --eval-name mmlu --port 30000 --num-examples 1000
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [02:50<00:00,  5.87it/s]
Total latency: 170.435 s
Score: 0.736

Hellaswag

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:04<00:00, 49.88it/s]
Latency: 4.174
Accuracy: 0.770

GPQA

python -m sglang.test.run_eval --eval-name gpqa --port 30000 --repeat 8
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [06:52<00:00,  2.08s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [07:44<00:00,  2.35s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [07:45<00:00,  2.35s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [07:49<00:00,  2.37s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [07:53<00:00,  2.39s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [07:53<00:00,  2.39s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [07:55<00:00,  2.40s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [08:03<00:00,  2.44s/it]
====================███████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [07:49<00:00,  1.16it/s]
Repeat: 8, mean: 0.358██████████████████████████████████████████████████████████████████████████████████████████▋      | 187/198 [07:55<00:28,  2.59s/it]
Scores: ['0.364', '0.394', '0.308', '0.404', '0.354', '0.343', '0.333', '0.364']█████████████████████████████████████▊ | 196/198 [07:52<00:01,  1.06it/s]
====================
[METRIC] gpqa_mean_score=0.3579545454545455 labels={"model": "IQuest-Coder-V1-40B-Loop-Instruct", "eval": "gpqa", "repeat": 8}
Writing report to /tmp/gpqa__IQuest-Coder-V1-40B-Loop-Instruct.html
{'chars': np.float64(1915.0353535353536), 'chars:std': np.float64(900.5486171670248), 'score:std': np.float64(0.48104569292083466), 'scores': ['0.364', '0.394', '0.308', '0.404', '0.354', '0.343', '0.333', '0.364'], 'mean_score': np.float64(0.3579545454545455)}

Throughput

python3 -m sglang.bench_serving --backend sglang --num-prompt 100 --dataset-name sharegpt
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 not set   
Successful requests:                     100       
Benchmark duration (s):                  74.96     
Total input tokens:                      39104     
Total input text tokens:                 39104     
Total generated tokens:                  25195     
Total generated tokens (retokenized):    25261     
Request throughput (req/s):              1.33      
Input token throughput (tok/s):          521.68    
Output token throughput (tok/s):         336.12    
Peak output token throughput (tok/s):    1473.00   
Peak concurrent requests:                100       
Total token throughput (tok/s):          857.80    
Concurrency:                             25.56     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   19160.51  
Median E2E Latency (ms):                 15912.64  
P90 E2E Latency (ms):                    38875.57  
P99 E2E Latency (ms):                    50444.83  
---------------Time to First Token----------------
Mean TTFT (ms):                          4189.49   
Median TTFT (ms):                        4603.00   
P99 TTFT (ms):                           5886.91   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          120.00    
Median TPOT (ms):                        67.48     
P99 TPOT (ms):                           863.13    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           59.67     
Median ITL (ms):                         58.89     
P95 ITL (ms):                            60.14     
P99 ITL (ms):                            60.69     
Max ITL (ms):                            4893.52   
==================================================

Humaneval

image

Multiple

image

@attack204
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@hnyls2002 hnyls2002 merged commit 7b682de into sgl-project:main Jan 12, 2026
106 of 134 checks passed
@hnyls2002
Copy link
Copy Markdown
Collaborator

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

Hey Zelong, thanks so much for your verification commands. Could you share me how you are verifying these two benchmarks?

Humaneval

Multiple

I only see your score 😂 thanks!

cc @zijiexia

whybeyoung pushed a commit to whybeyoung/sglang that referenced this pull request Jan 14, 2026
Co-authored-by: yxing <yxing@iquestlab.com>
Co-authored-by: yzhu <yzhu@ubiquant.com>
Co-authored-by: zelong518 <zelonghuang02@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants