[CI]Add Kimi k2 nightly test#5682
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds a new nightly test for the Kimi-K2 model and a multi-node test configuration for Qwen3-235B. The changes include a new test configuration file, a new Python test file, and modifications to a shell script. My review has identified a critical issue in the run.sh script where a temporary CI workaround, which involves checking out a hardcoded pull request, has been included. This must be removed before merging to ensure the integrity of the nightly tests.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
b2bbcea to
477b19d
Compare
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
ad0b319 to
4947fc2
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
391a98c to
1c19aab
Compare
Signed-off-by: MrZ20 <2609716663@qq.com> add ci nightly test Signed-off-by: MrZ20 <2609716663@qq.com> test Signed-off-by: MrZ20 <2609716663@qq.com> fix Signed-off-by: MrZ20 <2609716663@qq.com> fix Signed-off-by: MrZ20 <2609716663@qq.com> fix Signed-off-by: MrZ20 <2609716663@qq.com> check error Signed-off-by: MrZ20 <2609716663@qq.com> test Signed-off-by: MrZ20 <2609716663@qq.com> revert Signed-off-by: MrZ20 <2609716663@qq.com> fix Signed-off-by: MrZ20 <2609716663@qq.com> fix Signed-off-by: MrZ20 <2609716663@qq.com>
1c19aab to
ea876ac
Compare
ea876ac to
84244f3
Compare
…to eplb_refactor * 'main' of https://github.com/vllm-project/vllm-ascend: [CI] Unblock 4-cards test (vllm-project#5831) [Refactor] Provide a framework to accommodate operators for different hardware devices (vllm-project#5735) [Refactor] Modify the binding logic to allocate CPU cores for each NPU card (vllm-project#5555) [BugFix] Support setting tp=1 for the Eagle draft model to take effect (vllm-project#5519) support triton of mrope (vllm-project#5664) [bugfix] A2 Environment Pooling for Memcache Compatibility (vllm-project#5601) [Doc] Update community contributors and versioning naming to follow vLLM (vllm-project#5820) [Refactor] Add comments for Metadata classes in attention module (vllm-project#5789) [Bugfix] bugfix for the order of dummy run pad and sync (vllm-project#5777) [CI] Move nightly-a2 test to hk (vllm-project#5807) [CI] Show disk usage for CI shared volume (vllm-project#5821) Bump actions/checkout from 4 to 6 (vllm-project#5795) Bump actions/github-script from 7 to 8 (vllm-project#5796) [bugfix](cp) align max_context_chunk to cp_virtual_block_size (vllm-project#5767) [bugfix]limit graph replay sync (vllm-project#5761) [CI]Add Kimi k2 nightly test (vllm-project#5682) [Doc] add tls check to pd disaggregation readme (vllm-project#5638) [CI] adpat v0.13.0 change (vllm-project#5793)
### What this PR does / why we need it?
The PR add performance and accuracy tests for **Kimi-K2-Instruct-W8A8**
and **Kimi-K2-Thinking** models to the Nightly test suite.
#### Test Configuration
**Kimi-K2-Instruct-W8A8**
- model: vllm-ascend/Kimi-K2-Instruct-W8A8
- Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node)
- Architecture: Unified Distributed Inference
- Parallelism: **DP4 + TP8 + EP** (Data Parallel 4, Tensor Parallel 8,
Expert Parallel enabled).
- Optimization: **torchair graph**, **no-prefix-caching**.
- Node 0: DP Rank 0-1, Local DP 2, Tensor Parallel 8.
- Node 1: DP Rank 2-3, Local DP 2, Tensor Parallel 8.
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs2800.
- Accuracy: vllm-ascend/gsm8k-lite.
**Kimi-K2-Thinking**
- Model: moonshotai/Kimi-K2-Thinking
- Hardware: A3, 1 Node (16 NPUs total)
- Architecture: Single Node Distributed Inference
- Parallelism: TP16 + EP (Tensor Parallel 16, Expert Parallel enabled).
- Optimization: **no-prefix-caching**
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs400.
- Accuracy: vllm-ascend/gsm8k-lite.
### Does this PR introduce _any_ user-facing change?
**Yes.** This PR enhances the ```AisbenchRunner``` to support dynamic
configuration of the ```trust_remote_code``` flag. This allows the
AISBench client to successfully load tokenizers for models that require
custom code execution (e.g., **Kimi-K2-Thinking and
Kimi-K2-Instruct-W8A8**).
**Changes:**
1. ```AisbenchRunner.__init__ ```Added the ability to capture the
```trust_remote_code``` parameter from the case configuration.
``` python
self.batch_size = aisbench_config["batch_size"]
self.request_rate = aisbench_config.get("request_rate", 0)
+ self.trust_remote_code = aisbench_config.get("trust_remote_code", False)
self.temperature = aisbench_config.get("temperature")
self.top_k = aisbench_config.get("top_k")
```
2. ```AisbenchRunner._init_request_conf``` Added regex substitution to
inject the parameter into the generated dynamic configuration file.
``` python
content = re.sub(r'batch_size.*', f'batch_size = {self.batch_size},',
content)
+ content = re.sub(r'trust_remote_code=.*',
+ f'trust_remote_code={self.trust_remote_code},',
+ content)
content = content.replace("top_k", "#top_k")
content = content.replace("seed", "#seed")
```
**Details:**
- New Config Key: Users can add ```"trust_remote_code": True``` to any
dictionary within the ```aisbench_cases``` list.
- Default Value: Defaults to ```False``` to maintain existing security
protocols for standard models.
- Impact: Resolves ```ValueError``` when benchmarking reasoning models
or models with custom tokenizers that previously failed during the
AISBench local initialization phase.
**User Example:**
Users can now enable custom code execution for specific models (like
Kimi-K2-Thinking) directly in their test suite:
```
# Now supported in test scripts:
aisbench_cases = [{
"case_type": "performance",
"request_conf": "vllm_api_stream_chat",
"trust_remote_code": True, # New user-facing parameter
...
}]
```
### How was this patch tested?
Actions:
- https://github.com/vllm-project/vllm-ascend/actions/runs/20849768433
Result as following:
- **Kimi-K2-Instruct-W8A8**(25m25s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 96.88
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤═══════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪═══════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 34571.489 ms │ 28657.8054 ms │ 36294.1788 ms │ 34714.7329 ms │ 35247.2724 ms │ 35526.6758 ms │ 36146.4314 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 2043.9136 ms │ 627.4718 ms │ 3532.3978 ms │ 1906.0194 ms │ 2307.7979 ms │ 2883.8528 ms │ 3283.7012 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 127.5591 ms │ 106.4937 ms │ 137.107 ms │ 128.3135 ms │ 129.5704 ms │ 131.1332 ms │ 134.1087 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 126.5571 ms │ 0.0095 ms │ 1340.783 ms │ 104.1398 ms │ 110.1272 ms │ 119.6124 ms │ 950.2924 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3516.6055 │ 3014.0 │ 3985.0 │ 3525.0 │ 3525.0 │ 3586.8 │ 3800.67 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 7.4143 token/s │ 7.0535 token/s │ 8.933 token/s │ 7.3744 token/s │ 7.4118 token/s │ 7.5608 token/s │ 8.7051 token/s │ 512 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧═══════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 279430.9375 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 63.3452 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 1.8323 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1800502 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 1720.5255 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 131072 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 6443.4598 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 469.0676 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 6912.5274 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- **Kimi-K2-Thinking**(43m51s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 100.00
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 172384.3573 ms │ 34456.5517 ms │ 205922.9407 ms │ 174844.2216 ms │ 202656.092 ms │ 204428.9502 ms │ 205468.6776 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 138740.3228 ms │ 655.1066 ms │ 171777.3003 ms │ 141088.0561 ms │ 169237.5599 ms │ 170716.4954 ms │ 171393.1278 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 131.9374 ms │ 90.6331 ms │ 135.4144 ms │ 132.405 ms │ 132.948 ms │ 133.7549 ms │ 135.2543 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 130.9028 ms │ 0.0099 ms │ 960.3683 ms │ 116.9623 ms │ 122.3127 ms │ 132.0522 ms │ 886.4662 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3514.575 │ 3014.0 │ 3843.0 │ 3525.0 │ 3525.0 │ 3588.0 │ 3801.08 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 1.6799 token/s │ 1.2432 token/s │ 7.4296 token/s │ 1.4642 token/s │ 1.4737 token/s │ 1.8754 token/s │ 7.125 token/s │ 400 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 1166795.568 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 59.0967 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 0.3428 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1405830 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 25.332 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 102400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 1204.864 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 87.7617 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 1292.6258 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
### What this PR does / why we need it?
The PR add performance and accuracy tests for **Kimi-K2-Instruct-W8A8**
and **Kimi-K2-Thinking** models to the Nightly test suite.
#### Test Configuration
**Kimi-K2-Instruct-W8A8**
- model: vllm-ascend/Kimi-K2-Instruct-W8A8
- Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node)
- Architecture: Unified Distributed Inference
- Parallelism: **DP4 + TP8 + EP** (Data Parallel 4, Tensor Parallel 8,
Expert Parallel enabled).
- Optimization: **torchair graph**, **no-prefix-caching**.
- Node 0: DP Rank 0-1, Local DP 2, Tensor Parallel 8.
- Node 1: DP Rank 2-3, Local DP 2, Tensor Parallel 8.
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs2800.
- Accuracy: vllm-ascend/gsm8k-lite.
**Kimi-K2-Thinking**
- Model: moonshotai/Kimi-K2-Thinking
- Hardware: A3, 1 Node (16 NPUs total)
- Architecture: Single Node Distributed Inference
- Parallelism: TP16 + EP (Tensor Parallel 16, Expert Parallel enabled).
- Optimization: **no-prefix-caching**
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs400.
- Accuracy: vllm-ascend/gsm8k-lite.
### Does this PR introduce _any_ user-facing change?
**Yes.** This PR enhances the ```AisbenchRunner``` to support dynamic
configuration of the ```trust_remote_code``` flag. This allows the
AISBench client to successfully load tokenizers for models that require
custom code execution (e.g., **Kimi-K2-Thinking and
Kimi-K2-Instruct-W8A8**).
**Changes:**
1. ```AisbenchRunner.__init__ ```Added the ability to capture the
```trust_remote_code``` parameter from the case configuration.
``` python
self.batch_size = aisbench_config["batch_size"]
self.request_rate = aisbench_config.get("request_rate", 0)
+ self.trust_remote_code = aisbench_config.get("trust_remote_code", False)
self.temperature = aisbench_config.get("temperature")
self.top_k = aisbench_config.get("top_k")
```
2. ```AisbenchRunner._init_request_conf``` Added regex substitution to
inject the parameter into the generated dynamic configuration file.
``` python
content = re.sub(r'batch_size.*', f'batch_size = {self.batch_size},',
content)
+ content = re.sub(r'trust_remote_code=.*',
+ f'trust_remote_code={self.trust_remote_code},',
+ content)
content = content.replace("top_k", "#top_k")
content = content.replace("seed", "#seed")
```
**Details:**
- New Config Key: Users can add ```"trust_remote_code": True``` to any
dictionary within the ```aisbench_cases``` list.
- Default Value: Defaults to ```False``` to maintain existing security
protocols for standard models.
- Impact: Resolves ```ValueError``` when benchmarking reasoning models
or models with custom tokenizers that previously failed during the
AISBench local initialization phase.
**User Example:**
Users can now enable custom code execution for specific models (like
Kimi-K2-Thinking) directly in their test suite:
```
# Now supported in test scripts:
aisbench_cases = [{
"case_type": "performance",
"request_conf": "vllm_api_stream_chat",
"trust_remote_code": True, # New user-facing parameter
...
}]
```
### How was this patch tested?
Actions:
- https://github.com/vllm-project/vllm-ascend/actions/runs/20849768433
Result as following:
- **Kimi-K2-Instruct-W8A8**(25m25s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 96.88
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤═══════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪═══════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 34571.489 ms │ 28657.8054 ms │ 36294.1788 ms │ 34714.7329 ms │ 35247.2724 ms │ 35526.6758 ms │ 36146.4314 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 2043.9136 ms │ 627.4718 ms │ 3532.3978 ms │ 1906.0194 ms │ 2307.7979 ms │ 2883.8528 ms │ 3283.7012 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 127.5591 ms │ 106.4937 ms │ 137.107 ms │ 128.3135 ms │ 129.5704 ms │ 131.1332 ms │ 134.1087 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 126.5571 ms │ 0.0095 ms │ 1340.783 ms │ 104.1398 ms │ 110.1272 ms │ 119.6124 ms │ 950.2924 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3516.6055 │ 3014.0 │ 3985.0 │ 3525.0 │ 3525.0 │ 3586.8 │ 3800.67 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 7.4143 token/s │ 7.0535 token/s │ 8.933 token/s │ 7.3744 token/s │ 7.4118 token/s │ 7.5608 token/s │ 8.7051 token/s │ 512 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧═══════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 279430.9375 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 63.3452 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 1.8323 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1800502 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 1720.5255 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 131072 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 6443.4598 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 469.0676 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 6912.5274 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- **Kimi-K2-Thinking**(43m51s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 100.00
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 172384.3573 ms │ 34456.5517 ms │ 205922.9407 ms │ 174844.2216 ms │ 202656.092 ms │ 204428.9502 ms │ 205468.6776 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 138740.3228 ms │ 655.1066 ms │ 171777.3003 ms │ 141088.0561 ms │ 169237.5599 ms │ 170716.4954 ms │ 171393.1278 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 131.9374 ms │ 90.6331 ms │ 135.4144 ms │ 132.405 ms │ 132.948 ms │ 133.7549 ms │ 135.2543 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 130.9028 ms │ 0.0099 ms │ 960.3683 ms │ 116.9623 ms │ 122.3127 ms │ 132.0522 ms │ 886.4662 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3514.575 │ 3014.0 │ 3843.0 │ 3525.0 │ 3525.0 │ 3588.0 │ 3801.08 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 1.6799 token/s │ 1.2432 token/s │ 7.4296 token/s │ 1.4642 token/s │ 1.4737 token/s │ 1.8754 token/s │ 7.125 token/s │ 400 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 1166795.568 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 59.0967 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 0.3428 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1405830 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 25.332 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 102400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 1204.864 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 87.7617 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 1292.6258 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
[Bugfix]Add register_kv_cache in ucm_connector (#5657)
To adapt different shapes of the KV cache, UCM optimized the
initialization of store by moving it into `register_kv_caches`.
Therefore, this update adds `register_kv_caches` interface to
UCMConnectorV1.
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>
[misc]Add Kimi-K2 series to CI model list (#5656)
Add the model to CI for subsequent addition of nightly test cases:
- moonshotai/Kimi-K2-Thinking
- vllm-ascend/Kimi-K2-Instruct-W8A8
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
[CI] cleanup single/multi-card test (#5623)
1. speed up e2e light test.
2. create `2-cards` and `4-cards` folder in multicard
3. move ops to nightly
4. run test in Alphabetical Order
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/8be6432bdaf6275664d857b1e5e9bf8ed1ce299e
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
[CI] Bump lm-eval version to v0.4.9.2 (#5655)
fix https://github.com/vllm-project/vllm-ascend/issues/2865, lm-eval [got an official update last
month](https://github.com/EleutherAI/lm-evaluation-harness/releases/tag/v0.4.9.2),
so let's bump the version.
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
Signed-off-by: wangli <wangli858794774@gmail.com>
[CI] Add workflow to cancel running workflows on PR close (#5646)
Example:
https://github.com/vllm-project/vllm-ascend/actions/runs/20735955959/job/59533181655
is still running after
https://github.com/vllm-project/vllm-ascend/pull/5612 is closed.
And the action will be running for more than 2 hours, which needs to be
cleanup.
It seems that the Github Aciton will not cancel it automatically, so I
add this to cannel those PR related actions once it is closed.
Tested in https://github.com/pacoxu/pacoxu/actions/runs/20743173119.
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
Signed-off-by: Paco Xu <roollingstone@gmail.com>
[Bugfix] fix resource are insufficient when pcp and piecewise (#5377)
Resolving the issue of insufficient resources during service operation
when PCP is enabled in a piecewise scenario.
When enabling PCP and executing in piecewise mode, the curl request
fails due to insufficient resources, resulting in the error message "The
resources are insufficient." Through profiling analysis, it was found
that the PCP communication domain also occupies streams and consumes
resources. Therefore, when updating aclgraph sizes, the PCP
communication domain needs to be taken into account.
- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
[Bugfix] Fix the graph capture failure issue in the eagle3+full scenario. (#5553)
When launching the service in the scenario where the
cudagraph_mode is set to FULL and Eagle3 acceleration is enabled for
inference, an error in fia will cause graph capture to fail. This PR
fixes the issue.
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/7157596103666ee7ccb7008acee8bff8a8ff1731
Signed-off-by: WithHades <244036962@qq.com>
[CI] move image and wheel job to schedule way (#5685)
move image and wheel job to schedule way to save CI resource
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
[Refactor] Fix AttentionMaskBuilder singleton and remove redundant pcp_prefill_mask (#4870)
This PR fixes the `AttentionMaskBuilder` singleton initialization issue
introduced in PR #4779 and removes the unused `pcp_prefill_mask` field.
After PR #4779 made `AttentionMaskBuilder` a singleton with `@singleton`
decorator, the class constructor now requires a `device` parameter.
However, two initialization sites were still using the old parameterless
constructor, causing failures.
1. **Fix singleton initialization**
- Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)`
in `AscendMLAMetadataBuilder.__init__()`
- Fixed `AttentionMaskBuilder()` → `AttentionMaskBuilder(self.device)`
in `AscendAttentionMetadataBuilder.__init__()`
2. **Remove unused field**
- Removed `pcp_prefill_mask` field from
`AscendPrefillContextParallelMetadata` (never used in codebase)
- Updated related test assertions
- Issue #5463
- PR #4779 (Unify all mask generation methods)
- PR #5389 (Make AttentionMaskBuilder singleton)
No. This is an internal refactoring.
- ✅ Local testing: No linter errors
- ✅ Unit tests for attention modules verified
- ⏳ CI pipeline
Signed-off-by: lico67373 <918688502@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
[Refactor] Import global var form vllm instead of overwirte it (#5469)
Import global var form vllm instead of overwirte it, so that we could
use the correct global variant value
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/5326c89803566a131c928f7fdd2100b75c981a42
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
[Tests] Add qwen3-8b nightly test (#5597)
Add qwen3-8b nightly test
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/7157596103666ee7ccb7008acee8bff8a8ff1731
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
[BugFix][Fusion] Fix graph fusion failure problem (#5676)
Currently, the vllm pull request
(https://github.com/vllm-project/vllm/pull/24252) is causing operator
fusion to fail. This issue was previously fixed by patching the backend.
The root cause has been identified, and the problem can be resolved with
this pull request.
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
---------
Signed-off-by: wxsIcey <1790571317@qq.com>
[1/N][CI] Refactor accuracy test (#5400)
1. Accuracy testing no longer compares eager and graph modes; instead,
it directly extracts the golden result under the graph mode
configuration (the implicit purpose of this case is to verify whether
modifications affect existing results)
2. Next step: finer-grained supervision of logits/sampler results
- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/254f6b986720c92ddf97fbb1a6a6465da8e87e29
Signed-off-by: wangli <wangli858794774@gmail.com>
[Kernel] Add moe_gating_top_k operator support for Ascend NPU (#5579)
1.replace moe_gating_top_k from torch_npu with custom op
2.enable the renorm function of moe_gating_top_k in softmax scenerio
No
No need test
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/7157596103666ee7ccb7008acee8bff8a8ff1731
---------
Signed-off-by: ZCG12345 <2097562023@qq.com>
[BugFix][P/D] Fix pre-create link parameter error (#5694)
Fix pre-create link parameter error, `batch_transfer_sync_write`
requires list.
No.
By CI.
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
[refactor] Refactor the interface for shard weight and remove the flashcomm2 o_shared interface. (#5181)
- Delete the environment variable
`VLLM_ASCEND_ENABLE_FLASHCOMM2_OSHARED`
- Introduce layer_sharding as a configurable feature in
additional_config
- Revise the term "shared weight" to "shard weight."
Configuration : The feature is opt-in via the additional_config
argument:
```
--additional-config '{
"layer_sharding": ["o_proj", "q_b_proj"]
}'
```
This is orthogonal to standard tensor parallelism and weight replication
strategies. It is treated as a separate, explicit feature.It can be used
in any scenario, combined with the
flashcomm2https://github.com/vllm-project/vllm-ascend/pull/3232 feature
or the ShardedCP #4702 feature, to achieve significant performance.
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9
---------
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: clrs97 <524936896@qq.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
[bugfix] adapt to new implemented get_kv_cache_spec in cpuoffload connector (#4311)
func `get_kv_cache_spec` in model_runner changed a lot and caused error
in cpuoffloading connector which is copied from model_runner, this PR
adapts to new implemented `get_kv_cache_spec` to fix it.
- vLLM version: v0.11.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2918c1b49c88c29783c86f78d2c4221cb9622379
Signed-off-by: lidenghui <lidenghui1110@gmail.com>
[Feature] add the magicmtp speculative decoding acceleration algorithm (#5542)
1. MagicMTP (paper: "Block Verification Accelerates Speculative
Decoding") was introduced to consider the influence among multiple draft
tokens, improving the acceptance rate without compromising accuracy.
2. Added Triton and PyTorch implementations, and added E2E test cases.
MagicMTP will automatically take effect when the parameter
"num_speculative_tokens" >= 3.
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/7157596103666ee7ccb7008acee8bff8a8ff1731
Signed-off-by: chenaoxuan <cax1165@163.com>
Optimize the print info format when deprecated code is used in vllm-ascend (#5696)
Optimize the warning print information format when detects depredated
code is used in vllm-ascend.
NA
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
[CI] fix image build tag (#5703)
ref doesn't work with workflow_dispatch, let's change it to raw way
This PR also merge the pr_create job into one runner to save resource.
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
[EPLB][CI] EPLB add aclgraph and redundant expert ci (#5625)
EPLB currently does not have CI related to aclgraph and redundancy
experts; this PR adds them.
release on #5529
Tested the use cases to be added in this PR.
PASSED
====================================================== warnings summary
==========================================================
<frozen importlib._bootstrap>:241
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type
SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:241
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type
SwigPyObject has no __module__ attribute
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================== 1 passed, 2
warnings in 272.24s (0:04:32)
=====================================================
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/8be6432bdaf6275664d857b1e5e9bf8ed1ce299e
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Create 111
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
cleancode
Signed-off-by: guanguan0308 <dulinyi.me@qq.com>
fix
Signed-off-by: guanguan0308 <dulinyi.me@qq.com>
fix
Signed-off-by: guanguan0308 <dulinyi.me@qq.com>
fix
Signed-off-by: guanguan0308 <dulinyi.me@qq.com>
Create 111
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
Update 111
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
Update 111
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
Update 111
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
Update 111
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
Update 111
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
Update 111
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
Update 111
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
Update 111
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
Update 111
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
Update 111
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
Update 111
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
Update 111
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
Update 111
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
Update 111
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
Update 111
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
Update 111
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
Update 111
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
Update 111
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
Update 111
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
[CI] Drop outdated cases (#5709)
Correcting some outdated use cases:
`tests/e2e/singlecard/test_aclgraph_accuracy.py::test_models_output` ->
`tests/e2e/singlecard/test_aclgraph_accuracy.py::test_piecewise_res_consistency`
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
Signed-off-by: wangli <wangli858794774@gmail.com>
[CI] Fix image build workflow_dispatch error (#5717)
type `raw` must contain `value` section. This PR fix the image build
error
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
[Feat][Bugfix][main] Adapted SP to eagle3 (#5562)
Adapted sp to eagle3.
There may still be some problems, e.g., accuracy in some scenes,
`sp`+`dp`...
We will fix them later.
N/A
We tested it mainly in a new `e2e`.
```shell
pytest -s tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py::test_llama_qwen_eagle_acceptance
```
```text
.
=============================== warnings summary ===============================
<frozen importlib._bootstrap>:241
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:241
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============= 3 passed, 1 skipped, 2 warnings in 142.05s (0:02:22) =============
```
It passed.
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/7157596103666ee7ccb7008acee8bff8a8ff1731
Signed-off-by: drslark <slarksblood@qq.com>
[bugfix] Support dsv3.2 enable both mtp and full_decode_only (#5679)
are enabled for the DSV32 model, the operators cannot be compiled into
the graph. This PR fixes that issue.
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
[Doc] Add Qwen3-Omni-30B-A3B-Thinking Tutorials (#3991)
Add Qwen3-Omni-30B-A3B-Thinking Tutorials
No
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/5326c89803566a131c928f7fdd2100b75c981a42
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
[Fix] Fixes speculative decode indexing and unpad condition for attention metadata (#5626)
This addresses the issue brought up by #5356 and #4963, and we believe
the unnecessary conditions are the root cause.
Change the unpad trigger to be driven by actual size mismatches
(num_reqs vs base_num_reqs or scheduled vs input token counts) rather
than specific speculative-method flags. Then remove brittle workarounds
that forced request counts and sliced query start locations.
This prevents incorrect indexing and length mismatches during
speculative decoding and makes metadata unpadding more robust across
scheduling modes.
None.
Tested by existing cases.
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/8be6432bdaf6275664d857b1e5e9bf8ed1ce299e
---------
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
[CI] Add triton ascend in nightly CI (#5716)
Add triton ascend in nightly
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
---------
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
[feature]dcp&pcp support mlapo (#5672)
mlapo in deepseek is a huge performance improvement in decode, this pr
support pcp & dcp with mlapo
NO
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
---------
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
[CI] Remove workflow_dispatch way for image build (#5742)
There is some problem for workflow_dispatch way for image build. Let's
remove it first to make CI happy. I'll add it back once it's well
tested.
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
[Nightly] Move ops to the correct path (#5642)
Move ops to the correct path where they belong
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
Signed-off-by: wangli <wangli858794774@gmail.com>
[OP] Enable custom op aclnnMoeInitRoutingCustom (#5332)
This PR enables custom op `aclnnMoeInitRoutingCustom` introduced in PR
No
- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/bc0a5a0c089844b17cb93f3294348f411e523586
---------
Signed-off-by: QianChenxi <chenxi.qian.cq@outlook.com>
Signed-off-by: zzzzwwjj <1183291235@qq.com>
Co-authored-by: zzzzwwjj <1183291235@qq.com>
[CI] Add qwen3 next ci (#5395)
Add Qwen3Next CI
NO
- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/254f6b986720c92ddf97fbb1a6a6465da8e87e29
---------
Signed-off-by: SunnyLee219 <3294305115@qq.com>
[Doc] add PaddleOCR-VL tutorials guide (#5556)
1. add PaddleOCR-VL.md in the `docs/source/tutorials/`
2. add PaddleOCR-VL index in `docs/source/tutorials/index.md`
No
by CI
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/7157596103666ee7ccb7008acee8bff8a8ff1731
Signed-off-by: zouyizhou <zouyizhou@huawei.com>
[BugFix][DS 3.2] Fix ds indexer accuracy problem caused by rope. (#4641)
The rotary algorithm in deepseek indexer should be neox-style instead of
gptj style. PR #4413 fix this accuracy bug with new triton kernel. This
PR fixes original pytorch version.
None
CI passed with existing test.
- vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24
- vLLM main:
https://github.com/vllm-project/vllm/commit/86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24
Signed-off-by: whx-sjtu <2952154980@qq.com>
[Doc][fix] Fix the title of the document for the layer_sharding feature (#5759)
Fix the title of the document for the layer_sharding feature
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
---------
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
[CI] lint and ut use self_hosted runner (#5652)
lint and ut use self_hosted runner
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
[BugFix] NetLoader: No backend type associated with device type npu (#5700)
**What this PR does / why we need it?**
This PR fixes a bug in NetLoader
[PR#2888](https://github.com/vllm-project/vllm-ascend/pull/2888). The
bug was caused by
[PR#3612](https://github.com/vllm-project/vllm-ascend/pull/3612)
([1/N][Refactor] Refactor code to adapt with vllm main), which removed
the `stateless_init_device_torch_dist_pg` function from platform.py,
leading to a failure in the call. This PR adds a way to create a
stateless process group that does not depend on external code.
**Does this PR introduce any user-facing change?**
No
**How was this patch tested?**
Same with
[PR#2888](https://github.com/vllm-project/vllm-ascend/pull/2888)
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
---------
Signed-off-by: destinysky <kangrui10@126.com>
[CI] Accuracy issue of qwen3-next-w8a8 nightly test fix. (#5746)
Close the **Full Graph** mode to temporarily avoid accuracy issue for
**Qwen3-Next-80B-A3B-Instruct-W8A8**.
N/A
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
---------
Signed-off-by: InSec <1790766300@qq.com>
[BugFix] Xlite: Bypass the padding of the graph mode in non-MTP cases to obtain the correct decode num. (#5711)
This PR fixes a bug in Xlite
backend(https://atomgit.com/openeuler/GVirt/issues/1), The direct cause
of the problem is that the XModel::PrepareAttn function obtained an
illegal number of tokens to be inferred, -540. This illegal value is due
to the padding feature of inference in graph mode and the residual state
across steps. This issue is triggered when a prefill request is newly
added in a step and a decode ends simultaneously. It is first fixed
using num_decode_tokens instead of attn_metadata.num_decodes.
1. In graph mode, vllm_ascend has padding characteristics. In the
_prepare_inputs function, if the number of tokens to be inferred is less
than the set threshold (8 in this case), the attn_metadata.num_decode
array will be expanded to 8.
2. Meanwhile, vllm_ascend uses the class variable self.query_start_loc
of NPUModelRunner to record the tokens to be inferred. Due to poor
coordination with the graph mode padding mechanism when crossing steps,
in some cases (such as when a decode request is completed in a certain
step and a new prefill request is added at the same time), negative
values may be calculated for attn_metadata.query_lens.
3. After type conversion, the negative values in query_lens cause an
overflow. Xlite detects that the number of tokens to be inferred for the
decode request is too large and triggers a "decode len too long" alert.
No
Same with https://atomgit.com/openeuler/GVirt/issues/1
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
Signed-off-by: wwwumr <1127858301@qq.com>
[CustomOp] support TensorList for dispatchFFNCombine (#5665)
To support tensorList for dispatch_ffn_combine, to adjust eplb
N/A
Single Operator Testing
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
---------
Signed-off-by: lhchg <lhao_cheng@163.com>
Co-authored-by: lihaocheng <lihaosheng1@h-partners.com>
[CI] Avoid lint and ut for PR push (#5762)
1. Don't run lint and ut again once the PR is merged to save CI resource
2. Update codecov every 4 hour
3. rename `model_downloader` to suitable name
4. update schedule job to better time.
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
[BufFix]Fix the error when using Ascend custom operators with rank=128 (#5394)
The customized ascend operator sgmv_expand and sgmv_shrink applies only
to the scenario where rank is 8,16,32,64. When rank >= 128, the operator
is out of range, causing the model to report an error.
Depends on this commit https://github.com/vllm-project/vllm/pull/31408
- vLLM version: release/v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/254f6b986720c92ddf97fbb1a6a6465da8e87e29
---------
Signed-off-by: ZT-AIA <1028681969@qq.com>
Signed-off-by: ZT-AIA <63220130+ZT-AIA@users.noreply.github.com>
[Refactor] Replace the implementations of o_proj, q_b_proj, and kv_b_proj with custom_op for sharded CP (#5698)
Based on the Sharded-CP feature
PR:https://github.com/vllm-project/vllm-ascend/pull/4702;
RFC:https://github.com/vllm-project/vllm/issues/30055
This PR officially integrates Deepseek V3.2's DSA-CP support on the
basis of https://github.com/vllm-project/vllm-ascend/pull/4702,
improving inference efficiency and scalability under mixed
prefill-decode workloads. The main improvements include:
- Replace the implementations of o_proj, q_b_proj, and kv_b_proj with
custom_op for TP=1.
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
---------
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Signed-off-by: Kurumi5210 <jaychou1620@gmail.com>
Co-authored-by: clrs97 <524936896@qq.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
[Bugfix] Fix matmul allreduce precision issue by using original weight (#4939)
This PR fixes the precision issue from improper Tensor maintenance in
`vllm_ascend/ops/linear_op.py` under the Verl reinforcement learning
(RL) scenario. issue:
https://github.com/vllm-project/vllm-ascend/issues/5747
Key changes:
1. Remove the custom class member `self.weight_t` in
`vllm_ascend/ops/linear_op.py`;
2. Adjust the input logic of the `npu_mm_all_reduce_base` operator to
directly fetch weight parameters from the model's `nn.Parameters`,
instead of using pre-created Tensors.
> In the vllm model, it is recommended to avoid creating additional
parameter copies (such as self.weight_t) for computation; if already
created, they must be synchronized with the model's original parameters.
This is because parameter synchronization between training and inference
in the Verl reinforcement learning (RL) scenario may cause memory
address changes to nn.Parameters, and unsynchronized extra Tensors will
reference old memory without updating with the parameters—ultimately
leading to precision issues.
No.
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9
Signed-off-by: icerain-alt <450125138@qq.com>
Co-authored-by: Shangwei-Li <lishangwei@mail.ustc.edu.cn>
[Feature] GLM4.6 support mtp with fullgraph (#5460)
GLM4.6 support mtp with fullgraph to improve performance
`
export HCCL_BUFFSIZE=1024
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE=AIV
vllm serve /weight/glm4.6_w8a8_with_float_mtp \
--data-parallel-size 1 \
--tensor-parallel-size 16 \
--seed 1024 \
--served-model-name glm \
--max-model-len 35000 \
--max-num-batched-tokens 16384 \
--max-num-seqs 16 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--speculative-config '{"num_speculative_tokens": 1,
"model":"/weight/glm4.6_w8a8_with_float_mtp", "method":"mtp"}' \
--compilation-config '{"cudagraph_capture_sizes": [1,2,4,8,16,32],
"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--async-scheduling \
`
test case:
`
vllm bench serve \
--backend vllm \
--dataset-name prefix_repetition \
--prefix-repetition-prefix-len 22400 \
--prefix-repetition-suffix-len 9600 \
--prefix-repetition-output-len 1024 \
--num-prompts 1 \
--prefix-repetition-num-prefixes 1 \
--ignore-eos \
--model glm \
--tokenizer /weight/glm4.6_w8a8_with_float_mtp \
--seed 1000 \
--host 0.0.0.0 \
--port 8000 \
--endpoint /v1/completions \
--max-concurrency 1 \
--request-rate 1
`
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/5326c89803566a131c928f7fdd2100b75c981a42
Signed-off-by: 1092626063 <1092626063@qq.com>
[CI]Add Disaggregated PD Nightly Test for Qwen3-235B and Qwen3-VL-235B (#5502)
This PR adds online **Disaggregated Prefill/Decode** performance and
accuracy tests for the **Qwen3-235B-A22B** and
**Qwen3-VL-235B-A22B-Instruct** models to the Nightly test suite.
These test configurations simulate the deployment of massive MoE and
Vision-Language models in **a dual-node (32 NPU)** environment,
utilizing Mooncake (KVCache Transfer) technology to achieve efficient KV
cache transfer between the Prefill node and the Decode node.
**Qwen3-235B-A22B**
- Model: Qwen/Qwen3-235B-A22B
- Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node)
- Architecture: Disaggregated Prefill & Decode
- Node 0 (Producer/Prefill): **DP2 + TP8 + EP + FLASHCOMM1 +
FUSED_MC2**.
- Node 1 (Consumer/Decode): **DP4 + TP4 + EP + FLASHCOMM1 + FUSED_MC2 +
FULL_DECODE_ONLY**.
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs2800.
- Accuracy: vllm-ascend/gsm8k-lite.
**Qwen3-VL-235B-A22B-Instruct**
- Model: Qwen/Qwen3-VL-235B-A22B-Instruct
- Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node)
- Architecture: Disaggregated Prefill & Decode
- Node 0 (Producer/Prefill): **DP2 + TP8 + EP**.
- Node 1 (Consumer/Decode): **DP4 + TP4 + EP + FULL_DECODE_ONLY**.
- Benchmarks:
- Performance: vllm-ascend/textvqa-perf-1080p.
- Accuracy: vllm-ascend/textvqa-lite.
Nightly test action on CI
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/45c1ca1ca1ee8fa06df263c8715e8a412ff408d4
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
support mxfp8 quantization (qwen dense) (#5723)
support mxfp8 quantization (qwen liner layer)
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
Signed-off-by: wangyao <iwangyao@outlook.com>
[Doc] Add GLM4.5 GLM4.6 doc (#5740)
Add GLM4.5 GLM4.6 doc
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
Signed-off-by: 1092626063 <1092626063@qq.com>
[bugfix] Fixing KV Pool Memory Retention and Performance Degradation Issues (#5751)
1.Fixed memory retention on certain GPUs caused by missing PUT
operations.
2.Fixed performance degradation resulting from architectural
incompatibilities in the underlying refactor.
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
---------
Signed-off-by: fems14 <1804143737@qq.com>
[P/D][bugfix]Fix the PCP port mapping error issue (#5706)
Fix the PCP port mapping error issue.In a multi-node PD separation
scenario, when the PCP feature is enabled, there is an issue with the
ZMQ transmission port. Specifically, the IP and port received by Side D
do not match. The cause of this issue is an error in the port mapping
update strategy logic.
No
By ci
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
[Feat] flashcomm2+oshard Generalized (#4723)
[FlashComm2](https://gitcode.com/ascend-tribe/ascend-inference-cluster/blob/main/FlashComm/FlashComm2%E5%A4%A7%E6%A8%A1%E5%9E%8B%E6%8E%A8%E7%90%86%E4%B8%AD%E4%BB%A5%E5%AD%98%E6%8D%A2%E4%BC%A0%E7%9A%84%E9%80%9A%E4%BF%A1%E4%BC%98%E5%8C%96%E6%8A%80%E6%9C%AF.pdf)
introduces redundant storage of the o_proj matrix, which imposes
pressure on GPU memory. We propose the FlashComm2+Oshard approach by
integrating the shared linear layer feature (#2931). This approach
distributes weights layer-by-layer to each GPU and accesses the o_proj
of each layer via asynchronous broadcast operations, thereby alleviating
memory pressure while achieving nearly lossless performance compared to
the original FlashComm2. This PR implements a generalized
FlashComm2+Oshard solution.
Using following env to support flashcomm2 with oshard
```shell
export VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1
--additional-config '{
"layer_sharding": ["o_proj"]
}'
```
- vLLM version: v0.12.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/ad32e3e19ccf0526cb6744a5fed09a138a5fb2f9
---------
Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
adapt to minimax_m2 (#5624)
This PR fixes Minimax model loading in vLLM Ascend backend by:
Adding model type check for "minimax" and "minimax_m2" to replace "mlp"
prefix with "block_sparse_moe"
Implementing special handling for Minimax expert layer naming
conventions
Adding Minimax configuration to packed_modules_model_mapping for proper
qkv_proj and experts module handling
Without these changes, Minimax models fail to load on Ascend devices due
to incompatible layer naming and module packing.
Yes. Users can now successfully load and run Minimax models on Ascend
hardware with vLLM. This enables inference capabilities for this model
family on Ascend devices.
Local Testing:
Verified model loading for minimax-xxx and minimax_m2-xxx model variants
on Atlas 800I A2 hardware
Tested inference with sample prompts using vLLM's OpenAI-compatible API
server
Benchmark Validation:
Compared throughput and latency metrics against GPU baseline
Verified memory usage stays within expected limits for different batch
sizes
Tested multi-card inference scenarios with tensor parallelism
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/8be6432bdaf6275664d857b1e5e9bf8ed1ce299e
---------
Signed-off-by: Feng-xiaosuo <tengchang1@huawei.com>
[P/D] layerwise connector supports DeepSeek-V3.2 sparse attention && Distribute transfer tasks to redundant kv_head cards (#5722)
Add new function to mooncake layerwise connector, including:
1. supports sparse attention, for DeepSeek-V3.2
2. Distribute transfer tasks to redundant kv_head cards
This PR is related to [[RFC]: CDCP Scheduling for Disaggregated
Prefilling with KV Cache Layerwise Push
Support](https://github.com/vllm-project/vllm-ascend/issues/4842)
No.
By CI.
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
[main][bugfix] Fix fullgraph padding bug in mtp eagle refactor (#5692)
The condition for determining padding in the fullgraph overlay with MTP
and PCP has been modified to accommodate corner cases where the shape
capture size is manually specified.
no
ut and tests
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
---------
Signed-off-by: lilinsiman <lilinsiman@gmail.com>
[Perf] Supports compute-communication overlap in the forward of sfa_v1 in the Sharded-CP feature. (#5701)
> Extracted from PR #5513
Based on the Sharded-CP feature PR:#4702;
RFC:https://github.com/vllm-project/vllm/issues/30055
- This PR adjusts the calculation order in the SFA.
- split `index_select` into `indexer_select_pre_process` and
`indexer_select_post_process`.
- Combine `nope`, `rope` and `index-k` into a tensor to perform
asynchronous all-gather.
input=40k && num_batch_token=20k
- before:
```
Mean TTFT (ms): 2614.52
Median TTFT (ms): 3148.03
P50 TTFT (ms): 3148.03
P90 TTFT (ms): 3163.48
P99 TTFT (ms): 3170.20
```
- after:
```
Mean TTFT (ms): 2529.92
Median TTFT (ms): 3051.69
P50 TTFT (ms): 3051.69
P90 TTFT (ms): 3067.31
P99 TTFT (ms): 3072.15
```
None
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
---------
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
[Feature] Support for cross-attention and whisper model (#5592)
To solve the problem of the
issue:https://github.com/vllm-project/vllm-ascend/issues/2262
- support for cross-attention when the model is encoder-decoder
- support for whisper model
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/7157596103666ee7ccb7008acee8bff8a8ff1731
Signed-off-by: gh924 <guihao2@huawei.com>
Co-authored-by: Aoxuan Chen <43376869+chenaoxuan@users.noreply.github.com>
[CI] Align multi-node nightly test paramter with corresponding tutorials document (#5756)
Align multi-node nightly test paramter with tutorials documents.
NA
Test locally and nighly e2e multi-node test cases.
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
---------
Signed-off-by: leo-pony <nengjunma@outlook.com>
[Doc] Update doc url link (#5781)
Drop `dev` suffix for doc url.
Rename url to `https://docs.vllm.ai/projects/ascend`
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
[CI] adpat v0.13.0 change (#5793)
Add `releases` match case for CI jobs and update related doc for v0.13.0
branch
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
[Doc] add tls check to pd disaggregation readme (#5638)
update pd disaggregation multi_node readme, update the environment check
command for A3, add tls check
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/8be6432bdaf6275664d857b1e5e9bf8ed1ce299e
Signed-off-by: liziyu <liziyu16@huawei.com>
[CI]Add Kimi k2 nightly test (#5682)
The PR add performance and accuracy tests for **Kimi-K2-Instruct-W8A8**
and **Kimi-K2-Thinking** models to the Nightly test suite.
**Kimi-K2-Instruct-W8A8**
- model: vllm-ascend/Kimi-K2-Instruct-W8A8
- Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node)
- Architecture: Unified Distributed Inference
- Parallelism: **DP4 + TP8 + EP** (Data Parallel 4, Tensor Parallel 8,
Expert Parallel enabled).
- Optimization: **torchair graph**, **no-prefix-caching**.
- Node 0: DP Rank 0-1, Local DP 2, Tensor Parallel 8.
- Node 1: DP Rank 2-3, Local DP 2, Tensor Parallel 8.
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs2800.
- Accuracy: vllm-ascend/gsm8k-lite.
**Kimi-K2-Thinking**
- Model: moonshotai/Kimi-K2-Thinking
- Hardware: A3, 1 Node (16 NPUs total)
- Architecture: Single Node Distributed Inference
- Parallelism: TP16 + EP (Tensor Parallel 16, Expert Parallel enabled).
- Optimization: **no-prefix-caching**
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs400.
- Accuracy: vllm-ascend/gsm8k-lite.
**Yes.** This PR enhances the ```AisbenchRunner``` to support dynamic
configuration of the ```trust_remote_code``` flag. This allows the
AISBench client to successfully load tokenizers for models that require
custom code execution (e.g., **Kimi-K2-Thinking and
Kimi-K2-Instruct-W8A8**).
**Changes:**
1. ```AisbenchRunner.__init__ ```Added the ability to capture the
```trust_remote_code``` parameter from the case configuration.
``` python
self.batch_size = aisbench_config["batch_size"]
self.request_rate = aisbench_config.get("request_rate", 0)
+ self.trust_remote_code = aisbench_config.get("trust_remote_code", False)
self.temperature = aisbench_config.get("temperature")
self.top_k = aisbench_config.get("top_k")
```
2. ```AisbenchRunner._init_request_conf``` Added regex substitution to
inject the parameter into the generated dynamic configuration file.
``` python
content = re.sub(r'batch_size.*', f'batch_size = {self.batch_size},',
content)
+ content = re.sub(r'trust_remote_code=.*',
+ f'trust_remote_code={self.trust_remote_code},',
+ content)
content = content.replace("top_k", "#top_k")
content = content.replace("seed", "#seed")
```
**Details:**
- New Config Key: Users can add ```"trust_remote_code": True``` to any
dictionary within the ```aisbench_cases``` list.
- Default Value: Defaults to ```False``` to maintain existing security
protocols for standard models.
- Impact: Resolves ```ValueError``` when benchmarking reasoning models
or models with custom tokenizers that previously failed during the
AISBench local initialization phase.
**User Example:**
Users can now enable custom code execution for specific models (like
Kimi-K2-Thinking) directly in their test suite:
```
aisbench_cases = [{
"case_type": "performance",
"request_conf": "vllm_api_stream_chat",
"trust_remote_code": True, # New user-facing parameter
...
}]
```
Actions:
- https://github.com/vllm-project/vllm-ascend/actions/runs/20849768433
Result as following:
- **Kimi-K2-Instruct-W8A8**(25m25s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 96.88
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤═══════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪═══════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 34571.489 ms │ 28657.8054 ms │ 36294.1788 ms │ 34714.7329 ms │ 35247.2724 ms │ 35526.6758 ms │ 36146.4314 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 2043.9136 ms │ 627.4718 ms │ 3532.3978 ms │ 1906.0194 ms │ 2307.7979 ms │ 2883.8528 ms │ 3283.7012 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 127.5591 ms │ 106.4937 ms │ 137.107 ms │ 128.3135 ms │ 129.5704 ms │ 131.1332 ms │ 134.1087 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 126.5571 ms │ 0.0095 ms │ 1340.783 ms │ 104.1398 ms │ 110.1272 ms │ 119.6124 ms │ 950.2924 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3516.6055 │ 3014.0 │ 3985.0 │ 3525.0 │ 3525.0 │ 3586.8 │ 3800.67 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 7.4143 token/s │ 7.0535 token/s │ 8.933 token/s │ 7.3744 token/s │ 7.4118 token/s │ 7.5608 token/s │ 8.7051 token/s │ 512 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧═══════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 279430.9375 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 63.3452 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 1.8323 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1800502 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 1720.5255 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 131072 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 6443.4598 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 469.0676 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 6912.5274 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- **Kimi-K2-Thinking**(43m51s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 100.00
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 172384.3573 ms │ 34456.5517 ms │ 205922.9407 ms │ 174844.2216 ms │ 202656.092 ms │ 204428.9502 ms │ 205468.6776 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 138740.3228 ms │ 655.1066 ms │ 171777.3003 ms │ 141088.0561 ms │ 169237.5599 ms │ 170716.4954 ms │ 171393.1278 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 131.9374 ms │ 90.6331 ms │ 135.4144 ms │ 132.405 ms │ 132.948 ms │ 133.7549 ms │ 135.2543 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 130.9028 ms │ 0.0099 ms │ 960.3683 ms │ 116.9623 ms │ 122.3127 ms │ 132.0522 ms │ 886.4662 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3514.575 │ 3014.0 │ 3843.0 │ 3525.0 │ 3525.0 │ 3588.0 │ 3801.08 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 1.6799 token/s │ 1.2432 token/s │ 7.4296 token/s │ 1.4642 token/s │ 1.4737 token/s │ 1.8754 token/s │ 7.125 token/s │ 400 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 1166795.568 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 59.0967 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 0.3428 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1405830 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 25.332 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 102400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 1204.864 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 87.7617 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 1292.6258 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
[bugfix]limit graph replay sync (#5761)
when graph mode is picewise,replay by synchronize will be effect
performance, sync almost cost 250us

only sync when graph mode contain full mode
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
---------
Signed-off-by: wangyongjun <wangyongjun7@huawei.com>
[bugfix](cp) align max_context_chunk to cp_virtual_block_size (#5767)
In the chunked prefill scenario, CP needs to align the
`max_context_chunk` to the `cp_virtual_block_size`, but the current
implementation only aligns it to the `block_size`. For
PD-disaggregation, `cp_kv_cache_interleave_size` is typically set equal
to `block_size`, in which case `cp_virtual_block_size=block_size *
dcp_size * pcp_size`. Under specific conditions, this can lead to
misalignment of certain chunks, subsequently triggering assertion check
errors.
No
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
Bump actions/github-script from 7 to 8 (#5796)
Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 8.
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bump actions/checkout from 4 to 6 (#5795)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6.
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[CI] Show disk usage for CI shared volume (#5821)
1. Remove some useless but too large models from the shared volume
2. Add a new step to show current usage
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
[CI] Move nightly-a2 test to hk (#5807)
This patch initial testing involved connecting two nodes from the HK
region to nightly A2.
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
[Bugfix] bugfix for the order of dummy run pad and sync (#5777)
This PR addresses an issue in piecewise graph mode when Multi-Threading
Parallelism (MTP) is enabled. Specifically, the original dummy run
sequence performs the following steps in order:
1. Sync DP (input length = 1 + k)
2. Dispatch (input length = 1 + k, with padding==graph size)
However, in the model execution phase, the sequence differs, resulting
in:
1. Padding (input length = 1, with padding)
2. Sync DP (input length = 1 + k)
3. Dispatch (input length 1 + k != graph size 1 + k, with padding)
This discrepancy leads to a mismatch between the input sizes used in the
model execution and those expected by the dispatch graph, causing an
inconsistency in graph size.
This PR ensures that the dispatch graph size aligns correctly by
modifying the sequence of operations during model execution to match the
dummy run sequence, resolving the mismatch issue.
No
- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d
Signed-off-by: LiuYi-UP <1150854440@qq.com>
[Refactor] Add comments for Metadata classes in attention module (#5789)
Add docstrings for Metadata and MetadataBuilder classes in the attention
module to improve code readability.
Related to #5463 (Item 11: Add some comments for CommonMetadata and
others)
**Modified files:**
- `vllm_ascend/attention/context_parallel/common_cp.py`: Added comments
for `AscendPCPMetadata`, `CPChunkedContextMetadata`,
`AscendMetadataForPrefill`, `AscendMetadataForDecode`
- `vllm_ascend/attention/utils.py`: Added comments for
`AscendPrefillContextParallelMetadata`
- `vllm_ascend/attention/mla_v1.py`: Added comments for
`Chunke…
### What this PR does / why we need it?
The PR add performance and accuracy tests for **Kimi-K2-Instruct-W8A8**
and **Kimi-K2-Thinking** models to the Nightly test suite.
#### Test Configuration
**Kimi-K2-Instruct-W8A8**
- model: vllm-ascend/Kimi-K2-Instruct-W8A8
- Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node)
- Architecture: Unified Distributed Inference
- Parallelism: **DP4 + TP8 + EP** (Data Parallel 4, Tensor Parallel 8,
Expert Parallel enabled).
- Optimization: **torchair graph**, **no-prefix-caching**.
- Node 0: DP Rank 0-1, Local DP 2, Tensor Parallel 8.
- Node 1: DP Rank 2-3, Local DP 2, Tensor Parallel 8.
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs2800.
- Accuracy: vllm-ascend/gsm8k-lite.
**Kimi-K2-Thinking**
- Model: moonshotai/Kimi-K2-Thinking
- Hardware: A3, 1 Node (16 NPUs total)
- Architecture: Single Node Distributed Inference
- Parallelism: TP16 + EP (Tensor Parallel 16, Expert Parallel enabled).
- Optimization: **no-prefix-caching**
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs400.
- Accuracy: vllm-ascend/gsm8k-lite.
### Does this PR introduce _any_ user-facing change?
**Yes.** This PR enhances the ```AisbenchRunner``` to support dynamic
configuration of the ```trust_remote_code``` flag. This allows the
AISBench client to successfully load tokenizers for models that require
custom code execution (e.g., **Kimi-K2-Thinking and
Kimi-K2-Instruct-W8A8**).
**Changes:**
1. ```AisbenchRunner.__init__ ```Added the ability to capture the
```trust_remote_code``` parameter from the case configuration.
``` python
self.batch_size = aisbench_config["batch_size"]
self.request_rate = aisbench_config.get("request_rate", 0)
+ self.trust_remote_code = aisbench_config.get("trust_remote_code", False)
self.temperature = aisbench_config.get("temperature")
self.top_k = aisbench_config.get("top_k")
```
2. ```AisbenchRunner._init_request_conf``` Added regex substitution to
inject the parameter into the generated dynamic configuration file.
``` python
content = re.sub(r'batch_size.*', f'batch_size = {self.batch_size},',
content)
+ content = re.sub(r'trust_remote_code=.*',
+ f'trust_remote_code={self.trust_remote_code},',
+ content)
content = content.replace("top_k", "#top_k")
content = content.replace("seed", "#seed")
```
**Details:**
- New Config Key: Users can add ```"trust_remote_code": True``` to any
dictionary within the ```aisbench_cases``` list.
- Default Value: Defaults to ```False``` to maintain existing security
protocols for standard models.
- Impact: Resolves ```ValueError``` when benchmarking reasoning models
or models with custom tokenizers that previously failed during the
AISBench local initialization phase.
**User Example:**
Users can now enable custom code execution for specific models (like
Kimi-K2-Thinking) directly in their test suite:
```
# Now supported in test scripts:
aisbench_cases = [{
"case_type": "performance",
"request_conf": "vllm_api_stream_chat",
"trust_remote_code": True, # New user-facing parameter
...
}]
```
### How was this patch tested?
Actions:
- https://github.com/vllm-project/vllm-ascend/actions/runs/20849768433
Result as following:
- **Kimi-K2-Instruct-W8A8**(25m25s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 96.88
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤═══════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪═══════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 34571.489 ms │ 28657.8054 ms │ 36294.1788 ms │ 34714.7329 ms │ 35247.2724 ms │ 35526.6758 ms │ 36146.4314 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 2043.9136 ms │ 627.4718 ms │ 3532.3978 ms │ 1906.0194 ms │ 2307.7979 ms │ 2883.8528 ms │ 3283.7012 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 127.5591 ms │ 106.4937 ms │ 137.107 ms │ 128.3135 ms │ 129.5704 ms │ 131.1332 ms │ 134.1087 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 126.5571 ms │ 0.0095 ms │ 1340.783 ms │ 104.1398 ms │ 110.1272 ms │ 119.6124 ms │ 950.2924 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3516.6055 │ 3014.0 │ 3985.0 │ 3525.0 │ 3525.0 │ 3586.8 │ 3800.67 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 7.4143 token/s │ 7.0535 token/s │ 8.933 token/s │ 7.3744 token/s │ 7.4118 token/s │ 7.5608 token/s │ 8.7051 token/s │ 512 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧═══════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 279430.9375 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 63.3452 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 1.8323 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1800502 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 1720.5255 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 131072 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 6443.4598 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 469.0676 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 6912.5274 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- **Kimi-K2-Thinking**(43m51s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 100.00
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 172384.3573 ms │ 34456.5517 ms │ 205922.9407 ms │ 174844.2216 ms │ 202656.092 ms │ 204428.9502 ms │ 205468.6776 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 138740.3228 ms │ 655.1066 ms │ 171777.3003 ms │ 141088.0561 ms │ 169237.5599 ms │ 170716.4954 ms │ 171393.1278 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 131.9374 ms │ 90.6331 ms │ 135.4144 ms │ 132.405 ms │ 132.948 ms │ 133.7549 ms │ 135.2543 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 130.9028 ms │ 0.0099 ms │ 960.3683 ms │ 116.9623 ms │ 122.3127 ms │ 132.0522 ms │ 886.4662 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3514.575 │ 3014.0 │ 3843.0 │ 3525.0 │ 3525.0 │ 3588.0 │ 3801.08 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 1.6799 token/s │ 1.2432 token/s │ 7.4296 token/s │ 1.4642 token/s │ 1.4737 token/s │ 1.8754 token/s │ 7.125 token/s │ 400 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 1166795.568 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 59.0967 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 0.3428 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1405830 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 25.332 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 102400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 1204.864 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 87.7617 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 1292.6258 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
### What this PR does / why we need it?
The PR add performance and accuracy tests for **Kimi-K2-Instruct-W8A8**
and **Kimi-K2-Thinking** models to the Nightly test suite.
#### Test Configuration
**Kimi-K2-Instruct-W8A8**
- model: vllm-ascend/Kimi-K2-Instruct-W8A8
- Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node)
- Architecture: Unified Distributed Inference
- Parallelism: **DP4 + TP8 + EP** (Data Parallel 4, Tensor Parallel 8,
Expert Parallel enabled).
- Optimization: **torchair graph**, **no-prefix-caching**.
- Node 0: DP Rank 0-1, Local DP 2, Tensor Parallel 8.
- Node 1: DP Rank 2-3, Local DP 2, Tensor Parallel 8.
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs2800.
- Accuracy: vllm-ascend/gsm8k-lite.
**Kimi-K2-Thinking**
- Model: moonshotai/Kimi-K2-Thinking
- Hardware: A3, 1 Node (16 NPUs total)
- Architecture: Single Node Distributed Inference
- Parallelism: TP16 + EP (Tensor Parallel 16, Expert Parallel enabled).
- Optimization: **no-prefix-caching**
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs400.
- Accuracy: vllm-ascend/gsm8k-lite.
### Does this PR introduce _any_ user-facing change?
**Yes.** This PR enhances the ```AisbenchRunner``` to support dynamic
configuration of the ```trust_remote_code``` flag. This allows the
AISBench client to successfully load tokenizers for models that require
custom code execution (e.g., **Kimi-K2-Thinking and
Kimi-K2-Instruct-W8A8**).
**Changes:**
1. ```AisbenchRunner.__init__ ```Added the ability to capture the
```trust_remote_code``` parameter from the case configuration.
``` python
self.batch_size = aisbench_config["batch_size"]
self.request_rate = aisbench_config.get("request_rate", 0)
+ self.trust_remote_code = aisbench_config.get("trust_remote_code", False)
self.temperature = aisbench_config.get("temperature")
self.top_k = aisbench_config.get("top_k")
```
2. ```AisbenchRunner._init_request_conf``` Added regex substitution to
inject the parameter into the generated dynamic configuration file.
``` python
content = re.sub(r'batch_size.*', f'batch_size = {self.batch_size},',
content)
+ content = re.sub(r'trust_remote_code=.*',
+ f'trust_remote_code={self.trust_remote_code},',
+ content)
content = content.replace("top_k", "#top_k")
content = content.replace("seed", "#seed")
```
**Details:**
- New Config Key: Users can add ```"trust_remote_code": True``` to any
dictionary within the ```aisbench_cases``` list.
- Default Value: Defaults to ```False``` to maintain existing security
protocols for standard models.
- Impact: Resolves ```ValueError``` when benchmarking reasoning models
or models with custom tokenizers that previously failed during the
AISBench local initialization phase.
**User Example:**
Users can now enable custom code execution for specific models (like
Kimi-K2-Thinking) directly in their test suite:
```
# Now supported in test scripts:
aisbench_cases = [{
"case_type": "performance",
"request_conf": "vllm_api_stream_chat",
"trust_remote_code": True, # New user-facing parameter
...
}]
```
### How was this patch tested?
Actions:
- https://github.com/vllm-project/vllm-ascend/actions/runs/20849768433
Result as following:
- **Kimi-K2-Instruct-W8A8**(25m25s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 96.88
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤═══════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪═══════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 34571.489 ms │ 28657.8054 ms │ 36294.1788 ms │ 34714.7329 ms │ 35247.2724 ms │ 35526.6758 ms │ 36146.4314 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 2043.9136 ms │ 627.4718 ms │ 3532.3978 ms │ 1906.0194 ms │ 2307.7979 ms │ 2883.8528 ms │ 3283.7012 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 127.5591 ms │ 106.4937 ms │ 137.107 ms │ 128.3135 ms │ 129.5704 ms │ 131.1332 ms │ 134.1087 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 126.5571 ms │ 0.0095 ms │ 1340.783 ms │ 104.1398 ms │ 110.1272 ms │ 119.6124 ms │ 950.2924 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3516.6055 │ 3014.0 │ 3985.0 │ 3525.0 │ 3525.0 │ 3586.8 │ 3800.67 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 7.4143 token/s │ 7.0535 token/s │ 8.933 token/s │ 7.3744 token/s │ 7.4118 token/s │ 7.5608 token/s │ 8.7051 token/s │ 512 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧═══════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 279430.9375 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 63.3452 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 1.8323 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1800502 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 1720.5255 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 131072 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 6443.4598 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 469.0676 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 6912.5274 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- **Kimi-K2-Thinking**(43m51s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 100.00
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 172384.3573 ms │ 34456.5517 ms │ 205922.9407 ms │ 174844.2216 ms │ 202656.092 ms │ 204428.9502 ms │ 205468.6776 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 138740.3228 ms │ 655.1066 ms │ 171777.3003 ms │ 141088.0561 ms │ 169237.5599 ms │ 170716.4954 ms │ 171393.1278 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 131.9374 ms │ 90.6331 ms │ 135.4144 ms │ 132.405 ms │ 132.948 ms │ 133.7549 ms │ 135.2543 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 130.9028 ms │ 0.0099 ms │ 960.3683 ms │ 116.9623 ms │ 122.3127 ms │ 132.0522 ms │ 886.4662 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3514.575 │ 3014.0 │ 3843.0 │ 3525.0 │ 3525.0 │ 3588.0 │ 3801.08 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 1.6799 token/s │ 1.2432 token/s │ 7.4296 token/s │ 1.4642 token/s │ 1.4737 token/s │ 1.8754 token/s │ 7.125 token/s │ 400 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 1166795.568 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 59.0967 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 0.3428 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1405830 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 25.332 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 102400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 1204.864 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 87.7617 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 1292.6258 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
### What this PR does / why we need it?
The PR add performance and accuracy tests for **Kimi-K2-Instruct-W8A8**
and **Kimi-K2-Thinking** models to the Nightly test suite.
#### Test Configuration
**Kimi-K2-Instruct-W8A8**
- model: vllm-ascend/Kimi-K2-Instruct-W8A8
- Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node)
- Architecture: Unified Distributed Inference
- Parallelism: **DP4 + TP8 + EP** (Data Parallel 4, Tensor Parallel 8,
Expert Parallel enabled).
- Optimization: **torchair graph**, **no-prefix-caching**.
- Node 0: DP Rank 0-1, Local DP 2, Tensor Parallel 8.
- Node 1: DP Rank 2-3, Local DP 2, Tensor Parallel 8.
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs2800.
- Accuracy: vllm-ascend/gsm8k-lite.
**Kimi-K2-Thinking**
- Model: moonshotai/Kimi-K2-Thinking
- Hardware: A3, 1 Node (16 NPUs total)
- Architecture: Single Node Distributed Inference
- Parallelism: TP16 + EP (Tensor Parallel 16, Expert Parallel enabled).
- Optimization: **no-prefix-caching**
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs400.
- Accuracy: vllm-ascend/gsm8k-lite.
### Does this PR introduce _any_ user-facing change?
**Yes.** This PR enhances the ```AisbenchRunner``` to support dynamic
configuration of the ```trust_remote_code``` flag. This allows the
AISBench client to successfully load tokenizers for models that require
custom code execution (e.g., **Kimi-K2-Thinking and
Kimi-K2-Instruct-W8A8**).
**Changes:**
1. ```AisbenchRunner.__init__ ```Added the ability to capture the
```trust_remote_code``` parameter from the case configuration.
``` python
self.batch_size = aisbench_config["batch_size"]
self.request_rate = aisbench_config.get("request_rate", 0)
+ self.trust_remote_code = aisbench_config.get("trust_remote_code", False)
self.temperature = aisbench_config.get("temperature")
self.top_k = aisbench_config.get("top_k")
```
2. ```AisbenchRunner._init_request_conf``` Added regex substitution to
inject the parameter into the generated dynamic configuration file.
``` python
content = re.sub(r'batch_size.*', f'batch_size = {self.batch_size},',
content)
+ content = re.sub(r'trust_remote_code=.*',
+ f'trust_remote_code={self.trust_remote_code},',
+ content)
content = content.replace("top_k", "#top_k")
content = content.replace("seed", "#seed")
```
**Details:**
- New Config Key: Users can add ```"trust_remote_code": True``` to any
dictionary within the ```aisbench_cases``` list.
- Default Value: Defaults to ```False``` to maintain existing security
protocols for standard models.
- Impact: Resolves ```ValueError``` when benchmarking reasoning models
or models with custom tokenizers that previously failed during the
AISBench local initialization phase.
**User Example:**
Users can now enable custom code execution for specific models (like
Kimi-K2-Thinking) directly in their test suite:
```
# Now supported in test scripts:
aisbench_cases = [{
"case_type": "performance",
"request_conf": "vllm_api_stream_chat",
"trust_remote_code": True, # New user-facing parameter
...
}]
```
### How was this patch tested?
Actions:
- https://github.com/vllm-project/vllm-ascend/actions/runs/20849768433
Result as following:
- **Kimi-K2-Instruct-W8A8**(25m25s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 96.88
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤═══════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪═══════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 34571.489 ms │ 28657.8054 ms │ 36294.1788 ms │ 34714.7329 ms │ 35247.2724 ms │ 35526.6758 ms │ 36146.4314 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 2043.9136 ms │ 627.4718 ms │ 3532.3978 ms │ 1906.0194 ms │ 2307.7979 ms │ 2883.8528 ms │ 3283.7012 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 127.5591 ms │ 106.4937 ms │ 137.107 ms │ 128.3135 ms │ 129.5704 ms │ 131.1332 ms │ 134.1087 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 126.5571 ms │ 0.0095 ms │ 1340.783 ms │ 104.1398 ms │ 110.1272 ms │ 119.6124 ms │ 950.2924 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3516.6055 │ 3014.0 │ 3985.0 │ 3525.0 │ 3525.0 │ 3586.8 │ 3800.67 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 7.4143 token/s │ 7.0535 token/s │ 8.933 token/s │ 7.3744 token/s │ 7.4118 token/s │ 7.5608 token/s │ 8.7051 token/s │ 512 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧═══════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 279430.9375 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 63.3452 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 1.8323 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1800502 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 1720.5255 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 131072 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 6443.4598 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 469.0676 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 6912.5274 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- **Kimi-K2-Thinking**(43m51s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 100.00
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 172384.3573 ms │ 34456.5517 ms │ 205922.9407 ms │ 174844.2216 ms │ 202656.092 ms │ 204428.9502 ms │ 205468.6776 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 138740.3228 ms │ 655.1066 ms │ 171777.3003 ms │ 141088.0561 ms │ 169237.5599 ms │ 170716.4954 ms │ 171393.1278 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 131.9374 ms │ 90.6331 ms │ 135.4144 ms │ 132.405 ms │ 132.948 ms │ 133.7549 ms │ 135.2543 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 130.9028 ms │ 0.0099 ms │ 960.3683 ms │ 116.9623 ms │ 122.3127 ms │ 132.0522 ms │ 886.4662 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3514.575 │ 3014.0 │ 3843.0 │ 3525.0 │ 3525.0 │ 3588.0 │ 3801.08 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 1.6799 token/s │ 1.2432 token/s │ 7.4296 token/s │ 1.4642 token/s │ 1.4737 token/s │ 1.8754 token/s │ 7.125 token/s │ 400 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 1166795.568 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 59.0967 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 0.3428 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1405830 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 25.332 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 102400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 1204.864 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 87.7617 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 1292.6258 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
### What this PR does / why we need it?
The PR add performance and accuracy tests for **Kimi-K2-Instruct-W8A8**
and **Kimi-K2-Thinking** models to the Nightly test suite.
#### Test Configuration
**Kimi-K2-Instruct-W8A8**
- model: vllm-ascend/Kimi-K2-Instruct-W8A8
- Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node)
- Architecture: Unified Distributed Inference
- Parallelism: **DP4 + TP8 + EP** (Data Parallel 4, Tensor Parallel 8,
Expert Parallel enabled).
- Optimization: **torchair graph**, **no-prefix-caching**.
- Node 0: DP Rank 0-1, Local DP 2, Tensor Parallel 8.
- Node 1: DP Rank 2-3, Local DP 2, Tensor Parallel 8.
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs2800.
- Accuracy: vllm-ascend/gsm8k-lite.
**Kimi-K2-Thinking**
- Model: moonshotai/Kimi-K2-Thinking
- Hardware: A3, 1 Node (16 NPUs total)
- Architecture: Single Node Distributed Inference
- Parallelism: TP16 + EP (Tensor Parallel 16, Expert Parallel enabled).
- Optimization: **no-prefix-caching**
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs400.
- Accuracy: vllm-ascend/gsm8k-lite.
### Does this PR introduce _any_ user-facing change?
**Yes.** This PR enhances the ```AisbenchRunner``` to support dynamic
configuration of the ```trust_remote_code``` flag. This allows the
AISBench client to successfully load tokenizers for models that require
custom code execution (e.g., **Kimi-K2-Thinking and
Kimi-K2-Instruct-W8A8**).
**Changes:**
1. ```AisbenchRunner.__init__ ```Added the ability to capture the
```trust_remote_code``` parameter from the case configuration.
``` python
self.batch_size = aisbench_config["batch_size"]
self.request_rate = aisbench_config.get("request_rate", 0)
+ self.trust_remote_code = aisbench_config.get("trust_remote_code", False)
self.temperature = aisbench_config.get("temperature")
self.top_k = aisbench_config.get("top_k")
```
2. ```AisbenchRunner._init_request_conf``` Added regex substitution to
inject the parameter into the generated dynamic configuration file.
``` python
content = re.sub(r'batch_size.*', f'batch_size = {self.batch_size},',
content)
+ content = re.sub(r'trust_remote_code=.*',
+ f'trust_remote_code={self.trust_remote_code},',
+ content)
content = content.replace("top_k", "#top_k")
content = content.replace("seed", "#seed")
```
**Details:**
- New Config Key: Users can add ```"trust_remote_code": True``` to any
dictionary within the ```aisbench_cases``` list.
- Default Value: Defaults to ```False``` to maintain existing security
protocols for standard models.
- Impact: Resolves ```ValueError``` when benchmarking reasoning models
or models with custom tokenizers that previously failed during the
AISBench local initialization phase.
**User Example:**
Users can now enable custom code execution for specific models (like
Kimi-K2-Thinking) directly in their test suite:
```
# Now supported in test scripts:
aisbench_cases = [{
"case_type": "performance",
"request_conf": "vllm_api_stream_chat",
"trust_remote_code": True, # New user-facing parameter
...
}]
```
### How was this patch tested?
Actions:
- https://github.com/vllm-project/vllm-ascend/actions/runs/20849768433
Result as following:
- **Kimi-K2-Instruct-W8A8**(25m25s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 96.88
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤═══════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪═══════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 34571.489 ms │ 28657.8054 ms │ 36294.1788 ms │ 34714.7329 ms │ 35247.2724 ms │ 35526.6758 ms │ 36146.4314 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 2043.9136 ms │ 627.4718 ms │ 3532.3978 ms │ 1906.0194 ms │ 2307.7979 ms │ 2883.8528 ms │ 3283.7012 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 127.5591 ms │ 106.4937 ms │ 137.107 ms │ 128.3135 ms │ 129.5704 ms │ 131.1332 ms │ 134.1087 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 126.5571 ms │ 0.0095 ms │ 1340.783 ms │ 104.1398 ms │ 110.1272 ms │ 119.6124 ms │ 950.2924 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3516.6055 │ 3014.0 │ 3985.0 │ 3525.0 │ 3525.0 │ 3586.8 │ 3800.67 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 7.4143 token/s │ 7.0535 token/s │ 8.933 token/s │ 7.3744 token/s │ 7.4118 token/s │ 7.5608 token/s │ 8.7051 token/s │ 512 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧═══════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 279430.9375 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 63.3452 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 1.8323 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1800502 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 1720.5255 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 131072 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 6443.4598 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 469.0676 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 6912.5274 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- **Kimi-K2-Thinking**(43m51s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 100.00
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 172384.3573 ms │ 34456.5517 ms │ 205922.9407 ms │ 174844.2216 ms │ 202656.092 ms │ 204428.9502 ms │ 205468.6776 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 138740.3228 ms │ 655.1066 ms │ 171777.3003 ms │ 141088.0561 ms │ 169237.5599 ms │ 170716.4954 ms │ 171393.1278 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 131.9374 ms │ 90.6331 ms │ 135.4144 ms │ 132.405 ms │ 132.948 ms │ 133.7549 ms │ 135.2543 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 130.9028 ms │ 0.0099 ms │ 960.3683 ms │ 116.9623 ms │ 122.3127 ms │ 132.0522 ms │ 886.4662 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3514.575 │ 3014.0 │ 3843.0 │ 3525.0 │ 3525.0 │ 3588.0 │ 3801.08 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 1.6799 token/s │ 1.2432 token/s │ 7.4296 token/s │ 1.4642 token/s │ 1.4737 token/s │ 1.8754 token/s │ 7.125 token/s │ 400 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 1166795.568 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 59.0967 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 0.3428 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1405830 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 25.332 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 102400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 1204.864 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 87.7617 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 1292.6258 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
### What this PR does / why we need it?
The PR add performance and accuracy tests for **Kimi-K2-Instruct-W8A8**
and **Kimi-K2-Thinking** models to the Nightly test suite.
#### Test Configuration
**Kimi-K2-Instruct-W8A8**
- model: vllm-ascend/Kimi-K2-Instruct-W8A8
- Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node)
- Architecture: Unified Distributed Inference
- Parallelism: **DP4 + TP8 + EP** (Data Parallel 4, Tensor Parallel 8,
Expert Parallel enabled).
- Optimization: **torchair graph**, **no-prefix-caching**.
- Node 0: DP Rank 0-1, Local DP 2, Tensor Parallel 8.
- Node 1: DP Rank 2-3, Local DP 2, Tensor Parallel 8.
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs2800.
- Accuracy: vllm-ascend/gsm8k-lite.
**Kimi-K2-Thinking**
- Model: moonshotai/Kimi-K2-Thinking
- Hardware: A3, 1 Node (16 NPUs total)
- Architecture: Single Node Distributed Inference
- Parallelism: TP16 + EP (Tensor Parallel 16, Expert Parallel enabled).
- Optimization: **no-prefix-caching**
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs400.
- Accuracy: vllm-ascend/gsm8k-lite.
### Does this PR introduce _any_ user-facing change?
**Yes.** This PR enhances the ```AisbenchRunner``` to support dynamic
configuration of the ```trust_remote_code``` flag. This allows the
AISBench client to successfully load tokenizers for models that require
custom code execution (e.g., **Kimi-K2-Thinking and
Kimi-K2-Instruct-W8A8**).
**Changes:**
1. ```AisbenchRunner.__init__ ```Added the ability to capture the
```trust_remote_code``` parameter from the case configuration.
``` python
self.batch_size = aisbench_config["batch_size"]
self.request_rate = aisbench_config.get("request_rate", 0)
+ self.trust_remote_code = aisbench_config.get("trust_remote_code", False)
self.temperature = aisbench_config.get("temperature")
self.top_k = aisbench_config.get("top_k")
```
2. ```AisbenchRunner._init_request_conf``` Added regex substitution to
inject the parameter into the generated dynamic configuration file.
``` python
content = re.sub(r'batch_size.*', f'batch_size = {self.batch_size},',
content)
+ content = re.sub(r'trust_remote_code=.*',
+ f'trust_remote_code={self.trust_remote_code},',
+ content)
content = content.replace("top_k", "#top_k")
content = content.replace("seed", "#seed")
```
**Details:**
- New Config Key: Users can add ```"trust_remote_code": True``` to any
dictionary within the ```aisbench_cases``` list.
- Default Value: Defaults to ```False``` to maintain existing security
protocols for standard models.
- Impact: Resolves ```ValueError``` when benchmarking reasoning models
or models with custom tokenizers that previously failed during the
AISBench local initialization phase.
**User Example:**
Users can now enable custom code execution for specific models (like
Kimi-K2-Thinking) directly in their test suite:
```
# Now supported in test scripts:
aisbench_cases = [{
"case_type": "performance",
"request_conf": "vllm_api_stream_chat",
"trust_remote_code": True, # New user-facing parameter
...
}]
```
### How was this patch tested?
Actions:
- https://github.com/vllm-project/vllm-ascend/actions/runs/20849768433
Result as following:
- **Kimi-K2-Instruct-W8A8**(25m25s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 96.88
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤═══════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪═══════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 34571.489 ms │ 28657.8054 ms │ 36294.1788 ms │ 34714.7329 ms │ 35247.2724 ms │ 35526.6758 ms │ 36146.4314 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 2043.9136 ms │ 627.4718 ms │ 3532.3978 ms │ 1906.0194 ms │ 2307.7979 ms │ 2883.8528 ms │ 3283.7012 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 127.5591 ms │ 106.4937 ms │ 137.107 ms │ 128.3135 ms │ 129.5704 ms │ 131.1332 ms │ 134.1087 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 126.5571 ms │ 0.0095 ms │ 1340.783 ms │ 104.1398 ms │ 110.1272 ms │ 119.6124 ms │ 950.2924 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3516.6055 │ 3014.0 │ 3985.0 │ 3525.0 │ 3525.0 │ 3586.8 │ 3800.67 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 7.4143 token/s │ 7.0535 token/s │ 8.933 token/s │ 7.3744 token/s │ 7.4118 token/s │ 7.5608 token/s │ 8.7051 token/s │ 512 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧═══════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 279430.9375 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 63.3452 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 1.8323 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1800502 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 1720.5255 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 131072 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 6443.4598 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 469.0676 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 6912.5274 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- **Kimi-K2-Thinking**(43m51s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 100.00
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 172384.3573 ms │ 34456.5517 ms │ 205922.9407 ms │ 174844.2216 ms │ 202656.092 ms │ 204428.9502 ms │ 205468.6776 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 138740.3228 ms │ 655.1066 ms │ 171777.3003 ms │ 141088.0561 ms │ 169237.5599 ms │ 170716.4954 ms │ 171393.1278 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 131.9374 ms │ 90.6331 ms │ 135.4144 ms │ 132.405 ms │ 132.948 ms │ 133.7549 ms │ 135.2543 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 130.9028 ms │ 0.0099 ms │ 960.3683 ms │ 116.9623 ms │ 122.3127 ms │ 132.0522 ms │ 886.4662 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3514.575 │ 3014.0 │ 3843.0 │ 3525.0 │ 3525.0 │ 3588.0 │ 3801.08 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 1.6799 token/s │ 1.2432 token/s │ 7.4296 token/s │ 1.4642 token/s │ 1.4737 token/s │ 1.8754 token/s │ 7.125 token/s │ 400 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 1166795.568 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 59.0967 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 0.3428 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1405830 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 25.332 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 102400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 1204.864 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 87.7617 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 1292.6258 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
### What this PR does / why we need it?
The PR add performance and accuracy tests for **Kimi-K2-Instruct-W8A8**
and **Kimi-K2-Thinking** models to the Nightly test suite.
#### Test Configuration
**Kimi-K2-Instruct-W8A8**
- model: vllm-ascend/Kimi-K2-Instruct-W8A8
- Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node)
- Architecture: Unified Distributed Inference
- Parallelism: **DP4 + TP8 + EP** (Data Parallel 4, Tensor Parallel 8,
Expert Parallel enabled).
- Optimization: **torchair graph**, **no-prefix-caching**.
- Node 0: DP Rank 0-1, Local DP 2, Tensor Parallel 8.
- Node 1: DP Rank 2-3, Local DP 2, Tensor Parallel 8.
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs2800.
- Accuracy: vllm-ascend/gsm8k-lite.
**Kimi-K2-Thinking**
- Model: moonshotai/Kimi-K2-Thinking
- Hardware: A3, 1 Node (16 NPUs total)
- Architecture: Single Node Distributed Inference
- Parallelism: TP16 + EP (Tensor Parallel 16, Expert Parallel enabled).
- Optimization: **no-prefix-caching**
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs400.
- Accuracy: vllm-ascend/gsm8k-lite.
### Does this PR introduce _any_ user-facing change?
**Yes.** This PR enhances the ```AisbenchRunner``` to support dynamic
configuration of the ```trust_remote_code``` flag. This allows the
AISBench client to successfully load tokenizers for models that require
custom code execution (e.g., **Kimi-K2-Thinking and
Kimi-K2-Instruct-W8A8**).
**Changes:**
1. ```AisbenchRunner.__init__ ```Added the ability to capture the
```trust_remote_code``` parameter from the case configuration.
``` python
self.batch_size = aisbench_config["batch_size"]
self.request_rate = aisbench_config.get("request_rate", 0)
+ self.trust_remote_code = aisbench_config.get("trust_remote_code", False)
self.temperature = aisbench_config.get("temperature")
self.top_k = aisbench_config.get("top_k")
```
2. ```AisbenchRunner._init_request_conf``` Added regex substitution to
inject the parameter into the generated dynamic configuration file.
``` python
content = re.sub(r'batch_size.*', f'batch_size = {self.batch_size},',
content)
+ content = re.sub(r'trust_remote_code=.*',
+ f'trust_remote_code={self.trust_remote_code},',
+ content)
content = content.replace("top_k", "#top_k")
content = content.replace("seed", "#seed")
```
**Details:**
- New Config Key: Users can add ```"trust_remote_code": True``` to any
dictionary within the ```aisbench_cases``` list.
- Default Value: Defaults to ```False``` to maintain existing security
protocols for standard models.
- Impact: Resolves ```ValueError``` when benchmarking reasoning models
or models with custom tokenizers that previously failed during the
AISBench local initialization phase.
**User Example:**
Users can now enable custom code execution for specific models (like
Kimi-K2-Thinking) directly in their test suite:
```
# Now supported in test scripts:
aisbench_cases = [{
"case_type": "performance",
"request_conf": "vllm_api_stream_chat",
"trust_remote_code": True, # New user-facing parameter
...
}]
```
### How was this patch tested?
Actions:
- https://github.com/vllm-project/vllm-ascend/actions/runs/20849768433
Result as following:
- **Kimi-K2-Instruct-W8A8**(25m25s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 96.88
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤═══════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪═══════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 34571.489 ms │ 28657.8054 ms │ 36294.1788 ms │ 34714.7329 ms │ 35247.2724 ms │ 35526.6758 ms │ 36146.4314 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 2043.9136 ms │ 627.4718 ms │ 3532.3978 ms │ 1906.0194 ms │ 2307.7979 ms │ 2883.8528 ms │ 3283.7012 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 127.5591 ms │ 106.4937 ms │ 137.107 ms │ 128.3135 ms │ 129.5704 ms │ 131.1332 ms │ 134.1087 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 126.5571 ms │ 0.0095 ms │ 1340.783 ms │ 104.1398 ms │ 110.1272 ms │ 119.6124 ms │ 950.2924 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3516.6055 │ 3014.0 │ 3985.0 │ 3525.0 │ 3525.0 │ 3586.8 │ 3800.67 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 7.4143 token/s │ 7.0535 token/s │ 8.933 token/s │ 7.3744 token/s │ 7.4118 token/s │ 7.5608 token/s │ 8.7051 token/s │ 512 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧═══════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 279430.9375 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 63.3452 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 1.8323 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1800502 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 1720.5255 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 131072 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 6443.4598 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 469.0676 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 6912.5274 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- **Kimi-K2-Thinking**(43m51s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 100.00
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 172384.3573 ms │ 34456.5517 ms │ 205922.9407 ms │ 174844.2216 ms │ 202656.092 ms │ 204428.9502 ms │ 205468.6776 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 138740.3228 ms │ 655.1066 ms │ 171777.3003 ms │ 141088.0561 ms │ 169237.5599 ms │ 170716.4954 ms │ 171393.1278 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 131.9374 ms │ 90.6331 ms │ 135.4144 ms │ 132.405 ms │ 132.948 ms │ 133.7549 ms │ 135.2543 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 130.9028 ms │ 0.0099 ms │ 960.3683 ms │ 116.9623 ms │ 122.3127 ms │ 132.0522 ms │ 886.4662 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3514.575 │ 3014.0 │ 3843.0 │ 3525.0 │ 3525.0 │ 3588.0 │ 3801.08 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 1.6799 token/s │ 1.2432 token/s │ 7.4296 token/s │ 1.4642 token/s │ 1.4737 token/s │ 1.8754 token/s │ 7.125 token/s │ 400 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 1166795.568 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 59.0967 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 0.3428 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1405830 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 25.332 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 102400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 1204.864 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 87.7617 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 1292.6258 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it?
The PR add performance and accuracy tests for **Kimi-K2-Instruct-W8A8**
and **Kimi-K2-Thinking** models to the Nightly test suite.
#### Test Configuration
**Kimi-K2-Instruct-W8A8**
- model: vllm-ascend/Kimi-K2-Instruct-W8A8
- Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node)
- Architecture: Unified Distributed Inference
- Parallelism: **DP4 + TP8 + EP** (Data Parallel 4, Tensor Parallel 8,
Expert Parallel enabled).
- Optimization: **torchair graph**, **no-prefix-caching**.
- Node 0: DP Rank 0-1, Local DP 2, Tensor Parallel 8.
- Node 1: DP Rank 2-3, Local DP 2, Tensor Parallel 8.
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs2800.
- Accuracy: vllm-ascend/gsm8k-lite.
**Kimi-K2-Thinking**
- Model: moonshotai/Kimi-K2-Thinking
- Hardware: A3, 1 Node (16 NPUs total)
- Architecture: Single Node Distributed Inference
- Parallelism: TP16 + EP (Tensor Parallel 16, Expert Parallel enabled).
- Optimization: **no-prefix-caching**
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs400.
- Accuracy: vllm-ascend/gsm8k-lite.
### Does this PR introduce _any_ user-facing change?
**Yes.** This PR enhances the ```AisbenchRunner``` to support dynamic
configuration of the ```trust_remote_code``` flag. This allows the
AISBench client to successfully load tokenizers for models that require
custom code execution (e.g., **Kimi-K2-Thinking and
Kimi-K2-Instruct-W8A8**).
**Changes:**
1. ```AisbenchRunner.__init__ ```Added the ability to capture the
```trust_remote_code``` parameter from the case configuration.
``` python
self.batch_size = aisbench_config["batch_size"]
self.request_rate = aisbench_config.get("request_rate", 0)
+ self.trust_remote_code = aisbench_config.get("trust_remote_code", False)
self.temperature = aisbench_config.get("temperature")
self.top_k = aisbench_config.get("top_k")
```
2. ```AisbenchRunner._init_request_conf``` Added regex substitution to
inject the parameter into the generated dynamic configuration file.
``` python
content = re.sub(r'batch_size.*', f'batch_size = {self.batch_size},',
content)
+ content = re.sub(r'trust_remote_code=.*',
+ f'trust_remote_code={self.trust_remote_code},',
+ content)
content = content.replace("top_k", "#top_k")
content = content.replace("seed", "#seed")
```
**Details:**
- New Config Key: Users can add ```"trust_remote_code": True``` to any
dictionary within the ```aisbench_cases``` list.
- Default Value: Defaults to ```False``` to maintain existing security
protocols for standard models.
- Impact: Resolves ```ValueError``` when benchmarking reasoning models
or models with custom tokenizers that previously failed during the
AISBench local initialization phase.
**User Example:**
Users can now enable custom code execution for specific models (like
Kimi-K2-Thinking) directly in their test suite:
```
# Now supported in test scripts:
aisbench_cases = [{
"case_type": "performance",
"request_conf": "vllm_api_stream_chat",
"trust_remote_code": True, # New user-facing parameter
...
}]
```
### How was this patch tested?
Actions:
- https://github.com/vllm-project/vllm-ascend/actions/runs/20849768433
Result as following:
- **Kimi-K2-Instruct-W8A8**(25m25s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 96.88
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤═══════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪═══════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 34571.489 ms │ 28657.8054 ms │ 36294.1788 ms │ 34714.7329 ms │ 35247.2724 ms │ 35526.6758 ms │ 36146.4314 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 2043.9136 ms │ 627.4718 ms │ 3532.3978 ms │ 1906.0194 ms │ 2307.7979 ms │ 2883.8528 ms │ 3283.7012 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 127.5591 ms │ 106.4937 ms │ 137.107 ms │ 128.3135 ms │ 129.5704 ms │ 131.1332 ms │ 134.1087 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 126.5571 ms │ 0.0095 ms │ 1340.783 ms │ 104.1398 ms │ 110.1272 ms │ 119.6124 ms │ 950.2924 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3516.6055 │ 3014.0 │ 3985.0 │ 3525.0 │ 3525.0 │ 3586.8 │ 3800.67 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 7.4143 token/s │ 7.0535 token/s │ 8.933 token/s │ 7.3744 token/s │ 7.4118 token/s │ 7.5608 token/s │ 8.7051 token/s │ 512 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧═══════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 279430.9375 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 63.3452 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 1.8323 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1800502 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 1720.5255 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 131072 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 6443.4598 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 469.0676 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 6912.5274 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- **Kimi-K2-Thinking**(43m51s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 100.00
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 172384.3573 ms │ 34456.5517 ms │ 205922.9407 ms │ 174844.2216 ms │ 202656.092 ms │ 204428.9502 ms │ 205468.6776 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 138740.3228 ms │ 655.1066 ms │ 171777.3003 ms │ 141088.0561 ms │ 169237.5599 ms │ 170716.4954 ms │ 171393.1278 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 131.9374 ms │ 90.6331 ms │ 135.4144 ms │ 132.405 ms │ 132.948 ms │ 133.7549 ms │ 135.2543 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 130.9028 ms │ 0.0099 ms │ 960.3683 ms │ 116.9623 ms │ 122.3127 ms │ 132.0522 ms │ 886.4662 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3514.575 │ 3014.0 │ 3843.0 │ 3525.0 │ 3525.0 │ 3588.0 │ 3801.08 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 1.6799 token/s │ 1.2432 token/s │ 7.4296 token/s │ 1.4642 token/s │ 1.4737 token/s │ 1.8754 token/s │ 7.125 token/s │ 400 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 1166795.568 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 59.0967 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 0.3428 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1405830 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 25.332 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 102400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 1204.864 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 87.7617 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 1292.6258 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
… to `.yaml` (#6503) ### What this PR does / why we need it? This PR refactors the nightly single-node model test by migrating test configurations from Python scripts to a more maintainable `YAML-based` format. | Original PR | Python (`.py`) | YAML (`.yaml`) | | :--- | :--- | :--- | | [#3568](#3568) | `test_deepseek_r1_0528_w8a8_eplb.py` | `DeepSeek-R1-0528-W8A8.yaml` | | [#3631](#3631) | `test_deepseek_r1_0528_w8a8.py` | `DeepSeek-R1-0528-W8A8.yaml` | | [#5874](#5874) | `test_deepseek_r1_w8a8_hbm.py` | `DeepSeek-R1-W8A8-HBM.yaml` | | [#3908](#3908) | `test_deepseek_v3_2_w8a8.py` | `DeepSeek-V3.2-W8A8.yaml` | | [#5682](#5682) | `test_kimi_k2_thinking.py` | `Kimi-K2-Thinking.yaml` | | [#4111](#4111) | `test_mtpx_deepseek_r1_0528_w8a8.py` | `MTPX-DeepSeek-R1-0528-W8A8.yaml` | | [#3733](#3733) | `test_prefix_cache_deepseek_r1_0528_w8a8.py` | `Prefix-Cache-DeepSeek-R1-0528-W8A8.yaml` | | [#6543](#6543) | `test_qwen3_235b_w8a8.py` | `Qwen3-235B-A22B-W8A8.yaml` | | [#6543](#6543) | `test_qwen3_235b_a22b_w8a8_eplb.py` | `Qwen3-235B-A22B-W8A8.yaml` | | [#3973](#3973) | `test_qwen3_30b_w8a8.py` | `Qwen3-30B-A3B-W8A8.yaml` | | [#3541](#3541) | `test_qwen3_32b_int8.py` | `Qwen3-32B-Int8.yaml` | | [#3757](#3757) | `test_qwq_32b.py` | `QwQ-32B.yaml` | | [#5616](#5616) | `test_qwen3_next_w8a8.py` | `Qwen3-Next-80B-A3B-Instruct-W8A8.yaml` | | [#3541](#3541) | `test_qwen2_5_vl_7b.py` | `Qwen2.5-VL-7B-Instruct.yaml` | | [#5301](#5301) | `test_qwen2_5_vl_7b_epd.py` | `Qwen2.5-VL-7B-Instruct-EPD.yaml` | | [#3707](#3707) | `test_qwen2_5_vl_32b.py` | `Qwen2.5-VL-32B-Instruct.yaml` | | [#3676](#3676) | `test_qwen3_32b_int8_a3_feature_stack3.py` | `Qwen3-32B-Int8-A3-Feature-Stack3.yaml` | | [#3709](#3709) | `test_prefix_cache_qwen3_32b_int8.py` | `Prefix-Cache-Qwen3-32B-Int8.yaml` | | [#5395](#5395) | `test_qwen3_next.py` | `Qwen3-Next-80B-A3B-Instruct-A2.yaml` | | [#3474](#3474) | `test_qwen3_32b.py` | `Qwen3-32B.yaml` | | [#3541](#3541) | `test_qwen3_32b_int8.py` | `Qwen3-32B-Int8-A2.yaml` | ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
The PR add performance and accuracy tests for **Kimi-K2-Instruct-W8A8**
and **Kimi-K2-Thinking** models to the Nightly test suite.
#### Test Configuration
**Kimi-K2-Instruct-W8A8**
- model: vllm-ascend/Kimi-K2-Instruct-W8A8
- Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node)
- Architecture: Unified Distributed Inference
- Parallelism: **DP4 + TP8 + EP** (Data Parallel 4, Tensor Parallel 8,
Expert Parallel enabled).
- Optimization: **torchair graph**, **no-prefix-caching**.
- Node 0: DP Rank 0-1, Local DP 2, Tensor Parallel 8.
- Node 1: DP Rank 2-3, Local DP 2, Tensor Parallel 8.
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs2800.
- Accuracy: vllm-ascend/gsm8k-lite.
**Kimi-K2-Thinking**
- Model: moonshotai/Kimi-K2-Thinking
- Hardware: A3, 1 Node (16 NPUs total)
- Architecture: Single Node Distributed Inference
- Parallelism: TP16 + EP (Tensor Parallel 16, Expert Parallel enabled).
- Optimization: **no-prefix-caching**
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs400.
- Accuracy: vllm-ascend/gsm8k-lite.
### Does this PR introduce _any_ user-facing change?
**Yes.** This PR enhances the ```AisbenchRunner``` to support dynamic
configuration of the ```trust_remote_code``` flag. This allows the
AISBench client to successfully load tokenizers for models that require
custom code execution (e.g., **Kimi-K2-Thinking and
Kimi-K2-Instruct-W8A8**).
**Changes:**
1. ```AisbenchRunner.__init__ ```Added the ability to capture the
```trust_remote_code``` parameter from the case configuration.
``` python
self.batch_size = aisbench_config["batch_size"]
self.request_rate = aisbench_config.get("request_rate", 0)
+ self.trust_remote_code = aisbench_config.get("trust_remote_code", False)
self.temperature = aisbench_config.get("temperature")
self.top_k = aisbench_config.get("top_k")
```
2. ```AisbenchRunner._init_request_conf``` Added regex substitution to
inject the parameter into the generated dynamic configuration file.
``` python
content = re.sub(r'batch_size.*', f'batch_size = {self.batch_size},',
content)
+ content = re.sub(r'trust_remote_code=.*',
+ f'trust_remote_code={self.trust_remote_code},',
+ content)
content = content.replace("top_k", "#top_k")
content = content.replace("seed", "#seed")
```
**Details:**
- New Config Key: Users can add ```"trust_remote_code": True``` to any
dictionary within the ```aisbench_cases``` list.
- Default Value: Defaults to ```False``` to maintain existing security
protocols for standard models.
- Impact: Resolves ```ValueError``` when benchmarking reasoning models
or models with custom tokenizers that previously failed during the
AISBench local initialization phase.
**User Example:**
Users can now enable custom code execution for specific models (like
Kimi-K2-Thinking) directly in their test suite:
```
# Now supported in test scripts:
aisbench_cases = [{
"case_type": "performance",
"request_conf": "vllm_api_stream_chat",
"trust_remote_code": True, # New user-facing parameter
...
}]
```
### How was this patch tested?
Actions:
- https://github.com/vllm-project/vllm-ascend/actions/runs/20849768433
Result as following:
- **Kimi-K2-Instruct-W8A8**(25m25s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 96.88
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤═══════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪═══════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 34571.489 ms │ 28657.8054 ms │ 36294.1788 ms │ 34714.7329 ms │ 35247.2724 ms │ 35526.6758 ms │ 36146.4314 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 2043.9136 ms │ 627.4718 ms │ 3532.3978 ms │ 1906.0194 ms │ 2307.7979 ms │ 2883.8528 ms │ 3283.7012 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 127.5591 ms │ 106.4937 ms │ 137.107 ms │ 128.3135 ms │ 129.5704 ms │ 131.1332 ms │ 134.1087 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 126.5571 ms │ 0.0095 ms │ 1340.783 ms │ 104.1398 ms │ 110.1272 ms │ 119.6124 ms │ 950.2924 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3516.6055 │ 3014.0 │ 3985.0 │ 3525.0 │ 3525.0 │ 3586.8 │ 3800.67 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 7.4143 token/s │ 7.0535 token/s │ 8.933 token/s │ 7.3744 token/s │ 7.4118 token/s │ 7.5608 token/s │ 8.7051 token/s │ 512 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧═══════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 279430.9375 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 63.3452 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 1.8323 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1800502 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 1720.5255 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 131072 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 6443.4598 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 469.0676 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 6912.5274 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- **Kimi-K2-Thinking**(43m51s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 100.00
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 172384.3573 ms │ 34456.5517 ms │ 205922.9407 ms │ 174844.2216 ms │ 202656.092 ms │ 204428.9502 ms │ 205468.6776 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 138740.3228 ms │ 655.1066 ms │ 171777.3003 ms │ 141088.0561 ms │ 169237.5599 ms │ 170716.4954 ms │ 171393.1278 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 131.9374 ms │ 90.6331 ms │ 135.4144 ms │ 132.405 ms │ 132.948 ms │ 133.7549 ms │ 135.2543 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 130.9028 ms │ 0.0099 ms │ 960.3683 ms │ 116.9623 ms │ 122.3127 ms │ 132.0522 ms │ 886.4662 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3514.575 │ 3014.0 │ 3843.0 │ 3525.0 │ 3525.0 │ 3588.0 │ 3801.08 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 1.6799 token/s │ 1.2432 token/s │ 7.4296 token/s │ 1.4642 token/s │ 1.4737 token/s │ 1.8754 token/s │ 7.125 token/s │ 400 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 1166795.568 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 59.0967 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 0.3428 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1405830 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 25.332 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 102400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 1204.864 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 87.7617 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 1292.6258 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it?
The PR add performance and accuracy tests for **Kimi-K2-Instruct-W8A8**
and **Kimi-K2-Thinking** models to the Nightly test suite.
#### Test Configuration
**Kimi-K2-Instruct-W8A8**
- model: vllm-ascend/Kimi-K2-Instruct-W8A8
- Hardware: A3, 2 Nodes (32 NPUs total, 16 NPUs per node)
- Architecture: Unified Distributed Inference
- Parallelism: **DP4 + TP8 + EP** (Data Parallel 4, Tensor Parallel 8,
Expert Parallel enabled).
- Optimization: **torchair graph**, **no-prefix-caching**.
- Node 0: DP Rank 0-1, Local DP 2, Tensor Parallel 8.
- Node 1: DP Rank 2-3, Local DP 2, Tensor Parallel 8.
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs2800.
- Accuracy: vllm-ascend/gsm8k-lite.
**Kimi-K2-Thinking**
- Model: moonshotai/Kimi-K2-Thinking
- Hardware: A3, 1 Node (16 NPUs total)
- Architecture: Single Node Distributed Inference
- Parallelism: TP16 + EP (Tensor Parallel 16, Expert Parallel enabled).
- Optimization: **no-prefix-caching**
- Benchmarks:
- Performance: vllm-ascend/GSM8K-in3500-bs400.
- Accuracy: vllm-ascend/gsm8k-lite.
### Does this PR introduce _any_ user-facing change?
**Yes.** This PR enhances the ```AisbenchRunner``` to support dynamic
configuration of the ```trust_remote_code``` flag. This allows the
AISBench client to successfully load tokenizers for models that require
custom code execution (e.g., **Kimi-K2-Thinking and
Kimi-K2-Instruct-W8A8**).
**Changes:**
1. ```AisbenchRunner.__init__ ```Added the ability to capture the
```trust_remote_code``` parameter from the case configuration.
``` python
self.batch_size = aisbench_config["batch_size"]
self.request_rate = aisbench_config.get("request_rate", 0)
+ self.trust_remote_code = aisbench_config.get("trust_remote_code", False)
self.temperature = aisbench_config.get("temperature")
self.top_k = aisbench_config.get("top_k")
```
2. ```AisbenchRunner._init_request_conf``` Added regex substitution to
inject the parameter into the generated dynamic configuration file.
``` python
content = re.sub(r'batch_size.*', f'batch_size = {self.batch_size},',
content)
+ content = re.sub(r'trust_remote_code=.*',
+ f'trust_remote_code={self.trust_remote_code},',
+ content)
content = content.replace("top_k", "#top_k")
content = content.replace("seed", "#seed")
```
**Details:**
- New Config Key: Users can add ```"trust_remote_code": True``` to any
dictionary within the ```aisbench_cases``` list.
- Default Value: Defaults to ```False``` to maintain existing security
protocols for standard models.
- Impact: Resolves ```ValueError``` when benchmarking reasoning models
or models with custom tokenizers that previously failed during the
AISBench local initialization phase.
**User Example:**
Users can now enable custom code execution for specific models (like
Kimi-K2-Thinking) directly in their test suite:
```
# Now supported in test scripts:
aisbench_cases = [{
"case_type": "performance",
"request_conf": "vllm_api_stream_chat",
"trust_remote_code": True, # New user-facing parameter
...
}]
```
### How was this patch tested?
Actions:
- https://github.com/vllm-project/vllm-ascend/actions/runs/20849768433
Result as following:
- **Kimi-K2-Instruct-W8A8**(25m25s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 96.88
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤═══════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪═══════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 34571.489 ms │ 28657.8054 ms │ 36294.1788 ms │ 34714.7329 ms │ 35247.2724 ms │ 35526.6758 ms │ 36146.4314 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 2043.9136 ms │ 627.4718 ms │ 3532.3978 ms │ 1906.0194 ms │ 2307.7979 ms │ 2883.8528 ms │ 3283.7012 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 127.5591 ms │ 106.4937 ms │ 137.107 ms │ 128.3135 ms │ 129.5704 ms │ 131.1332 ms │ 134.1087 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 126.5571 ms │ 0.0095 ms │ 1340.783 ms │ 104.1398 ms │ 110.1272 ms │ 119.6124 ms │ 950.2924 ms │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3516.6055 │ 3014.0 │ 3985.0 │ 3525.0 │ 3525.0 │ 3586.8 │ 3800.67 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 512 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 7.4143 token/s │ 7.0535 token/s │ 8.933 token/s │ 7.3744 token/s │ 7.4118 token/s │ 7.5608 token/s │ 8.7051 token/s │ 512 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧═══════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 279430.9375 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 512 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 63.3452 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 1.8323 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1800502 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 1720.5255 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 131072 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 6443.4598 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 469.0676 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 6912.5274 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- **Kimi-K2-Thinking**(43m51s)
1. Accuracy test
```
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 7cd45e accuracy gen 100.00
```
2. Perf test
```
╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL │ total │ 172384.3573 ms │ 34456.5517 ms │ 205922.9407 ms │ 174844.2216 ms │ 202656.092 ms │ 204428.9502 ms │ 205468.6776 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT │ total │ 138740.3228 ms │ 655.1066 ms │ 171777.3003 ms │ 141088.0561 ms │ 169237.5599 ms │ 170716.4954 ms │ 171393.1278 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT │ total │ 131.9374 ms │ 90.6331 ms │ 135.4144 ms │ 132.405 ms │ 132.948 ms │ 133.7549 ms │ 135.2543 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL │ total │ 130.9028 ms │ 0.0099 ms │ 960.3683 ms │ 116.9623 ms │ 122.3127 ms │ 132.0522 ms │ 886.4662 ms │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens │ total │ 3514.575 │ 3014.0 │ 3843.0 │ 3525.0 │ 3525.0 │ 3588.0 │ 3801.08 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens │ total │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 256.0 │ 400 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput │ total │ 1.6799 token/s │ 1.2432 token/s │ 7.4296 token/s │ 1.4642 token/s │ 1.4737 token/s │ 1.8754 token/s │ 7.125 token/s │ 400 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤═══════════════════╕
│ Common Metric │ Stage │ Value │
╞══════════════════════════╪═════════╪═══════════════════╡
│ Benchmark Duration │ total │ 1166795.568 ms │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Failed Requests │ total │ 0 │
├──────────────────────────┼─────────┼───────────────────┤
│ Success Requests │ total │ 400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Concurrency │ total │ 59.0967 │
├──────────────────────────┼─────────┼───────────────────┤
│ Max Concurrency │ total │ 64 │
├──────────────────────────┼─────────┼───────────────────┤
│ Request Throughput │ total │ 0.3428 req/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Input Tokens │ total │ 1405830 │
├──────────────────────────┼─────────┼───────────────────┤
│ Prefill Token Throughput │ total │ 25.332 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total generated tokens │ total │ 102400 │
├──────────────────────────┼─────────┼───────────────────┤
│ Input Token Throughput │ total │ 1204.864 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Output Token Throughput │ total │ 87.7617 token/s │
├──────────────────────────┼─────────┼───────────────────┤
│ Total Token Throughput │ total │ 1292.6258 token/s │
╘══════════════════════════╧═════════╧═══════════════════╛
```
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@2f4e654
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: root <root@LAPTOP-VQKDDVMG.localdomain>
… to `.yaml` (vllm-project#6503) ### What this PR does / why we need it? This PR refactors the nightly single-node model test by migrating test configurations from Python scripts to a more maintainable `YAML-based` format. | Original PR | Python (`.py`) | YAML (`.yaml`) | | :--- | :--- | :--- | | [vllm-project#3568](vllm-project#3568) | `test_deepseek_r1_0528_w8a8_eplb.py` | `DeepSeek-R1-0528-W8A8.yaml` | | [vllm-project#3631](vllm-project#3631) | `test_deepseek_r1_0528_w8a8.py` | `DeepSeek-R1-0528-W8A8.yaml` | | [vllm-project#5874](vllm-project#5874) | `test_deepseek_r1_w8a8_hbm.py` | `DeepSeek-R1-W8A8-HBM.yaml` | | [vllm-project#3908](vllm-project#3908) | `test_deepseek_v3_2_w8a8.py` | `DeepSeek-V3.2-W8A8.yaml` | | [vllm-project#5682](vllm-project#5682) | `test_kimi_k2_thinking.py` | `Kimi-K2-Thinking.yaml` | | [vllm-project#4111](vllm-project#4111) | `test_mtpx_deepseek_r1_0528_w8a8.py` | `MTPX-DeepSeek-R1-0528-W8A8.yaml` | | [vllm-project#3733](vllm-project#3733) | `test_prefix_cache_deepseek_r1_0528_w8a8.py` | `Prefix-Cache-DeepSeek-R1-0528-W8A8.yaml` | | [vllm-project#6543](vllm-project#6543) | `test_qwen3_235b_w8a8.py` | `Qwen3-235B-A22B-W8A8.yaml` | | [vllm-project#6543](vllm-project#6543) | `test_qwen3_235b_a22b_w8a8_eplb.py` | `Qwen3-235B-A22B-W8A8.yaml` | | [vllm-project#3973](vllm-project#3973) | `test_qwen3_30b_w8a8.py` | `Qwen3-30B-A3B-W8A8.yaml` | | [vllm-project#3541](vllm-project#3541) | `test_qwen3_32b_int8.py` | `Qwen3-32B-Int8.yaml` | | [vllm-project#3757](vllm-project#3757) | `test_qwq_32b.py` | `QwQ-32B.yaml` | | [vllm-project#5616](vllm-project#5616) | `test_qwen3_next_w8a8.py` | `Qwen3-Next-80B-A3B-Instruct-W8A8.yaml` | | [vllm-project#3541](vllm-project#3541) | `test_qwen2_5_vl_7b.py` | `Qwen2.5-VL-7B-Instruct.yaml` | | [vllm-project#5301](vllm-project#5301) | `test_qwen2_5_vl_7b_epd.py` | `Qwen2.5-VL-7B-Instruct-EPD.yaml` | | [vllm-project#3707](vllm-project#3707) | `test_qwen2_5_vl_32b.py` | `Qwen2.5-VL-32B-Instruct.yaml` | | [vllm-project#3676](vllm-project#3676) | `test_qwen3_32b_int8_a3_feature_stack3.py` | `Qwen3-32B-Int8-A3-Feature-Stack3.yaml` | | [vllm-project#3709](vllm-project#3709) | `test_prefix_cache_qwen3_32b_int8.py` | `Prefix-Cache-Qwen3-32B-Int8.yaml` | | [vllm-project#5395](vllm-project#5395) | `test_qwen3_next.py` | `Qwen3-Next-80B-A3B-Instruct-A2.yaml` | | [vllm-project#3474](vllm-project#3474) | `test_qwen3_32b.py` | `Qwen3-32B.yaml` | | [vllm-project#3541](vllm-project#3541) | `test_qwen3_32b_int8.py` | `Qwen3-32B-Int8-A2.yaml` | ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 --------- Signed-off-by: MrZ20 <2609716663@qq.com>
What this PR does / why we need it?
The PR add performance and accuracy tests for Kimi-K2-Instruct-W8A8 and Kimi-K2-Thinking models to the Nightly test suite.
Test Configuration
Kimi-K2-Instruct-W8A8
Kimi-K2-Thinking
Does this PR introduce any user-facing change?
Yes. This PR enhances the
AisbenchRunnerto support dynamic configuration of thetrust_remote_codeflag. This allows the AISBench client to successfully load tokenizers for models that require custom code execution (e.g., Kimi-K2-Thinking and Kimi-K2-Instruct-W8A8).Changes:
AisbenchRunner.__init__Added the ability to capture thetrust_remote_codeparameter from the case configuration.AisbenchRunner._init_request_confAdded regex substitution to inject the parameter into the generated dynamic configuration file.Details:
"trust_remote_code": Trueto any dictionary within theaisbench_caseslist.Falseto maintain existing security protocols for standard models.ValueErrorwhen benchmarking reasoning models or models with custom tokenizers that previously failed during the AISBench local initialization phase.User Example:
Users can now enable custom code execution for specific models (like Kimi-K2-Thinking) directly in their test suite:
How was this patch tested?
Actions:
Result as following: