Skip to content

[NVIDIA] Add Kimi K2.5 NVFP4 GB200 disaggregated vLLM benchmarks via Dynamo#1008

Merged
functionstackx merged 2 commits into
mainfrom
nv/kimi2.5-disagg-gb200-8k1k-1k1k
Apr 7, 2026
Merged

[NVIDIA] Add Kimi K2.5 NVFP4 GB200 disaggregated vLLM benchmarks via Dynamo#1008
functionstackx merged 2 commits into
mainfrom
nv/kimi2.5-disagg-gb200-8k1k-1k1k

Add Kimi K2.5 NVFP4 GB200 disaggregated vLLM benchmarks via Dynamo

3e358a8
Select commit
Loading
Failed to load commit list.
Claude / Claude Code Review completed Apr 6, 2026 in 10m 8s

Code review found 2 potential issues

Found 5 candidates, confirmed 2. See review comments for details.

Details

Severity Count
🔴 Important 0
🟡 Nit 2
🟣 Pre-existing 0
Severity File:Line Issue
🟡 Nit runners/launch_gb200-nv.sh:42-43 Misleading error message omits precision constraint for dynamo-vllm
🟡 Nit .github/configs/nvidia-master.yaml:6694 Missing concurrency 64 in 8k1k 1P4D sweep

Annotations

Check warning on line 43 in runners/launch_gb200-nv.sh

See this annotation in the file changed.

@claude claude / Claude Code Review

Misleading error message omits precision constraint for dynamo-vllm

The error message on line 42 says "Supported prefixes for dynamo-vllm: kimik2.5" but the guard condition checks both MODEL_PREFIX=="kimik2.5" AND PRECISION=="fp4", so a user passing MODEL_PREFIX=kimik2.5 with PRECISION=fp8 will see a self-contradictory message that lists kimik2.5 as both unsupported and supported, hiding the real problem (unsupported precision). The message should say "Supported combinations for dynamo-vllm: kimik2.5/fp4" to clearly communicate the constraint.

Check warning on line 6694 in .github/configs/nvidia-master.yaml

See this annotation in the file changed.

@claude claude / Claude Code Review

Missing concurrency 64 in 8k1k 1P4D sweep

The 8k1k 1P4D search-space has `conc-list: [4, 8, 16, 32, 128]`, skipping 64, while the structurally identical 1k1k 1P4D config has `[4, 8, 16, 32, 64, 128]`. This creates a gap in the performance curve at concurrency=64 for the low-latency 8k1k configuration and appears to be a copy-paste oversight.