docs(README): add Qwen/Qwen3.6-35B-A3B-FP8 (DFlash spec) to leaderboard#33
docs(README): add Qwen/Qwen3.6-35B-A3B-FP8 (DFlash spec) to leaderboard#33dineshreddy91 wants to merge 1 commit intomainfrom
Conversation
Benchmarked the Cloud Run vLLM deployment of Qwen3.6-35B-A3B with DFlash speculative decoding (vllm-project/vllm#40898) against FineVision-vlmbench-mini, 64 inputs / 144 images, 3 runs per concurrency. Peak: 523.1 tok/s at 16 workers, ITL 35.4 ms. Note vs existing rows: - 35B params (vs 1-8B in rows 1-5) — fewer tok/s expected - vLLM 0.19.2rc1.dev129 (PR-40898 branch) vs 0.15.1 - Routes through Cloud Run HTTPS, not local vLLM - Cloud Run --concurrency=4 caps in-flight at 8 across 2 instances; higher client concurrency just queues at the load balancer Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| | 3 | `PaddlePaddle/PaddleOCR-VL` | 2,341.9 | 64 | 6,385 ms | 49.0 ms | | ||
| | 4 | `deepseek-ai/DeepSeek-OCR` | 1,195.8 | 32 | 3,571 ms | 15.9 ms | | ||
| | 5 | `Qwen/Qwen3-VL-8B-Instruct` | 953.8 | 64 | 448 ms | 25.7 ms | | ||
| | 6 | `Qwen/Qwen3.6-35B-A3B-FP8` (DFlash spec) | 523.1 | 16 | 18,399 ms | 0.3 ms | |
There was a problem hiding this comment.
🟡 PR does not bump version in vlmbench/version.py as required by CLAUDE.md
CLAUDE.md mandates: "Every PR must bump the version in vlmbench/version.py (__version__ = "X.Y.Z"). If the version is already bumped, do not bump it again." This PR only modifies README.md and does not include a version bump. The version remains at 0.5.5 (unchanged from the prior commit vlmbench/version.py:1).
Prompt for agents
The CLAUDE.md rule requires every PR to bump the version in vlmbench/version.py. The current version is 0.5.5 (defined in vlmbench/version.py). Since this is a docs-only change, a patch bump to 0.5.6 would be appropriate. Edit vlmbench/version.py and change __version__ = "0.5.5" to __version__ = "0.5.6".
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Code Review
This pull request adds a new performance benchmark entry for the Qwen/Qwen3.6-35B-A3B-FP8 model to the README. Feedback was provided to clarify the entry by noting that the environment (Cloud Run) and software version (vLLM nightly) differ from the table's established criteria, and to highlight that the reported TPOT is influenced by speculative decoding.
| | 3 | `PaddlePaddle/PaddleOCR-VL` | 2,341.9 | 64 | 6,385 ms | 49.0 ms | | ||
| | 4 | `deepseek-ai/DeepSeek-OCR` | 1,195.8 | 32 | 3,571 ms | 15.9 ms | | ||
| | 5 | `Qwen/Qwen3-VL-8B-Instruct` | 953.8 | 64 | 448 ms | 25.7 ms | | ||
| | 6 | `Qwen/Qwen3.6-35B-A3B-FP8` (DFlash spec) | 523.1 | 16 | 18,399 ms | 0.3 ms | |
There was a problem hiding this comment.
This entry introduces several inconsistencies with the leaderboard's established criteria (defined in line 116):
- Environment: The header specifies local hardware (RTX 6000), while this result was obtained via Cloud Run with HTTPS overhead.
- Software: The header specifies vLLM v0.15.1, but this used a nightly build (v0.19.2rc1...).
- Metrics: The 0.3 ms TPOT is, as noted in the PR description, "unrealistically low" due to speculative decoding and not directly comparable to the other models' TPOT.
To avoid misleading users, these caveats should be explicitly mentioned in the table row to clarify why the results (especially TTFT and TPOT) differ so significantly from the other entries.
| | 6 | `Qwen/Qwen3.6-35B-A3B-FP8` (DFlash spec) | 523.1 | 16 | 18,399 ms | 0.3 ms | | |
| | 6 | `Qwen/Qwen3.6-35B-A3B-FP8` (DFlash spec, Cloud Run, vLLM nightly) | 523.1 | 16 | 18,399 ms | 0.3 ms | |
Summary
Adds row #6 to the leaderboard for
Qwen/Qwen3.6-35B-A3B-FP8with DFlashspeculative decoding (vllm-project/vllm#40898).
Peak: 523.1 tok/s at 16 workers, ITL 35.4 ms.
Benchmark
Throughput regresses past 16 workers because the Cloud Run service is
configured with
--concurrency=4per instance × 2 instances = 8 effectivein-flight; extra workers just queue at the load balancer.
Apples-to-oranges caveats
v0.15.1accepted draft windows return multiple tokens per step; the more honest
user-perceived metric is ITL (35.4 ms)
Test plan
~/.vlmbench/benchmarks/🤖 Generated with Claude Code