docs(README): add Qwen/Qwen3.6-35B-A3B-FP8 (DFlash spec) to leaderboard by dineshreddy91 · Pull Request #33 · vlm-run/vlmbench

dineshreddy91 · 2026-05-08T23:18:52Z

Summary

Adds row #6 to the leaderboard for Qwen/Qwen3.6-35B-A3B-FP8 with DFlash
speculative decoding (vllm-project/vllm#40898).

Peak: 523.1 tok/s at 16 workers, ITL 35.4 ms.

Benchmark

uvx vlmbench run \
  -m Qwen/Qwen3.6-35B-A3B-FP8 \
  -d hf://vlm-run/FineVision-vlmbench-mini \
  --max-samples 64 \
  --prompt "Describe this image in 80 words or less" \
  --concurrency 4,8,16,32,64 \
  --base-url "https://<our-cloud-run-endpoint>/v1" \
  --api-key "$M2M"

Workers	Tok/s	TTFT (ms)	TPOT (ms)	ITL (ms)	Reliability
4	368.3	8,284	0.3	41.7	192/192
8	448.2	14,399	0.3	37.0	192/192
16	523.1	18,399	0.3	35.4	192/192
32	433.1	40,824	0.3	38.2	192/192
64	453.2	58,050	0.3	35.4	192/192

Throughput regresses past 16 workers because the Cloud Run service is
configured with --concurrency=4 per instance × 2 instances = 8 effective
in-flight; extra workers just queue at the load balancer.

Apples-to-oranges caveats

35B model vs the existing 1-8B rows
vLLM 0.19.2rc1.dev129+g3cfc8f8b7 (PR-40898 nightly) vs v0.15.1
Hosted on Cloud Run GPU with HTTPS round-trip overhead, not local vLLM
DFlash spec decoding produces unrealistically low TPOT (0.3 ms) because
accepted draft windows return multiple tokens per step; the more honest
user-perceived metric is ITL (35.4 ms)

Test plan

All 5 concurrency levels: 192/192 reqs ok
Results saved to ~/.vlmbench/benchmarks/

🤖 Generated with Claude Code

Benchmarked the Cloud Run vLLM deployment of Qwen3.6-35B-A3B with DFlash speculative decoding (vllm-project/vllm#40898) against FineVision-vlmbench-mini, 64 inputs / 144 images, 3 runs per concurrency. Peak: 523.1 tok/s at 16 workers, ITL 35.4 ms. Note vs existing rows: - 35B params (vs 1-8B in rows 1-5) — fewer tok/s expected - vLLM 0.19.2rc1.dev129 (PR-40898 branch) vs 0.15.1 - Routes through Cloud Run HTTPS, not local vLLM - Cloud Run --concurrency=4 caps in-flight at 8 across 2 instances; higher client concurrency just queues at the load balancer Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

devin-ai-integration

Devin Review found 1 potential issue.

View 1 additional finding in Devin Review.

devin-ai-integration · 2026-05-08T23:20:05Z

 | 3 | `PaddlePaddle/PaddleOCR-VL` | 2,341.9 | 64 | 6,385 ms | 49.0 ms |
 | 4 | `deepseek-ai/DeepSeek-OCR` | 1,195.8 | 32 | 3,571 ms | 15.9 ms |
 | 5 | `Qwen/Qwen3-VL-8B-Instruct` | 953.8 | 64 | 448 ms | 25.7 ms |
+| 6 | `Qwen/Qwen3.6-35B-A3B-FP8` (DFlash spec) | 523.1 | 16 | 18,399 ms | 0.3 ms |


🟡 PR does not bump version in vlmbench/version.py as required by CLAUDE.md

CLAUDE.md mandates: "Every PR must bump the version in vlmbench/version.py (__version__ = "X.Y.Z"). If the version is already bumped, do not bump it again." This PR only modifies README.md and does not include a version bump. The version remains at 0.5.5 (unchanged from the prior commit vlmbench/version.py:1).

Prompt for agents

The CLAUDE.md rule requires every PR to bump the version in vlmbench/version.py. The current version is 0.5.5 (defined in vlmbench/version.py). Since this is a docs-only change, a patch bump to 0.5.6 would be appropriate. Edit vlmbench/version.py and change __version__ = "0.5.5" to __version__ = "0.5.6".

Was this helpful? React with 👍 or 👎 to provide feedback.

gemini-code-assist

Code Review

This pull request adds a new performance benchmark entry for the Qwen/Qwen3.6-35B-A3B-FP8 model to the README. Feedback was provided to clarify the entry by noting that the environment (Cloud Run) and software version (vLLM nightly) differ from the table's established criteria, and to highlight that the reported TPOT is influenced by speculative decoding.

gemini-code-assist · 2026-05-08T23:20:06Z

 | 3 | `PaddlePaddle/PaddleOCR-VL` | 2,341.9 | 64 | 6,385 ms | 49.0 ms |
 | 4 | `deepseek-ai/DeepSeek-OCR` | 1,195.8 | 32 | 3,571 ms | 15.9 ms |
 | 5 | `Qwen/Qwen3-VL-8B-Instruct` | 953.8 | 64 | 448 ms | 25.7 ms |
+| 6 | `Qwen/Qwen3.6-35B-A3B-FP8` (DFlash spec) | 523.1 | 16 | 18,399 ms | 0.3 ms |


This entry introduces several inconsistencies with the leaderboard's established criteria (defined in line 116):

Environment: The header specifies local hardware (RTX 6000), while this result was obtained via Cloud Run with HTTPS overhead.

Software: The header specifies vLLM v0.15.1, but this used a nightly build (v0.19.2rc1...).

Metrics: The 0.3 ms TPOT is, as noted in the PR description, "unrealistically low" due to speculative decoding and not directly comparable to the other models' TPOT.

To avoid misleading users, these caveats should be explicitly mentioned in the table row to clarify why the results (especially TTFT and TPOT) differ so significantly from the other entries.

Suggested change

| 6 | `Qwen/Qwen3.6-35B-A3B-FP8` (DFlash spec) | 523.1 | 16 | 18,399 ms | 0.3 ms |

| 6 | `Qwen/Qwen3.6-35B-A3B-FP8` (DFlash spec, Cloud Run, vLLM nightly) | 523.1 | 16 | 18,399 ms | 0.3 ms |

devin-ai-integration Bot reviewed May 8, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

dineshreddy91 closed this May 8, 2026

dineshreddy91 deleted the leaderboard/qwen3-6-35b-a3b-dflash branch May 8, 2026 23:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(README): add Qwen/Qwen3.6-35B-A3B-FP8 (DFlash spec) to leaderboard#33

docs(README): add Qwen/Qwen3.6-35B-A3B-FP8 (DFlash spec) to leaderboard#33
dineshreddy91 wants to merge 1 commit intomainfrom
leaderboard/qwen3-6-35b-a3b-dflash

dineshreddy91 commented May 8, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot May 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	\| 6 \| `Qwen/Qwen3.6-35B-A3B-FP8` (DFlash spec) \| 523.1 \| 16 \| 18,399 ms \| 0.3 ms \|
	\| 6 \| `Qwen/Qwen3.6-35B-A3B-FP8` (DFlash spec, Cloud Run, vLLM nightly) \| 523.1 \| 16 \| 18,399 ms \| 0.3 ms \|

Conversation

dineshreddy91 commented May 8, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark

Apples-to-oranges caveats

Test plan

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dineshreddy91 commented May 8, 2026 •

edited by devin-ai-integration Bot

Loading