Skip to content

docs(README): add Qwen/Qwen3.6-35B-A3B-FP8 (DFlash spec) to leaderboard#33

Closed
dineshreddy91 wants to merge 1 commit intomainfrom
leaderboard/qwen3-6-35b-a3b-dflash
Closed

docs(README): add Qwen/Qwen3.6-35B-A3B-FP8 (DFlash spec) to leaderboard#33
dineshreddy91 wants to merge 1 commit intomainfrom
leaderboard/qwen3-6-35b-a3b-dflash

Conversation

@dineshreddy91
Copy link
Copy Markdown

@dineshreddy91 dineshreddy91 commented May 8, 2026

Summary

Adds row #6 to the leaderboard for Qwen/Qwen3.6-35B-A3B-FP8 with DFlash
speculative decoding (vllm-project/vllm#40898).

Peak: 523.1 tok/s at 16 workers, ITL 35.4 ms.

Benchmark

uvx vlmbench run \
  -m Qwen/Qwen3.6-35B-A3B-FP8 \
  -d hf://vlm-run/FineVision-vlmbench-mini \
  --max-samples 64 \
  --prompt "Describe this image in 80 words or less" \
  --concurrency 4,8,16,32,64 \
  --base-url "https://<our-cloud-run-endpoint>/v1" \
  --api-key "$M2M"
Workers Tok/s TTFT (ms) TPOT (ms) ITL (ms) Reliability
4 368.3 8,284 0.3 41.7 192/192
8 448.2 14,399 0.3 37.0 192/192
16 523.1 18,399 0.3 35.4 192/192
32 433.1 40,824 0.3 38.2 192/192
64 453.2 58,050 0.3 35.4 192/192

Throughput regresses past 16 workers because the Cloud Run service is
configured with --concurrency=4 per instance × 2 instances = 8 effective
in-flight; extra workers just queue at the load balancer.

Apples-to-oranges caveats

  • 35B model vs the existing 1-8B rows
  • vLLM 0.19.2rc1.dev129+g3cfc8f8b7 (PR-40898 nightly) vs v0.15.1
  • Hosted on Cloud Run GPU with HTTPS round-trip overhead, not local vLLM
  • DFlash spec decoding produces unrealistically low TPOT (0.3 ms) because
    accepted draft windows return multiple tokens per step; the more honest
    user-perceived metric is ITL (35.4 ms)

Test plan

  • All 5 concurrency levels: 192/192 reqs ok
  • Results saved to ~/.vlmbench/benchmarks/

🤖 Generated with Claude Code


Open in Devin Review

Benchmarked the Cloud Run vLLM deployment of Qwen3.6-35B-A3B with
DFlash speculative decoding (vllm-project/vllm#40898) against
FineVision-vlmbench-mini, 64 inputs / 144 images, 3 runs per concurrency.

Peak: 523.1 tok/s at 16 workers, ITL 35.4 ms.

Note vs existing rows:
- 35B params (vs 1-8B in rows 1-5) — fewer tok/s expected
- vLLM 0.19.2rc1.dev129 (PR-40898 branch) vs 0.15.1
- Routes through Cloud Run HTTPS, not local vLLM
- Cloud Run --concurrency=4 caps in-flight at 8 across 2 instances; higher
  client concurrency just queues at the load balancer

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 1 additional finding in Devin Review.

Open in Devin Review

Comment thread README.md
| 3 | `PaddlePaddle/PaddleOCR-VL` | 2,341.9 | 64 | 6,385 ms | 49.0 ms |
| 4 | `deepseek-ai/DeepSeek-OCR` | 1,195.8 | 32 | 3,571 ms | 15.9 ms |
| 5 | `Qwen/Qwen3-VL-8B-Instruct` | 953.8 | 64 | 448 ms | 25.7 ms |
| 6 | `Qwen/Qwen3.6-35B-A3B-FP8` (DFlash spec) | 523.1 | 16 | 18,399 ms | 0.3 ms |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 PR does not bump version in vlmbench/version.py as required by CLAUDE.md

CLAUDE.md mandates: "Every PR must bump the version in vlmbench/version.py (__version__ = "X.Y.Z"). If the version is already bumped, do not bump it again." This PR only modifies README.md and does not include a version bump. The version remains at 0.5.5 (unchanged from the prior commit vlmbench/version.py:1).

Prompt for agents
The CLAUDE.md rule requires every PR to bump the version in vlmbench/version.py. The current version is 0.5.5 (defined in vlmbench/version.py). Since this is a docs-only change, a patch bump to 0.5.6 would be appropriate. Edit vlmbench/version.py and change __version__ = "0.5.5" to __version__ = "0.5.6".
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new performance benchmark entry for the Qwen/Qwen3.6-35B-A3B-FP8 model to the README. Feedback was provided to clarify the entry by noting that the environment (Cloud Run) and software version (vLLM nightly) differ from the table's established criteria, and to highlight that the reported TPOT is influenced by speculative decoding.

Comment thread README.md
| 3 | `PaddlePaddle/PaddleOCR-VL` | 2,341.9 | 64 | 6,385 ms | 49.0 ms |
| 4 | `deepseek-ai/DeepSeek-OCR` | 1,195.8 | 32 | 3,571 ms | 15.9 ms |
| 5 | `Qwen/Qwen3-VL-8B-Instruct` | 953.8 | 64 | 448 ms | 25.7 ms |
| 6 | `Qwen/Qwen3.6-35B-A3B-FP8` (DFlash spec) | 523.1 | 16 | 18,399 ms | 0.3 ms |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This entry introduces several inconsistencies with the leaderboard's established criteria (defined in line 116):

  1. Environment: The header specifies local hardware (RTX 6000), while this result was obtained via Cloud Run with HTTPS overhead.
  2. Software: The header specifies vLLM v0.15.1, but this used a nightly build (v0.19.2rc1...).
  3. Metrics: The 0.3 ms TPOT is, as noted in the PR description, "unrealistically low" due to speculative decoding and not directly comparable to the other models' TPOT.

To avoid misleading users, these caveats should be explicitly mentioned in the table row to clarify why the results (especially TTFT and TPOT) differ so significantly from the other entries.

Suggested change
| 6 | `Qwen/Qwen3.6-35B-A3B-FP8` (DFlash spec) | 523.1 | 16 | 18,399 ms | 0.3 ms |
| 6 | `Qwen/Qwen3.6-35B-A3B-FP8` (DFlash spec, Cloud Run, vLLM nightly) | 523.1 | 16 | 18,399 ms | 0.3 ms |

@dineshreddy91 dineshreddy91 deleted the leaderboard/qwen3-6-35b-a3b-dflash branch May 8, 2026 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant