[Paged KV] Add ShareGPT avg generation length tool by WindChimeRan · Pull Request #127 · vllm-project/vllm-metal

WindChimeRan · 2026-03-01T16:49:57Z

Summary

Add a diagnostic tool for paged KV cache development. Runs offline vLLM inference on ShareGPT prompts and reports response length statistics (mean/std tokens).

The intended workflow is to compare the non-paged (standard MLX cache) and paged (Metal kernel) paths: if a KV cache bugfix improves alignment between the two distributions, that's a strong signal the fix is correct.

Also useful for comparing across batch sizes (--max-num-seqs 1 vs 8) to verify batched decode consistency. Related: #119

I learnt this from https://arxiv.org/pdf/2601.11580 table 1.

relevant post from thinking machine lab: https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/

Usage

VLLM_METAL_USE_PAGED_ATTENTION=1 VLLM_METAL_MEMORY_FRACTION=0.7 \
    python tools/avg_gen_length.py --max-num-seqs 1 8

Signed-off-by: ran <hzz5361@psu.edu>

LxYuan0420

Few minor changes and I think we are good to merge; I like the direction here. Having small, repeatable benchmark/smoke scripts will really help us validate end-to-end behavior as paged attention evolves.

Nit (optional): could we put this under benchmarks/ instead of tools/ to match the repo layout?

LxYuan0420 · 2026-03-02T05:13:15Z

+    huggingface-cli download anon8231489123/ShareGPT_Vicuna_unfiltered \
+        --repo-type dataset --local-dir . ShareGPT_V3_unfiltered_cleaned_split.json


Can we update the example command to use hf download ... because hugging face-cli download is deprecated

LxYuan0420 · 2026-03-02T05:14:07Z

+        max_model_len=max_model_len,
+        max_num_seqs=max_num_seqs,
+    )
+    sampling_params = SamplingParams(temperature=0, seed=42, max_tokens=max_tokens)


the 42 is hardcoded value ; should be wired to use --seed

Signed-off-by: ran <hzz5361@psu.edu>

add sharegpt avg/std len tool to quantify response divergence

44b578a

Signed-off-by: ran <hzz5361@psu.edu>

This was referenced Mar 1, 2026

Fix paged-attention KV cache dtype + size accounting (issue #119) #125

Merged

Metal paged-attention parity mismatch vs standard path #119

Closed

add seed and fix greedy sampling

5382acc

Signed-off-by: ran <hzz5361@psu.edu>

WindChimeRan changed the title ~~Add ShareGPT avg generation length tool~~ [Paged KV] Add ShareGPT avg generation length tool Mar 2, 2026

LxYuan0420 requested changes Mar 2, 2026

View reviewed changes

minor fix

cd1126f

Signed-off-by: ran <hzz5361@psu.edu>

WindChimeRan requested a review from LxYuan0420 March 2, 2026 05:24

LxYuan0420 approved these changes Mar 2, 2026

View reviewed changes

LxYuan0420 merged commit a00b661 into vllm-project:main Mar 2, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Paged KV] Add ShareGPT avg generation length tool#127

[Paged KV] Add ShareGPT avg generation length tool#127
LxYuan0420 merged 3 commits intovllm-project:mainfrom
WindChimeRan:avg_gen_length

WindChimeRan commented Mar 1, 2026 •

edited

Loading

Uh oh!

LxYuan0420 left a comment

Uh oh!

LxYuan0420 Mar 2, 2026

Uh oh!

WindChimeRan Mar 2, 2026

Uh oh!

LxYuan0420 Mar 2, 2026

Uh oh!

WindChimeRan Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		huggingface-cli download anon8231489123/ShareGPT_Vicuna_unfiltered \
		--repo-type dataset --local-dir . ShareGPT_V3_unfiltered_cleaned_split.json

Conversation

WindChimeRan commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Usage

Uh oh!

LxYuan0420 left a comment

Choose a reason for hiding this comment

Uh oh!

LxYuan0420 Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

WindChimeRan Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

LxYuan0420 Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

WindChimeRan Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WindChimeRan commented Mar 1, 2026 •

edited

Loading