Skip to content

[Paged KV] Add ShareGPT avg generation length tool#127

Merged
LxYuan0420 merged 3 commits intovllm-project:mainfrom
WindChimeRan:avg_gen_length
Mar 2, 2026
Merged

[Paged KV] Add ShareGPT avg generation length tool#127
LxYuan0420 merged 3 commits intovllm-project:mainfrom
WindChimeRan:avg_gen_length

Conversation

@WindChimeRan
Copy link
Copy Markdown
Collaborator

@WindChimeRan WindChimeRan commented Mar 1, 2026

Summary

Add a diagnostic tool for paged KV cache development. Runs offline vLLM inference on ShareGPT prompts and reports response length statistics (mean/std tokens).

The intended workflow is to compare the non-paged (standard MLX cache) and paged (Metal kernel) paths: if a KV cache bugfix improves alignment between the two distributions, that's a strong signal the fix is correct.

Also useful for comparing across batch sizes (--max-num-seqs 1 vs 8) to verify batched decode consistency. Related: #119

I learnt this from https://arxiv.org/pdf/2601.11580 table 1.

relevant post from thinking machine lab: https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/

Usage

VLLM_METAL_USE_PAGED_ATTENTION=1 VLLM_METAL_MEMORY_FRACTION=0.7 \
    python tools/avg_gen_length.py --max-num-seqs 1 8

Signed-off-by: ran <hzz5361@psu.edu>
@WindChimeRan WindChimeRan changed the title Add ShareGPT avg generation length tool [Paged KV] Add ShareGPT avg generation length tool Mar 2, 2026
Copy link
Copy Markdown
Collaborator

@LxYuan0420 LxYuan0420 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor changes and I think we are good to merge; I like the direction here. Having small, repeatable benchmark/smoke scripts will really help us validate end-to-end behavior as paged attention evolves.

Nit (optional): could we put this under benchmarks/ instead of tools/ to match the repo layout?

Comment thread tools/avg_gen_length.py Outdated
Comment on lines +7 to +8
huggingface-cli download anon8231489123/ShareGPT_Vicuna_unfiltered \
--repo-type dataset --local-dir . ShareGPT_V3_unfiltered_cleaned_split.json
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the example command to use hf download ... because hugging face-cli download is deprecated

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread tools/avg_gen_length.py Outdated
max_model_len=max_model_len,
max_num_seqs=max_num_seqs,
)
sampling_params = SamplingParams(temperature=0, seed=42, max_tokens=max_tokens)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 42 is hardcoded value ; should be wired to use --seed

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: ran <hzz5361@psu.edu>
@WindChimeRan WindChimeRan requested a review from LxYuan0420 March 2, 2026 05:24
@LxYuan0420 LxYuan0420 merged commit a00b661 into vllm-project:main Mar 2, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants