Use runtime profiling to replace manual memory analyzers by zhuohan123 · Pull Request #81 · vllm-project/vllm

zhuohan123 · 2023-05-07T16:58:50Z

Fix #59.

Previously we used a manual memory profiler, which must be implemented separately for each model. This PR replaces it with a general memory profiler.

zhuohan123 · 2023-05-09T04:11:33Z

@WoosukKwon This PR is ready for review.

WoosukKwon

Thanks @zhuohan123 for this PR. Please check my comments.

WoosukKwon · 2023-05-09T06:05:27Z

Additionally, I've found that the PR changes the output of the examples in simple_server.py. We should reset the random generator states after the profiling run.

WoosukKwon · 2023-05-09T22:29:49Z

@zhuohan123 I will start to merge the other PRs first, so that this PR does not block any.

zhuohan123 · 2023-05-13T14:58:53Z

@WoosukKwon I fixed all the review comments and merged with the latest master. Please review again. Feel free to merge it.

One thing I noticed is this bug in the sampler. With this bug, I discover that these two sorts (link1, link2) take a lot of memory (4-5GB) when sampling for 2560 tokens at the same time. In this case, they are sorting a (2560, 51200) shape tensor. I'm not sure why these two sorts take that much memory and whether there are more memory efficient way to implement topp and topk.

zhuohan123 · 2023-05-13T15:01:07Z

Additionally, I've found that the PR changes the output of the examples in simple_server.py. We should reset the random generator states after the profiling run.

I haven't added it in the current PR since it looks pretty redundant to add an additional "set_random_seed" call to all the workers after profiling. It will be a nested ray call and complicates the code. I think this will confuse future people when they read the code.

WoosukKwon · 2023-05-13T22:23:45Z

I haven't added it in the current PR since it looks pretty redundant to add an additional "set_random_seed" call to all the workers after profiling. It will be a nested ray call and complicates the code. I think this will confuse future people when they read the code.

What do you mean by "nested ray call"? I didn't actually get it. And I think the purpose of resetting the random state is to make sure everything that happens before starting serving should not affect the outputs. In this sense, I think we should reset the random states after the profiling run, while we don't have to reset the states before profiling run.

WoosukKwon · 2023-05-14T19:19:06Z

One thing I noticed is this bug in the sampler. With this bug, I discover that these two sorts (link1, link2) take a lot of memory (4-5GB) when sampling for 2560 tokens at the same time. In this case, they are sorting a (2560, 51200) shape tensor. I'm not sure why these two sorts take that much memory and whether there are more memory efficient way to implement topp and topk.

Thanks for finding the bug. As you did, -1 should be replaced with vocab_size.
On the memory consumption, it seems it's a known issue: Peak GPU-memory usage extremely huge when sorting with torch.sort pytorch/pytorch#77049. For now, I think we can take into account max_num_sequences (which is 256 by default) in calculating the peak memory usage. This will reduce the memory consumption of sorting by 10x.

WoosukKwon

LGTM. Thanks!

…t#81)

* remove elementwise kernel * fix lint

…se_v3 habana_main rebase

…io-vllm-cpu-v2-19 Red Hat Konflux update vllm-cpu-v2-19

ivnle · 2026-01-08T23:26:46Z

Resolution

Root Cause Identified: BPE tokenizer merges </think> differently based on preceding context:

.</think> → [4005, 27963, 29] (period merges with </)
</think> → [524, 27963, 29] (space merges with </)

The vLLM reasoning parser was using token sequence matching, which failed because it couldn't predict which variant the model would generate.

Fix Implemented: Replaced token sequence matching with windowed decode approach in libs/vllm/vllm/reasoning/olmo3_reasoning_parser.py:

Decode last 15 tokens and search for </think> string
O(1) complexity, ~8μs overhead per call (<1% at 1000 tok/s)
Works for all BPE tokenization variants

Verification:

45/45 unit tests passed
Real data: 100% detection rate (vs 2% with old implementation)
Scratch experiment scratch_issue81_gcd_fix_olmo_gcd_fix_verify confirmed GCD activates correctly after </think>

Commits (in vLLM fork):

4b30e5f62 - Fix OLMo3 reasoning parser: use windowed decode for detection

Related Issues Created:

Use slow tokenizer for LLaMA #84 - Store formatted prompts (discovered during investigation)
Support string-based stopping conditions #92 - Pando experiment execution issue (discovered during verification)

ivnle · 2026-01-08T23:26:53Z

Fixed in vLLM fork commit 4b30e5f62. See comment above for details.

lfm2-vl support: lfm2_vl_dense and lfm2_vl_moe are both supported. ## Purpose support lfm2_vl_dense and lfm2_vl_moe ## Test Plan I support vllm backend on our private VLMEvalKit. ## Test Result The MMStar result of vllm matches that result of hf backend. --------- Signed-off-by: Paul Pak <paulpak58@gmail.com> Co-authored-by: Paul Pak <paulpak58@gmail.com>

zhuohan123 added 5 commits May 7, 2023 16:58

Use runtime profiling to replace manual memory analyzers

33ef394

Merge branch 'main' into dynamic-memory-profiler

f07bc4a

Fix merge error

72f5b9a

Add argument for cache memory utilization

38771dc

Add comments

ff18742

zhuohan123 requested a review from WoosukKwon May 9, 2023 04:11

WoosukKwon reviewed May 9, 2023

View reviewed changes

WoosukKwon mentioned this pull request May 11, 2023

Decrease the default size of swap space #69

Closed

zhuohan123 added 7 commits May 11, 2023 16:37

Resolve part of review comments

78c6f30

Fix comments on GPU memory percentage

3881670

Merge branch 'main' into dynamic-memory-profiler

c96ff33

Fix merging errors

dcca6f4

fix logging

511cc61

fix a bug in sampler

51de2cb

fix fastapi frontend

5d46bec

WoosukKwon reviewed May 14, 2023

View reviewed changes

Comment thread cacheflow/model_executor/model_loader.py Outdated

zhuohan123 added 4 commits May 18, 2023 22:40

Fix random seed

03ec645

Fix the placement for get_cache_block_size

83c46e8

Merge branch 'main' into dynamic-memory-profiler

99cb539

Profile memory with max_num_sequences sequences

b9a8da9

WoosukKwon approved these changes May 19, 2023

View reviewed changes

WoosukKwon mentioned this pull request May 19, 2023

Refactor system architecture #109

Merged

zhuohan123 merged commit f756799 into main May 19, 2023

zhuohan123 deleted the dynamic-memory-profiler branch May 24, 2023 04:40

WoosukKwon mentioned this pull request May 25, 2023

Implement custom kernels for top-k and top-p sampling #125

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Use runtime profiling to replace manual memory analyzers (vllm-projec…

5295816

…t#81)

dllehr-amd pushed a commit to dllehr-amd/vllm that referenced this pull request Jul 22, 2024

Remove elementwise kernel before each fp8 gemm (vllm-project#81)

c3e8349

* remove elementwise kernel * fix lint

JHLEE17 pushed a commit to JHLEE17/vllm that referenced this pull request Aug 1, 2024

Merge pull request vllm-project#81 from HabanaAI/private/kzawora/reba…

a797e6a

…se_v3 habana_main rebase

happyandslow mentioned this pull request Dec 16, 2024

Enhance cache service metrics aibrix/vllm#13

Merged

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request Mar 26, 2025

Merge pull request vllm-project#81 from red-hat-data-services/appstud…

a959f92

…io-vllm-cpu-v2-19 Red Hat Konflux update vllm-cpu-v2-19

Uh oh!

Conversation

zhuohan123 commented May 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuohan123 commented May 9, 2023

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WoosukKwon commented May 9, 2023

Uh oh!

WoosukKwon commented May 9, 2023

Uh oh!

zhuohan123 commented May 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuohan123 commented May 13, 2023

Uh oh!

WoosukKwon commented May 13, 2023

Uh oh!

WoosukKwon commented May 14, 2023

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

ivnle commented Jan 8, 2026

Resolution

Uh oh!

ivnle commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhuohan123 commented May 7, 2023 •

edited

Loading

zhuohan123 commented May 13, 2023 •

edited

Loading