Skip to content

Use runtime profiling to replace manual memory analyzers#81

Merged
zhuohan123 merged 16 commits intomainfrom
dynamic-memory-profiler
May 19, 2023
Merged

Use runtime profiling to replace manual memory analyzers#81
zhuohan123 merged 16 commits intomainfrom
dynamic-memory-profiler

Conversation

@zhuohan123
Copy link
Copy Markdown
Member

@zhuohan123 zhuohan123 commented May 7, 2023

Fix #59.

Previously we used a manual memory profiler, which must be implemented separately for each model. This PR replaces it with a general memory profiler.

@zhuohan123 zhuohan123 requested a review from WoosukKwon May 9, 2023 04:11
@zhuohan123
Copy link
Copy Markdown
Member Author

@WoosukKwon This PR is ready for review.

Copy link
Copy Markdown
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhuohan123 for this PR. Please check my comments.

Comment thread cacheflow/master/server.py Outdated
Comment thread cacheflow/http_frontend/fastapi_frontend.py Outdated
Comment thread cacheflow/master/server.py
Comment thread cacheflow/master/server.py
Comment thread cacheflow/models/llama.py Outdated
Comment thread cacheflow/worker/worker.py Outdated
Comment thread cacheflow/worker/worker.py Outdated
Comment thread cacheflow/worker/worker.py
Comment thread cacheflow/worker/worker.py Outdated
Comment thread cacheflow/worker/worker.py Outdated
@WoosukKwon
Copy link
Copy Markdown
Collaborator

Additionally, I've found that the PR changes the output of the examples in simple_server.py. We should reset the random generator states after the profiling run.

@WoosukKwon
Copy link
Copy Markdown
Collaborator

@zhuohan123 I will start to merge the other PRs first, so that this PR does not block any.

@zhuohan123
Copy link
Copy Markdown
Member Author

zhuohan123 commented May 13, 2023

@WoosukKwon I fixed all the review comments and merged with the latest master. Please review again. Feel free to merge it.

One thing I noticed is this bug in the sampler. With this bug, I discover that these two sorts (link1, link2) take a lot of memory (4-5GB) when sampling for 2560 tokens at the same time. In this case, they are sorting a (2560, 51200) shape tensor. I'm not sure why these two sorts take that much memory and whether there are more memory efficient way to implement topp and topk.

@zhuohan123
Copy link
Copy Markdown
Member Author

Additionally, I've found that the PR changes the output of the examples in simple_server.py. We should reset the random generator states after the profiling run.

I haven't added it in the current PR since it looks pretty redundant to add an additional "set_random_seed" call to all the workers after profiling. It will be a nested ray call and complicates the code. I think this will confuse future people when they read the code.

@WoosukKwon
Copy link
Copy Markdown
Collaborator

I haven't added it in the current PR since it looks pretty redundant to add an additional "set_random_seed" call to all the workers after profiling. It will be a nested ray call and complicates the code. I think this will confuse future people when they read the code.

What do you mean by "nested ray call"? I didn't actually get it. And I think the purpose of resetting the random state is to make sure everything that happens before starting serving should not affect the outputs. In this sense, I think we should reset the random states after the profiling run, while we don't have to reset the states before profiling run.

@WoosukKwon
Copy link
Copy Markdown
Collaborator

One thing I noticed is this bug in the sampler. With this bug, I discover that these two sorts (link1, link2) take a lot of memory (4-5GB) when sampling for 2560 tokens at the same time. In this case, they are sorting a (2560, 51200) shape tensor. I'm not sure why these two sorts take that much memory and whether there are more memory efficient way to implement topp and topk.

  1. Thanks for finding the bug. As you did, -1 should be replaced with vocab_size.
  2. On the memory consumption, it seems it's a known issue: Peak GPU-memory usage extremely huge when sorting with torch.sort pytorch/pytorch#77049. For now, I think we can take into account max_num_sequences (which is 256 by default) in calculating the peak memory usage. This will reduce the memory consumption of sorting by 10x.

Comment thread cacheflow/model_executor/model_loader.py Outdated
Copy link
Copy Markdown
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@zhuohan123 zhuohan123 merged commit f756799 into main May 19, 2023
@zhuohan123 zhuohan123 deleted the dynamic-memory-profiler branch May 24, 2023 04:40
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
dllehr-amd pushed a commit to dllehr-amd/vllm that referenced this pull request Jul 22, 2024
JHLEE17 pushed a commit to JHLEE17/vllm that referenced this pull request Aug 1, 2024
dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request Mar 26, 2025
…io-vllm-cpu-v2-19

Red Hat Konflux update vllm-cpu-v2-19
@ivnle
Copy link
Copy Markdown

ivnle commented Jan 8, 2026

Resolution

Root Cause Identified: BPE tokenizer merges </think> differently based on preceding context:

  • .</think>[4005, 27963, 29] (period merges with </)
  • </think>[524, 27963, 29] (space merges with </)

The vLLM reasoning parser was using token sequence matching, which failed because it couldn't predict which variant the model would generate.

Fix Implemented: Replaced token sequence matching with windowed decode approach in libs/vllm/vllm/reasoning/olmo3_reasoning_parser.py:

  • Decode last 15 tokens and search for </think> string
  • O(1) complexity, ~8μs overhead per call (<1% at 1000 tok/s)
  • Works for all BPE tokenization variants

Verification:

  • 45/45 unit tests passed
  • Real data: 100% detection rate (vs 2% with old implementation)
  • Scratch experiment scratch_issue81_gcd_fix_olmo_gcd_fix_verify confirmed GCD activates correctly after </think>

Commits (in vLLM fork):

  • 4b30e5f62 - Fix OLMo3 reasoning parser: use windowed decode for detection

Related Issues Created:

@ivnle
Copy link
Copy Markdown

ivnle commented Jan 8, 2026

Fixed in vLLM fork commit 4b30e5f62. See comment above for details.

tianshu-Michael-yu added a commit to tianshu-Michael-yu/vllm that referenced this pull request Feb 13, 2026
<!-- markdownlint-disable -->
lfm2-vl support: lfm2_vl_dense and lfm2_vl_moe are both supported. 

## Purpose
support lfm2_vl_dense and lfm2_vl_moe

## Test Plan
I support vllm backend on our private VLMEvalKit.

## Test Result
The MMStar result of vllm matches that result of hf backend.

---------

Signed-off-by: Paul Pak <paulpak58@gmail.com>
Co-authored-by: Paul Pak <paulpak58@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Profile memory usage

3 participants