Use runtime profiling to replace manual memory analyzers#81
Use runtime profiling to replace manual memory analyzers#81zhuohan123 merged 16 commits intomainfrom
Conversation
|
@WoosukKwon This PR is ready for review. |
WoosukKwon
left a comment
There was a problem hiding this comment.
Thanks @zhuohan123 for this PR. Please check my comments.
|
Additionally, I've found that the PR changes the output of the examples in |
|
@zhuohan123 I will start to merge the other PRs first, so that this PR does not block any. |
|
@WoosukKwon I fixed all the review comments and merged with the latest master. Please review again. Feel free to merge it. One thing I noticed is this bug in the sampler. With this bug, I discover that these two sorts (link1, link2) take a lot of memory (4-5GB) when sampling for 2560 tokens at the same time. In this case, they are sorting a (2560, 51200) shape tensor. I'm not sure why these two sorts take that much memory and whether there are more memory efficient way to implement topp and topk. |
I haven't added it in the current PR since it looks pretty redundant to add an additional "set_random_seed" call to all the workers after profiling. It will be a nested ray call and complicates the code. I think this will confuse future people when they read the code. |
What do you mean by "nested ray call"? I didn't actually get it. And I think the purpose of resetting the random state is to make sure everything that happens before starting serving should not affect the outputs. In this sense, I think we should reset the random states after the profiling run, while we don't have to reset the states before profiling run. |
|
* remove elementwise kernel * fix lint
…se_v3 habana_main rebase
…io-vllm-cpu-v2-19 Red Hat Konflux update vllm-cpu-v2-19
ResolutionRoot Cause Identified: BPE tokenizer merges
The vLLM reasoning parser was using token sequence matching, which failed because it couldn't predict which variant the model would generate. Fix Implemented: Replaced token sequence matching with windowed decode approach in
Verification:
Commits (in vLLM fork):
Related Issues Created:
|
|
Fixed in vLLM fork commit 4b30e5f62. See comment above for details. |
<!-- markdownlint-disable --> lfm2-vl support: lfm2_vl_dense and lfm2_vl_moe are both supported. ## Purpose support lfm2_vl_dense and lfm2_vl_moe ## Test Plan I support vllm backend on our private VLMEvalKit. ## Test Result The MMStar result of vllm matches that result of hf backend. --------- Signed-off-by: Paul Pak <paulpak58@gmail.com> Co-authored-by: Paul Pak <paulpak58@gmail.com>
Fix #59.
Previously we used a manual memory profiler, which must be implemented separately for each model. This PR replaces it with a general memory profiler.