-
Notifications
You must be signed in to change notification settings - Fork 203
feat: [1/2] Top-k and Top-p support for dtensor worker with vLLM V0 when TP==1
#773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for top-k and top-p sampling in the dtensor worker when using vLLM v0 engine with tensor parallelism size equal to 1. The implementation includes the sampling logic and comprehensive test coverage.
Key changes:
- Implements
apply_top_k_top_pandapply_top_k_onlyfunctions for sampling logits - Updates dtensor policy worker to apply sampling parameters after temperature scaling for vLLM v0
- Adds comprehensive unit tests for the sampling utilities
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| nemo_rl/models/policy/utils.py | Implements top-k and top-p sampling functions based on vLLM's approach |
| nemo_rl/models/policy/dtensor_policy_worker.py | Integrates sampling into logits post-processing with TP=1 restriction |
| tests/unit/models/policy/test_utils.py | Adds comprehensive test suite for sampling utilities |
|
thanks for adding this! With the upcoming vllm 0.10.0, we should be able to get v1 to output post-processed logprobs as well. |
|
@SahilJain314 Thanks for pointing this out. In that case, I will wait for the pr for upgrading to |
Signed-off-by: Zhanda Zhu <[email protected]>
TP==1TP==1
|
Since vllm has been upgraded to However, I found that the meaning of Specifically, in https://github.com/vllm-project/vllm/blob/v0.10.0/vllm/v1/sample/sampler.py#L28-L91, if From my perspective, it is not possible to re-apply sampling parameters like Please correct me if I'm wrong. What are your thoughts on how we should proceed? The most effective solution is to request a feature that returns the after-sampling logprobs. cc: @wangshangsam |
|
@Dazz993 I think you are right that in vllm==0.10.0 v1 engine:
the way I can think for now is to have a vllm patch to get the
also we may ask vllm team if they could support this in the future. wdyt? @Dazz993 @SahilJain314 |
|
BTW, two questions:
|
|
@YUki-666 Thanks for your comments!
Yes. Temperature also requires the entire vocabulary, since
Yes, that is exactly what I am thinking.
Not sure about the exact behavior for different recipes. In our
According to https://github.com/vllm-project/vllm/blob/v0.10.0/vllm/model_executor/layers/sampler.py#L219-L328, I think it is returning the final logprobs (sampling metadata applied). |
|
Update: I found this PR to get the final logprobs from vllm: vllm-project/vllm#22387 |
Thank you for the interest! That PR will apply not only top-k & top-p, but also temperature & min-p. Does that work for your use case? |
Yep! We want the final logprobs just before the sampling. Thank you for your PR! |
|
@zhandaz Just to confirm - the state of this PR right now is that it's waiting for the vLLM 0.11 release? |
Yep. Unless we want to another more complicated file patch, or we have to wait for the vLLM to support this first. |
|
@zhandaz is 0.10.2 new enough? |
|
@terrykong @zhandaz vllm 0.10.2 seems to include PR 22387: https://github.com/vllm-project/vllm/releases/tag/v0.10.2 |
What does this PR do ?
tldr: Support top-k and top-p for dtensor worker with vLLM v0. This pr only supports
tp==1, the pr fortp>1is in #774.tp_size > 1would need changes on the distributed functions. Because Top-k/top-p filtering requires seeing the full vocabulary to determine which tokens to keep. From what I can tell, we need to changeDistributedLogprobor_compute_distributed_log_softmaxto support p and k. We have another PR for this case so that people can easily revert if the changes on distributed functions are considered not elegant.For the detailed implementation, we add our implementation based on vLLM's: https://github.com/vllm-project/vllm/blob/34a20c49b3f81f64133428b3a0d62309db1256f9/vllm/v1/sample/ops/topk_topp_sampler.py.
Issues
Related Issue: #69