Add tests for chunked prefill and prefix cache with causal pooling models#26526
Conversation
…dels This test uses Qwen3-Embedding-0.6B which has causal attention and LAST pooling, and therefore supports these features. To make these verifications, the test creates an interceptor Pooler which verifies if the prompt processing is done in one go or piecewise. For prefix caching it verifies if the number of tokens that it has seen is less than expected. Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Ayush Singh <ayush1009208@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request adds end-to-end tests for chunked prefill and prefix caching with pooling models. The tests use a wrapper to intercept calls and verify the behavior. The implementation is mostly correct, but I found one issue in the prefix cache test where a hardcoded value is used, making the test brittle. My review provides a suggestion to make the test more robust by dynamically getting the value from the configuration.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
|
e2e testing is now running on CPU LOL |
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
noooop
left a comment
There was a problem hiding this comment.
Thanks for this fascinating work
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Head branch was pushed to by a user without write access
|
I've added an annotation to skip if CPU is detected, let me know if that's ok. The tests are also running on GPU in the "V1 Test e2e + engine". There was a successful execution here: https://buildkite.com/vllm/ci/builds/34649/steps/canvas?jid=0199de40-8310-4e84-9cf7-624a5d1922b0 |
…dels (vllm-project#26526) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Ayush Singh <ayush1009208@gmail.com> Signed-off-by: 1994 <1994@users.noreply.github.com>
…dels (vllm-project#26526) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Ayush Singh <ayush1009208@gmail.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
|
@maxdebayser thanks for the PR i sure have a lot to learn i will come back with another good Pull Request for sure thankyou for giving me a chance. |
|
No problem, @ArkVex . Thanks for getting this PR started. I had to bring it home for time reasons, but I hope you'll pick up other issues to work on. It sure is a steep learning curve, but it's rewarding. |
…dels (vllm-project#26526) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Ayush Singh <ayush1009208@gmail.com> Signed-off-by: bbartels <benjamin@bartels.dev>
…dels (vllm-project#26526) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Ayush Singh <ayush1009208@gmail.com>
…dels (vllm-project#26526) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Ayush Singh <ayush1009208@gmail.com>
…dels (vllm-project#26526) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Ayush Singh <ayush1009208@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…dels (vllm-project#26526) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Ayush Singh <ayush1009208@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…dels (vllm-project#26526) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Ayush Singh <ayush1009208@gmail.com>
…dels (vllm-project#26526) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Ayush Singh <ayush1009208@gmail.com>
Addresses: #23436
This test uses Qwen3-Embedding-0.6B which has causal attention and LAST pooling, and therefore supports these features. To make these verifications, the test creates an interceptor Pooler which verifies if the prompt processing is done in one go or piecewise. For prefix caching it verifies if the number of tokens that it has seen is less than expected.
@ArkVex, this issue has been open since August so I finished the implementation here but I've added you as co-author. Can you take a look and see if you agree with the changes?
cc: @noooop