Add support for the /rerank endpoint in vllm bench serve#26602
Add support for the /rerank endpoint in vllm bench serve#26602DarkLight1337 merged 8 commits intovllm-project:mainfrom
Conversation
The /rerank API can be support both by embedding models and native reranker models. However, with reranker models the query is concatenated with each document with a separator token in between. Therefore the amount of tokens that passes through the model has to be accounted for differently in each case. Because of these details, this PR, adds a specialized random dataset to generates requests which send the expected amount of tokens. So when the use sets `random-input-len`, `num-prompts` and `random-batch-size`, in both cases we will generate requests such that the total amount of tokens is prompts*input-len in batches of size batch-size*input-len. Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
|
Documentation preview: https://vllm--26602.org.readthedocs.build/en/26602/ |
|
cc: @noooop , @DarkLight1337 , @ZJY0516 |
There was a problem hiding this comment.
Code Review
This pull request adds valuable support for benchmarking the /rerank endpoint, including a new specialized random dataset and documentation. The implementation is well-structured, refactoring existing embedding benchmark logic into a more general _run_pooling_request function to accommodate both embeddings and reranking. However, I've identified a critical issue that can cause the benchmark to crash under specific default conditions. Please see the detailed comment for the fix.
|
Related to #21796 |
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: 1994 <1994@users.noreply.github.com>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: bbartels <benjamin@bartels.dev>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
…t#26602) Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
The /rerank API can be support both by embedding models and native reranker models. However, with reranker models the query is concatenated with each document with a separator token in between. Therefore the amount of tokens that passes through the model has to be accounted for differently in each case.
Because of these details, this PR, adds a specialized random dataset to generates requests which send the expected amount of tokens. So when the use sets
random-input-len,num-promptsandrandom-batch-size, in both cases we will generate requests such that the total amount of tokens is promptsinput-len in batches of size batch-sizeinput-len.Here is an example of how this works. With the server running an reranker or embedding model
run a benchmark using the vllm-rerank backend and the
random-rerankdataset: