Skip to content

[WIP] Use vllm transformers backend for pooling model runner.#752

Closed
maxdebayser wants to merge 8 commits intomainfrom
vllm_model_loader_pooling
Closed

[WIP] Use vllm transformers backend for pooling model runner.#752
maxdebayser wants to merge 8 commits intomainfrom
vllm_model_loader_pooling

Conversation

@maxdebayser
Copy link
Copy Markdown
Collaborator

@maxdebayser maxdebayser commented Feb 19, 2026

This PR explores using the vLLM model loader to load either the vLLM or the transformers modeling code in the pooling model runner. It also shows how to plug a custom attention class.

Why? Because 1) pooling models are simpler and don't require paged attention 2) We were already loading the transformers code for pooling, but in hacky way, without taking advantage of the vLLM pooler code.

This investigation showed the following:

  1. In eager mode, both the vLLM and the tranformers backend work correctly
  2. With torch 2.7, we can't compile the vLLM linear classes without graph breaks. Transformers backend tries to replace the liner classes in transformers code, but this can be circumvented.
  3. The transformers backend only support the pooling models in transformers 5.0.0, but with a few hacks the 4.57 transformers can be supported
  4. torch_sendnn can't transform the fx graph correctly for transformers 5.0.0

With a few hacks in this PR, both the 4.57 and the 5.0 transformers can run the pooling models correctly on cpu and inductor. But only 4.57 runs on the spyre device. With the 4.57 version, the custom attention class is not used. This means that this PR can be simplied a lot if we remove the code to support 5.0.

This PR builds on ideas from PR #217

maxdebayser and others added 8 commits February 16, 2026 18:13
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.

@maxdebayser maxdebayser requested review from joerunde, tdoublep and tjohnson31415 and removed request for tjohnson31415 February 19, 2026 14:42
@maxdebayser
Copy link
Copy Markdown
Collaborator Author

Upstream transformers 5.0 issue: vllm-project/vllm#30566

@maxdebayser
Copy link
Copy Markdown
Collaborator Author

Apparently the current stack works with torch 2.10. I'm going to test if the vllm modeling code compiles with this version.

@joerunde
Copy link
Copy Markdown
Collaborator

bot:test
MARKERS="spyre and embedding"

@maxdebayser
Copy link
Copy Markdown
Collaborator Author

This was an interesting experiment but it hit too many compatibility roadblocks with transformers, sendnn, and the compiler. But I think that the learning of this PR can be used to create a vllm model loader to load the embedding models in a better way than we currently do. I'll open a new PR for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants