[WIP] Use vllm transformers backend for pooling model runner. by maxdebayser · Pull Request #752 · torch-spyre/sendnn-inference

maxdebayser · 2026-02-19T14:41:39Z

This PR explores using the vLLM model loader to load either the vLLM or the transformers modeling code in the pooling model runner. It also shows how to plug a custom attention class.

Why? Because 1) pooling models are simpler and don't require paged attention 2) We were already loading the transformers code for pooling, but in hacky way, without taking advantage of the vLLM pooler code.

This investigation showed the following:

In eager mode, both the vLLM and the tranformers backend work correctly
With torch 2.7, we can't compile the vLLM linear classes without graph breaks. Transformers backend tries to replace the liner classes in transformers code, but this can be circumvented.
The transformers backend only support the pooling models in transformers 5.0.0, but with a few hacks the 4.57 transformers can be supported
torch_sendnn can't transform the fx graph correctly for transformers 5.0.0

With a few hacks in this PR, both the 4.57 and the 5.0 transformers can run the pooling models correctly on cpu and inductor. But only 4.57 runs on the spyre device. With the 4.57 version, the custom attention class is not used. This means that this PR can be simplied a lot if we remove the code to support 5.0.

This PR builds on ideas from PR #217

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

github-actions · 2026-02-19T14:42:01Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.

maxdebayser · 2026-02-23T16:26:17Z

Upstream transformers 5.0 issue: vllm-project/vllm#30566

maxdebayser · 2026-02-23T16:31:34Z

Apparently the current stack works with torch 2.10. I'm going to test if the vllm modeling code compiles with this version.

joerunde · 2026-02-24T18:11:54Z

bot:test
MARKERS="spyre and embedding"

maxdebayser · 2026-03-11T14:39:34Z

This was an interesting experiment but it hit too many compatibility roadblocks with transformers, sendnn, and the compiler. But I think that the learning of this PR can be used to create a vllm model loader to load the embedding models in a better way than we currently do. I'll open a new PR for that.

maxdebayser and others added 8 commits February 16, 2026 18:13

Load vllm or transformers model code with pooling model runner

4c184e4

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

Merge branch '2.0-release-prep' into vllm_model_loader_pooling

d3d0862

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

fix small issues

3388fb4

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

restore compilation code

5339b6d

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

set compilation config

47ec74c

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

solve graph breaks

19bec3e

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

make it work with transformers 4.57

bfe1cea

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>

Merge branch 'main' into vllm_model_loader_pooling

1d54b1e

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

maxdebayser requested review from nikolaospapandreou, prashantgupta24, rafvasq, sducouedic, tdoublep and yannicks1 as code owners February 19, 2026 14:41

maxdebayser requested review from joerunde, tjohnson31415 and yannicks1 and removed request for joerunde, nikolaospapandreou, prashantgupta24, rafvasq, sducouedic, tdoublep and yannicks1 February 19, 2026 14:41

maxdebayser requested review from joerunde, tdoublep and tjohnson31415 and removed request for tjohnson31415 February 19, 2026 14:42

maxdebayser closed this Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Use vllm transformers backend for pooling model runner.#752

[WIP] Use vllm transformers backend for pooling model runner.#752
maxdebayser wants to merge 8 commits intomainfrom
vllm_model_loader_pooling

maxdebayser commented Feb 19, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

maxdebayser commented Feb 23, 2026

Uh oh!

maxdebayser commented Feb 23, 2026

Uh oh!

joerunde commented Feb 24, 2026

Uh oh!

maxdebayser commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maxdebayser commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

maxdebayser commented Feb 23, 2026

Uh oh!

maxdebayser commented Feb 23, 2026

Uh oh!

joerunde commented Feb 24, 2026

Uh oh!

maxdebayser commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maxdebayser commented Feb 19, 2026 •

edited

Loading