[Draft] Migrate bitsandbytes support to OOT plugin by Isotr0py · Pull Request #43529 · vllm-project/vllm

Isotr0py · 2026-05-24T15:08:04Z

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

mergify · 2026-05-24T15:08:41Z

Documentation preview: https://vllm--43529.org.readthedocs.build/en/43529/

gemini-code-assist

Code Review

This pull request migrates BitsAndBytes support to an out-of-tree plugin, removing hardcoded BNB logic from the core vLLM codebase. It introduces generic hooks for quantization configurations, model loaders, and weight sharding to support this plugin architecture. Feedback focuses on performance optimizations in vllm/model_executor/layers/linear.py, specifically recommending that the calculation of shard offsets and indices be guarded by a check for the presence of a shard_indexer to avoid unnecessary overhead during standard model loading.

gemini-code-assist · 2026-05-24T15:13:29Z

+                index = list(itertools.accumulate([0] + self.output_sizes))
+                orig_offsets = {
+                    str(i): (index[i], size) for i, size in enumerate(self.output_sizes)
+                }
+                orig_offsets["total"] = (self.output_size, 0)
+                shard_size, shard_offset = adjust_shard_indexes(
+                    param, orig_offsets, str(shard_id), shard_size, shard_offset
+                )


The calculation of index and orig_offsets is performed inside the shard loop for every parameter. This introduces unnecessary overhead during model loading for all models using merged linear layers, even when no custom quantization plugin is used. These calculations should be guarded by a check for shard_indexer to avoid performance regressions in standard model loading.

if getattr(param, "shard_indexer", None) is not None: index = list(itertools.accumulate([0] + self.output_sizes)) orig_offsets = { str(i): (index[i], size) for i, size in enumerate(self.output_sizes) } orig_offsets["total"] = (self.output_size, 0) shard_size, shard_offset = adjust_shard_indexes( param, orig_offsets, str(shard_id), shard_size, shard_offset )

gemini-code-assist · 2026-05-24T15:13:30Z

+            index = list(itertools.accumulate([0] + self.output_sizes))
+            orig_offsets = {
+                str(i): (index[i], size) for i, size in enumerate(self.output_sizes)
+            }
+            orig_offsets["total"] = (self.output_size, 0)
+            shard_size, shard_offset = adjust_shard_indexes(
+                param, orig_offsets, str(loaded_shard_id), shard_size, shard_offset
+            )


The orig_offsets dictionary is constructed for every parameter in the weight_loader, which is unnecessary for standard models that do not utilize a custom shard_indexer. Adding a guard check for the indexer will prevent this overhead during model initialization.

Suggested change

index = list(itertools.accumulate([0] + self.output_sizes))

orig_offsets = {

str(i): (index[i], size) for i, size in enumerate(self.output_sizes)

}

orig_offsets["total"] = (self.output_size, 0)

shard_size, shard_offset = adjust_shard_indexes(

param, orig_offsets, str(loaded_shard_id), shard_size, shard_offset

)

if getattr(param, "shard_indexer", None) is not None:

index = list(itertools.accumulate([0] + self.output_sizes))

orig_offsets = {

str(i): (index[i], size) for i, size in enumerate(self.output_sizes)

}

orig_offsets["total"] = (self.output_size, 0)

shard_size, shard_offset = adjust_shard_indexes(

param, orig_offsets, str(loaded_shard_id), shard_size, shard_offset

)

gemini-code-assist · 2026-05-24T15:13:30Z

+                orig_qkv_offsets = {
+                    "q": (0, self.total_num_heads * self.head_size),
+                    "k": (
+                        self.total_num_heads * self.head_size,
+                        self.total_num_kv_heads * self.head_size,
+                    ),
+                    "v": (
+                        (self.total_num_heads + self.total_num_kv_heads)
+                        * self.head_size,
+                        self.total_num_kv_heads * self.v_head_size,
+                    ),
+                    "total": (
+                        (self.total_num_heads + self.total_num_kv_heads)
+                        * self.head_size
+                        + self.total_num_kv_heads * self.v_head_size,
+                        0,
+                    ),
+                }
+                shard_size, shard_offset = adjust_shard_indexes(
+                    param, orig_qkv_offsets, shard_id, shard_size, shard_offset
+                )


In QKVParallelLinear.weight_loader, the orig_qkv_offsets dictionary is recalculated inside the shard loop for every parameter. This results in redundant computations during model loading for all models. This block should be guarded to only execute when a shard_indexer is present on the parameter.

if getattr(param, "shard_indexer", None) is not None: orig_qkv_offsets = { "q": (0, self.total_num_heads * self.head_size), "k": ( self.total_num_heads * self.head_size, self.total_num_kv_heads * self.head_size, ), "v": ( (self.total_num_heads + self.total_num_kv_heads) * self.head_size, self.total_num_kv_heads * self.v_head_size, ), "total": ( (self.total_num_heads + self.total_num_kv_heads) * self.head_size + self.total_num_kv_heads * self.v_head_size, 0, ), } shard_size, shard_offset = adjust_shard_indexes( param, orig_qkv_offsets, shard_id, shard_size, shard_offset )

gemini-code-assist · 2026-05-24T15:13:30Z

+            orig_qkv_offsets = {
+                "q": (0, self.num_heads * self.head_size),
+                "k": (
+                    self.num_heads * self.head_size,
+                    self.num_kv_heads * self.head_size,
+                ),
+                "v": (
+                    (self.num_heads + self.num_kv_heads) * self.head_size,
+                    self.num_kv_heads * self.v_head_size,
+                ),
+                "total": (
+                    (self.num_heads + self.num_kv_heads) * self.head_size
+                    + self.num_kv_heads * self.v_head_size,
+                    0,
+                ),
+            }
+            shard_size, shard_offset = adjust_shard_indexes(
+                param, orig_qkv_offsets, loaded_shard_id, shard_size, shard_offset
+            )


Constructing orig_qkv_offsets for every parameter in the QKV weight loader is wasteful for non-quantized models. A guard check for shard_indexer should be added to maintain optimal loading performance.

Suggested change

orig_qkv_offsets = {

"q": (0, self.num_heads * self.head_size),

"k": (

self.num_heads * self.head_size,

self.num_kv_heads * self.head_size,

),

"v": (

(self.num_heads + self.num_kv_heads) * self.head_size,

self.num_kv_heads * self.v_head_size,

),

"total": (

(self.num_heads + self.num_kv_heads) * self.head_size

+ self.num_kv_heads * self.v_head_size,

0,

),

}

shard_size, shard_offset = adjust_shard_indexes(

param, orig_qkv_offsets, loaded_shard_id, shard_size, shard_offset

)

if getattr(param, "shard_indexer", None) is not None:

orig_qkv_offsets = {

"q": (0, self.num_heads * self.head_size),

"k": (

self.num_heads * self.head_size,

self.num_kv_heads * self.head_size,

),

"v": (

(self.num_heads + self.num_kv_heads) * self.head_size,

self.num_kv_heads * self.v_head_size,

),

"total": (

(self.num_heads + self.num_kv_heads) * self.head_size

+ self.num_kv_heads * self.v_head_size,

0,

),

}

shard_size, shard_offset = adjust_shard_indexes(

param, orig_qkv_offsets, loaded_shard_id, shard_size, shard_offset

)

mergify · 2026-05-25T07:29:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Isotr0py.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

mergify · 2026-06-04T16:55:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Isotr0py.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Isotr0py added 2 commits May 21, 2026 23:39

remove bnb

c341346

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Merge remote-tracking branch 'upstream/main' into remove-bnb

06aef2d

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

mergify Bot added documentation Improvements or additions to documentation ci/build nvidia rocm Related to AMD ROCm labels May 24, 2026

github-project-automation Bot added this to NVIDIA and AMD May 24, 2026

mergify Bot added the v1 label May 24, 2026

github-project-automation Bot moved this to Todo in AMD May 24, 2026

gemini-code-assist Bot reviewed May 24, 2026

View reviewed changes

Isotr0py mentioned this pull request May 25, 2026

[Migration] Migrate GGUF quantization support to plugin #39612

Open

5 tasks

mergify Bot added the needs-rebase label May 25, 2026

Isotr0py mentioned this pull request May 27, 2026

[Bugfix] Convert Gemma4-MM ViT linear layers to vllm native impl #43798

Merged

4 tasks

Isotr0py added 5 commits May 28, 2026 21:56

clean

985ffb0

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

clean

c0a8f39

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

clean

db3deb1

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

clean

c77d04b

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Merge remote-tracking branch 'upstream/main' into remove-bnb

c045b70

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

mergify Bot removed the needs-rebase label Jun 4, 2026

mergify Bot added the needs-rebase label Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Draft] Migrate bitsandbytes support to OOT plugin#43529

[Draft] Migrate bitsandbytes support to OOT plugin#43529
Isotr0py wants to merge 7 commits into
vllm-project:mainfrom
Isotr0py:remove-bnb

Isotr0py commented May 24, 2026 •

edited by github-actions Bot

Loading

Uh oh!

mergify Bot commented May 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 24, 2026

Uh oh!

gemini-code-assist Bot May 24, 2026

Uh oh!

gemini-code-assist Bot May 24, 2026

Uh oh!

gemini-code-assist Bot May 24, 2026

Uh oh!

mergify Bot commented May 25, 2026

Uh oh!

mergify Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Isotr0py commented May 24, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify Bot commented May 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 24, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented May 25, 2026

Uh oh!

mergify Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Isotr0py commented May 24, 2026 •

edited by github-actions Bot

Loading