Skip to content

[Draft] Migrate bitsandbytes support to OOT plugin#43529

Draft
Isotr0py wants to merge 7 commits into
vllm-project:mainfrom
Isotr0py:remove-bnb
Draft

[Draft] Migrate bitsandbytes support to OOT plugin#43529
Isotr0py wants to merge 7 commits into
vllm-project:mainfrom
Isotr0py:remove-bnb

Conversation

@Isotr0py
Copy link
Copy Markdown
Member

@Isotr0py Isotr0py commented May 24, 2026

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Isotr0py added 2 commits May 21, 2026 23:39
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 24, 2026

Documentation preview: https://vllm--43529.org.readthedocs.build/en/43529/

@mergify mergify Bot added documentation Improvements or additions to documentation ci/build nvidia rocm Related to AMD ROCm labels May 24, 2026
@mergify mergify Bot added the v1 label May 24, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD May 24, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates BitsAndBytes support to an out-of-tree plugin, removing hardcoded BNB logic from the core vLLM codebase. It introduces generic hooks for quantization configurations, model loaders, and weight sharding to support this plugin architecture. Feedback focuses on performance optimizations in vllm/model_executor/layers/linear.py, specifically recommending that the calculation of shard offsets and indices be guarded by a check for the presence of a shard_indexer to avoid unnecessary overhead during standard model loading.

Comment thread vllm/model_executor/layers/linear.py Outdated
Comment on lines +771 to +778
index = list(itertools.accumulate([0] + self.output_sizes))
orig_offsets = {
str(i): (index[i], size) for i, size in enumerate(self.output_sizes)
}
orig_offsets["total"] = (self.output_size, 0)
shard_size, shard_offset = adjust_shard_indexes(
param, orig_offsets, str(shard_id), shard_size, shard_offset
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The calculation of index and orig_offsets is performed inside the shard loop for every parameter. This introduces unnecessary overhead during model loading for all models using merged linear layers, even when no custom quantization plugin is used. These calculations should be guarded by a check for shard_indexer to avoid performance regressions in standard model loading.

                if getattr(param, "shard_indexer", None) is not None:
                    index = list(itertools.accumulate([0] + self.output_sizes))
                    orig_offsets = {
                        str(i): (index[i], size)
                        for i, size in enumerate(self.output_sizes)
                    }
                    orig_offsets["total"] = (self.output_size, 0)
                    shard_size, shard_offset = adjust_shard_indexes(
                        param, orig_offsets, str(shard_id), shard_size, shard_offset
                    )

Comment thread vllm/model_executor/layers/linear.py Outdated
Comment on lines +812 to +819
index = list(itertools.accumulate([0] + self.output_sizes))
orig_offsets = {
str(i): (index[i], size) for i, size in enumerate(self.output_sizes)
}
orig_offsets["total"] = (self.output_size, 0)
shard_size, shard_offset = adjust_shard_indexes(
param, orig_offsets, str(loaded_shard_id), shard_size, shard_offset
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The orig_offsets dictionary is constructed for every parameter in the weight_loader, which is unnecessary for standard models that do not utilize a custom shard_indexer. Adding a guard check for the indexer will prevent this overhead during model initialization.

Suggested change
index = list(itertools.accumulate([0] + self.output_sizes))
orig_offsets = {
str(i): (index[i], size) for i, size in enumerate(self.output_sizes)
}
orig_offsets["total"] = (self.output_size, 0)
shard_size, shard_offset = adjust_shard_indexes(
param, orig_offsets, str(loaded_shard_id), shard_size, shard_offset
)
if getattr(param, "shard_indexer", None) is not None:
index = list(itertools.accumulate([0] + self.output_sizes))
orig_offsets = {
str(i): (index[i], size) for i, size in enumerate(self.output_sizes)
}
orig_offsets["total"] = (self.output_size, 0)
shard_size, shard_offset = adjust_shard_indexes(
param, orig_offsets, str(loaded_shard_id), shard_size, shard_offset
)

Comment thread vllm/model_executor/layers/linear.py Outdated
Comment on lines +1244 to +1264
orig_qkv_offsets = {
"q": (0, self.total_num_heads * self.head_size),
"k": (
self.total_num_heads * self.head_size,
self.total_num_kv_heads * self.head_size,
),
"v": (
(self.total_num_heads + self.total_num_kv_heads)
* self.head_size,
self.total_num_kv_heads * self.v_head_size,
),
"total": (
(self.total_num_heads + self.total_num_kv_heads)
* self.head_size
+ self.total_num_kv_heads * self.v_head_size,
0,
),
}
shard_size, shard_offset = adjust_shard_indexes(
param, orig_qkv_offsets, shard_id, shard_size, shard_offset
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In QKVParallelLinear.weight_loader, the orig_qkv_offsets dictionary is recalculated inside the shard loop for every parameter. This results in redundant computations during model loading for all models. This block should be guarded to only execute when a shard_indexer is present on the parameter.

                if getattr(param, "shard_indexer", None) is not None:
                    orig_qkv_offsets = {
                        "q": (0, self.total_num_heads * self.head_size),
                        "k": (
                            self.total_num_heads * self.head_size,
                            self.total_num_kv_heads * self.head_size,
                        ),
                        "v": (
                            (self.total_num_heads + self.total_num_kv_heads)
                            * self.head_size,
                            self.total_num_kv_heads * self.v_head_size,
                        ),
                        "total": (
                            (self.total_num_heads + self.total_num_kv_heads)
                            * self.head_size
                            + self.total_num_kv_heads * self.v_head_size,
                            0,
                        ),
                    }
                    shard_size, shard_offset = adjust_shard_indexes(
                        param, orig_qkv_offsets, shard_id, shard_size, shard_offset
                    )

Comment thread vllm/model_executor/layers/linear.py Outdated
Comment on lines +1306 to +1324
orig_qkv_offsets = {
"q": (0, self.num_heads * self.head_size),
"k": (
self.num_heads * self.head_size,
self.num_kv_heads * self.head_size,
),
"v": (
(self.num_heads + self.num_kv_heads) * self.head_size,
self.num_kv_heads * self.v_head_size,
),
"total": (
(self.num_heads + self.num_kv_heads) * self.head_size
+ self.num_kv_heads * self.v_head_size,
0,
),
}
shard_size, shard_offset = adjust_shard_indexes(
param, orig_qkv_offsets, loaded_shard_id, shard_size, shard_offset
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Constructing orig_qkv_offsets for every parameter in the QKV weight loader is wasteful for non-quantized models. A guard check for shard_indexer should be added to maintain optimal loading performance.

Suggested change
orig_qkv_offsets = {
"q": (0, self.num_heads * self.head_size),
"k": (
self.num_heads * self.head_size,
self.num_kv_heads * self.head_size,
),
"v": (
(self.num_heads + self.num_kv_heads) * self.head_size,
self.num_kv_heads * self.v_head_size,
),
"total": (
(self.num_heads + self.num_kv_heads) * self.head_size
+ self.num_kv_heads * self.v_head_size,
0,
),
}
shard_size, shard_offset = adjust_shard_indexes(
param, orig_qkv_offsets, loaded_shard_id, shard_size, shard_offset
)
if getattr(param, "shard_indexer", None) is not None:
orig_qkv_offsets = {
"q": (0, self.num_heads * self.head_size),
"k": (
self.num_heads * self.head_size,
self.num_kv_heads * self.head_size,
),
"v": (
(self.num_heads + self.num_kv_heads) * self.head_size,
self.num_kv_heads * self.v_head_size,
),
"total": (
(self.num_heads + self.num_kv_heads) * self.head_size
+ self.num_kv_heads * self.v_head_size,
0,
),
}
shard_size, shard_offset = adjust_shard_indexes(
param, orig_qkv_offsets, loaded_shard_id, shard_size, shard_offset
)

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 25, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Isotr0py.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Isotr0py added 5 commits May 28, 2026 21:56
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@mergify mergify Bot removed the needs-rebase label Jun 4, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jun 4, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Isotr0py.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation needs-rebase nvidia rocm Related to AMD ROCm v1

Projects

Status: Todo
Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant