Transformers v5 by kylesayrs · Pull Request #2647 · vllm-project/llm-compressor

kylesayrs · 2026-04-24T14:27:52Z

Examples Changes

MoE

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

coderabbitai · 2026-04-24T14:28:00Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e69a4cd8-4b25-4551-ab50-85013f9eb420

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch kylesayrs/transformers-v5

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-24T14:28:02Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

mergify · 2026-04-24T14:29:41Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

gemini-code-assist

Code Review

This pull request refactors MoE calibration by replacing model-specific modules with a generic linearization framework that unfuses expert weights into standard nn.Linear layers. Feedback identifies critical bugs, such as missing imports in the GPT-OSS module and incorrect handling of 3D input tensors in the LinearExperts forward pass. Further improvements were suggested regarding the fragility of using source code inspection for module detection and the efficiency of the gated MLP implementation.

gemini-code-assist · 2026-04-24T14:30:15Z

+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        top_k_index: torch.Tensor,
+        top_k_weights: torch.Tensor,
+    ) -> torch.Tensor:
+        final_hidden_states = torch.zeros_like(hidden_states)
+        num_experts = len(self)
+
+        # create tokens mask
+        with torch.no_grad():
+            expert_mask = torch.nn.functional.one_hot(top_k_index, num_experts)
+            expert_mask = expert_mask.permute(2, 1, 0)
+
+        for expert_idx in range(num_experts):
+            # select tokens for this expert
+            top_k_pos, token_indices = torch.where(expert_mask[expert_idx])
+            if token_indices.numel() == 0:
+                continue
+
+            # apply expert, maybe pass all tokens to the expert
+            expert = self[expert_idx]
+            if context.CALIBRATE_ALL_EXPERTS:
+                expert_output = expert(hidden_states)[token_indices]
+            else:
+                expert_output = expert(hidden_states[token_indices])
+
+            # apply weighting to outputs
+            expert_weights = top_k_weights[token_indices, top_k_pos, None]
+            weighted_output = expert_output * expert_weights
+
+            # accumulate the selected tokens
+            final_hidden_states.index_add_(0, token_indices, weighted_output)
+
+        return final_hidden_states


The forward method of LinearExperts needs to handle 3D hidden_states (e.g., [batch, sequence, hidden]) by flattening them before processing. Otherwise, index_add_(0, token_indices, ...) will incorrectly index into the batch dimension instead of the token dimension, leading to incorrect results or out-of-bounds errors. Additionally, an explicit cast to the destination dtype is recommended for weighted_output to ensure compatibility with index_add_ when using mixed precision (e.g., float32 weights with bfloat16 states).

def forward( self, hidden_states: torch.Tensor, top_k_index: torch.Tensor, top_k_weights: torch.Tensor, ) -> torch.Tensor: orig_shape = hidden_states.shape # Flatten to (total_tokens, hidden_dim) hidden_states = hidden_states.view(-1, orig_shape[-1]) top_k_index = top_k_index.view(-1, top_k_index.shape[-1]) top_k_weights = top_k_weights.view(-1, top_k_weights.shape[-1]) final_hidden_states = torch.zeros_like(hidden_states) num_experts = len(self) # create tokens mask with torch.no_grad(): expert_mask = torch.nn.functional.one_hot(top_k_index, num_experts) expert_mask = expert_mask.permute(2, 1, 0) for expert_idx in range(num_experts): # select tokens for this expert top_k_pos, token_indices = torch.where(expert_mask[expert_idx]) if token_indices.numel() == 0: continue # apply expert, maybe pass all tokens to the expert expert = self[expert_idx] if context.CALIBRATE_ALL_EXPERTS: expert_output = expert(hidden_states)[token_indices] else: expert_output = expert(hidden_states[token_indices]) # apply weighting to outputs expert_weights = top_k_weights[token_indices, top_k_pos, None] weighted_output = expert_output * expert_weights # accumulate the selected tokens final_hidden_states.index_add_( 0, token_indices, weighted_output.to(final_hidden_states.dtype) ) return final_hidden_states.view(orig_shape)

Hidden states is expected to be already flattened
https://github.com/huggingface/transformers/blob/a66638d854ae536e0ca31e8bcfa480adfaf58284/src/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py#L141

gemini-code-assist · 2026-04-24T14:30:15Z

+def _is_moe_experts_module(module) -> bool:
+    """Detect modules whose class is decorated with
+    ``@use_experts_implementation`` by inspecting the class source AST."""
+    try:
+        source = inspect.getsource(type(module))
+        tree = ast.parse(source)
+    except (OSError, TypeError):
+        return False
+
+    for node in ast.iter_child_nodes(tree):
+        if not isinstance(node, ast.ClassDef):
+            continue
+        for decorator in node.decorator_list:
+            if isinstance(decorator, ast.Name):
+                name = decorator.id
+            elif isinstance(decorator, ast.Call) and isinstance(
+                decorator.func, ast.Name
+            ):
+                name = decorator.func.id
+            else:
+                continue
+            if name == "use_experts_implementation":
+                return True
+
+    return False


Using inspect.getsource and ast.parse to detect MoE modules is fragile and potentially slow. It will fail if the source code is unavailable (e.g., in some deployment environments) and adds overhead for every module in the model. A more robust approach would be to check for specific attributes (like gate_up_proj and down_proj) or use a more direct way to identify these modules if possible.

Yep, will think more on this. I think this is the most robust solution, but should probably be lru cached against the module class.

gemini-code-assist · 2026-04-24T14:30:15Z

+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        return self.down_proj(
+            self._apply_gate(
+                torch.cat(
+                    [self.gate_proj(hidden_states), self.up_proj(hidden_states)], dim=-1
+                )
+            )
+        )


The use of torch.cat followed by _apply_gate (which typically performs a chunk operation) is inefficient. Since _apply_gate is currently restricted to _default_apply_gate (which just splits the input and multiplies), you could optimize this by applying the gate logic directly to the separate projection outputs, avoiding the concatenation and subsequent chunking.

kylesayrs · 2026-04-27T20:53:38Z

+)
+
+
+def linearize_moe_model(model: PreTrainedModel):


In the base case, we need to do a conversion. However, we can optimize this on a per-model basis by writing our own weight converters which are used directly during loading

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

mergify · 2026-04-30T06:36:42Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

maxdebayser · 2026-05-01T14:20:25Z

Thanks for working on this @kylesayrs. The IBM Spyre stack for vLLM depends on llm-compressor because of this library: https://github.com/foundation-model-stack/fms-model-optimizer. Currently we're unable to upgrade to transformers 5 due to this. Do you need any help?

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

mergify · 2026-05-05T00:42:23Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kylesayrs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

moe context

621e347

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs changed the title ~~Transformers v5 Support~~ Transformers v5 Apr 24, 2026

mergify Bot added the quality-failed label Apr 24, 2026

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

dsikka mentioned this pull request Apr 27, 2026

Q2 2026 Roadmap #2624

Open

18 tasks

kylesayrs commented Apr 27, 2026

View reviewed changes

deepseek example works on small model

0cb025c

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs mentioned this pull request Apr 30, 2026

[WIP] [DSV4] Quantization Support vllm-project/vllm#41276

Draft

kylesayrs added 2 commits April 30, 2026 02:34

dsv4 works

e03eb83

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

add random scripts

a308bc0

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

mergify Bot removed the quality-failed label Apr 30, 2026

mergify Bot added the quality-failed label Apr 30, 2026

pasta-paul mentioned this pull request May 2, 2026

[Bug]: compressed-tensors W4A16 MoE: weight_scale not sharded along K under tensor parallelism, kernel computes wrong group_size vllm-project/vllm#41511

Open

6 tasks

(kinda) works with main after dsv4 branch was merged

efa2016

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

mergify Bot removed the quality-failed label May 5, 2026

mergify Bot added the needs-rebase label May 5, 2026

kylesayrs added 2 commits May 5, 2026 16:36

finish cache, update example for new names

b30eb98

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

cache, tests, example

a5f2952

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs mentioned this pull request May 5, 2026

[Tracing] Support tracing cache #2686

Open

kylesayrs added 2 commits May 5, 2026 18:24

add tests

a10f90f

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

checkpoint

02da449

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

Conversation

kylesayrs commented Apr 24, 2026

Examples Changes

MoE

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

mergify Bot commented Apr 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

kylesayrs Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

kylesayrs Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

kylesayrs Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Apr 30, 2026

Uh oh!

maxdebayser commented May 1, 2026

Uh oh!

mergify Bot commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading