Skip to content

Transformers v5#2647

Draft
kylesayrs wants to merge 9 commits intomainfrom
kylesayrs/transformers-v5
Draft

Transformers v5#2647
kylesayrs wants to merge 9 commits intomainfrom
kylesayrs/transformers-v5

Conversation

@kylesayrs
Copy link
Copy Markdown
Collaborator

Examples Changes

MoE

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e69a4cd8-4b25-4551-ab50-85013f9eb420

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kylesayrs/transformers-v5

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kylesayrs kylesayrs changed the title Transformers v5 Support Transformers v5 Apr 24, 2026
@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 24, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors MoE calibration by replacing model-specific modules with a generic linearization framework that unfuses expert weights into standard nn.Linear layers. Feedback identifies critical bugs, such as missing imports in the GPT-OSS module and incorrect handling of 3D input tensors in the LinearExperts forward pass. Further improvements were suggested regarding the fragility of using source code inspection for module detection and the efficiency of the gated MLP implementation.

Comment thread src/llmcompressor/modeling/moe/gpt_oss.py
Comment on lines +207 to +241
def forward(
self,
hidden_states: torch.Tensor,
top_k_index: torch.Tensor,
top_k_weights: torch.Tensor,
) -> torch.Tensor:
final_hidden_states = torch.zeros_like(hidden_states)
num_experts = len(self)

# create tokens mask
with torch.no_grad():
expert_mask = torch.nn.functional.one_hot(top_k_index, num_experts)
expert_mask = expert_mask.permute(2, 1, 0)

for expert_idx in range(num_experts):
# select tokens for this expert
top_k_pos, token_indices = torch.where(expert_mask[expert_idx])
if token_indices.numel() == 0:
continue

# apply expert, maybe pass all tokens to the expert
expert = self[expert_idx]
if context.CALIBRATE_ALL_EXPERTS:
expert_output = expert(hidden_states)[token_indices]
else:
expert_output = expert(hidden_states[token_indices])

# apply weighting to outputs
expert_weights = top_k_weights[token_indices, top_k_pos, None]
weighted_output = expert_output * expert_weights

# accumulate the selected tokens
final_hidden_states.index_add_(0, token_indices, weighted_output)

return final_hidden_states
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The forward method of LinearExperts needs to handle 3D hidden_states (e.g., [batch, sequence, hidden]) by flattening them before processing. Otherwise, index_add_(0, token_indices, ...) will incorrectly index into the batch dimension instead of the token dimension, leading to incorrect results or out-of-bounds errors. Additionally, an explicit cast to the destination dtype is recommended for weighted_output to ensure compatibility with index_add_ when using mixed precision (e.g., float32 weights with bfloat16 states).

    def forward(
        self,
        hidden_states: torch.Tensor,
        top_k_index: torch.Tensor,
        top_k_weights: torch.Tensor,
    ) -> torch.Tensor:
        orig_shape = hidden_states.shape
        # Flatten to (total_tokens, hidden_dim)
        hidden_states = hidden_states.view(-1, orig_shape[-1])
        top_k_index = top_k_index.view(-1, top_k_index.shape[-1])
        top_k_weights = top_k_weights.view(-1, top_k_weights.shape[-1])

        final_hidden_states = torch.zeros_like(hidden_states)
        num_experts = len(self)

        # create tokens mask
        with torch.no_grad():
            expert_mask = torch.nn.functional.one_hot(top_k_index, num_experts)
            expert_mask = expert_mask.permute(2, 1, 0)

        for expert_idx in range(num_experts):
            # select tokens for this expert
            top_k_pos, token_indices = torch.where(expert_mask[expert_idx])
            if token_indices.numel() == 0:
                continue

            # apply expert, maybe pass all tokens to the expert
            expert = self[expert_idx]
            if context.CALIBRATE_ALL_EXPERTS:
                expert_output = expert(hidden_states)[token_indices]
            else:
                expert_output = expert(hidden_states[token_indices])

            # apply weighting to outputs
            expert_weights = top_k_weights[token_indices, top_k_pos, None]
            weighted_output = expert_output * expert_weights

            # accumulate the selected tokens
            final_hidden_states.index_add_(
                0, token_indices, weighted_output.to(final_hidden_states.dtype)
            )

        return final_hidden_states.view(orig_shape)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +36 to +60
def _is_moe_experts_module(module) -> bool:
"""Detect modules whose class is decorated with
``@use_experts_implementation`` by inspecting the class source AST."""
try:
source = inspect.getsource(type(module))
tree = ast.parse(source)
except (OSError, TypeError):
return False

for node in ast.iter_child_nodes(tree):
if not isinstance(node, ast.ClassDef):
continue
for decorator in node.decorator_list:
if isinstance(decorator, ast.Name):
name = decorator.id
elif isinstance(decorator, ast.Call) and isinstance(
decorator.func, ast.Name
):
name = decorator.func.id
else:
continue
if name == "use_experts_implementation":
return True

return False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using inspect.getsource and ast.parse to detect MoE modules is fragile and potentially slow. It will fail if the source code is unavailable (e.g., in some deployment environments) and adds overhead for every module in the model. A more robust approach would be to check for specific attributes (like gate_up_proj and down_proj) or use a more direct way to identify these modules if possible.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, will think more on this. I think this is the most robust solution, but should probably be lru cached against the module class.

Comment on lines +114 to +121
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
return self.down_proj(
self._apply_gate(
torch.cat(
[self.gate_proj(hidden_states), self.up_proj(hidden_states)], dim=-1
)
)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of torch.cat followed by _apply_gate (which typically performs a chunk operation) is inefficient. Since _apply_gate is currently restricted to _default_apply_gate (which just splits the input and multiplies), you could optimize this by applying the gate logic directly to the separate projection outputs, avoiding the concatenation and subsequent chunking.

@dsikka dsikka mentioned this pull request Apr 27, 2026
18 tasks
)


def linearize_moe_model(model: PreTrainedModel):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the base case, we need to do a conversion. However, we can optimize this on a per-model basis by writing our own weight converters which are used directly during loading

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@mergify mergify Bot removed the quality-failed label Apr 30, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 30, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

@maxdebayser
Copy link
Copy Markdown

Thanks for working on this @kylesayrs. The IBM Spyre stack for vLLM depends on llm-compressor because of this library: https://github.com/foundation-model-stack/fms-model-optimizer. Currently we're unable to upgrade to transformers 5 due to this. Do you need any help?

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@mergify mergify Bot removed the quality-failed label May 5, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 5, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kylesayrs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 5, 2026
kylesayrs added 2 commits May 5, 2026 16:36
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
kylesayrs added 2 commits May 5, 2026 18:24
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants