Skip to content

[Bugfix] Fix FusedMoE weight loading with padded hidden dimensions#37010

Open
SandishKumarHN wants to merge 1 commit intovllm-project:mainfrom
SandishKumarHN:issue-36926
Open

[Bugfix] Fix FusedMoE weight loading with padded hidden dimensions#37010
SandishKumarHN wants to merge 1 commit intovllm-project:mainfrom
SandishKumarHN:issue-36926

Conversation

@SandishKumarHN
Copy link

@SandishKumarHN SandishKumarHN commented Mar 13, 2026

Summary

Fixes #36926

When DeepEP/NIXL EP backends round up hidden_size for alignment (e.g., 2688 → 3072), FusedMoE weight parameters are allocated with the padded size but
checkpoint weights have the original size. This causes RuntimeError: The size of tensor a (3072) must match the size of tensor b (2688) during
expert_data.copy_(loaded_weight) in weight loading.

  • Added _narrow_expert_data_for_padding() static method that narrows padded parameter dimensions to match checkpoint weights before copying
  • Applied to _load_w2, _load_w13, and _load_per_channel_weight_scale (3 paths)
  • Excluded BitsAndBytes w2 path — BnB params are flat packed-integer tensors where copy_() is intercepted by __torch_function__ for in-flight quantization;
    shapes are intentionally different
  • When hidden_size is not padded (common case), the helper is a no-op since all dimensions already match
  • Follows the existing narrowing pattern used by the mxfp4 quantization path (line 1069-1078)

Not a duplicate: Checked open PRs — #34285 and #30647 address roundup refactoring and forward-pass padding, not weight loading.

Test plan

  • New unit tests for _narrow_expert_data_for_padding (7 cases: matching shapes, w2/w13 dims, 3D tensors, 1D scales, scalar weights, storage sharing)
  • New integration tests for padded weight loading (w2, w13, no-padding no-op)
  • python -m pytest tests/kernels/moe/test_moe_weight_loading_padded.py -v — 10/10 pass
  • python -m pytest tests/kernels/moe/ -v -k "not deepep" — existing MoE tests

@mergify mergify bot added the bug Something isn't working label Mar 13, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in FusedMoE weight loading where padded hidden dimensions cause shape mismatches. The solution introduces a _narrow_expert_data_for_padding static method to correctly slice the parameter tensor before copying weights, which is a sound approach. The changes are applied to all relevant weight loading paths and are accompanied by a comprehensive set of new tests. My review found a potential IndexError in the new helper function if the input tensors have different ranks. I've provided a suggestion to make the implementation more robust. Overall, this is a good fix for the issue.

Comment on lines +951 to +953
for d in range(loaded_weight.ndim):
if expert_data.shape[d] != loaded_weight.shape[d]:
expert_data = expert_data.narrow(d, 0, loaded_weight.shape[d])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The loop for d in range(loaded_weight.ndim): is not entirely safe. If expert_data.ndim < loaded_weight.ndim, this will raise an IndexError when accessing expert_data.shape[d]. While this might not be a typical scenario for padding issues, making the code more robust against such cases would prevent potential crashes.

I recommend iterating up to min(expert_data.ndim, loaded_weight.ndim). This ensures the loop doesn't go out of bounds. If the tensor ranks are different, it's likely not a simple padding problem, and the subsequent copy_ operation will correctly fail with a shape mismatch error, providing a clearer signal of the underlying issue.

Suggested change
for d in range(loaded_weight.ndim):
if expert_data.shape[d] != loaded_weight.shape[d]:
expert_data = expert_data.narrow(d, 0, loaded_weight.shape[d])
for d in range(min(expert_data.ndim, loaded_weight.ndim)):
if expert_data.shape[d] != loaded_weight.shape[d]:
expert_data = expert_data.narrow(d, 0, loaded_weight.shape[d])

… backends round up hidden_size (e.g., 2688 -> 3072)

Signed-off-by: SandishKumarHN <sandish@fb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: nemotron_h does not work with DeepEP all2all backends due to hidden dim rounding

1 participant