[Bugfix] Fix FusedMoE weight loading with padded hidden dimensions#37010
[Bugfix] Fix FusedMoE weight loading with padded hidden dimensions#37010SandishKumarHN wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a bug in FusedMoE weight loading where padded hidden dimensions cause shape mismatches. The solution introduces a _narrow_expert_data_for_padding static method to correctly slice the parameter tensor before copying weights, which is a sound approach. The changes are applied to all relevant weight loading paths and are accompanied by a comprehensive set of new tests. My review found a potential IndexError in the new helper function if the input tensors have different ranks. I've provided a suggestion to make the implementation more robust. Overall, this is a good fix for the issue.
| for d in range(loaded_weight.ndim): | ||
| if expert_data.shape[d] != loaded_weight.shape[d]: | ||
| expert_data = expert_data.narrow(d, 0, loaded_weight.shape[d]) |
There was a problem hiding this comment.
The loop for d in range(loaded_weight.ndim): is not entirely safe. If expert_data.ndim < loaded_weight.ndim, this will raise an IndexError when accessing expert_data.shape[d]. While this might not be a typical scenario for padding issues, making the code more robust against such cases would prevent potential crashes.
I recommend iterating up to min(expert_data.ndim, loaded_weight.ndim). This ensures the loop doesn't go out of bounds. If the tensor ranks are different, it's likely not a simple padding problem, and the subsequent copy_ operation will correctly fail with a shape mismatch error, providing a clearer signal of the underlying issue.
| for d in range(loaded_weight.ndim): | |
| if expert_data.shape[d] != loaded_weight.shape[d]: | |
| expert_data = expert_data.narrow(d, 0, loaded_weight.shape[d]) | |
| for d in range(min(expert_data.ndim, loaded_weight.ndim)): | |
| if expert_data.shape[d] != loaded_weight.shape[d]: | |
| expert_data = expert_data.narrow(d, 0, loaded_weight.shape[d]) |
… backends round up hidden_size (e.g., 2688 -> 3072) Signed-off-by: SandishKumarHN <sandish@fb.com>
ee71cce to
1181bd0
Compare
Summary
Fixes #36926
When DeepEP/NIXL EP backends round up
hidden_sizefor alignment (e.g., 2688 → 3072),FusedMoEweight parameters are allocated with the padded size butcheckpoint weights have the original size. This causes
RuntimeError: The size of tensor a (3072) must match the size of tensor b (2688)duringexpert_data.copy_(loaded_weight)in weight loading._narrow_expert_data_for_padding()static method that narrows padded parameter dimensions to match checkpoint weights before copying_load_w2,_load_w13, and_load_per_channel_weight_scale(3 paths)copy_()is intercepted by__torch_function__for in-flight quantization;shapes are intentionally different
Not a duplicate: Checked open PRs — #34285 and #30647 address roundup refactoring and forward-pass padding, not weight loading.
Test plan
_narrow_expert_data_for_padding(7 cases: matching shapes, w2/w13 dims, 3D tensors, 1D scales, scalar weights, storage sharing)python -m pytest tests/kernels/moe/test_moe_weight_loading_padded.py -v— 10/10 passpython -m pytest tests/kernels/moe/ -v -k "not deepep"— existing MoE tests