Fix nan in global scaling factor for large scale nvfp4 EP#13162
Fix nan in global scaling factor for large scale nvfp4 EP#13162Fridge003 merged 2 commits intosgl-project:mainfrom
Conversation
This reverts commit 99e2580.
There was a problem hiding this comment.
Just had a sync with @wenscarl.
The root cause of the NaN issue wasn’t that the loader failed to map physical to logical experts correctly; rather, it was that the mapping must include all physical expert indices when sglang_require_global_experts=True. This is because each layer’s w13_input_scale is allocated based on the total number of physical experts.
For example, if there are 256 logical experts and 2 redundant experts, an incorrect logical→physical map might look like:
logical -> physical
0 -> 256
1 -> 257
2 -> 2
...
255 -> 255
The input_scales tensor would have shape (258,), and it then would look like this after loading:
physical
0 -> loading nothing (wrong!)
1 -> loading nothing (wrong!)
2 -> loading logical 2
...
255 -> loading logical 255
256 -> loading logical 0
257 -> loading logical 1
After this PR, we ensure that all 258 physical experts correctly load something, which resolves the NaN issue.
That exactly my understanding. |
…-project#13162) Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
… and sgl-project#13341) (sgl-project#13348)" This reverts commit 78a4b44.
Motivation
Part of #12866 in this PR), the root cause is that
w13[2]_input_scaleshould not depends onlogical_to_all_physical_map. In this fix, read in all physical experts' input scale regardless of its logical expert id.cc @kaixih @Fridge003
The root cause is:
The
w13_input_scaleis of shape [288, ...] but if the maplogical_to_all_physical_mapis :there 0th and 1st row is missing and since
w13_input_scaleis initialized bytorch.empty, the corresponding values are undefined.Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist