-
-
Notifications
You must be signed in to change notification settings - Fork 15.1k
[MoE Refactor][15/N] Apply Refactor to Fp8 #31415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
213 commits
Select commit
Hold shift + click to select a range
e5c50db
cleanup process weights after loading
b1dddfd
removing spurious aiter stuff
78e9289
removing spurious aiter stuff
f1ae727
good codex bot
1a576b8
revert spurious aiter stuff
70367ac
reduce LOC changes
8425200
reduce LOC changes
f414f6c
further simplification
b31a8cb
updated
f2c70e1
cleanup comment
bd2046b
fix custom routing function for flashinfer
6655381
invalid checks in FP8MoE
a8820d8
stashing ... mixtral via flashinfer is not working properly'
844093c
merged aiter
cc2df79
updated
6db374b
updated
c2cc8e1
stash
1cf3b88
fix bad merge and bad qdq for per-tensor
985a0ab
weight rotation should only happen for per-tensor
7defa14
improve error message
078fdc4
updated
b55d8e3
update marlin ordering
8fbae90
improve comments
4bbb70f
add helper functions to share between online and offline quantization
bd72e61
fix up condition
24d219b
fix up condition
c290ebb
fix up condition
550e763
update to revert cleanup
d1fba0e
Merge branch 'main' into clean-up-fp8-process-after-loading
robertgshaw2-redhat 7822c4d
Merge remote-tracking branch 'origin/main' into clean-up-fp8-process-…
32ef76a
updated
a8c5927
updated with proper impots
80039a0
added missing file
d9dfa7b
abstract the kernel swapping
bbcd012
abstract the kernel swapping
5822325
start applying to ct_moe_fp8
3f2322a
further progress
a7b6550
further progress
1737ed6
nit logging
cba5504
factor out the process after loading for requantize
b2bc870
rename setup_kernel -> make_kernel
14247f0
make convert_weights_to_runtime_format a pure function
570c359
factor out process weights after load
50d0bea
make names shorter
1e750fc
remove marlin_input_dtype
d4c5c4d
force reset
d626763
remove unneeded change
f7aaa16
remove imports
cdcc9f5
guard against TRT-LLM
ad6dc86
rename get_fp8_moe_backend -> select_fp8_moe_backend
dab20c6
add todo
5128d77
initial refactoring
2bc0a42
apply refactor to fp8 and ct
8a3b303
apply refactor globally
47a58a2
updated fused moe quant config for weight scale name
4aa3a64
Merge branch 'main' into fix-up-marlin-prepare-layer
robertgshaw2-redhat 3eaa036
update comment
7ce4ef8
fix marlin tensor
5a55df8
Merge remote-tracking branch 'origin/main' into apply-refactor-to-ct
9d49046
merged main
8369d49
in process of refactoring cutlass fp8
4c3b253
convert to use mk for compressed-tensors
34cdaaa
updated
9ad1a09
fix online quantization
dad15e2
fix online quantization
542d2de
reduce name lenght for less newlines
758fae9
update name to convert_to_fp8_moe_kernel_format
35a0426
w13_weight -> w13 etc to reduce line breaks
33898b4
try to reduce loc change
7e8db4f
try to reduce loc change
0e9629d
rename make_kernel -> make_fp8_moe_kernel
2a32bac
update commentary about disable expert map
864379e
cleanup
b7091e3
removing run_cutlass_moe_fp8
da47fe0
add back ops import
3658b48
remove strides construction
464bb50
remove run_cutlass_fp8
b411440
apply to test cutlass moe
b17cb0e
remove import
e045766
fixed failing test
1d134d5
attempt to fix cutlass moe unit test
8e864f0
init workspace manager
b3f9a10
pre-commit on marlin input dtype
f915373
some more tweaks
f30ae4b
revert change to stray file
5f99481
clean up select_gemm_impl
8bdedc2
we are now passing for fp8.py triton block!
face1ce
,merged main
9f31d65
reduce loc change
47ca569
update comment
df49c3a
update tritonordeepgemmexperts
4828063
updated
6cefc23
split into separate situation
4880aef
add small batch fallback for cutlass
b89f68f
add small batch fallback for cutlass
d43dbb5
fix fallback
19ae34f
revert changes to marlin utils file
3845ee8
revert changes to get_marlin_input_dtype
3196cb1
nits
461cd8c
the ordered on ABC, Class matters for some reason in python --- duck
16aa74e
stash
be0abe2
apply changes to modelopt
robertgshaw2-redhat 4eed452
remove unneeded cruft
robertgshaw2-redhat 450f035
cleanup initialization
robertgshaw2-redhat d6a1f64
initial commit
9a28683
move to cuda before the reshapes for r&d
e84eaa2
clean
058a998
clean
d53b6ff
clean
9a7cf4d
clean
1182e1d
clean
2126f98
clean
33741a8
clean
31c4e22
stash
e0129dd
working end to end
eb6699b
comment nits
5be7ab1
comment nits
844a65a
remove
24a0302
rename method
7edf70f
stash trtllm fix
9d994a6
stash changes
6ff4b75
Merge branch 'fix-flashinfer-experts-quant-config-hack' of https://gi…
c9a7e5b
updated
2408ad2
make trtllm work
96ff599
ad back import
f9a4724
update comments
e8831f9
apply changes to fp8.py
f8f9a33
nit
59f97a6
revert unneeded assets
113e472
rename
a98a380
update comment
df82e9c
Merge branch 'fix-flashinfer-experts-quant-config-hack' of https://gi…
df5035c
naming
3678402
add back check to prevent mixtral
783b64d
Merge branch 'main' into fix-flashinfer-experts-quant-config-hack
robertgshaw2-redhat c30d404
remove delete me
a285f5e
update to address pavani's feedback
2d96161
reduce LOC change
a910872
fix
d2decd6
reduce loc nit
870fc6a
clean up fi checking
23e79fd
updated
86a0e5c
fix assign float to parameter
7eaa18b
updated doc string
83a7d9b
fix up
7300bc5
fix up
344167d
fix up configs
b887c4f
cleanup
d4d4231
remove unneeded assert
218e697
standardize how we add the params
b2e3a50
updated
dd30416
updated
3d22ba3
updated
56edeca
a few small nits
e917f5d
fix tests
39987f6
revert the llama weight loading hack
140f447
stash
de6faa1
unstash
173e67d
Merge remote-tracking branch 'origin/main' into fix-flashinfer-expert…
fac4014
merge
robertgshaw2-redhat c1c1195
fix merge from amd guard
robertgshaw2-redhat 79acaac
merge the fi branch
robertgshaw2-redhat caf46be
rename flashinfer trtllm funciton names
robertgshaw2-redhat 2a5a58b
cleanup!
robertgshaw2-redhat 4d76cb6
tests are passing
robertgshaw2-redhat d304a73
remove comments
robertgshaw2-redhat 2ee1c44
updated
a5a1d0b
circular import
5a627d8
updated the details
2a5c255
fix typing
08a1979
fixed cutlass block
8f2341f
clean up deepgemm a bit
85d59c8
use proper naming for modelopt
1a69be2
use proper naming for modelopt
8c4dddf
use proper naming for modelopt
16721e5
merged main
8fec574
update log for unsupported
95f0b37
Merge remote-tracking branch 'origin/main' into apply-refactor-to-ct
609b9b9
merge main
3ceb254
nit
af2cbd3
update convert_weights_to_kernel_format
fb6e402
rever use for flashinfer_moe_backend
da6218a
stash work
64b7ba5
fix trtllm kernel
ebd76f2
fix importing issues
6afe4bb
fix compressed tensors issue
eecb7dc
fix lint
804c147
fix error from lack of routing
b6e5dc5
delayed imports
617c662
fix cutlass tensor
6b1d1ad
make marlin pass
2c2e274
make things easier to follow in the ci logs
eb83e8d
add dp/ep
c7424b7
Merge branch 'main' into apply-refactor-to-ct
robertgshaw2-redhat e41d147
nit
3469b8d
updated
af5a4f4
updated
34175d1
ipdate the test coverage for dp/ep
2687d2c
fix up again, some of the a2a backends are not working
1eceb09
Merge branch 'main' into apply-refactor-to-ct
robertgshaw2-redhat 9ce786a
update oracle to not use cutlass for block quant
fefa376
delete llama 4 load time optim
5cccc6c
docs fix
9810137
revert .to(cuda)
35c3bc3
updated with bills comments
ce8deb7
revert change for llama experts not being loaded properly
08742c5
fix marlin comment
a369a51
Merge branch 'main' into apply-refactor-to-ct
robertgshaw2-redhat d4486f8
updated the access logs
be3dc9a
updated
a927bec
updated
192339d
delay imports
5dccfc6
fix missing
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
5 changes: 5 additions & 0 deletions
5
tests/evals/gsm8k/configs/moe-refactor-dp-ep/Llama-4-Scout-Fp8-ModelOpt-triton.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| model_name: "nvidia/Llama-4-Scout-17B-16E-Instruct-FP8" | ||
| accuracy_threshold: 0.92 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --data-parallel-size 2 --enable-expert-parallel" |
8 changes: 8 additions & 0 deletions
8
.../evals/gsm8k/configs/moe-refactor-dp-ep/Qwen3-30B-A3B-Fp8-AutoFp8-deepgemm-deepep-ht.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| model_name: "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8" | ||
| accuracy_threshold: 0.88 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --data-parallel-size 2 --enable-expert-parallel --all2all-backend deepep_high_throughput" | ||
| env: | ||
| VLLM_USE_DEEP_GEMM: "1" | ||
| VLLM_USE_DEEP_GEMM_MOE: "1" |
9 changes: 9 additions & 0 deletions
9
.../evals/gsm8k/configs/moe-refactor-dp-ep/Qwen3-30B-A3B-Fp8-AutoFp8-deepgemm-deepep-ll.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| model_name: "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8" | ||
| accuracy_threshold: 0.88 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --data-parallel-size 2 --enable-expert-parallel --all2all-backend deepep_low_latency --disable-uvicorn-access-log" | ||
| env: | ||
| VLLM_USE_DEEP_GEMM: "1" | ||
| VLLM_USE_DEEP_GEMM_MOE: "1" | ||
| VLLM_USE_DEEP_GEMM_E8M0: "0" |
8 changes: 8 additions & 0 deletions
8
tests/evals/gsm8k/configs/moe-refactor-dp-ep/Qwen3-30B-A3B-Fp8-AutoFp8-deepgemm.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| model_name: "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8" | ||
| accuracy_threshold: 0.88 | ||
| num_questions: 1319 | ||
| num_fewshot: 5 | ||
| server_args: "--enforce-eager --max-model-len 8192 --data-parallel-size 2 --enable-expert-parallel" | ||
| env: | ||
| VLLM_USE_DEEP_GEMM: "1" | ||
| VLLM_USE_DEEP_GEMM_MOE: "1" |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function was unused