-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Add model gpt-oss #8822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Add model gpt-oss #8822
Changes from all commits
Commits
Show all changes
138 commits
Select commit
Hold shift + click to select a range
0123240
init orangina
1b6d426
trivial modification
jychen21 b28605c
debugging weight loading
jychen21 966bab4
fix o_proj linear error
jychen21 8c545d9
fix config compatiblity & RMSNorm failed with no kernel is available
jychen21 918e3c0
rename
jychen21 0bba216
add sliding window attn every other layer & fix attn sinks weight loa…
jychen21 87d81a6
update
jychen21 81bb3bf
use silu as a WA
jychen21 8febdd6
moe bias support (not finished)
jychen21 606961e
fix moe intermediate size configuration to match intermediate size
xutizhou d1b159a
fix the problem that mlp weight is not really loaded, and mlp bias su…
linhu-nv fab80db
Merge branch 'linhu/dev' into 'feat/orangina'
linhu-nv 884666a
add structure
xutizhou fc37426
to display
xutizhou 2355609
rm model sstructure
xutizhou c49f8ee
Update OpenAIMoeAttention to set sinks parameter dtype to bfloat16
xutizhou c426b23
Refactor OpenAIMoeAttention by removing RMSNorm normalization for que…
xutizhou 4eb4680
Add a TODO comment to ensure correct sliding window size for flashinf…
xutizhou f4ab9cb
mark attn o_proj all reduce
xutizhou ee3c851
Add bias support to FusedMoE and related quantization methods
xutizhou 7958fa3
Add TODO comment to indicate future replacement of gate with router i…
xutizhou 5bb606f
fix rope yarn mismatch issue
jychen21 76c206d
Merge remote-tracking branch 'refs/remotes/origin/feat/orangina' into…
xutizhou c9fdd37
Merge branch 'feat/orangina' into 'feat/orangina'
xutizhou 51f9a92
Fix naming mismatch for gate and router parameters in OpenAIMoeForCau…
xutizhou 4abddbc
Update comment to clarify naming convention for gate and router param…
xutizhou 9b9320d
Merge branch 'feat/orangina' into 'feat/orangina'
xutizhou 314bf1a
Bug fix: FusedMoE expert weight loader can not load weights
jychen21 3f0bbd4
bf16 fusedmoe integration
zhuofan1123 00ae587
Merge branch 'moe' into 'feat/orangina'
zhuofan1123 d129e5b
add accuracy test
jychen21 ca3061f
trivial code changes
jychen21 63af97e
add one prompt test
jychen21 e9fcf71
loop import issue solved even we dont change the import codes
linhu-nv 209247a
Merge branch 'linhu/dev' into 'feat/orangina'
linhu-nv 38b2d2e
add mxfp4 triton api
zhuofan1123 da5fc63
Merge branch 'moe' into 'feat/orangina'
zhuofan1123 358347c
Enhance FlashInfer backend with attention sink support and add relate…
xutizhou ccfa992
Merge remote-tracking branch 'upstream/feat/orangina' into feat/atten…
xutizhou 1322f26
Update OpenAIMoeAttention to incorporate attention sink parameter in …
xutizhou b5c0eb4
Refactor OpenAIMoeAttention to use plural 'sinks' for clarity in atte…
xutizhou 1a61c44
Add enable_attention_sink parameter to OpenAIMoeAttention initializat…
xutizhou c528f22
Add sink attention mechanism to OpenAIMoe and Qwen3Moe models, introd…
xutizhou 5c3f293
support mxfp4 moe
zhuofan1123 b4ac555
pad weight for Hopper
zhuofan1123 6bb7fa2
add args for mxfp4
zhuofan1123 842f3e6
Merge branch 'moe' into 'feat/orangina'
zhuofan1123 f4faaae
make continuous after transpose
zhuofan1123 4b0d768
Update d_rcp calculation in FlashInfer backend to incorporate exponen…
xutizhou 43355f8
Refactor OpenAIMoeAttention to utilize precomputed sink values, simpl…
xutizhou 4a57691
Merge remote-tracking branch 'upstream/feat/orangina' into feat/atten…
xutizhou b558084
Add debug prints for prompt and generated output in throughput test
xutizhou 3a78899
Remove unused sink_softmax and sink_attention_ref functions from Qwen…
xutizhou ecca06e
Remove unused imports from openai_moe.py to clean up the codebase and…
xutizhou 452d567
Accuracy fix: Attn using reference sdpa impl as a WA
jychen21 c9e075a
Add a TODO comment to OpenAIMoeAttention regarding potential exponent…
xutizhou a2c3376
Add tensor logging functionality in FlashInfer backend to track input…
xutizhou 70c28c7
Refactor OpenAIMoeAttention to implement sink attention mechanism, en…
xutizhou 5b8e09d
Refactor debug logging in OpenAIMoeAttention to always print layer_id…
xutizhou f1ba131
Update OpenAIMoeAttention to conditionally log tensor shapes based on…
xutizhou ce0ec2a
Refactor FlashInfer backend to simplify window size calculation by re…
xutizhou 32fa9ee
Update tolerance levels in OpenAIMoeAttention tensor comparison to im…
xutizhou 27a562f
Merge remote-tracking branch 'upstream/feat/orangina' into feat/atten…
xutizhou cd9a5b1
Add sdpa function to openai_moe.py for enhanced attention mechanism w…
xutizhou 281b18e
Add new openai_moe.py file and implement QK attention calculation in …
xutizhou 1d00b7e
Update sliding_window_size parameter in OpenAIMoeAttention to ensure …
xutizhou 300f14f
Add flashinfer_attention_ref method to OpenAIMoeAttention for improve…
xutizhou b322437
Refactor flashinfer_attention_ref method in OpenAIMoeAttention to acc…
xutizhou 55840e6
Update OpenAIMoeAttention to check if sinks need fp32 before exp and …
xutizhou 399feed
debugging acc & bug fix
jychen21 dc9187e
Refactor attention output calculation in OpenAIMoeAttention by renami…
xutizhou fdff13c
Merge branch 'feat/attention_sink_final' into 'feat/orangina'
xutizhou 9c08480
Update sliding_window handling in OpenAIMoeAttention to default to -1…
xutizhou d29a62c
Merge branch 'feat/attention_sink_final' into 'feat/orangina'
xutizhou 467ca00
Update sliding_window handling in OpenAIMoeAttention to default to -1…
xutizhou 6d3c212
Merge branch 'feat/attention_sink_final' into 'feat/orangina'
xutizhou 2401d39
Add key tokens into promt for wrong detokenizing
jychen21 d847171
Implement torch native attention version supporting both sink and sli…
jychen21 7ddc192
remove sdpa ref cause decode phase can not simply use this, use 'torc…
jychen21 8f87d2e
remove sdpa ref cause decode phase can not simply use this, use 'torc…
jychen21 4f87f27
disable shuffle for pre-final weights
zhuofan1123 2e99f16
First e2e accuracy test, verified on gsm8k(0.735) and mmlu(0.828)
jychen21 92ebb1b
uncomment mmlu acc target assertion
jychen21 1d507a4
fix mxfp4 for tp
zhuofan1123 72eea4f
renaming model to gpt-oss as recommended
jychen21 b3bcf23
Refactor weight processing in UnquantizedFusedMoEMethodOpenAI to remo…
xutizhou 2a33809
Refactor weight and bias parameter handling in FusedMoE to streamline…
xutizhou eef6505
Add two modes for SwiGLU act (chunk / pairwise)
jychen21 2c8e56d
remove WA in layernorm forward_cuda, just call layernorm forward_nati…
jychen21 930b974
Enhance FlashInfer attention mechanism by adjusting window handling a…
xutizhou d7c40d7
Merge remote-tracking branch 'upstream/feat/orangina' into feat/orangina
xutizhou 07375b3
sliding_window remove -1 for torch native impl
jychen21 3e38c7c
Refactor attention sink handling in OpenAIMoeAttention to conditional…
xutizhou 40c3220
Merge remote-tracking branch 'upstream/feat/orangina' into feat/orangina
xutizhou ede293e
load weight for pair-wise act
zhuofan1123 519ad18
Enhance SwiGLU implementation by adding a pair_wise option for tensor…
xutizhou 0f110f0
Refactor FlashInfer attention backend to utilize layer.attention_sink…
xutizhou e41fda4
Refactor sink_attention_ref function to improve handling of query and…
xutizhou 9b2fd12
fix torch native backend bug
xutizhou 2ca2c16
Remove workaround for orangina in RMSNorm and add pair_wise_act param…
xutizhou d66f0f3
Refactor get_attention_sliding_window_size function to simplify logic…
xutizhou 1513244
update acc test serving args
jychen21 b74b149
add clamp limit
zhuofan1123 a4f7c2e
fix flashinfer accuracy bug
xutizhou 010093a
Merge branch 'main' into feat/orangina
jychen21 ce0ab79
Merge branch 'feat/orangina-backup' into feat/orangina
jychen21 eba1e69
fix code format
jychen21 e270628
Tune accuracy test params: temperature1.0 top_p1.0 top_k0.0, add chat…
jychen21 9b5b80b
Refactor version assertion logic and enhance SWAChunkCache eviction h…
xutizhou 939a2b8
reduce mxfp4 memory usage
zhuofan1123 4a56bef
Update SWAChunkCache to require attention_chunk_size as an int and in…
xutizhou 5b7169b
Merge remote-tracking branch 'upstream/feat/orangina' into feat/orangina
xutizhou f2a7796
Set default attention_chunk_size to 128 in model_config.py and remove…
xutizhou c407e58
Adapt simple eval to support orangina reasoning mode, system_message,…
jychen21 3e76fad
gpqa bug fix
jychen21 c9ad3cf
Fix: Make anser comparison case-insenstive (GPQA)
jychen21 6e7da63
Update default attention_chunk_size to None in model_config.py
xutizhou 7268b8a
Update default enable_attention_sink to False in FlashInferAttnBacken…
xutizhou 1f3aeff
support mxfp4 quant config
zhuofan1123 c322662
fix scale loading issue when tp_size>2
zhuofan1123 4288d18
make gemm2_output contiguous for hopper
zhuofan1123 3feb217
remove original moe impl
zhuofan1123 48f7864
add args for fp8 activation
zhuofan1123 7eba750
rename arg
zhuofan1123 99d0f75
remove checkpoint_weights_transposed
zhuofan1123 022b646
Merge branch 'quant' into 'feat/orangina'
zhuofan1123 2c56ec8
rename to gpt_oss
zhuofan1123 8205914
Add Triton kernel to set tensor to zero and update scale initializati…
xutizhou 46653db
fix flashinfer + cuda graph
PerkzZheng 0835c26
remove file
zhuofan1123 6661e8c
Update sink parameter type to float32 and remove unused flashinfer code
xutizhou d16cdd4
Merge remote-tracking branch 'upstream/feat/orangina' into feat/orangina
xutizhou 9982849
recover QUERY_TEMPLATE_MULTICHOICE and ANSWER_PATTERN_MULTICHOICE
jychen21 e4db1ba
Merge remote-tracking branch 'github/main' into final_rebase
xutizhou 1972bb1
[refactor] Update imports and enhance deepep mode handling in GptOssM…
xutizhou 8ac1978
Update SamplerResponse
jychen21 6640f5c
Clean up
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| [submodule "3rdparty/triton"] | ||
| path = 3rdparty/triton | ||
| url = https://github.com/dongfengy/triton.git | ||
| branch = fused_moe_triton_0613 | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -234,6 +234,8 @@ def throughput_test_once( | |
|
|
||
| st = time.perf_counter() | ||
| gen_out = backend.generate(prompt=prompt, sampling_params=sampling_params) | ||
| print(f"prompt: {prompt}") | ||
| print(f"gen_out: {gen_out}") | ||
|
Comment on lines
+237
to
+238
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| latency = time.perf_counter() - st | ||
|
|
||
| if profile: | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
399 changes: 335 additions & 64 deletions
399
python/sglang/srt/layers/attention/flashinfer_backend.py
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The submodule for Triton points to a personal fork (
dongfengy/triton) and a specific branch (fused_moe_triton_0613). This introduces a dependency on a personal repository, which can be a maintenance and security risk.It's highly recommended to use an official repository or a fork under the project's organization to ensure stability and long-term maintenance. If this is a temporary measure for development, it should be replaced before merging into the main branch.