Remove `mha` param from Wave decode attention kernel by paulzzy · Pull Request #28 · harsh-nod/sglang

paulzzy · 2025-07-08T17:52:05Z

Motivation

See iree-org/iree-turbine#1039

Modifications

Update Wave decode attention kernel integration to align with iree-org/iree-turbine#1039

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Depends on iree-org/iree-turbine#1039 Signed-off-by: Paul Zhang <paul.zhang@amd.com>

Signed-off-by: Stanley Winata <stanley.winata@amd.com> [Wave] Add wave extend attention kernel Signed-off-by: Harsh Menon <harsh@nod-labs.com> [Wave] Adding logit_cap and layer scaling to API Also add support for the wave backend to the model runner. And use Triton decode kernels for now. [Wave] Run chunked prefill for perf comparison on Wave test Need to rename the non chunked/regular prefill version because otherwise rpd will treat it as the same kernel Signed-off-by: Stanley Winata <stanley.winata@amd.com> [Wave] Cache the function that loads the wave kernel Also maintain a global kernel hash to avoid recomputing the hash on every call. [Wave] Don't specify block size and enable buffer ops [Wave] Enable wave runtime and update scheduling API [Wave] Update API to use wave_compile & WaveCompileOptions [Wave] Update wave backend and extend attention to latest [Wave] Add speculative decode kernel Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com> cache kernels using lru_cache Update WaveBackend to use Wave Decode (#6) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Revert "Update WaveBackend to use Wave Decode (#6)" (#7) This reverts commit eac4599. Wave Backend decode (#8) * align shapes Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> * fix Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> --------- Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Wave backend fixes (#10) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> More fixes to Wave decode (#12) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> is_causal Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Enable the grok in3 model (#14) Set unique cache dir for each worker (#16) update kernel (#18) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> updated spec decode test as per wave Signed-off-by: xintin <gaurav.verma@amd.com> fix extend (#23) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Refactor paged decode intermediate arrays shapes (#24) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> remove dyn symbols (#26) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> cleanup shapes (#27) Some fields were removed from `paged_decode_attention_shape`. Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Remove `mha` param from Wave decode attention kernel (#28) Depends on iree-org/iree-turbine#1039 Signed-off-by: Paul Zhang <paul.zhang@amd.com> nfc: fix problems reported by linting update references from iree.turbine to wave_lang

Signed-off-by: Stanley Winata <stanley.winata@amd.com> [Wave] Add wave extend attention kernel Signed-off-by: Harsh Menon <harsh@nod-labs.com> [Wave] Adding logit_cap and layer scaling to API Also add support for the wave backend to the model runner. And use Triton decode kernels for now. [Wave] Run chunked prefill for perf comparison on Wave test Need to rename the non chunked/regular prefill version because otherwise rpd will treat it as the same kernel Signed-off-by: Stanley Winata <stanley.winata@amd.com> [Wave] Cache the function that loads the wave kernel Also maintain a global kernel hash to avoid recomputing the hash on every call. [Wave] Don't specify block size and enable buffer ops [Wave] Enable wave runtime and update scheduling API [Wave] Update API to use wave_compile & WaveCompileOptions [Wave] Update wave backend and extend attention to latest [Wave] Add speculative decode kernel Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com> cache kernels using lru_cache Update WaveBackend to use Wave Decode (harsh-nod#6) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Revert "Update WaveBackend to use Wave Decode (harsh-nod#6)" (harsh-nod#7) This reverts commit eac4599. Wave Backend decode (harsh-nod#8) * align shapes Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> * fix Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> --------- Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Wave backend fixes (harsh-nod#10) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> More fixes to Wave decode (harsh-nod#12) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> is_causal Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Enable the grok in3 model (harsh-nod#14) Set unique cache dir for each worker (harsh-nod#16) update kernel (harsh-nod#18) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> updated spec decode test as per wave Signed-off-by: xintin <gaurav.verma@amd.com> fix extend (harsh-nod#23) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Refactor paged decode intermediate arrays shapes (harsh-nod#24) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> remove dyn symbols (harsh-nod#26) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> cleanup shapes (harsh-nod#27) Some fields were removed from `paged_decode_attention_shape`. Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Remove `mha` param from Wave decode attention kernel (harsh-nod#28) Depends on iree-org/iree-turbine#1039 Signed-off-by: Paul Zhang <paul.zhang@amd.com> nfc: fix problems reported by linting update references from iree.turbine to wave_lang

Signed-off-by: Stanley Winata <stanley.winata@amd.com> [Wave] Add wave extend attention kernel Signed-off-by: Harsh Menon <harsh@nod-labs.com> [Wave] Adding logit_cap and layer scaling to API Also add support for the wave backend to the model runner. And use Triton decode kernels for now. [Wave] Run chunked prefill for perf comparison on Wave test Need to rename the non chunked/regular prefill version because otherwise rpd will treat it as the same kernel Signed-off-by: Stanley Winata <stanley.winata@amd.com> [Wave] Cache the function that loads the wave kernel Also maintain a global kernel hash to avoid recomputing the hash on every call. [Wave] Don't specify block size and enable buffer ops [Wave] Enable wave runtime and update scheduling API [Wave] Update API to use wave_compile & WaveCompileOptions [Wave] Update wave backend and extend attention to latest [Wave] Add speculative decode kernel Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com> cache kernels using lru_cache Update WaveBackend to use Wave Decode (#6) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Revert "Update WaveBackend to use Wave Decode (#6)" (#7) This reverts commit eac4599. Wave Backend decode (#8) * align shapes Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> * fix Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> --------- Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Wave backend fixes (#10) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> More fixes to Wave decode (#12) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> is_causal Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Enable the grok in3 model (#14) Set unique cache dir for each worker (#16) update kernel (#18) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> updated spec decode test as per wave Signed-off-by: xintin <gaurav.verma@amd.com> fix extend (#23) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Refactor paged decode intermediate arrays shapes (#24) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> remove dyn symbols (#26) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> cleanup shapes (#27) Some fields were removed from `paged_decode_attention_shape`. Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Remove `mha` param from Wave decode attention kernel (#28) Depends on iree-org/iree-turbine#1039 Signed-off-by: Paul Zhang <paul.zhang@amd.com> nfc: fix problems reported by linting update references from iree.turbine to wave_lang

Signed-off-by: Stanley Winata <stanley.winata@amd.com> [Wave] Add wave extend attention kernel Signed-off-by: Harsh Menon <harsh@nod-labs.com> [Wave] Adding logit_cap and layer scaling to API Also add support for the wave backend to the model runner. And use Triton decode kernels for now. [Wave] Run chunked prefill for perf comparison on Wave test Need to rename the non chunked/regular prefill version because otherwise rpd will treat it as the same kernel Signed-off-by: Stanley Winata <stanley.winata@amd.com> [Wave] Cache the function that loads the wave kernel Also maintain a global kernel hash to avoid recomputing the hash on every call. [Wave] Don't specify block size and enable buffer ops [Wave] Enable wave runtime and update scheduling API [Wave] Update API to use wave_compile & WaveCompileOptions [Wave] Update wave backend and extend attention to latest [Wave] Add speculative decode kernel Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com> cache kernels using lru_cache Update WaveBackend to use Wave Decode (harsh-nod#6) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Revert "Update WaveBackend to use Wave Decode (harsh-nod#6)" (harsh-nod#7) This reverts commit eac4599. Wave Backend decode (harsh-nod#8) * align shapes Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> * fix Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> --------- Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Wave backend fixes (harsh-nod#10) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> More fixes to Wave decode (harsh-nod#12) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> is_causal Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Enable the grok in3 model (harsh-nod#14) Set unique cache dir for each worker (harsh-nod#16) update kernel (harsh-nod#18) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> updated spec decode test as per wave Signed-off-by: xintin <gaurav.verma@amd.com> fix extend (harsh-nod#23) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Refactor paged decode intermediate arrays shapes (harsh-nod#24) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> remove dyn symbols (harsh-nod#26) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> cleanup shapes (harsh-nod#27) Some fields were removed from `paged_decode_attention_shape`. Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Remove `mha` param from Wave decode attention kernel (harsh-nod#28) Depends on iree-org/iree-turbine#1039 Signed-off-by: Paul Zhang <paul.zhang@amd.com> nfc: fix problems reported by linting update references from iree.turbine to wave_lang

Signed-off-by: Stanley Winata <stanley.winata@amd.com> [Wave] Add wave extend attention kernel Signed-off-by: Harsh Menon <harsh@nod-labs.com> [Wave] Adding logit_cap and layer scaling to API Also add support for the wave backend to the model runner. And use Triton decode kernels for now. [Wave] Run chunked prefill for perf comparison on Wave test Need to rename the non chunked/regular prefill version because otherwise rpd will treat it as the same kernel Signed-off-by: Stanley Winata <stanley.winata@amd.com> [Wave] Cache the function that loads the wave kernel Also maintain a global kernel hash to avoid recomputing the hash on every call. [Wave] Don't specify block size and enable buffer ops [Wave] Enable wave runtime and update scheduling API [Wave] Update API to use wave_compile & WaveCompileOptions [Wave] Update wave backend and extend attention to latest [Wave] Add speculative decode kernel Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com> cache kernels using lru_cache Update WaveBackend to use Wave Decode (#6) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Revert "Update WaveBackend to use Wave Decode (#6)" (#7) This reverts commit eac4599. Wave Backend decode (#8) * align shapes Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> * fix Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> --------- Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Wave backend fixes (#10) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> More fixes to Wave decode (#12) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> is_causal Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Enable the grok in3 model (#14) Set unique cache dir for each worker (#16) update kernel (#18) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> updated spec decode test as per wave Signed-off-by: xintin <gaurav.verma@amd.com> fix extend (#23) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Refactor paged decode intermediate arrays shapes (#24) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> remove dyn symbols (#26) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> cleanup shapes (#27) Some fields were removed from `paged_decode_attention_shape`. Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Remove `mha` param from Wave decode attention kernel (#28) Depends on iree-org/iree-turbine#1039 Signed-off-by: Paul Zhang <paul.zhang@amd.com> nfc: fix problems reported by linting update references from iree.turbine to wave_lang

Signed-off-by: Stanley Winata <stanley.winata@amd.com> [Wave] Add wave extend attention kernel Signed-off-by: Harsh Menon <harsh@nod-labs.com> [Wave] Adding logit_cap and layer scaling to API Also add support for the wave backend to the model runner. And use Triton decode kernels for now. [Wave] Run chunked prefill for perf comparison on Wave test Need to rename the non chunked/regular prefill version because otherwise rpd will treat it as the same kernel Signed-off-by: Stanley Winata <stanley.winata@amd.com> [Wave] Cache the function that loads the wave kernel Also maintain a global kernel hash to avoid recomputing the hash on every call. [Wave] Don't specify block size and enable buffer ops [Wave] Enable wave runtime and update scheduling API [Wave] Update API to use wave_compile & WaveCompileOptions [Wave] Update wave backend and extend attention to latest [Wave] Add speculative decode kernel Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com> cache kernels using lru_cache Update WaveBackend to use Wave Decode (harsh-nod#6) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Revert "Update WaveBackend to use Wave Decode (harsh-nod#6)" (harsh-nod#7) This reverts commit eac4599. Wave Backend decode (harsh-nod#8) * align shapes Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> * fix Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> --------- Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Wave backend fixes (harsh-nod#10) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> More fixes to Wave decode (harsh-nod#12) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> is_causal Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Enable the grok in3 model (harsh-nod#14) Set unique cache dir for each worker (harsh-nod#16) update kernel (harsh-nod#18) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> updated spec decode test as per wave Signed-off-by: xintin <gaurav.verma@amd.com> fix extend (harsh-nod#23) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Refactor paged decode intermediate arrays shapes (harsh-nod#24) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> remove dyn symbols (harsh-nod#26) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> cleanup shapes (harsh-nod#27) Some fields were removed from `paged_decode_attention_shape`. Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Remove `mha` param from Wave decode attention kernel (harsh-nod#28) Depends on iree-org/iree-turbine#1039 Signed-off-by: Paul Zhang <paul.zhang@amd.com> nfc: fix problems reported by linting update references from iree.turbine to wave_lang

Signed-off-by: Stanley Winata <stanley.winata@amd.com> [Wave] Add wave extend attention kernel Signed-off-by: Harsh Menon <harsh@nod-labs.com> [Wave] Adding logit_cap and layer scaling to API Also add support for the wave backend to the model runner. And use Triton decode kernels for now. [Wave] Run chunked prefill for perf comparison on Wave test Need to rename the non chunked/regular prefill version because otherwise rpd will treat it as the same kernel Signed-off-by: Stanley Winata <stanley.winata@amd.com> [Wave] Cache the function that loads the wave kernel Also maintain a global kernel hash to avoid recomputing the hash on every call. [Wave] Don't specify block size and enable buffer ops [Wave] Enable wave runtime and update scheduling API [Wave] Update API to use wave_compile & WaveCompileOptions [Wave] Update wave backend and extend attention to latest [Wave] Add speculative decode kernel Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com> cache kernels using lru_cache Update WaveBackend to use Wave Decode (#6) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Revert "Update WaveBackend to use Wave Decode (#6)" (#7) This reverts commit eac4599. Wave Backend decode (#8) * align shapes Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> * fix Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> --------- Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Wave backend fixes (#10) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> More fixes to Wave decode (#12) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> is_causal Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Enable the grok in3 model (#14) Set unique cache dir for each worker (#16) update kernel (#18) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> updated spec decode test as per wave Signed-off-by: xintin <gaurav.verma@amd.com> fix extend (#23) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Refactor paged decode intermediate arrays shapes (#24) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> remove dyn symbols (#26) Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> cleanup shapes (#27) Some fields were removed from `paged_decode_attention_shape`. Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Remove `mha` param from Wave decode attention kernel (#28) Depends on iree-org/iree-turbine#1039 Signed-off-by: Paul Zhang <paul.zhang@amd.com> nfc: fix problems reported by linting update references from iree.turbine to wave_lang

Remove mha param from Wave decode attention kernel

94697a5

Depends on iree-org/iree-turbine#1039 Signed-off-by: Paul Zhang <paul.zhang@amd.com>

paulzzy force-pushed the push-ywzyounlnpmm branch from e21219f to 94697a5 Compare July 8, 2025 17:56

Hardcode84 merged commit 1bb7af3 into harsh-nod:main Jul 8, 2025

xintin pushed a commit that referenced this pull request Jul 14, 2025

Remove mha param from Wave decode attention kernel (#28)

552ab7e

Depends on iree-org/iree-turbine#1039 Signed-off-by: Paul Zhang <paul.zhang@amd.com>

willghatch pushed a commit that referenced this pull request Jul 28, 2025

Remove mha param from Wave decode attention kernel (#28)

2b03e29

Depends on iree-org/iree-turbine#1039 Signed-off-by: Paul Zhang <paul.zhang@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove `mha` param from Wave decode attention kernel#28

Remove `mha` param from Wave decode attention kernel#28
Hardcode84 merged 1 commit into
harsh-nod:mainfrom
paulzzy:push-ywzyounlnpmm

paulzzy commented Jul 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

paulzzy commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

paulzzy commented Jul 8, 2025 •

edited

Loading