Skip to content

[Wave] Remove mha param from paged decode attention#1039

Merged
Hardcode84 merged 4 commits into
iree-org:mainfrom
paulzzy:push-pxzqqzypmwop
Jul 8, 2025
Merged

[Wave] Remove mha param from paged decode attention#1039
Hardcode84 merged 4 commits into
iree-org:mainfrom
paulzzy:push-pxzqqzypmwop

Conversation

@paulzzy
Copy link
Copy Markdown
Contributor

@paulzzy paulzzy commented Jul 8, 2025

Can be derived from shape.num_query_heads == shape.num_kv_heads, no need for user to specify.

Can be derived from `shape.num_query_heads == shape.num_kv_heads`, no
need for user to specify.

Signed-off-by: Paul Zhang <paul.zhang@amd.com>
Signed-off-by: Paul Zhang <paul.zhang@amd.com>
@paulzzy paulzzy force-pushed the push-pxzqqzypmwop branch from aa757c1 to f19c2d0 Compare July 8, 2025 17:55
paulzzy added a commit to paulzzy/sglang that referenced this pull request Jul 8, 2025
Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>
@Hardcode84
Copy link
Copy Markdown
Contributor

You need to update the test as well

@paulzzy paulzzy requested a review from Hardcode84 July 8, 2025 18:01
Signed-off-by: Paul Zhang <paul.zhang@amd.com>
@paulzzy
Copy link
Copy Markdown
Contributor Author

paulzzy commented Jul 8, 2025

Looks like all tests pass except a flaky one.

Details
 =========================== short test summary info ============================
FAILED tests/kernel/wave/attention/vanilla_attention_test.py::testAttentionBSHD[mfma_variant0-no_dyn-SchedulingType.NONE-8x128x128x64x256] - AssertionError: Tensor-likes are not close!

Mismatched elements: 64986 / 131072 (49.6%)
Greatest absolute difference: 0.5736904740333557 at index (0, 2, 2, 46) (up to 0.001 allowed)
Greatest relative difference: 1.0 at index (0, 0, 2, 0) (up to 0.001 allowed)
= 1 failed, 656 passed, 980 skipped, 9 xfailed, 42 warnings in 314.50s (0:05:14) =

@Hardcode84 Hardcode84 merged commit 8757da2 into iree-org:main Jul 8, 2025
11 of 12 checks passed
Hardcode84 pushed a commit to harsh-nod/sglang that referenced this pull request Jul 8, 2025
Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>
@paulzzy paulzzy deleted the push-pxzqqzypmwop branch July 8, 2025 18:39
xintin pushed a commit to harsh-nod/sglang that referenced this pull request Jul 14, 2025
Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>
willghatch pushed a commit to harsh-nod/sglang that referenced this pull request Jul 28, 2025
Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>
xintin pushed a commit to harsh-nod/sglang that referenced this pull request Aug 15, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
Hardcode84 pushed a commit to Hardcode84/sglang that referenced this pull request Aug 27, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (harsh-nod#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (harsh-nod#6)" (harsh-nod#7)

This reverts commit eac4599.

Wave Backend decode (harsh-nod#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (harsh-nod#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (harsh-nod#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (harsh-nod#14)

Set unique cache dir for each worker (harsh-nod#16)

update kernel (harsh-nod#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (harsh-nod#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (harsh-nod#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (harsh-nod#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (harsh-nod#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (harsh-nod#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
raikonenfnu added a commit to harsh-nod/sglang that referenced this pull request Sep 8, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
willghatch pushed a commit to harsh-nod/sglang that referenced this pull request Oct 17, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
xintin pushed a commit to harsh-nod/sglang that referenced this pull request Nov 3, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
Hardcode84 pushed a commit to Hardcode84/sglang that referenced this pull request Nov 10, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (harsh-nod#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (harsh-nod#6)" (harsh-nod#7)

This reverts commit eac4599.

Wave Backend decode (harsh-nod#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (harsh-nod#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (harsh-nod#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (harsh-nod#14)

Set unique cache dir for each worker (harsh-nod#16)

update kernel (harsh-nod#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (harsh-nod#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (harsh-nod#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (harsh-nod#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (harsh-nod#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (harsh-nod#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
raikonenfnu added a commit to harsh-nod/sglang that referenced this pull request Nov 17, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
panditsa pushed a commit to harsh-nod/sglang that referenced this pull request Nov 20, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
willghatch pushed a commit to harsh-nod/sglang that referenced this pull request Dec 15, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
willghatch pushed a commit to harsh-nod/sglang that referenced this pull request Dec 19, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
xintin pushed a commit to harsh-nod/sglang that referenced this pull request Jan 5, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
Hardcode84 pushed a commit to Hardcode84/sglang that referenced this pull request Jan 12, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (harsh-nod#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (harsh-nod#6)" (harsh-nod#7)

This reverts commit eac4599.

Wave Backend decode (harsh-nod#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (harsh-nod#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (harsh-nod#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (harsh-nod#14)

Set unique cache dir for each worker (harsh-nod#16)

update kernel (harsh-nod#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (harsh-nod#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (harsh-nod#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (harsh-nod#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (harsh-nod#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (harsh-nod#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
Hardcode84 pushed a commit to Hardcode84/sglang that referenced this pull request Jan 12, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (harsh-nod#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (harsh-nod#6)" (harsh-nod#7)

This reverts commit eac4599.

Wave Backend decode (harsh-nod#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (harsh-nod#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (harsh-nod#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (harsh-nod#14)

Set unique cache dir for each worker (harsh-nod#16)

update kernel (harsh-nod#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (harsh-nod#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (harsh-nod#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (harsh-nod#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (harsh-nod#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (harsh-nod#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
Hardcode84 pushed a commit to harsh-nod/sglang that referenced this pull request Jan 12, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
Hardcode84 pushed a commit to harsh-nod/sglang that referenced this pull request Jan 12, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
panditsa pushed a commit to harsh-nod/sglang that referenced this pull request Jan 16, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
panditsa pushed a commit to harsh-nod/sglang that referenced this pull request Jan 16, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
raikonenfnu added a commit to harsh-nod/sglang that referenced this pull request Jan 26, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
raikonenfnu added a commit to harsh-nod/sglang that referenced this pull request Jan 26, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
willghatch pushed a commit to harsh-nod/sglang that referenced this pull request Feb 12, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
willghatch pushed a commit to harsh-nod/sglang that referenced this pull request Feb 12, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants