Skip to content

Remove mha param from Wave decode attention kernel#28

Merged
Hardcode84 merged 1 commit into
harsh-nod:mainfrom
paulzzy:push-ywzyounlnpmm
Jul 8, 2025
Merged

Remove mha param from Wave decode attention kernel#28
Hardcode84 merged 1 commit into
harsh-nod:mainfrom
paulzzy:push-ywzyounlnpmm

Conversation

@paulzzy
Copy link
Copy Markdown

@paulzzy paulzzy commented Jul 8, 2025

Motivation

See iree-org/iree-turbine#1039

Modifications

Update Wave decode attention kernel integration to align with iree-org/iree-turbine#1039

Checklist

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>
@paulzzy paulzzy force-pushed the push-ywzyounlnpmm branch from e21219f to 94697a5 Compare July 8, 2025 17:56
@Hardcode84 Hardcode84 merged commit 1bb7af3 into harsh-nod:main Jul 8, 2025
xintin pushed a commit that referenced this pull request Jul 14, 2025
Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>
willghatch pushed a commit that referenced this pull request Jul 28, 2025
Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>
xintin pushed a commit that referenced this pull request Aug 15, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
Hardcode84 pushed a commit to Hardcode84/sglang that referenced this pull request Aug 27, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (harsh-nod#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (harsh-nod#6)" (harsh-nod#7)

This reverts commit eac4599.

Wave Backend decode (harsh-nod#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (harsh-nod#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (harsh-nod#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (harsh-nod#14)

Set unique cache dir for each worker (harsh-nod#16)

update kernel (harsh-nod#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (harsh-nod#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (harsh-nod#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (harsh-nod#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (harsh-nod#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (harsh-nod#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
raikonenfnu added a commit that referenced this pull request Sep 8, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
willghatch pushed a commit that referenced this pull request Oct 17, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
xintin pushed a commit that referenced this pull request Nov 3, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
Hardcode84 pushed a commit to Hardcode84/sglang that referenced this pull request Nov 10, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (harsh-nod#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (harsh-nod#6)" (harsh-nod#7)

This reverts commit eac4599.

Wave Backend decode (harsh-nod#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (harsh-nod#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (harsh-nod#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (harsh-nod#14)

Set unique cache dir for each worker (harsh-nod#16)

update kernel (harsh-nod#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (harsh-nod#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (harsh-nod#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (harsh-nod#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (harsh-nod#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (harsh-nod#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
raikonenfnu added a commit that referenced this pull request Nov 17, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
panditsa pushed a commit that referenced this pull request Nov 20, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
willghatch pushed a commit that referenced this pull request Dec 15, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
willghatch pushed a commit that referenced this pull request Dec 19, 2025
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
xintin pushed a commit that referenced this pull request Jan 5, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
Hardcode84 pushed a commit to Hardcode84/sglang that referenced this pull request Jan 12, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (harsh-nod#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (harsh-nod#6)" (harsh-nod#7)

This reverts commit eac4599.

Wave Backend decode (harsh-nod#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (harsh-nod#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (harsh-nod#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (harsh-nod#14)

Set unique cache dir for each worker (harsh-nod#16)

update kernel (harsh-nod#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (harsh-nod#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (harsh-nod#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (harsh-nod#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (harsh-nod#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (harsh-nod#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
Hardcode84 pushed a commit to Hardcode84/sglang that referenced this pull request Jan 12, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (harsh-nod#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (harsh-nod#6)" (harsh-nod#7)

This reverts commit eac4599.

Wave Backend decode (harsh-nod#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (harsh-nod#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (harsh-nod#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (harsh-nod#14)

Set unique cache dir for each worker (harsh-nod#16)

update kernel (harsh-nod#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (harsh-nod#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (harsh-nod#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (harsh-nod#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (harsh-nod#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (harsh-nod#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
Hardcode84 pushed a commit that referenced this pull request Jan 12, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
Hardcode84 pushed a commit that referenced this pull request Jan 12, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
panditsa pushed a commit that referenced this pull request Jan 16, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
panditsa pushed a commit that referenced this pull request Jan 16, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
raikonenfnu added a commit that referenced this pull request Jan 26, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
raikonenfnu added a commit that referenced this pull request Jan 26, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
willghatch pushed a commit that referenced this pull request Feb 12, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
willghatch pushed a commit that referenced this pull request Feb 12, 2026
Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Add wave extend attention kernel

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

[Wave] Adding logit_cap and layer scaling to API

Also add support for the wave backend to the model
runner. And use Triton decode kernels for now.

[Wave] Run chunked prefill for perf comparison on Wave test

Need to rename the non chunked/regular prefill version because otherwise
rpd will treat it as the same kernel

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

[Wave] Cache the function that loads the wave kernel

Also maintain a global kernel hash to avoid
recomputing the hash on every call.

[Wave] Don't specify block size and enable buffer ops

[Wave] Enable wave runtime and update scheduling API

[Wave] Update API to use wave_compile & WaveCompileOptions

[Wave] Update wave backend and extend attention to latest

[Wave] Add speculative decode kernel

Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>

cache kernels using lru_cache

Update WaveBackend to use Wave Decode  (#6)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Revert "Update WaveBackend to use Wave Decode  (#6)" (#7)

This reverts commit eac4599.

Wave Backend decode (#8)

* align shapes

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

* fix

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

---------

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Wave backend fixes (#10)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

More fixes to Wave decode (#12)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

is_causal

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Enable the grok in3 model (#14)

Set unique cache dir for each worker (#16)

update kernel (#18)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

updated spec decode test as per wave

Signed-off-by: xintin <gaurav.verma@amd.com>

fix extend (#23)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Refactor paged decode intermediate arrays shapes (#24)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

remove dyn symbols (#26)

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

cleanup shapes (#27)

Some fields were removed from `paged_decode_attention_shape`.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Remove `mha` param from Wave decode attention kernel (#28)

Depends on iree-org/iree-turbine#1039

Signed-off-by: Paul Zhang <paul.zhang@amd.com>

nfc: fix problems reported by linting

update references from iree.turbine to wave_lang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants