-
Notifications
You must be signed in to change notification settings - Fork 967
Ameyn/wide vec t1 #3147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Ameyn/wide vec t1 #3147
Changes from all commits
Commits
Show all changes
31 commits
Select commit
Hold shift + click to select a range
8a6e981
perf(gdn): fix bf16_state T=1 per-call overhead and add pool+padding …
ameynaik-hub 1ad4dbe
perf(gdn): add wide_vec BF16 MTP kernel, auto-dispatch, and T=2 heuri…
ameynaik-hub bb558d5
perf(gdn): restore ILP=4 MTP launcher for small-batch BF16 MTP decode
ameynaik-hub 316c36b
perf(gdn): extend wide_vec fast path to small batches via tile_v
ameynaik-hub 241bdd2
perf(gdn): route T=1 decode through wide_vec at large work_units (poo…
ameynaik-hub 6b7a731
chore(gdn): remove dead cooprow BF16 decode kernel and launcher
ameynaik-hub ee54669
chore(gdn): drop unused cpasync import after cooprow removal
ameynaik-hub 898171f
fix(gdn): drop dead import of _reference_gdn_mtp from wide_vec module
ameynaik-hub 7aa0141
chore(gdn): drop cooprow-era module constants and stale module docstring
ameynaik-hub 936231f
chore(gdn): drop redundant V!=128 gate in _select_wide_vec_tile_v
ameynaik-hub 088d879
chore(gdn): make BF16 GDN kernels pool-only; auto-promote at wrapper
ameynaik-hub 0a92290
chore(gdn): remove dead T=1 ILP=8 kernel; route T=1 fallback through MTP
ameynaik-hub 97c71f3
test(gdn): pass initial_state_indices=arange(B) for pool-only BF16 ke…
ameynaik-hub 4da2b2a
bench(gdn): add --pool-mode {single,split} for BF16 state benchmark
ameynaik-hub e108118
feat(gdn): add split-pool support to gdn_wide_vec_kernel
ameynaik-hub 0b7b66f
feat(gdn): add split-pool support to gdn_decode_bf16state_mtp_ilp4_ke…
ameynaik-hub 3260f43
chore(gdn): remove dead gdn_decode_bf16state_mtp_kernel and ILP=8 path
ameynaik-hub 9e475a5
test(gdn): add split-pool MTP coverage for wide_vec and mtp_ilp4
ameynaik-hub 49d9d2a
refactor(gdn): merge gdn_decode_bf16_state_wide_vec.py into the main …
ameynaik-hub a234644
fix(gdn): use batch-scoped i_n for intermediate_states indexing (OOB;…
ameynaik-hub aa63cba
chore(gdn): refresh stale comments / docstrings post-cleanup
ameynaik-hub e1f6c53
perf(gdn): elide write-side address arithmetic when reads/writes alia…
ameynaik-hub 7266b5e
Merge branch 'main' into ameyn/wide_vec_t1
ameynaik-hub 4c7c6c9
Merge branch 'main' into ameyn/wide_vec_t1
ameynaik-hub f8c73dc
Merge branch 'main' into ameyn/wide_vec_t1
ameynaik-hub bab0309
Merge branch 'main' into ameyn/wide_vec_t1
ameynaik-hub bc6c269
Merge branch 'main' into ameyn/wide_vec_t1
ameynaik-hub 51ccd37
test(gdn): chunk wide_vec MTP intermediate-state assert to avoid OOM
ameynaik-hub e61db27
Merge branch 'main' into ameyn/wide_vec_t1
ameynaik-hub 1133601
Merge branch 'main' into ameyn/wide_vec_t1
ameynaik-hub 89c2ebe
Merge branch 'main' into ameyn/wide_vec_t1
ameynaik-hub File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reapply the K-contiguous guard on the synthetic-pool BF16 path.
The real pool path already rejects tensors with
stride(-1) != 1, but this newuse_pool=Falsebranch skips that check and passesstatestraight into the BF16 kernel asinitial_state_source. A non-K-contiguous[B, HV, V, K]view can still satisfy the shape check and then produce wrong reads/writes in the fast path.Suggested fix
if use_pool: bf16_pool = initial_state bf16_indices = initial_state_indices else: + assert state.stride(-1) == 1, ( + "state must be K-contiguous (stride[-1] == 1) for bf16 pretranspose decode, " + f"got stride={state.stride()}" + ) bf16_pool = state bf16_indices = torch.arange(B, dtype=torch.int32, device=q.device)🤖 Prompt for AI Agents