Skip to content

Conversation

@createthis
Copy link
Owner

No description provided.

@createthis createthis self-assigned this Oct 27, 2025
  - Removed the forced CPU backend assignment of kvaware_indices
    - src/llama-sparse-topk.cpp: deleted the block that moved result to
      backend_cpu. Now it stays where it’s produced.
    - src/llama-model.cpp: removed both instances of
      ggml_backend_sched_set_tensor_backend(sched, kvaware_indices,
backend_cpu) so we don’t bounce indices to host in MLA and MHA sparse
paths.
- Gate debug-only float32 cast of indices:
  - src/llama-sparse-topk.cpp: only cast to F32 and log the f32 indices
    when LLAMA_SPARSE_DEBUG is set. This cuts extra nodes/copies in
normal runs.
- Increase default Top-K token tile size:
  - src/llama-sparse-topk.cpp: default TILE_T from 32 to 128, still
    overridable via LLAMA_SPARSE_TOPK_TILE_T.
branches, so we avoid the extra backend hop to CPU after
apply_sparse_attention_kvaware
@createthis createthis merged commit 7780061 into deepseek_v3_2_exp Oct 28, 2025
33 of 64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant