Skip to content

[BUG] Support 32 head size for SGLang #1048

@DavidBao03

Description

@DavidBao03

Recently, SGLang supports Bert Encoder Models with head size of 32, but only for Torch and Triton backend. I'm trying to use the Flashinfer backend but encountered:

Error: Invalid configuration : NUM_MMA_Q=2 NUM_MMA_D_QK=2 NUM_MMA_D_VO=2 NUM_MMA_KV=4 NUM_WARPS_Q=4 
NUM_WARPS_KV=1 please create an issue (https://github.com/flashinfer-ai/flashinfer/issues) and report the issue to the developers.

I'm using JIT mode and the cuda version is 12.4, torch version is 2.5, flashinfer version is 0.2.5 and sglang version is 0.4.6.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions