<img width="3110" height="1292" alt="Image" src="https://github.com/user-attachments/assets/1c106f0d-3613-4462-abd6-be31b81dc6b1" /> I found it extremely slow. In trtllm, it should be 1~2 us. Therefore, I think there's potential bug in this kernel (fp4_quantize)