Fix apply_bit_mask cuda implementation by Jialin · Pull Request #394 · mlc-ai/xgrammar

Jialin · 2025-08-11T08:20:35Z

Issue

Originally, most of the apply_token_bitmask unit tests are failed due to cuda invalid memory access.

pytest tests/python/test_token_bitmask_operations.py

Fix

With some debugging, we found that the root cause is due to putting logits.stride and bitmasks.stride in the wrong order.

Benchmark

With the fix, we also loop in cuda comparison in the benchmark runs, which still show significant improvements on top of triton implementation.

|   Batch |   Vocab |   Masked cnt |   Torch Compile |         Triton  |           CUDA  |
|    size |    size |              |     Baseline us |    us (speedup) |    us (speedup) |
|--------:|--------:|-------------:|----------------:|----------------:|----------------:|
|       1 |  128000 |            1 |            5.85 |    5.41 (1.08x) |    5.46 (1.07x) |
|       1 |  128000 |        64000 |            5.84 |    6.01 (0.97x) |    6.24 (0.94x) |
|       1 |  128000 |       127000 |            5.84 |    6.09 (0.96x) |    5.95 (0.98x) |
|       8 |  128000 |            1 |           10.75 |    5.86 (1.83x) |    5.90 (1.82x) |
|       8 |  128000 |        64000 |           10.75 |    7.59 (1.42x) |    9.85 (1.09x) |
|       8 |  128000 |       127000 |           10.77 |    7.85 (1.37x) |    8.06 (1.34x) |
|      64 |  128000 |            1 |           48.59 |   13.10 (3.71x) |    9.68 (5.02x) |
|      64 |  128000 |        64000 |           48.59 |   45.43 (1.07x) |   38.76 (1.25x) |
|      64 |  128000 |       127000 |           48.58 |   32.84 (1.48x) |   26.29 (1.85x) |
|     512 |  128000 |            1 |          349.84 |   67.34 (5.20x) |   37.06 (9.44x) |
|     512 |  128000 |        64000 |          346.94 |  330.36 (1.05x) |  256.53 (1.35x) |
|     512 |  128000 |       127000 |          345.54 |  249.66 (1.38x) |  157.51 (2.19x) |
|    4096 |  128000 |            1 |         2895.83 |  494.47 (5.86x) | 249.96 (11.59x) |
|    4096 |  128000 |        64000 |         2863.31 | 2517.85 (1.14x) | 1993.29 (1.44x) |
|    4096 |  128000 |       127000 |         2720.67 | 1935.24 (1.41x) | 1207.38 (2.25x) |

Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

Ubospica

LGTM. Thanks for the fix!

Jialin added 3 commits August 11, 2025 01:04

Fix bitmask cuda implementation

e42b339

Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

Update benchmark data

3d5a71c

Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

Revert unnecessary changes

c7787d1

Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

Jialin mentioned this pull request Aug 11, 2025

Fix apply_bitmask logit for both CPU and triton versions when shape and stride doesn't match #390

Merged

Ubospica approved these changes Aug 11, 2025

View reviewed changes

Ubospica merged commit 2b4775c into mlc-ai:main Aug 11, 2025
38 checks passed

Jialin deleted the bitmask_cuda_fix branch August 13, 2025 04:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix apply_bit_mask cuda implementation#394

Fix apply_bit_mask cuda implementation#394
Ubospica merged 3 commits intomlc-ai:mainfrom
Jialin:bitmask_cuda_fix

Jialin commented Aug 11, 2025

Uh oh!

Ubospica left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Jialin commented Aug 11, 2025

Issue

Fix

Benchmark

Uh oh!

Ubospica left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants