support multithread compute bitmask for spec decode by ZonePG · Pull Request #225 · guidance-ai/llguidance

ZonePG · 2025-07-31T07:44:33Z

Support speculative decoding when we have draft tokens.

Put the parallel compute bitmask with draft tokens in the Rust Constrained Backend to avoid Python GIL.

hudson-ai · 2025-08-01T20:31:37Z

Still need to look more closely at your implementation and the example/test, but what is the failure mode when draft tokens are disallowed by the mask? Are they always expected to be valid?

ZonePG · 2025-08-06T19:22:05Z

When constraint decoding and speculative decoding are combined as orthogonal techniques, the current sglang and vLLM calculations of the bitmask involve steps that are all performed within the framework and are not parallelized using multi-threading.

Taking vLLM as an example（PR：https://github.com/vllm-project/vllm/pull/14702/files）, the algorithm pseudocode can be abstracted as the following logic:

# 0. constrained decode: init bitmask
bitmask_tensor = llguidance_torch.allocate_token_bitmask((batch_size * (spec_k + 1)), self.vocab_size)

# 1. spec decode: draft sample
batch_draft_tokens = draft_sample_func() # shape is [batch_size, spec_k]

# 2. constrained decode: compute bitmask
for idx, matcher in enumerate(batch_ll_matchers):
    state_advancements = 0
    tokens = batch_draft_tokens[idx] + [None]
    for token_idx, token in enumerate(tokens):
        matcher.fill_bitmask(bitmask_tensor, idx* (spec_k + 1) + token_idx)
        if (
            token is None
            or matcher.is_terminated()
            or not matcher.try_cosume_token(token)
        ):
            break
        state_advancements += 1
        if state_advancements > 0:
            matcher.rollback(state_advancements)
            
# 3. spec decode: target verify
target_logits = target_verify_func() # shape is [batch_size * (spec_k + 1), vocab_size]

# 4. onstrained decode: apply bitmask
apply_bitmask(target_logits, bitmask)

# 5. spec decode: reject sample
batch_new_tokens = reject_sample(target_logits, batch_draft_tokens) # shape is List[List[int]], len(batch_draft_tokens) = batch_size

# 6. constrained decode: matcher accept token
for matcher, new_tokens in enumerate(batch_ll_matchers, batch_new_tokens):
    matcher.try_consume_tokens(new_tokens)

ZonePG · 2025-08-06T19:32:28Z

It is clear that in the 2th step, since each request's matcher is independent, it can be processed in parallel.

In llguidance, without speculative decoding, the bitmask's first dimension is the batch size, not batch size * (spec_k + 1). The calculation can be parallelized using fill_next_token_bitmask_par.

However, when combined with spec decode, the bitmask calculation follows the 2th step approach and can also be parallelized. Parallel processing can be done in the llguidance backend rather than in the inference(vLLM, SGLang) framework, which can better avoid the impact of GIL.

ZonePG · 2025-08-06T19:47:18Z

In Step 4, applying the bitmask ensures that multiple tokens generated during spec decode still meet the matcher's constraints.

Furthermore, for tokens that were not accepted by the matcher in Step 2, no operation is needed on the mask, which default value is to 1. These tokens that are not accepted by the matcher will be rejected during the spec decode reject sampling (step 5) process, thus ensuring correctness.

ZonePG · 2025-08-06T19:52:40Z

vllm PR (vllm-project/vllm#21862) has implemented multi-threaded computation under non-spec mode, but in fact, when using llguidance, the fill_next_token_bitmask_par interface can be directly called. Multi-threaded computation bitmask is still not supported under spec mode.

The main purpose of this PR is in step two, which is to parallelize the computation of bitmask under spec mode.

Once this PR is merged, I plan to directly make fill_next_token_bitmask_par_with_draft_tokens interface calls to vLLM and sglang, thus avoiding the need to compute bitmask using Python multi-threads within the inference framework. It can also avoid the GIL brought by Python multithreading.

I know that combining spec decode and constrained decode might be a bit difficult to understand. If there are any questions, feel free to discuss them.

…ken func

hudson-ai · 2025-08-06T22:41:31Z

@ZonePG your explanation here is really useful, thank you!

In Step 4, applying the bitmask ensures that multiple tokens generated during spec decode still meet the matcher's constraints.
Furthermore, for tokens that were not accepted by the matcher in Step 2, no operation is needed on the mask, which default value is to 1. These tokens that are not accepted by the matcher will be rejected during the spec decode reject sampling (step 5) process, thus ensuring correctness.

I'm still struggling to understand this point though -- shouldn't the mask's default value not matter here, given that the previous draft token would necessarily be rejected? If I'm misunderstanding and it is indeed important that the mask take a "high" value in the following positions, please add a check to your test that ensures this behavior.

Either way, adding a third matcher to your test and a sequence of draft tokens that should be rejected at some position would be useful, at least making the assertion that the draft token that is to be rejected is indeed rejected.

ZonePG · 2025-08-07T03:52:23Z

shouldn't the mask's default value not matter here, given that the previous draft token would necessarily be rejected?

Yes, the bitmask value for rejected positions is not matter here. Just the mask has a default bit value 1 at step 0.

Since we did not perform any operations, they still remain at the default value of 1.

ZonePG · 2025-08-07T07:00:34Z

hi @hudson-ai, I added more detailed test cases to test_par_draft_tokens, including the third matcher being a state machine that accepts any token, and the fourth matcher receiving only partially valid draft tokens.

hudson-ai · 2025-08-08T19:59:39Z

Looking pretty good to me (beyond some typing fixes needed in your tests it seems), although I'd appreciate some input from @mmoskal if you get a chance soon!

mmoskal

LGTM, though I didn't check all the address calculations exactly; I guess if they work, they work!

hudson-ai · 2025-08-12T17:48:26Z

Thank you @ZonePG!

support multithread compute bitmask for spec decode

fa7fdfb

hudson-ai reviewed Aug 1, 2025

View reviewed changes

Comment thread python/llguidance/_lib.pyi

hudson-ai requested a review from mmoskal August 1, 2025 20:31

update more explicit comment on unsafe_compute_mask_ptr_with_draft_to…

5dd26a6

…ken func

ZonePG force-pushed the parallel_compute_bitmask_with_draft_token branch from e68f3df to 5dd26a6 Compare August 6, 2025 20:29

ZonePG requested a review from hudson-ai August 6, 2025 20:31

ZonePG closed this Aug 7, 2025

ZonePG reopened this Aug 7, 2025

ZonePG force-pushed the parallel_compute_bitmask_with_draft_token branch from 3341109 to fc7571b Compare August 7, 2025 06:56

update test for test_par_draft_tokens

40afda1

ZonePG force-pushed the parallel_compute_bitmask_with_draft_token branch from fc7571b to 40afda1 Compare August 7, 2025 06:58

update test for test_par_draft_tokens

91bd2bc

update test_matcher.py typo fix

12c8972

mmoskal approved these changes Aug 10, 2025

View reviewed changes

Comment thread python/llguidance/_lib.pyi Outdated

zonepg666@gmail.com added 2 commits August 11, 2025 03:38

typo fix

b89c8eb

typo fix

380a374

hudson-ai reviewed Aug 11, 2025

View reviewed changes

Comment thread python/llguidance/_lib.pyi Outdated

rename one_mask_byte_size => one_mask_bytes

b674b66

ZonePG force-pushed the parallel_compute_bitmask_with_draft_token branch from 5525c98 to b674b66 Compare August 12, 2025 05:22

hudson-ai approved these changes Aug 12, 2025

View reviewed changes

hudson-ai merged commit 0bb5ee9 into guidance-ai:main Aug 12, 2025
3 checks passed

Conversation

ZonePG commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

hudson-ai commented Aug 1, 2025

Uh oh!

ZonePG commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZonePG commented Aug 6, 2025

Uh oh!

ZonePG commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZonePG commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hudson-ai commented Aug 6, 2025

Uh oh!

ZonePG commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZonePG commented Aug 7, 2025

Uh oh!

hudson-ai commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mmoskal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hudson-ai commented Aug 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ZonePG commented Jul 31, 2025 •

edited

Loading

ZonePG commented Aug 6, 2025 •

edited

Loading

ZonePG commented Aug 6, 2025 •

edited

Loading

ZonePG commented Aug 6, 2025 •

edited

Loading

ZonePG commented Aug 7, 2025 •

edited

Loading

hudson-ai commented Aug 8, 2025 •

edited

Loading