Skip to content

support multithread compute bitmask for spec decode#225

Merged
hudson-ai merged 8 commits intoguidance-ai:mainfrom
ZonePG:parallel_compute_bitmask_with_draft_token
Aug 12, 2025
Merged

support multithread compute bitmask for spec decode#225
hudson-ai merged 8 commits intoguidance-ai:mainfrom
ZonePG:parallel_compute_bitmask_with_draft_token

Conversation

@ZonePG
Copy link
Copy Markdown
Contributor

@ZonePG ZonePG commented Jul 31, 2025

Support speculative decoding when we have draft tokens.

Put the parallel compute bitmask with draft tokens in the Rust Constrained Backend to avoid Python GIL.

Comment thread python/llguidance/_lib.pyi
@hudson-ai
Copy link
Copy Markdown
Contributor

Still need to look more closely at your implementation and the example/test, but what is the failure mode when draft tokens are disallowed by the mask? Are they always expected to be valid?

@hudson-ai hudson-ai requested a review from mmoskal August 1, 2025 20:31
@ZonePG
Copy link
Copy Markdown
Contributor Author

ZonePG commented Aug 6, 2025

When constraint decoding and speculative decoding are combined as orthogonal techniques, the current sglang and vLLM calculations of the bitmask involve steps that are all performed within the framework and are not parallelized using multi-threading.

Taking vLLM as an example(PR:https://github.com/vllm-project/vllm/pull/14702/files), the algorithm pseudocode can be abstracted as the following logic:

# 0. constrained decode: init bitmask
bitmask_tensor = llguidance_torch.allocate_token_bitmask((batch_size * (spec_k + 1)), self.vocab_size)

# 1. spec decode: draft sample
batch_draft_tokens = draft_sample_func() # shape is [batch_size, spec_k]

# 2. constrained decode: compute bitmask
for idx, matcher in enumerate(batch_ll_matchers):
    state_advancements = 0
    tokens = batch_draft_tokens[idx] + [None]
    for token_idx, token in enumerate(tokens):
        matcher.fill_bitmask(bitmask_tensor, idx* (spec_k + 1) + token_idx)
        if (
            token is None
            or matcher.is_terminated()
            or not matcher.try_cosume_token(token)
        ):
            break
        state_advancements += 1
        if state_advancements > 0:
            matcher.rollback(state_advancements)
            
# 3. spec decode: target verify
target_logits = target_verify_func() # shape is [batch_size * (spec_k + 1), vocab_size]

# 4. onstrained decode: apply bitmask
apply_bitmask(target_logits, bitmask)

# 5. spec decode: reject sample
batch_new_tokens = reject_sample(target_logits, batch_draft_tokens) # shape is List[List[int]], len(batch_draft_tokens) = batch_size

# 6. constrained decode: matcher accept token
for matcher, new_tokens in enumerate(batch_ll_matchers, batch_new_tokens):
    matcher.try_consume_tokens(new_tokens)

@ZonePG
Copy link
Copy Markdown
Contributor Author

ZonePG commented Aug 6, 2025

It is clear that in the 2th step, since each request's matcher is independent, it can be processed in parallel.

In llguidance, without speculative decoding, the bitmask's first dimension is the batch size, not batch size * (spec_k + 1). The calculation can be parallelized using fill_next_token_bitmask_par.

However, when combined with spec decode, the bitmask calculation follows the 2th step approach and can also be parallelized. Parallel processing can be done in the llguidance backend rather than in the inference(vLLM, SGLang) framework, which can better avoid the impact of GIL.

@ZonePG
Copy link
Copy Markdown
Contributor Author

ZonePG commented Aug 6, 2025

In Step 4, applying the bitmask ensures that multiple tokens generated during spec decode still meet the matcher's constraints.

Furthermore, for tokens that were not accepted by the matcher in Step 2, no operation is needed on the mask, which default value is to 1. These tokens that are not accepted by the matcher will be rejected during the spec decode reject sampling (step 5) process, thus ensuring correctness.

@ZonePG
Copy link
Copy Markdown
Contributor Author

ZonePG commented Aug 6, 2025

vllm PR (vllm-project/vllm#21862) has implemented multi-threaded computation under non-spec mode, but in fact, when using llguidance, the fill_next_token_bitmask_par interface can be directly called. Multi-threaded computation bitmask is still not supported under spec mode.

The main purpose of this PR is in step two, which is to parallelize the computation of bitmask under spec mode.

Once this PR is merged, I plan to directly make fill_next_token_bitmask_par_with_draft_tokens interface calls to vLLM and sglang, thus avoiding the need to compute bitmask using Python multi-threads within the inference framework. It can also avoid the GIL brought by Python multithreading.

I know that combining spec decode and constrained decode might be a bit difficult to understand. If there are any questions, feel free to discuss them.

@ZonePG ZonePG force-pushed the parallel_compute_bitmask_with_draft_token branch from e68f3df to 5dd26a6 Compare August 6, 2025 20:29
@ZonePG ZonePG requested a review from hudson-ai August 6, 2025 20:31
@hudson-ai
Copy link
Copy Markdown
Contributor

@ZonePG your explanation here is really useful, thank you!

In Step 4, applying the bitmask ensures that multiple tokens generated during spec decode still meet the matcher's constraints.
Furthermore, for tokens that were not accepted by the matcher in Step 2, no operation is needed on the mask, which default value is to 1. These tokens that are not accepted by the matcher will be rejected during the spec decode reject sampling (step 5) process, thus ensuring correctness.

I'm still struggling to understand this point though -- shouldn't the mask's default value not matter here, given that the previous draft token would necessarily be rejected? If I'm misunderstanding and it is indeed important that the mask take a "high" value in the following positions, please add a check to your test that ensures this behavior.

Either way, adding a third matcher to your test and a sequence of draft tokens that should be rejected at some position would be useful, at least making the assertion that the draft token that is to be rejected is indeed rejected.

@ZonePG
Copy link
Copy Markdown
Contributor Author

ZonePG commented Aug 7, 2025

shouldn't the mask's default value not matter here, given that the previous draft token would necessarily be rejected?

Yes, the bitmask value for rejected positions is not matter here. Just the mask has a default bit value 1 at step 0.

Since we did not perform any operations, they still remain at the default value of 1.

@ZonePG ZonePG closed this Aug 7, 2025
@ZonePG ZonePG reopened this Aug 7, 2025
@ZonePG ZonePG force-pushed the parallel_compute_bitmask_with_draft_token branch from 3341109 to fc7571b Compare August 7, 2025 06:56
@ZonePG ZonePG force-pushed the parallel_compute_bitmask_with_draft_token branch from fc7571b to 40afda1 Compare August 7, 2025 06:58
@ZonePG
Copy link
Copy Markdown
Contributor Author

ZonePG commented Aug 7, 2025

hi @hudson-ai, I added more detailed test cases to test_par_draft_tokens, including the third matcher being a state machine that accepts any token, and the fourth matcher receiving only partially valid draft tokens.

@hudson-ai
Copy link
Copy Markdown
Contributor

hudson-ai commented Aug 8, 2025

Looking pretty good to me (beyond some typing fixes needed in your tests it seems), although I'd appreciate some input from @mmoskal if you get a chance soon!

Copy link
Copy Markdown
Member

@mmoskal mmoskal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I didn't check all the address calculations exactly; I guess if they work, they work!

Comment thread python/llguidance/_lib.pyi Outdated
zonepg666@gmail.com added 2 commits August 11, 2025 03:38
Comment thread python/llguidance/_lib.pyi Outdated
@ZonePG ZonePG force-pushed the parallel_compute_bitmask_with_draft_token branch from 5525c98 to b674b66 Compare August 12, 2025 05:22
@hudson-ai hudson-ai merged commit 0bb5ee9 into guidance-ai:main Aug 12, 2025
3 checks passed
@hudson-ai
Copy link
Copy Markdown
Contributor

Thank you @ZonePG!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants