support multithread compute bitmask for spec decode#225
support multithread compute bitmask for spec decode#225hudson-ai merged 8 commits intoguidance-ai:mainfrom
Conversation
|
Still need to look more closely at your implementation and the example/test, but what is the failure mode when draft tokens are disallowed by the mask? Are they always expected to be valid? |
|
When constraint decoding and speculative decoding are combined as orthogonal techniques, the current sglang and vLLM calculations of the bitmask involve steps that are all performed within the framework and are not parallelized using multi-threading. Taking vLLM as an example(PR:https://github.com/vllm-project/vllm/pull/14702/files), the algorithm pseudocode can be abstracted as the following logic: |
|
It is clear that in the 2th step, since each request's matcher is independent, it can be processed in parallel. In llguidance, without speculative decoding, the bitmask's first dimension is the batch size, not batch size * (spec_k + 1). The calculation can be parallelized using However, when combined with spec decode, the bitmask calculation follows the 2th step approach and can also be parallelized. Parallel processing can be done in the llguidance backend rather than in the inference(vLLM, SGLang) framework, which can better avoid the impact of GIL. |
|
In Step 4, applying the bitmask ensures that multiple tokens generated during spec decode still meet the matcher's constraints. Furthermore, for tokens that were not accepted by the matcher in Step 2, no operation is needed on the mask, which default value is to 1. These tokens that are not accepted by the matcher will be rejected during the spec decode reject sampling (step 5) process, thus ensuring correctness. |
|
vllm PR (vllm-project/vllm#21862) has implemented multi-threaded computation under non-spec mode, but in fact, when using llguidance, the The main purpose of this PR is in step two, which is to parallelize the computation of bitmask under spec mode. Once this PR is merged, I plan to directly make I know that combining spec decode and constrained decode might be a bit difficult to understand. If there are any questions, feel free to discuss them. |
e68f3df to
5dd26a6
Compare
|
@ZonePG your explanation here is really useful, thank you!
I'm still struggling to understand this point though -- shouldn't the mask's default value not matter here, given that the previous draft token would necessarily be rejected? If I'm misunderstanding and it is indeed important that the mask take a "high" value in the following positions, please add a check to your test that ensures this behavior. Either way, adding a third matcher to your test and a sequence of draft tokens that should be rejected at some position would be useful, at least making the assertion that the draft token that is to be rejected is indeed rejected. |
Yes, the bitmask value for rejected positions is not matter here. Just the mask has a default bit value 1 at step 0. Since we did not perform any operations, they still remain at the default value of 1. |
3341109 to
fc7571b
Compare
fc7571b to
40afda1
Compare
|
hi @hudson-ai, I added more detailed test cases to test_par_draft_tokens, including the third matcher being a state machine that accepts any token, and the fourth matcher receiving only partially valid draft tokens. |
|
Looking pretty good to me (beyond some typing fixes needed in your tests it seems), although I'd appreciate some input from @mmoskal if you get a chance soon! |
mmoskal
left a comment
There was a problem hiding this comment.
LGTM, though I didn't check all the address calculations exactly; I guess if they work, they work!
5525c98 to
b674b66
Compare
|
Thank you @ZonePG! |
Support speculative decoding when we have draft tokens.
Put the parallel compute bitmask with draft tokens in the Rust Constrained Backend to avoid Python GIL.