[Spec Decoding] Streamline batch expansion tensor manipulation #7851
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There are some inefficiencies in the spec decoding logic, particularly related to batch expansion and handling of mixed spec decode / non spec decode batches.
split_batch_by_proposal_len
method split the batch in a single iteration rather than iterating separately to get the spec/non-spec lists_run_no_spec
rather than_run_speculative_decoding_step
in the case that all sequences havemax_proposal_len == 0
_contract_batch_all_spec
method for the (common) case that all batch sequences have spec decode enabled which excludes logic to split/recombine the spec/non-spec sequences_split_scoring_output
method used in_contract_batch
to avoid unnecessary intermediate tensor manipulationIn an anecdotal test of mlpspeculator with bs=1 this gives a consistent 2-3% increase in throughput.