spec - default MTP draft backend_sampling to off (#23903)#23921
Closed
ssam18 wants to merge 1 commit into
Closed
Conversation
PR ggml-org#23287 enabled backend draft sampling by default for the MTP path, attaching a per-seq_id sampler chain (top_k=10) to the draft context. This adds compute-buffer footprint that scales with n_seq, so configs that fit comfortably in VRAM at --parallel N>1 on b9246 now OOM during the first decode on b9410+ (see ggml-org#23903 for the bisect, b9246 fit two slots in 15.6 GB, b9426 needs essentially the full 16 GB for one slot under the same model and flags). Default the new behavior off so the regression does not fire on configs that worked before. Users wanting backend sampling can opt back in with --spec-draft-backend-sampling (already wired by PR ggml-org#23287). The help text auto-reflects the new default via string_format("default: %s", ... ? "enabled" : "disabled").
Contributor
|
The poster should use |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #23903.
PR #23287 turned on backend sampling by default for the MTP draft path, but the per-sequence compute buffer overhead is big enough that configs which ran cleanly on b9246 now OOM at --parallel 2 on b9426 and later. Flipping the default back to off restores the working baseline and anyone wanting backend sampling can still opt in with --spec-draft-backend-sampling. I tested locally with a CPU build and a small non-MTP model since the reporter's exact setup needs a Blackwell card, so the MTP path itself still needs someone with that config to confirm the OOM is gone.