Skip to content

spec - default MTP draft backend_sampling to off (#23903)#23921

Closed
ssam18 wants to merge 1 commit into
ggml-org:masterfrom
ssam18:fix/issue-23903-mtp-backend-sampling-default
Closed

spec - default MTP draft backend_sampling to off (#23903)#23921
ssam18 wants to merge 1 commit into
ggml-org:masterfrom
ssam18:fix/issue-23903-mtp-backend-sampling-default

Conversation

@ssam18
Copy link
Copy Markdown
Contributor

@ssam18 ssam18 commented May 30, 2026

Fixes #23903.

PR #23287 turned on backend sampling by default for the MTP draft path, but the per-sequence compute buffer overhead is big enough that configs which ran cleanly on b9246 now OOM at --parallel 2 on b9426 and later. Flipping the default back to off restores the working baseline and anyone wanting backend sampling can still opt in with --spec-draft-backend-sampling. I tested locally with a CPU build and a small non-MTP model since the reporter's exact setup needs a Blackwell card, so the MTP path itself still needs someone with that config to confirm the OOM is gone.

PR ggml-org#23287 enabled backend draft sampling by default for the MTP path, attaching a per-seq_id sampler chain (top_k=10) to the draft context. This adds compute-buffer footprint that scales with n_seq, so configs that fit comfortably in VRAM at --parallel N>1 on b9246 now OOM during the first decode on b9410+ (see ggml-org#23903 for the bisect, b9246 fit two slots in 15.6 GB, b9426 needs essentially the full 16 GB for one slot
under the same model and flags).

Default the new behavior off so the regression does not fire on configs that worked before. Users wanting backend sampling can opt back in with --spec-draft-backend-sampling (already wired by PR ggml-org#23287).

The help text auto-reflects the new default via
string_format("default: %s", ... ? "enabled" : "disabled").
@am17an
Copy link
Copy Markdown
Contributor

am17an commented May 31, 2026

The poster should use --no-spec-draft-backend-sampling

@am17an am17an closed this May 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: MTP draft path OOMs reserving compute buffer on first decode (b9410+), a --parallel 2 config that b9246 runs fine

2 participants