StepFun 3.5 MTP#23274
Conversation
|
Converting to draft while I try something out. |
|
@ggerganov just FYI because it's related to the cleanup - I modified the TODO-annotated line with |
|
Sigh. Back to draft, let's see if I can do anything about that. |
|
@pwilkin The fill in llm_graph_input_mtp_chain_tokens::set_input writes contaminated KV entries, and the pad share is especially bad on the all-accept verify path. Nothing cleans them up afterwards, so they pile up across rounds. And block 0's last-position input should be the token target we just sampled, not a reuse of ubatch->token[n_tokens-1]. And even with T_new threaded in, block k+1's tail still has k slots that no real token can fill. Those have to come from the previous block's chain sample, same as what the DRAFT branch already does. So for block k+1 in PREFILL: chain-sampled tokens from block k at the tail (last shift = k slots), ubatch-shifted tokens for the rest. If the analysis holds up, I can put a PR on your step35mtp branch. Or take it yourself if you'd rather — no preference on my end, just let me know. |
3043a4b to
c0fad87
Compare
|
@forforever73 yeah, I see the problem. I've revised the goals for this PR - since doing proper multi-step MTP will require more significant changes, I'll stick to the simple version, which is to only use the first layer similarly to the Qwen3.5 MTP code and save the proper architecture for a future PR (so you're free to propose one). |
|
All right, here are the benchmark stats for the single-layer version (only small changes to the iSWA cache code left from core stuff): Non-MTP ( code_python pred= 192 draft= 0 acc= 0 rate=n/a tok/s=14.2
code_cpp pred= 192 draft= 0 acc= 0 rate=n/a tok/s=14.4
explain_concept pred= 192 draft= 0 acc= 0 rate=n/a tok/s=14.4
summarize pred= 192 draft= 0 acc= 0 rate=n/a tok/s=14.5
qa_factual pred= 192 draft= 0 acc= 0 rate=n/a tok/s=14.3
translation pred= 192 draft= 0 acc= 0 rate=n/a tok/s=14.6
creative_short pred= 192 draft= 0 acc= 0 rate=n/a tok/s=14.1
stepwise_math pred= 192 draft= 0 acc= 0 rate=n/a tok/s=14.1
long_code_review pred= 192 draft= 0 acc= 0 rate=n/a tok/s=14.2
Aggregate: {
"n_requests": 9,
"total_predicted": 1728,
"total_draft": 0,
"total_draft_accepted": 0,
"aggregate_accept_rate": null,
"wall_s_total": 144.35
}MTP ( code_python pred= 192 draft= 123 acc= 105 rate=0.854 tok/s=18.2
code_cpp pred= 192 draft= 108 acc= 92 rate=0.852 tok/s=17.1
explain_concept pred= 192 draft= 102 acc= 84 rate=0.824 tok/s=16.1
summarize pred= 186 draft= 117 acc= 97 rate=0.829 tok/s=16.7
qa_factual pred= 192 draft= 131 acc= 111 rate=0.847 tok/s=18.3
translation pred= 192 draft= 116 acc= 93 rate=0.802 tok/s=16.9
creative_short pred= 192 draft= 95 acc= 85 rate=0.895 tok/s=16.6
stepwise_math pred= 192 draft= 121 acc= 112 rate=0.926 tok/s=19.0
long_code_review pred= 192 draft= 116 acc= 100 rate=0.862 tok/s=17.0
Aggregate: {
"n_requests": 9,
"total_predicted": 1722,
"total_draft": 1029,
"total_draft_accepted": 879,
"aggregate_accept_rate": 0.8542,
"wall_s_total": 123.49
}I'd argue it's good enough. |
|
4xMI50 32GB, 2xMI50 16GB apohelios/step3p5_flash_Q4_1-00001-of-00005.gguf Non-MTP MTP MTP from #20981 @pwilkin Is the stepfun-mtp.gguf from your cmd publicly available? |
|
@toastytorque yeah, |
|
Also, try |
|
Tests above were indeed with With n=3 the performance is still lacking, might be a MI50 thing: It seems I get best results with p-min default (0.0) and n=1: |
|
Yeah, probably depends on the quant as well. |
|
@pwilkin |
|
@am17an any chance we could fast track it? This is the simple version and with fixes on main, no core changes are needed. @forforever73 already said he'd help with the proper full 3-layer support in a followup. |
|
Are these extra scripts required? Seems a bit hacky to have them in master |
|
Well, they're useful, but I'm willing to throw them out to some helper repo if you think that's better. |
|
@CISC should be GTG? |
|
AFAIU this (might) only work with draft size of 1 and we don't have meaningful performance numbers yet to make a conclusion. Probably better to see if the fully functional MTP that uses all MTP layers can be implemented instead of merging this partial work? |
|
@ggerganov all the performance numbers in this thread (from me and from @toastytorque) show a consistent speedup of about 20-25%, I think that's strong enough to warrant inclusion on its own (and this is a very simple patch, the complex solution is bound to be much more complicated and require more turnaround due to the changes to core). |
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
|
Yay, |
Fortunately every Linux distribution still has this wonderful tool named "dos2unix" :) |
|
For me the speedup is quite significant and consistent in it's current form (see numbers above), so of course I'd like to see it merged. Just my 2ct though. |
|
@CISC bump? |
Up to @ggerganov |
|
Do we know if this PR will work for step-3.7-flash as well? I tried to test it, but none of the step-3.7-flash GGUFs that I have came with any MTP layers. |
|
@coder543 I don't really know what I'm doing, but I gave this a shot:
|
|
@coder543 It should work for 3.7-Flash. I'm working on converting and updating my quants with it. |
* StepFun 3.5 MTP * Simplify to single layer * Rollback core changes * fix flake8 errors * Remove scripts * modify to convention * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * dos2unix --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
|
MTP draft model only can be taken from https://huggingface.co/notSnix/Step-3.7-Flash-Q4_K_M-MTP-GGUF |
|
I uploaded an MTP model converted with the latest code here: https://huggingface.co/stepfun-ai/Step-3.7-Flash-GGUF/tree/main It works with But I didn't observe any speedup on an Apple M4 Max. no mtp: mtp: Are there any parameter recommendations ? |
|
@forforever73 |
|
@slavap Interesting. On my device, those parameters make performance worse(35.93 t/s). Looks like I'll need to spend some time investigating it further. |
|
@forforever73 you might try different combinations of |
|
@pwilkin tried several parameter combinations, but all seem to make things slower. A bit strange |
|
I tried several Step-3.7-Flash MTP models, and most of them used an immense amount of memory (compared to not using MTP), so I couldn't really do anything with them. @AesSedai's IQ4 worked well enough for me to fit more than 128k tokens of context with MTP enabled on a DGX Spark. Some performance results on the DGX Spark using draft-p-min of 0.6:
A prompt is "What is the LHC?" B prompt is "Write a TypeScript React example." |
|
This test was using f16 KV cache |


Overview
MTP implementation for StepFun 3.5.
Additional information
Required a few changes to the core logic because StepFun uses a slightly different MTP architecture - it has 3 MTP layers which are used in a round-robin manner for tokens n+1, n+2 and n+3 respectively.
I'm running a suboptimal setup for testing this, but FWIW testing this on a
--cpu-moeStepFun3.5 increased token generation from 15 to 18 t/s.Requirements