Skip to content

Draft: FP4 Disagg + MTP Configs#48

Merged
Fridge003 merged 1 commit intoishandhanani:mainfrom
trevor-m:disagg-mtp
Dec 6, 2025
Merged

Draft: FP4 Disagg + MTP Configs#48
Fridge003 merged 1 commit intoishandhanani:mainfrom
trevor-m:disagg-mtp

Conversation

@trevor-m
Copy link
Collaborator

@trevor-m trevor-m commented Dec 2, 2025

Adds two configs:

  • 1p2d-mtp - Required minimal changes from original 1p2d config. I just had to add the speculative args and reduce the memory fraction on the decode nodes.
  • max-tpt-2-mtp - This one required a lot of workarounds:
    • To avoid SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK cap of 1024, I basically halved all of the settings related to tokens to account for speculative-num-steps=1.
      • max running requests: 67584->33792
      • cuda graph max bs: 1024->512 (later reduced to 256)
      • num reserved decode tokens 112->224
    • Use shell script patch for --speculative-moe-a2a-backend (remove once support mtp with deepseek r1 nvfp4 model sgl-project/sglang#13115 is merged)
    • After those changes I encountered OOMs on the decode side during draft mode cuda capture. I reduced the mem fraction from 0.83->0.73 but it didn't appear to make a difference. I ultimately just reduced the cuda graph bs to 256.
    • On the prefill side, there is a bug with single batch overlap and the speculative layer. I just disabled SBO for now

@ishandhanani
Copy link
Owner

Can you rename that shell script and add a comment for why its used and when we should remove it. Im assumign we wil remove once we ship sgl > 0.5.6

@trevor-m
Copy link
Collaborator Author

trevor-m commented Dec 2, 2025

Thanks, yes it can be removed once sgl-project/sglang#13115 is merged.

I also updated the PR description with all of the changes I made to the configs.

@Fridge003
Copy link
Collaborator

@trevor-m Is there any accuracy or performance data

@@ -0,0 +1,8 @@
#!/bin/bash
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename this file to gb200-fp4-mtp-setup.sh

i dont have a good way of organizing these lol so descriptive naming is probably the best

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about checkout-pr-13115.sh?
I also found out we need to disable the engine patch since this PR is based on main. I tried rebasing to 0.5.5 but it might have some dependencies on other PRs.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I can assist with this tomorrow

@trevor-m
Copy link
Collaborator Author

trevor-m commented Dec 4, 2025

@Fridge003

@trevor-m Is there any accuracy or performance data

This is the pareto with the low latency config:
1k1k
Let me try to check the accuracy.
The high throughput one is still not ready.

@Fridge003
Copy link
Collaborator

@Fridge003 Fridge003 merged commit e526265 into ishandhanani:main Dec 6, 2025
@ishandhanani ishandhanani mentioned this pull request Dec 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants