Graph parallel for dense Qwen-3.5 models#1331
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tanks to PR #1329, it is now easy to add graph parallel (a.k.a., split mode
graph) support for Qwen-3.5 (just the dense models for now).As with graph parallel for Qwen3-Next (#1292), the recurrent attention layer are not split between GPUs. Nevertheless (and unlike Qwen3-Next), we do see a small performance gain compared to split mode
layereven at zero context.Here some
sweep-benchresults for Qwen-3.5-27B quantized withQ4_K_Son a 2x3090 system. We see about 10% better PP at zero context, and 25% at a context of 64k tokens. TG is ~4% better at zero context, and ~12% better at context of 64k.