[Feature]: support Flux.2-dev CFG-Parallel#2010
[Feature]: support Flux.2-dev CFG-Parallel#2010hsliuustc0106 merged 4 commits intovllm-project:mainfrom
Conversation
|
does it apply to all flux.2 family models? what's the recommended parallel strategy if we have 2/4 devices? |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 395e644c12
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Does the CFG-parallel speed problem still persist, as you mentioned in the wechat group? Can you show me the speed comparison of cfg-sequential plan before and after this PR? @nuclearwu |
lishunyang12
left a comment
There was a problem hiding this comment.
Left a few comments. Main concern is that do_true_cfg activates unconditionally with the default guidance_scale=4.0, which means every request now pays for 2x transformer forward passes even when the user doesn't intend to use CFG.
hsliuustc0106
left a comment
There was a problem hiding this comment.
Review Summary
BLOCKER scan:
| Category | Result |
|---|---|
| Correctness | PASS |
| Reliability/Safety | PASS |
| Breaking Changes | PASS |
| Test Coverage | PASS - Comprehensive test results in PR body with generated images |
| Documentation | PASS - Tables updated for CFG parallel |
| Security | PASS |
OVERALL: 1 ISSUE (merge conflicts, need to be resolved by author)
VERDICT: COMMENT
Issues
- [Gate Failure] Merge conflicts - PR cannot be merged. Please rebase on main and resolve conflicts.
Non-blocking observations
- The PR provides comprehensive testing evidence with generated images across multiple configurations (eager mode, CFG parallel combinations)
- Code follows the established patterns for CFGParallelMixin (similar to other models like FLUX.1-dev)
- Documentation tables correctly updated to reflect CFG parallel support
- No MRO issues: CFGParallelMixin uses met metaclass=ABCMeta without init
@lishunyang12 Thank you for your review. I have made the necessary revisions based on the feedback provided. |
@wtomin You can compare this to #1629 without CFG-Parallel. The generation time of the same prompt is inconsistent, I considered that it might be due to the negative prompt for this. |
@hsliuustc0106 Currently, the machine resources are insufficient. Only single machine with multiple cards can be used for testing. |
@wtomin I pulled the main branch and test again, the performance of cfg-parallel has declined.Specifically as shown in the above table. |
|
I tried to test the performance before and after #2063 PR got merged, with
This table indicates that, at least for Qwen-Image, #2063 reduces the e2e latency by a small margin, instead of slowing it down. This is in line with the claims in #2063. Regarding the problems you identified here, I have some suggestions:
Also cc @TKONIY for some suggestions. |
I will check it this week. Looks like some compatibility problem between #2063 and |
@lishunyang12 Done, please review again. |
@wtomin Sorry, I try again. #2063 reduces the e2e latency by a small margin and update results to the above table. |
|
Missing e2e test for CFG parallelism. Please add a test that covers Feature support table not updated for Flux.2-dev CFG parallel in Can you also report the peak VRAM usage in your PR body? |
@wtomin Done, the peak VRAM usage as shown in the above tables, PTAL. |
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
Signed-off-by: wuzhongjian wuzhongjian_yewu@cmss.chinamobile.com
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
support Flux.2-dev CFG-Parallel
Test Plan
Reference #851 image generation:
The bash script to run all t2i tasks
Memory Profile:
Test Result
Reproduced with 4xA800.
Text-To-Image:
Memory Profiling (FLUX.2-dev, 1024x1024, 50 steps):
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)