[BugFix] Fix diffusion parallel_config YAML override and add deploy config field allowlist#3483
Conversation
|
A bit more context on why this fix is structured this way:
The core issue is that these override fields were still being carried as flat top-level engine args, while diffusion stages actually consume them from Could you take another look? Thanks! @lishunyang12 @hsliuustc0106 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0c1573f9e0
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| continue | ||
| if parallel_config_dict is None: | ||
| parallel_config_dict = {} | ||
| parallel_config_dict[key] = runtime_overrides.pop(key) |
There was a problem hiding this comment.
Recompute sequence size when moving degree overrides
When a diffusion stage already has a YAML parallel_config with sequence_parallel_size: 1 (for example hunyuan_image3_moe.yaml) and the user only overrides --ulysses-degree or --ring-degree, this writes the new degree into the nested config but leaves the stale sequence_parallel_size. The resulting DiffusionParallelConfig then fails its sequence_parallel_size == ulysses_degree * ring_degree validation unless users also know to pass --sequence-parallel-size; the flat normalization path computes that default automatically, so the nested override path should clear or recompute it when the degrees change.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I think if users explicitly override --ulysses-degree or --ring-degree, they should also make sure the related fields are set correctly.
There was a problem hiding this comment.
Fair, but the flat path in _create_default_diffusion_stage_cfg auto-derives sequence_parallel_size = ulysses_degree * ring_degree when it's unset, so today overriding only --ulysses-degree/--ring-degree works without touching SP. With this change that silently breaks for existing configs like hunyuan_image3_moe.yaml (YAML pins sequence_parallel_size: 1, the override bumps the degree, and validation then fails). Can we mirror that derivation here when a degree is overridden and SP wasn't explicitly set, so the two paths stay consistent?
There was a problem hiding this comment.
Fixed in d06c7f0:
- if
ulysses_degreeorring_degreeis overridden andsequence_parallel_sizeis not explicitly set, we recomputesequence_parallel_size = ulysses_degree * ring_degreein the nestedparallel_configoverride path. - Also added regression tests for both the recompute case and the explicit-override case.
0c1573f to
d05f261
Compare
|
this PR looks better than the previous one :) |
Haha sorry, ran into some local branch issues on the previous one |
|
Added two follow-up fixes in 3a5d176 and 9fc0ba3:
|
|
LGTM. This is a solid bugfix with comprehensive test coverage. The fix correctly normalizes diffusion parallel overrides into nested while preserving LLM stage behavior. The test cases are thorough and verify both scenarios. |
Co-authored-by: zzhuoxin1508 <zzhuoxin1508@users.noreply.github.com> Signed-off-by: xiaohajiayou <923390377@qq.com>
Signed-off-by: xiaohajiayou <923390377@qq.com>
Signed-off-by: xiaohajiayou <923390377@qq.com>
Signed-off-by: xiaohajiayou <923390377@qq.com>
Signed-off-by: xiaohajiayou <923390377@qq.com>
Signed-off-by: xiaohajiayou <923390377@qq.com>
b324593 to
583af4c
Compare
Signed-off-by: xiaohajiayou <923390377@qq.com>
583af4c to
d06c7f0
Compare
|
Updated this PR since the earlier review. Main changes are:
Could you take another look when you have time? A review here would help unblock the follow-up cleanup on the same config path. |
|
@alex-jw-brooks @wuhang2014 PTAL for the final check |
|
Considering the need for #3819 and since the CI has already passed, could you please take a look and see if this PR can be merged when you have time? |
|
Pulled this onto a local multi-GPU box and gave it a proper run-through, both the static config path and a real generation. On the config side I went through all 13 Then ran Wan2.2-TI2V-5B for real on 4 GPUs with
Generation finished without issues, so the override path does what it says. LGTM from me. One unrelated thing I ran into: with |
…onfig field allowlist (vllm-project#3483) Signed-off-by: xiaohajiayou <923390377@qq.com> Co-authored-by: zzhuoxin1508 <zzhuoxin1508@users.noreply.github.com> Signed-off-by: WeiQing Chen <david6666666@users.noreply.github.com>
…onfig field allowlist (vllm-project#3483) Signed-off-by: xiaohajiayou <923390377@qq.com> Co-authored-by: zzhuoxin1508 <zzhuoxin1508@users.noreply.github.com>
Purpose
Diffusion stages in omni multi-stage configs consume parallel settings from
engine_args.parallel_config, but CLI overrides were being applied as flat top-levelengine_argsfields.This could produce a resolved stage config where:
parallel_configstill kept the old valuesThis change fixes that at stage-config materialization time:
DiffusionParallelConfigare normalized intoengine_args.parallel_configTest Plan
parallel_config, verifying that CLI overrides replace the nested valuesparallel_config, verifying that the nested config is created from CLI overridesparallel_configTest Result