[Cache Refactor 1/N] Simplify CacheDiT Integration#2527
[Cache Refactor 1/N] Simplify CacheDiT Integration#2527alex-jw-brooks wants to merge 15 commits intovllm-project:mainfrom
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
It's better to create a RFC to explain the design. |
lishunyang12
left a comment
There was a problem hiding this comment.
Review: [Cache Refactor 1/N] Simplify CacheDiT Integration
Overall direction is great -- collapsing ~10 near-identical enabler functions into a declarative _block_fwd_patterns class attribute + generic enable_cache_for_dit / build_cache_context_refresh is a big win for maintainability. Net -511 lines with cleaner abstractions. A few issues need addressing before merging:
Behavioral regressions
-
Flux2-Klein lost its
Fn_compute_blocks = 2override (regression)
The oldenable_cache_for_flux2_kleinexplicitly setdb_cache_config.Fn_compute_blocks = 2after building the config, with a comment about quality degradation atFn=1. The new generic path has no mechanism for this model-specific override. This will silently change Flux2-Klein caching behavior. Consider either:- Keeping Flux2-Klein in
CUSTOM_DIT_ENABLERS, or - Extending
_block_fwd_patterns(or adding a sibling attribute like_cache_config_overrides) to let models declare config tweaks.
- Keeping Flux2-Klein in
-
check_forward_pattern=Falselost for LTX2 and HunyuanImage3
Both old enablers passedcheck_forward_pattern=FalsetoBlockAdapter. The newmaybe_build_block_adapternever passes this flag. The TODO comments in the model files acknowledge the uncertainty but don't resolve it. If the validation fails at runtime for these models, cache enablement will break. Please either:- Add
check_forward_patternto the_block_fwd_patternsdeclaration (e.g., make it a richer descriptor), or - Pass
check_forward_pattern=Falseinmaybe_build_block_adapterfor now and file a follow-up issue.
- Add
-
HunyuanImage3 pipeline accessor changed from
pipeline.modeltopipeline.transformer
The oldenable_cache_for_hunyuan_image3accessedpipeline.model(withpipeline.model.layers). The new generic path usesdefault_get_pipeline_transformer(pipeline)which returnspipeline.transformer. The_block_fwd_patternsis defined onHunyuanImage3Model, but if the pipeline stores it at.modelrather than.transformer, this will fail. Please verify the pipeline attribute name is correct.
Minor issues
-
Double
_build_db_cache_configcall inenable_cache_for_wan22single-transformer path
Lines buildingdb_cache_configare called once at the top of the function and again inside theif getattr(pipeline, "transformer_2", None) is None:branch. The second call shadows the first unnecessarily. -
Naming convention for
refresh_cache_context_funcTypeAlias
PEP 8 and common practice use PascalCase for type aliases (e.g.,RefreshCacheContextFunc). The lowercaserefresh_cache_context_funcreads like a variable name rather than a type. Minor, but worth aligning with conventions. -
Flux2-KleinSCM refresh also lost theFn_compute_blocksoverride
The old refresh function for Flux2-Klein passedFn_compute_blocks=db_cache_config.Fn_compute_blocksin theDBCacheConfig().reset(...)call during SCM-enabled refreshes. The genericbuild_cache_context_refreshdoes not do this, so SCM refreshes will also regress.
Looks good
- The
build_cache_context_refreshfactory is clean and well-structured. _resolve_calibrator_configextraction is a nice touch.- Keeping Wan22 and Bagel as custom enablers makes sense given their multi-transformer complexity.
- Removing the broken GLM Image enabler is the right call.
- The
_block_fwd_patternsdeclarative approach on models is a good foundation for the rest of the refactor series.
Please address items 1-3 (the behavioral regressions) before merge. Items 4-6 are nice-to-fix.
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
623ed8d to
d306f5d
Compare
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
Signed-off-by: Alex Brooks <albrooks@redhat.com>
|
I think it is a great design. But a lot of validation work is needed, be it unit testing and end-to-end. I don't think this pr is in ready state yet. Please show your test plan in more details and let's put it after 0.20.0 cut. |
Purpose
See RFC #2535
Currently, integration with Cache DiT involves copying a lot of code, but the underlying library that implements Cache DiT is very generic. We should not have to write much code for supporting new models in vLLM omni unless they need to handle special cases, e.g., Wan2's multi transformer architecture.
This PR is one of a few intended to simplify transformer caching in vLLM Omni so that we can hopefully have more generic integrations and have wider free model support, or at least make progress towards adding support simply being adding a new block adapter object.
For this PR:
build_cache_context_refreshthat we can use to build the cache refresh closure for different models; currently we have ~10 identical copy pastes of this functionenable_cache_for_ditthat takes as input aBlockAdapter, since this is the main thing that changes across models.Test Plan
We should be able to run DiT cache on all models on main + this branch and verify that we get the same output for each. However, some of these models are currently broken, so still working on testing.
CC @lishunyang12 @wtomin, could you please take a look when you get the chance?