[Perf] Regional torch.compile for code predictor decoder layers#4
Closed
marksverdhei wants to merge 3 commits into
Closed
[Perf] Regional torch.compile for code predictor decoder layers#4marksverdhei wants to merge 3 commits into
marksverdhei wants to merge 3 commits into
Conversation
Apply regionally_compile(mode="reduce-overhead") to the 5 Qwen3TTSDecoderLayer blocks inside the code predictor, reducing per-kernel launch overhead across the 31-iteration generate loop. Uses the existing diffusion/compile.py pattern. Falls back gracefully if compilation fails on a given GPU/CUDA version. Also adds a debug log for code_predictor.config.use_cache to inform Phase 1b KV cache work. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix attribute chain: .model.model.code_predictor.model → .model.model.talker.code_predictor.model (was silently failing) - Use dynamic=True instead of mode="reduce-overhead" to avoid CUDA graph shape mismatches in autoregressive generate() loop - Narrow except: separate ImportError from RuntimeError to avoid masking real bugs like wrong attribute paths - Add enforce_eager gate: skip compilation when enforce_eager=True, matching the diffusion model runner pattern - Use structured mock in tests to catch attribute chain bugs - Add test for enforce_eager gate Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- M1: Rate-limit debug log to fire once instead of 31x per token - M2: Add atexit teardown to restore sys.modules after tests - M3: Add __init__.py to test directory - M4: Add autouse pytest fixture for mock reset between tests - Update Test A for new profile run behavior (caps max_new_tokens=2 instead of returning dummy audio) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Author
|
Superseded by ht branch (same torch.compile work included). |
11 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
regionally_compile(mode="reduce-overhead")to the 5Qwen3TTSDecoderLayerblocks inside the code predictor model, reducing per-kernel launch overhead across the 31-iterationgenerate()loop (155 layer forward passes per token)vllm_omni/diffusion/compile.pypattern with graceful fallback if compilation fails on a given GPU/CUDA versionlogger.debugforcode_predictor.config.use_cacheto diagnose whether HF'sgenerate()is already using KV caching (informs Phase 1b priority)Context
Phase 1a of the Qwen3 TTS optimization plan. The code predictor's
forward()has trace-breaking ops (Python-levelModuleListindexing), so fulltorch.compileon the method doesn't work. Regional compilation of just the repeated decoder layers avoids those graph breaks while still eliminating kernel launch overhead.Test plan
pytest tests/model_executor/models/qwen3_tts/ -v— 8 tests pass (3 new compilation tests)use_cachevalue to inform Phase 1b🤖 Generated with Claude Code