Skip to content

[Perf] Regional torch.compile for code predictor decoder layers#4

Closed
marksverdhei wants to merge 3 commits into
htfrom
feat/code-predictor-compile
Closed

[Perf] Regional torch.compile for code predictor decoder layers#4
marksverdhei wants to merge 3 commits into
htfrom
feat/code-predictor-compile

Conversation

@marksverdhei

Copy link
Copy Markdown

Summary

  • Apply regionally_compile(mode="reduce-overhead") to the 5 Qwen3TTSDecoderLayer blocks inside the code predictor model, reducing per-kernel launch overhead across the 31-iteration generate() loop (155 layer forward passes per token)
  • Uses the existing vllm_omni/diffusion/compile.py pattern with graceful fallback if compilation fails on a given GPU/CUDA version
  • Adds a logger.debug for code_predictor.config.use_cache to diagnose whether HF's generate() is already using KV caching (informs Phase 1b priority)

Context

Phase 1a of the Qwen3 TTS optimization plan. The code predictor's forward() has trace-breaking ops (Python-level ModuleList indexing), so full torch.compile on the method doesn't work. Regional compilation of just the repeated decoder layers avoids those graph breaks while still eliminating kernel launch overhead.

Test plan

  • pytest tests/model_executor/models/qwen3_tts/ -v — 8 tests pass (3 new compilation tests)
  • Docker build + server start — check logs for "Code predictor decoder layers compiled"
  • Run inference request and compare latency to baseline (~0.91x realtime)
  • Check diagnostic log for use_cache value to inform Phase 1b

🤖 Generated with Claude Code

marksverdhei and others added 3 commits January 29, 2026 13:17
Apply regionally_compile(mode="reduce-overhead") to the 5
Qwen3TTSDecoderLayer blocks inside the code predictor, reducing
per-kernel launch overhead across the 31-iteration generate loop.
Uses the existing diffusion/compile.py pattern. Falls back gracefully
if compilation fails on a given GPU/CUDA version.

Also adds a debug log for code_predictor.config.use_cache to inform
Phase 1b KV cache work.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix attribute chain: .model.model.code_predictor.model →
  .model.model.talker.code_predictor.model (was silently failing)
- Use dynamic=True instead of mode="reduce-overhead" to avoid
  CUDA graph shape mismatches in autoregressive generate() loop
- Narrow except: separate ImportError from RuntimeError to avoid
  masking real bugs like wrong attribute paths
- Add enforce_eager gate: skip compilation when enforce_eager=True,
  matching the diffusion model runner pattern
- Use structured mock in tests to catch attribute chain bugs
- Add test for enforce_eager gate

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- M1: Rate-limit debug log to fire once instead of 31x per token
- M2: Add atexit teardown to restore sys.modules after tests
- M3: Add __init__.py to test directory
- M4: Add autouse pytest fixture for mock reset between tests
- Update Test A for new profile run behavior (caps max_new_tokens=2
  instead of returning dummy audio)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@marksverdhei marksverdhei changed the base branch from q3-tts to ht January 29, 2026 16:28
@marksverdhei

Copy link
Copy Markdown
Author

Superseded by ht branch (same torch.compile work included).

@marksverdhei marksverdhei deleted the feat/code-predictor-compile branch February 12, 2026 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant