[Perf] Regional torch.compile for code predictor decoder layers by marksverdhei · Pull Request #4 · heiervang-technologies/ht-vllm-omni

marksverdhei · 2026-01-29T12:17:59Z

Summary

Apply regionally_compile(mode="reduce-overhead") to the 5 Qwen3TTSDecoderLayer blocks inside the code predictor model, reducing per-kernel launch overhead across the 31-iteration generate() loop (155 layer forward passes per token)
Uses the existing vllm_omni/diffusion/compile.py pattern with graceful fallback if compilation fails on a given GPU/CUDA version
Adds a logger.debug for code_predictor.config.use_cache to diagnose whether HF's generate() is already using KV caching (informs Phase 1b priority)

Context

Phase 1a of the Qwen3 TTS optimization plan. The code predictor's forward() has trace-breaking ops (Python-level ModuleList indexing), so full torch.compile on the method doesn't work. Regional compilation of just the repeated decoder layers avoids those graph breaks while still eliminating kernel launch overhead.

Test plan

pytest tests/model_executor/models/qwen3_tts/ -v — 8 tests pass (3 new compilation tests)
Docker build + server start — check logs for "Code predictor decoder layers compiled"
Run inference request and compare latency to baseline (~0.91x realtime)
Check diagnostic log for use_cache value to inform Phase 1b

🤖 Generated with Claude Code

Apply regionally_compile(mode="reduce-overhead") to the 5 Qwen3TTSDecoderLayer blocks inside the code predictor, reducing per-kernel launch overhead across the 31-iteration generate loop. Uses the existing diffusion/compile.py pattern. Falls back gracefully if compilation fails on a given GPU/CUDA version. Also adds a debug log for code_predictor.config.use_cache to inform Phase 1b KV cache work. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Fix attribute chain: .model.model.code_predictor.model → .model.model.talker.code_predictor.model (was silently failing) - Use dynamic=True instead of mode="reduce-overhead" to avoid CUDA graph shape mismatches in autoregressive generate() loop - Narrow except: separate ImportError from RuntimeError to avoid masking real bugs like wrong attribute paths - Add enforce_eager gate: skip compilation when enforce_eager=True, matching the diffusion model runner pattern - Use structured mock in tests to catch attribute chain bugs - Add test for enforce_eager gate Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- M1: Rate-limit debug log to fire once instead of 31x per token - M2: Add atexit teardown to restore sys.modules after tests - M3: Add __init__.py to test directory - M4: Add autouse pytest fixture for mock reset between tests - Update Test A for new profile run behavior (caps max_new_tokens=2 instead of returning dummy audio) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

marksverdhei · 2026-01-29T17:59:56Z

Superseded by ht branch (same torch.compile work included).

marksverdhei and others added 3 commits January 29, 2026 13:17

marksverdhei changed the base branch from q3-tts to ht January 29, 2026 16:28

marksverdhei closed this Jan 29, 2026

marksverdhei deleted the feat/code-predictor-compile branch February 12, 2026 21:25

marksverdhei mentioned this pull request Feb 23, 2026

feat: reconcile HT TTS features onto upstream main #17

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Regional torch.compile for code predictor decoder layers#4

[Perf] Regional torch.compile for code predictor decoder layers#4
marksverdhei wants to merge 3 commits into
htfrom
feat/code-predictor-compile

marksverdhei commented Jan 29, 2026

Uh oh!

marksverdhei commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marksverdhei commented Jan 29, 2026

Summary

Context

Test plan

Uh oh!

marksverdhei commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant