plugin: hybrid Gemma4 wiring — block_table padding + model registration by handrewsTT · Pull Request #390 · tenstorrent/vllm

handrewsTT · 2026-05-13T13:08:30Z

Summary

Two changes required for vLLM to serve Gemma4 through the hybrid kv-cache-groups path. Companion to tt-metal PR vllm-project#44265 (the bridge itself); these go in together.

1. `model_runner.py` — pad `block_tables_per_group[i]` to `max_num_blocks_per_req`

vLLM's per-block-byte unifier (unify_kv_cache_spec_page_size) produces per-group block_table native widths of cdiv(max_model_len, group_block_size). For Gemma4-E2B with cache_config.block_size=64 the unifier doubles sliding's block_size to 128 so per-block bytes match full's: sliding ends up with width cdiv(131072,128)=1024, full with width cdiv(131072,64)=2048.

The legacy single-tensor view (block_tables_per_group[0]) was sliced to :max_num_blocks_per_req but never padded. On single-card runs where sliding wins group 0 the runtime tensor was narrower than the page_table warmup_model_decode captured the trace against, and copy_host_to_device then tripped its shape assertion on the second decode trace replay. The fix pads each group's block_table up to max_num_blocks_per_req width with zeros — zeros are safe because the kernel only reads up to each layer's active block count.

The per-layer expansion in _block_tables_per_layer already padded; the legacy single-tensor view was the gap.

2. `platform.py` — register Gemma4 arch names

Gemma4ForCausalLM and Gemma4ForConditionalGeneration are added to the TT model registry so vLLM dispatches them to the bridge in models/demos/gemma4/tt/generator_vllm.py. Other TT models (Gemma3, GptOss, ...) get away without explicit registration because upstream vLLM has a torch impl that the platform layer finds first via the inferred registry path; Gemma4 has no upstream vLLM implementation yet, so the inspection has to land on the TT class directly.

Dependencies

tt-metal PR [ROCm] Upgrade AITER to v0.1.13.post1 vllm-project/vllm#44265 — the bridge this plugin change dispatches to. Without that, this PR is a no-op (no Gemma4 model code to dispatch to). Without this PR, the bridge can't be reached from vLLM.

Test plan

Verified live vllm serve with --model google/gemma-4-E2B-it --max_num_seqs 1 produces coherent chat completions ("Paris", working jokes/haiku).
No regressions on existing hybrid-kv-cache-groups models (Gemma3, GptOss) — the block_table pad is a no-op when per-group native widths already equal max_num_blocks_per_req, and the registration additions don't touch existing entries.

🤖 Generated with Claude Code

tchedaTT · 2026-05-14T02:32:16Z

The comment and explanation regarding model registration seems really suspicious and at least partially wrong - other TT models don't get away without explicit registration.

Two changes both required for vLLM to serve Gemma4 through the hybrid kv-cache-groups path: ``model_runner.py`` — pad legacy ``block_tables_per_group[0]`` to ``max_num_blocks_per_req``. vLLM's per-block-byte unifier produces per-group block_table native widths of ``cdiv(max_model_len, group_block_size)``. For Gemma4-E2B with ``cache_config.block_size=64``, sliding ends up at block_size=128 (width=cdiv(131072,128)=1024) and full at block_size=64 (width=2048). The legacy single-tensor view (``block_tables_per_group[0]``) is sliced to ``:max_num_blocks_per_req`` but was not padded, so on single-card runs where sliding wins group 0 the runtime tensor was narrower than ``warmup_model_decode``'s page_table. ``copy_host_to _device`` then tripped its shape assertion on the second decode trace replay. ``platform.py`` — register ``Gemma4ForCausalLM`` / ``Gemma4ForConditionalGeneration`` arch names so vLLM resolves them to the TT bridge in ``models/demos/gemma4/tt/generator_vllm.py``. Other TT models get away without explicit registration because upstream vLLM has a torch impl that the platform layer finds first; Gemma4 has none yet. The bridge itself lands in tt-metal (separate PR). Without that PR this plugin change is a no-op (no Gemma4 model code to dispatch to); without this plugin change the bridge can't be reached from vLLM. They go in together. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

viktorpusTT approved these changes May 13, 2026

View reviewed changes

handrewsTT marked this pull request as ready for review May 13, 2026 15:52

handrewsTT requested review from ppetrovicTT and tchedaTT as code owners May 13, 2026 15:52

handrewsTT force-pushed the handrews/gemma4-hybrid-bridge branch from baa503d to 17b75df Compare May 21, 2026 09:02

handrewsTT merged commit 5eb61e8 into dev Jun 2, 2026
6 checks passed

handrewsTT deleted the handrews/gemma4-hybrid-bridge branch June 2, 2026 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

plugin: hybrid Gemma4 wiring — block_table padding + model registration#390

plugin: hybrid Gemma4 wiring — block_table padding + model registration#390
handrewsTT merged 1 commit into
devfrom
handrews/gemma4-hybrid-bridge

handrewsTT commented May 13, 2026

Uh oh!

tchedaTT commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

handrewsTT commented May 13, 2026

Summary

1. model_runner.py — pad block_tables_per_group[i] to max_num_blocks_per_req

2. platform.py — register Gemma4 arch names

Dependencies

Test plan

Uh oh!

tchedaTT commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. `model_runner.py` — pad `block_tables_per_group[i]` to `max_num_blocks_per_req`

2. `platform.py` — register Gemma4 arch names