Skip to content

plugin: hybrid Gemma4 wiring — block_table padding + model registration#390

Merged
handrewsTT merged 1 commit into
devfrom
handrews/gemma4-hybrid-bridge
Jun 2, 2026
Merged

plugin: hybrid Gemma4 wiring — block_table padding + model registration#390
handrewsTT merged 1 commit into
devfrom
handrews/gemma4-hybrid-bridge

Conversation

@handrewsTT
Copy link
Copy Markdown

Summary

Two changes required for vLLM to serve Gemma4 through the hybrid kv-cache-groups path. Companion to tt-metal PR vllm-project#44265 (the bridge itself); these go in together.

1. model_runner.py — pad block_tables_per_group[i] to max_num_blocks_per_req

vLLM's per-block-byte unifier (unify_kv_cache_spec_page_size) produces per-group block_table native widths of cdiv(max_model_len, group_block_size). For Gemma4-E2B with cache_config.block_size=64 the unifier doubles sliding's block_size to 128 so per-block bytes match full's: sliding ends up with width cdiv(131072,128)=1024, full with width cdiv(131072,64)=2048.

The legacy single-tensor view (block_tables_per_group[0]) was sliced to :max_num_blocks_per_req but never padded. On single-card runs where sliding wins group 0 the runtime tensor was narrower than the page_table warmup_model_decode captured the trace against, and copy_host_to_device then tripped its shape assertion on the second decode trace replay. The fix pads each group's block_table up to max_num_blocks_per_req width with zeros — zeros are safe because the kernel only reads up to each layer's active block count.

The per-layer expansion in _block_tables_per_layer already padded; the legacy single-tensor view was the gap.

2. platform.py — register Gemma4 arch names

Gemma4ForCausalLM and Gemma4ForConditionalGeneration are added to the TT model registry so vLLM dispatches them to the bridge in models/demos/gemma4/tt/generator_vllm.py. Other TT models (Gemma3, GptOss, ...) get away without explicit registration because upstream vLLM has a torch impl that the platform layer finds first via the inferred registry path; Gemma4 has no upstream vLLM implementation yet, so the inspection has to land on the TT class directly.

Dependencies

Test plan

  • Verified live vllm serve with --model google/gemma-4-E2B-it --max_num_seqs 1 produces coherent chat completions ("Paris", working jokes/haiku).
  • No regressions on existing hybrid-kv-cache-groups models (Gemma3, GptOss) — the block_table pad is a no-op when per-group native widths already equal max_num_blocks_per_req, and the registration additions don't touch existing entries.

🤖 Generated with Claude Code

@handrewsTT handrewsTT marked this pull request as ready for review May 13, 2026 15:52
@tchedaTT
Copy link
Copy Markdown

The comment and explanation regarding model registration seems really suspicious and at least partially wrong - other TT models don't get away without explicit registration.

Two changes both required for vLLM to serve Gemma4 through the
hybrid kv-cache-groups path:

``model_runner.py`` — pad legacy ``block_tables_per_group[0]`` to
``max_num_blocks_per_req``. vLLM's per-block-byte unifier produces
per-group block_table native widths of
``cdiv(max_model_len, group_block_size)``. For Gemma4-E2B with
``cache_config.block_size=64``, sliding ends up at block_size=128
(width=cdiv(131072,128)=1024) and full at block_size=64 (width=2048).
The legacy single-tensor view (``block_tables_per_group[0]``) is
sliced to ``:max_num_blocks_per_req`` but was not padded, so on
single-card runs where sliding wins group 0 the runtime tensor was
narrower than ``warmup_model_decode``'s page_table. ``copy_host_to
_device`` then tripped its shape assertion on the second decode
trace replay.

``platform.py`` — register ``Gemma4ForCausalLM`` /
``Gemma4ForConditionalGeneration`` arch names so vLLM resolves them
to the TT bridge in ``models/demos/gemma4/tt/generator_vllm.py``.
Other TT models get away without explicit registration because
upstream vLLM has a torch impl that the platform layer finds first;
Gemma4 has none yet.

The bridge itself lands in tt-metal (separate PR). Without that PR
this plugin change is a no-op (no Gemma4 model code to dispatch
to); without this plugin change the bridge can't be reached from
vLLM. They go in together.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@handrewsTT handrewsTT force-pushed the handrews/gemma4-hybrid-bridge branch from baa503d to 17b75df Compare May 21, 2026 09:02
@handrewsTT handrewsTT merged commit 5eb61e8 into dev Jun 2, 2026
6 checks passed
@handrewsTT handrewsTT deleted the handrews/gemma4-hybrid-bridge branch June 2, 2026 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants