document Gemma 4 audio design#1
Conversation
keegoid-codex
commented
May 3, 2026
- Add design notes for Gemma 4 audio input support.\n- Capture model capability findings, API formats, architecture decisions, and test coverage.\n\nVerification: docs-only; not run.
keegoid-cc
left a comment
There was a problem hiding this comment.
════ PAI | NATIVE MODE ═══════════════════════
🗒️ TASK: Adversarial PR review of Gemma 4 audio design doc
Summary
Adds a docs-only design note at docs/design/gemma4-audio-input.md describing audio input support for /v1/chat/completions (model capability findings, data flow, API formats, key design decisions, test matrix, files-changed summary).
Bugs & Correctness
- [CAT-2] Doc claims "Zero audio weights in
google/gemma-4-31b-it" as a Google design choice. Gemma 3 27B IT is text+vision-only upstream; if "Gemma 4 31B" is a local naming convention for the Gemma 3 27B successor or an unreleased model, the absolute claim risks aging poorly. Consider dating the finding or pinning to the exact HF revision inspected. - [CAT-3] "Max audio duration: 30 seconds at 16kHz (480,000 samples)" — 30s × 16kHz = 480,000 ✓, but the doc later lists "Audio longer than 30 seconds" under Not Tested without saying what happens (truncate? 400? silent drop?). Worth one sentence on observed behavior so future readers know the failure mode.
Design & Maintainability
- [CAT-3] "Files Changed" table duplicates information that will drift from the actual diff history. Either link to the implementing PR/commit SHA or drop the table — design docs that re-state file lists go stale within one refactor.
- [CAT-3] Decision waybarrios#4 (prefix cache disabled for audio) names a "future improvement" but no tracking issue/AD reference. Without a pointer this becomes a buried TODO.
- [CAT-3] Decision waybarrios#2 explains that base64 decoding is intentionally not done in
extract_multimodal_content()— good rationale, but the doc doesn't mention howaudio_url(remote fetch) cleanup is handled, onlyinput_audio. Section saysaudio_urlis "Not Tested"; clarify whether URL fetch path also goes through_temp_manageror is a known gap. - [CAT-3] The 10MB size guard for
input_audiois mentioned in prose only — no statement on what error surface the client sees (HTTP 400? 413?) or whetheraudio_urlpayloads are similarly bounded. SSRF/large-response risk on the URL path isn't addressed at all.
Security & Safety
- [CAT-3]
audio_urlfetch path is described but its threat model isn't: no mention of allowed schemes (file://? http to RFC1918?), redirect handling, response-size cap, or timeout. Even for a single-operator deployment this is worth one line so the reader knows it was considered.
Questions for the author
- Is "Gemma 4" the upstream Google name or a local alias for Gemma 3.x? The doc treats it as established; a one-line provenance note would prevent confusion.
- Decision waybarrios#5 says Gemma 4 doesn't use the native video path because
video_token_idis unset — is that asserted from config inspection, or assumed? Worth citing. - "mlx-vlm warns and uses only first" for multi-audio — does the server surface that warning to the client or silently drop? Current doc implies silent.
Severity of findings
low — docs-only change; findings are clarifications and minor gaps in threat-model/operational notes, no correctness defects.
🔧 CHANGE: None — review only, no files modified.
✅ VERIFY: Read full diff; arithmetic and cross-references checked.
🗣️ Fig: Solid design doc; main gaps are audio_url threat model, oversize/long-audio failure modes, and the file-list table that will drift.
VERDICT: approve
|
[DEV SecOps] verdict: PASS PR: #1 Finding count: high 0, medium 0, low 0. Security review categories:
Notes:
|