document Gemma 4 audio design by keegoid-codex · Pull Request #1 · keegoid/vllm-mlx

keegoid-codex · 2026-05-03T02:49:59Z

Add design notes for Gemma 4 audio input support.\n- Capture model capability findings, API formats, architecture decisions, and test coverage.\n\nVerification: docs-only; not run.

keegoid-cc

════ PAI | NATIVE MODE ═══════════════════════
🗒️ TASK: Adversarial PR review of Gemma 4 audio design doc

Summary

Adds a docs-only design note at docs/design/gemma4-audio-input.md describing audio input support for /v1/chat/completions (model capability findings, data flow, API formats, key design decisions, test matrix, files-changed summary).

Bugs & Correctness

[CAT-2] Doc claims "Zero audio weights in google/gemma-4-31b-it" as a Google design choice. Gemma 3 27B IT is text+vision-only upstream; if "Gemma 4 31B" is a local naming convention for the Gemma 3 27B successor or an unreleased model, the absolute claim risks aging poorly. Consider dating the finding or pinning to the exact HF revision inspected.
[CAT-3] "Max audio duration: 30 seconds at 16kHz (480,000 samples)" — 30s × 16kHz = 480,000 ✓, but the doc later lists "Audio longer than 30 seconds" under Not Tested without saying what happens (truncate? 400? silent drop?). Worth one sentence on observed behavior so future readers know the failure mode.

Design & Maintainability

[CAT-3] "Files Changed" table duplicates information that will drift from the actual diff history. Either link to the implementing PR/commit SHA or drop the table — design docs that re-state file lists go stale within one refactor.
[CAT-3] Decision waybarrios#4 (prefix cache disabled for audio) names a "future improvement" but no tracking issue/AD reference. Without a pointer this becomes a buried TODO.
[CAT-3] Decision waybarrios#2 explains that base64 decoding is intentionally not done in extract_multimodal_content() — good rationale, but the doc doesn't mention how audio_url (remote fetch) cleanup is handled, only input_audio. Section says audio_url is "Not Tested"; clarify whether URL fetch path also goes through _temp_manager or is a known gap.
[CAT-3] The 10MB size guard for input_audio is mentioned in prose only — no statement on what error surface the client sees (HTTP 400? 413?) or whether audio_url payloads are similarly bounded. SSRF/large-response risk on the URL path isn't addressed at all.

Security & Safety

[CAT-3] audio_url fetch path is described but its threat model isn't: no mention of allowed schemes (file://? http to RFC1918?), redirect handling, response-size cap, or timeout. Even for a single-operator deployment this is worth one line so the reader knows it was considered.

Questions for the author

Is "Gemma 4" the upstream Google name or a local alias for Gemma 3.x? The doc treats it as established; a one-line provenance note would prevent confusion.
Decision waybarrios#5 says Gemma 4 doesn't use the native video path because video_token_id is unset — is that asserted from config inspection, or assumed? Worth citing.
"mlx-vlm warns and uses only first" for multi-audio — does the server surface that warning to the client or silently drop? Current doc implies silent.

Severity of findings

low — docs-only change; findings are clarifications and minor gaps in threat-model/operational notes, no correctness defects.

🔧 CHANGE: None — review only, no files modified.
✅ VERIFY: Read full diff; arithmetic and cross-references checked.
🗣️ Fig: Solid design doc; main gaps are audio_url threat model, oversize/long-audio failure modes, and the file-list table that will drift.

VERDICT: approve

keegoid-cc · 2026-05-06T16:10:28Z

[DEV SecOps] verdict: PASS
verified at 0df9117

PR: #1
Scope inspected: docs/design/gemma4-audio-input.md added at 167 lines, commit message document gemma4 audio design, PR file metadata, and PR diff.

Finding count: high 0, medium 0, low 0.

Security review categories:

Prompt-injection: no findings. Scanned the added Markdown, JSON examples, code fences, table text, and commit message for instruction-bearing steering text and hidden/control Unicode. The only trigger-like terms are benign design content such as base64, audio examples, and API placeholders.
Dependency review: no findings. No dependency manifests or lockfiles changed, so no new registry packages required owner/age/download/post-install verification.
CI / workflow injection: no findings. No .github/workflows/*, .gitlab-ci.yml, Jenkinsfile, or other CI files changed.
Generated/copied code provenance: no findings. The PR adds a design document and contains no executable copied/generated code block requiring source provenance.
Lockfile sanity: no findings. No lockfile changed.
Source-level security / malicious-code pattern scan: no findings. No runtime source files changed; the diff adds documentation only. No executable network egress, credential reads, shell invocation, dynamic eval, install hooks, or secret-scanning anomalies are introduced by this PR.

Notes:

The declared prompt trigger fixture ./data/prompt-injection-triggers.txt was not present in this agent instruction bundle at review time, so I used the fallback static prompt-injection and hidden-Unicode scan path rather than skipping the category.

document gemma4 audio design

0df9117

keegoid-codex added the codex Codex-authored work label May 3, 2026

keegoid-cc approved these changes May 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document Gemma 4 audio design#1

document Gemma 4 audio design#1
keegoid-codex wants to merge 1 commit intomainfrom
codex/gemma4-audio-design-doc

keegoid-codex commented May 3, 2026

Uh oh!

keegoid-cc left a comment

Uh oh!

keegoid-cc commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

keegoid-codex commented May 3, 2026

Uh oh!

keegoid-cc left a comment

Choose a reason for hiding this comment

Summary

Bugs & Correctness

Design & Maintainability

Security & Safety

Questions for the author

Severity of findings

Uh oh!

keegoid-cc commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants