feat: add repetition_penalty support for MLLM path by janhilgard · Pull Request #258 · waybarrios/vllm-mlx

janhilgard · 2026-04-06T10:26:29Z

Summary

MLLM scheduler previously ignored the repetition_penalty parameter from API requests — tokens were sampled without any penalty, leading to excessive repetition in some models
Adds per-request logits processors built from repetition_penalty via mlx_lm.sample_utils.make_logits_processors, applied in _step() before sampling
Propagates repetition_penalty from SamplingParams through BatchedEngine → MLLMScheduler → MLLMBatchRequest → MLLMBatchGenerator
Handles filter() and extend() for logits processors to support continuous batching lifecycle

Test plan

Send request with repetition_penalty: 1.1 to an MLLM server, verify reduced repetition
Send request without repetition_penalty, verify default behavior unchanged
Verify [rep_penalty] log message appears when penalty is active
Concurrent requests with different penalties should each apply their own processor

🤖 Generated with Claude Code

Thump604 · 2026-04-07T23:55:00Z

@waybarrios, @janhilgard: brief note. This PR is the MLLM-path counterpart to my #213, which adds top_k, min_p, presence_penalty, and repetition_penalty to the LLM and SimpleEngine paths. Together you get full sampling-param coverage across all engine modes (SimpleEngine LLM, SimpleEngine MLLM, BatchedEngine, BatchedEngine MLLM scheduler).

No conflict between the two PRs at the file level (different scheduler paths). Both should land. Mergeable on current main.

Extends MLLM batch generator to support top_k, min_p, and presence_penalty alongside the existing repetition_penalty. This gives the MLLM path full parity with the LLM/SimpleEngine sampling parameter coverage. Changes: - MLLMBatchRequest: add top_k, min_p, presence_penalty fields - MLLMBatch: add per-request samplers list (filter/extend support) - _process_prompts: build per-request logits processors for presence_penalty and per-request samplers for top_k/min_p - _step: accept and apply per-request samplers - SamplingParams: add presence_penalty field - MLLMScheduler: propagate new params from kwargs to batch requests - BatchedEngine: pass new params through generate/stream_generate When a request uses default values (top_k=0, min_p=0.0, presence_penalty=0.0), no extra processors or samplers are created — zero overhead for standard requests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

janhilgard · 2026-04-08T08:12:52Z

@Thump604 Good call — I've expanded the PR to support all 4 sampling parameters in the MLLM path:

Parameter	Implementation	MLLM support
`repetition_penalty`	logits processor (existing)	✅
`presence_penalty`	logits processor via `make_logits_processors()`	✅ new
`top_k`	per-request sampler via `make_sampler()`	✅ new
`min_p`	per-request sampler via `make_sampler()`	✅ new

Architecture:

presence_penalty uses the same make_logits_processors() pattern as repetition_penalty — both are logits processors that modify logits based on token history
top_k and min_p require per-request samplers (different requests in a batch can have different values), so I added a samplers field to MLLMBatch with full filter()/extend() support for continuous batching
When a request uses default values (top_k=0, min_p=0.0, presence_penalty=0.0), no extra processors or samplers are created — zero overhead for standard requests

Files changed:

request.py — added presence_penalty to SamplingParams
mllm_batch_generator.py — MLLMBatchRequest fields, MLLMBatch.samplers, _process_prompts, _step
mllm_scheduler.py — propagate new params through add_request and _schedule_waiting
engine/batched.py — pass new params through generate/stream_generate

Together with your #213, this gives full sampling-param coverage across all engine modes. No conflicts at the file level (your PR touches LLM/SimpleEngine/server.py, this one touches MLLM path).

Thump604

Nice expansion. The architecture split — logits processors for the history-dependent penalties (repetition_penalty, presence_penalty) vs per-request samplers for the distribution-shape parameters (top_k, min_p) — matches the constraint that batched inference needs per-request sampler state while logits processors are a single stack. The zero-overhead default path (no sampler creation when all params are default) keeps batched throughput unchanged for the 99% case.

Confirmed this lands independently of #213 — they touch disjoint files (mllm_batch_generator.py / mllm_scheduler.py / batched.py vs llm.py / simple.py / server.py). Between the two, every engine mode gets full sampling param coverage.

Approving.

Incorporates 53 upstream commits including: - O(1) state-machine reasoning parser (PR waybarrios#234) - Resumable model download (PR waybarrios#77) - Block-aware prefix cache (PR waybarrios#217) - Message normalization (PR waybarrios#240) - Full sampling params (PR waybarrios#258) - ThinkRouter for Anthropic streaming - 22 new test files - License file, docs updates Conflict resolution: preserved production features (frequency_penalty conversion, tool markup safety nets, openai_to_anthropic import) while adopting upstream improvements (Gemma4 parser rewrite, cleaner logging, _model_name in streaming chunks). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

janhilgard force-pushed the feat/mllm-repetition-penalty branch from ff2d700 to 2d0acc0 Compare April 6, 2026 10:36

janhilgard force-pushed the feat/mllm-repetition-penalty branch from 2d0acc0 to 4116542 Compare April 8, 2026 08:12

Thump604 approved these changes Apr 8, 2026

View reviewed changes

Thump604 merged commit 31017b0 into waybarrios:main Apr 10, 2026
7 checks passed

janhilgard mentioned this pull request Apr 11, 2026

Enable tool calling for MLLM/VLM chat paths by passing tools and tool_choice into chat templates #116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add repetition_penalty support for MLLM path#258

feat: add repetition_penalty support for MLLM path#258
Thump604 merged 1 commit intowaybarrios:mainfrom
janhilgard:feat/mllm-repetition-penalty

janhilgard commented Apr 6, 2026 •

edited

Loading

Uh oh!

Thump604 commented Apr 7, 2026

Uh oh!

janhilgard commented Apr 8, 2026

Uh oh!

Thump604 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

janhilgard commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Thump604 commented Apr 7, 2026

Uh oh!

janhilgard commented Apr 8, 2026

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

janhilgard commented Apr 6, 2026 •

edited

Loading