Skip to content

feat: add repetition_penalty support for MLLM path#258

Merged
Thump604 merged 1 commit intowaybarrios:mainfrom
janhilgard:feat/mllm-repetition-penalty
Apr 10, 2026
Merged

feat: add repetition_penalty support for MLLM path#258
Thump604 merged 1 commit intowaybarrios:mainfrom
janhilgard:feat/mllm-repetition-penalty

Conversation

@janhilgard
Copy link
Copy Markdown
Collaborator

@janhilgard janhilgard commented Apr 6, 2026

Summary

  • MLLM scheduler previously ignored the repetition_penalty parameter from API requests — tokens were sampled without any penalty, leading to excessive repetition in some models
  • Adds per-request logits processors built from repetition_penalty via mlx_lm.sample_utils.make_logits_processors, applied in _step() before sampling
  • Propagates repetition_penalty from SamplingParams through BatchedEngineMLLMSchedulerMLLMBatchRequestMLLMBatchGenerator
  • Handles filter() and extend() for logits processors to support continuous batching lifecycle

Test plan

  • Send request with repetition_penalty: 1.1 to an MLLM server, verify reduced repetition
  • Send request without repetition_penalty, verify default behavior unchanged
  • Verify [rep_penalty] log message appears when penalty is active
  • Concurrent requests with different penalties should each apply their own processor

🤖 Generated with Claude Code

@janhilgard janhilgard force-pushed the feat/mllm-repetition-penalty branch from ff2d700 to 2d0acc0 Compare April 6, 2026 10:36
@Thump604
Copy link
Copy Markdown
Collaborator

Thump604 commented Apr 7, 2026

@waybarrios, @janhilgard: brief note. This PR is the MLLM-path counterpart to my #213, which adds top_k, min_p, presence_penalty, and repetition_penalty to the LLM and SimpleEngine paths. Together you get full sampling-param coverage across all engine modes (SimpleEngine LLM, SimpleEngine MLLM, BatchedEngine, BatchedEngine MLLM scheduler).

No conflict between the two PRs at the file level (different scheduler paths). Both should land. Mergeable on current main.

Extends MLLM batch generator to support top_k, min_p, and
presence_penalty alongside the existing repetition_penalty.
This gives the MLLM path full parity with the LLM/SimpleEngine
sampling parameter coverage.

Changes:
- MLLMBatchRequest: add top_k, min_p, presence_penalty fields
- MLLMBatch: add per-request samplers list (filter/extend support)
- _process_prompts: build per-request logits processors for
  presence_penalty and per-request samplers for top_k/min_p
- _step: accept and apply per-request samplers
- SamplingParams: add presence_penalty field
- MLLMScheduler: propagate new params from kwargs to batch requests
- BatchedEngine: pass new params through generate/stream_generate

When a request uses default values (top_k=0, min_p=0.0,
presence_penalty=0.0), no extra processors or samplers are created
— zero overhead for standard requests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@janhilgard janhilgard force-pushed the feat/mllm-repetition-penalty branch from 2d0acc0 to 4116542 Compare April 8, 2026 08:12
@janhilgard
Copy link
Copy Markdown
Collaborator Author

@Thump604 Good call — I've expanded the PR to support all 4 sampling parameters in the MLLM path:

Parameter Implementation MLLM support
repetition_penalty logits processor (existing)
presence_penalty logits processor via make_logits_processors() new
top_k per-request sampler via make_sampler() new
min_p per-request sampler via make_sampler() new

Architecture:

  • presence_penalty uses the same make_logits_processors() pattern as repetition_penalty — both are logits processors that modify logits based on token history
  • top_k and min_p require per-request samplers (different requests in a batch can have different values), so I added a samplers field to MLLMBatch with full filter()/extend() support for continuous batching
  • When a request uses default values (top_k=0, min_p=0.0, presence_penalty=0.0), no extra processors or samplers are created — zero overhead for standard requests

Files changed:

  • request.py — added presence_penalty to SamplingParams
  • mllm_batch_generator.pyMLLMBatchRequest fields, MLLMBatch.samplers, _process_prompts, _step
  • mllm_scheduler.py — propagate new params through add_request and _schedule_waiting
  • engine/batched.py — pass new params through generate/stream_generate

Together with your #213, this gives full sampling-param coverage across all engine modes. No conflicts at the file level (your PR touches LLM/SimpleEngine/server.py, this one touches MLLM path).

Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice expansion. The architecture split — logits processors for the history-dependent penalties (repetition_penalty, presence_penalty) vs per-request samplers for the distribution-shape parameters (top_k, min_p) — matches the constraint that batched inference needs per-request sampler state while logits processors are a single stack. The zero-overhead default path (no sampler creation when all params are default) keeps batched throughput unchanged for the 99% case.

Confirmed this lands independently of #213 — they touch disjoint files (mllm_batch_generator.py / mllm_scheduler.py / batched.py vs llm.py / simple.py / server.py). Between the two, every engine mode gets full sampling param coverage.

Approving.

@Thump604 Thump604 merged commit 31017b0 into waybarrios:main Apr 10, 2026
7 checks passed
janhilgard added a commit to janhilgard/vllm-mlx that referenced this pull request Apr 11, 2026
Incorporates 53 upstream commits including:
- O(1) state-machine reasoning parser (PR waybarrios#234)
- Resumable model download (PR waybarrios#77)
- Block-aware prefix cache (PR waybarrios#217)
- Message normalization (PR waybarrios#240)
- Full sampling params (PR waybarrios#258)
- ThinkRouter for Anthropic streaming
- 22 new test files
- License file, docs updates

Conflict resolution: preserved production features
(frequency_penalty conversion, tool markup safety nets,
openai_to_anthropic import) while adopting upstream
improvements (Gemma4 parser rewrite, cleaner logging,
_model_name in streaming chunks).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants