feat: add repetition_penalty support for MLLM path#258
feat: add repetition_penalty support for MLLM path#258Thump604 merged 1 commit intowaybarrios:mainfrom
Conversation
ff2d700 to
2d0acc0
Compare
|
@waybarrios, @janhilgard: brief note. This PR is the MLLM-path counterpart to my #213, which adds No conflict between the two PRs at the file level (different scheduler paths). Both should land. Mergeable on current main. |
Extends MLLM batch generator to support top_k, min_p, and presence_penalty alongside the existing repetition_penalty. This gives the MLLM path full parity with the LLM/SimpleEngine sampling parameter coverage. Changes: - MLLMBatchRequest: add top_k, min_p, presence_penalty fields - MLLMBatch: add per-request samplers list (filter/extend support) - _process_prompts: build per-request logits processors for presence_penalty and per-request samplers for top_k/min_p - _step: accept and apply per-request samplers - SamplingParams: add presence_penalty field - MLLMScheduler: propagate new params from kwargs to batch requests - BatchedEngine: pass new params through generate/stream_generate When a request uses default values (top_k=0, min_p=0.0, presence_penalty=0.0), no extra processors or samplers are created — zero overhead for standard requests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2d0acc0 to
4116542
Compare
|
@Thump604 Good call — I've expanded the PR to support all 4 sampling parameters in the MLLM path:
Architecture:
Files changed:
Together with your #213, this gives full sampling-param coverage across all engine modes. No conflicts at the file level (your PR touches LLM/SimpleEngine/server.py, this one touches MLLM path). |
Thump604
left a comment
There was a problem hiding this comment.
Nice expansion. The architecture split — logits processors for the history-dependent penalties (repetition_penalty, presence_penalty) vs per-request samplers for the distribution-shape parameters (top_k, min_p) — matches the constraint that batched inference needs per-request sampler state while logits processors are a single stack. The zero-overhead default path (no sampler creation when all params are default) keeps batched throughput unchanged for the 99% case.
Confirmed this lands independently of #213 — they touch disjoint files (mllm_batch_generator.py / mllm_scheduler.py / batched.py vs llm.py / simple.py / server.py). Between the two, every engine mode gets full sampling param coverage.
Approving.
Incorporates 53 upstream commits including: - O(1) state-machine reasoning parser (PR waybarrios#234) - Resumable model download (PR waybarrios#77) - Block-aware prefix cache (PR waybarrios#217) - Message normalization (PR waybarrios#240) - Full sampling params (PR waybarrios#258) - ThinkRouter for Anthropic streaming - 22 new test files - License file, docs updates Conflict resolution: preserved production features (frequency_penalty conversion, tool markup safety nets, openai_to_anthropic import) while adopting upstream improvements (Gemma4 parser rewrite, cleaner logging, _model_name in streaming chunks). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
repetition_penaltyparameter from API requests — tokens were sampled without any penalty, leading to excessive repetition in some modelsrepetition_penaltyviamlx_lm.sample_utils.make_logits_processors, applied in_step()before samplingrepetition_penaltyfromSamplingParamsthroughBatchedEngine→MLLMScheduler→MLLMBatchRequest→MLLMBatchGeneratorfilter()andextend()for logits processors to support continuous batching lifecycleTest plan
repetition_penalty: 1.1to an MLLM server, verify reduced repetitionrepetition_penalty, verify default behavior unchanged[rep_penalty]log message appears when penalty is active🤖 Generated with Claude Code