Skip to content

feat: repetition detector for degenerate token loops#65

Closed
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feature/repetition-detector
Closed

feat: repetition detector for degenerate token loops#65
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feature/repetition-detector

Conversation

@janhilgard
Copy link
Copy Markdown
Collaborator

Summary

  • Adds a lightweight repetition detector to the scheduler that monitors the last 32 generated tokens per request
  • Stops generation with finish_reason="stop" when degenerate patterns are detected:
    • Single-token repetition (8+ identical tokens, e.g. 0 0 0 0 0 0 0 0)
    • Short sequence repetition (2-4 token patterns repeated 6+ times, e.g. ab ab ab ab ab ab)
  • Ring buffer per UID with automatic cleanup on request finish/abort
  • Zero overhead when no repetition occurs (simple list append + periodic check)

Split out from PR #53 per review feedback — this touches the scheduler hot path and is independent of the GPT-OSS reasoning parser.

Test plan

  • 15 unit tests covering all detection patterns and edge cases (tests/test_repetition_detector.py)
  • Manual testing with models known to produce degenerate output
  • Verify no performance regression on normal generation
pytest tests/test_repetition_detector.py -v

🤖 Generated with Claude Code

Adds a lightweight repetition detector to the scheduler that monitors
the last 32 generated tokens per request and stops generation when
degenerate patterns are detected:

- Single-token repetition (8+ identical tokens)
- Short sequence repetition (2-4 token patterns repeated 6+ times)

This prevents runaway generation when models enter degenerate loops,
saving compute and improving reliability for long-running requests.

Includes 15 unit tests covering all detection patterns and edge cases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
janhilgard added a commit to janhilgard/vllm-mlx that referenced this pull request Feb 11, 2026
Moves repetition detection logic to feature/repetition-detector branch
(PR waybarrios#65) per review feedback on PR waybarrios#53.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@TomLucidor
Copy link
Copy Markdown

How is this different from repetition penalties or DRY?

@janhilgard
Copy link
Copy Markdown
Collaborator Author

Good question! They solve different problems:

Repetition penalty / DRY are preventative — they modify logits during sampling to discourage repetition before it happens. They work well most of the time.

This detector is a safety net — it doesn't touch sampling at all. It monitors output and terminates generation when degenerate loops have already formed. Think of it as a circuit breaker for the server.

Why both are needed:

  • Repetition penalties don't always prevent loops, especially with aggressively quantized models (4-bit, 6-bit) or certain MoE architectures where expert routing can get stuck
  • Without a detector, a stuck request burns compute indefinitely until max_tokens — potentially hundreds of seconds of wasted GPU time on a serving endpoint
  • vllm-mlx is an inference server, not a chat frontend — we can't rely on users configuring sampling params correctly. This catches the cases where penalties fail or aren't set

The overhead is near-zero (list append + periodic check on a 32-token window), so it's cheap insurance.

Thump604 added a commit to Thump604/vllm-mlx that referenced this pull request Mar 21, 2026
Detect and stop repeating token patterns during generation.
Sliding window (200 tokens), checks every 20 tokens for patterns
of length 2-50 repeated 3+ times. Enabled via --repetition-detector.

Prevents stuck loops that waste up to 13 minutes on large models.
Addresses waybarrios#65.
@janhilgard
Copy link
Copy Markdown
Collaborator Author

Closing in favor of #188 which has a cleaner architecture — standalone reusable RepetitionDetector class, configurable window/pattern/interval, opt-in CLI flag, and wider detection (patterns up to 50 tokens, 200-token window vs 32 here).

#188 currently covers SimpleEngine only. Happy to help integrate the same RepetitionDetector into BatchedEngine (which this PR covered) if that would be useful.

@janhilgard janhilgard closed this Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants