fix: Replace decode-based prefix matching with EOS-boundary splicing#1337
fix: Replace decode-based prefix matching with EOS-boundary splicing#1337
Conversation
…bustly preserve prior-turn tokens and prevent off-policy drift from retokenization Signed-off-by: Parth Chadha <pchadha@nvidia.com>
📝 WalkthroughWalkthroughReplaced _maybe_correct_merged_tokens with _replace_prefix_tokens and updated all call sites. Adjusted chat preprocessing to align prompts at the last assistant turn. Added vLLM async logger import and a filter to suppress “Added request” logs. Updated tests to reflect the new API and expanded edge-case coverage. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant OpenAI_Server as vLLM OpenAI Server
participant Preprocess as _preprocess_chat
participant Tokenizer
participant Prefix as _replace_prefix_tokens
participant Engine as Generation Engine
Client->>OpenAI_Server: ChatCompletion request
OpenAI_Server->>Preprocess: messages, tools
Preprocess->>Preprocess: deepcopy messages, find last assistant turn
Preprocess->>Tokenizer: tokenize model/template prefixes
Tokenizer-->>Preprocess: token_ids
Preprocess->>Prefix: model_prefix_ids, template_prefix_ids, template_ids
Prefix-->>Preprocess: spliced token_ids
Preprocess->>Engine: prompt with aligned tokens
Engine-->>OpenAI_Server: completion stream/result
OpenAI_Server-->>Client: response
rect rgba(240,248,255,0.6)
note right of Prefix: New/changed logic: EOS-based splice
end
sequenceDiagram
autonumber
participant Client
participant TokenizeServer as Tokenize Endpoint
participant Logger as vLLM Async Logger
Client->>TokenizeServer: Tokenize request
TokenizeServer->>Logger: log request
Note right of Logger: Filter suppresses "Added request"
Logger-->>TokenizeServer: (filtered logs)
TokenizeServer-->>Client: tokenization result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
nemo_rl/models/generation/vllm/vllm_worker_async.py (2)
38-121: Solid EOS-boundary splicing; add a small safety and docstring polish.Logic is sound and handles trailing-EOS and “last EOS in prefix” well. Two small improvements:
- Guard against malformed inputs: ensure len(template_prefix_token_ids) <= len(template_token_ids).
- Fix docstring typo (“uppdate” → “update”) and consider Google-style Args/Returns.
def _replace_prefix_tokens( tokenizer, model_prefix_token_ids: list[int], template_prefix_token_ids: list[int], template_token_ids: list[int], ) -> list[int]: @@ - and image tokenization is non-unique, then we will need to uppdate this + and image tokenization is non-unique, then we will need to update this function. @@ - template_cut_start = -1 + # Sanity check to prevent out-of-bounds (defensive) + assert len(template_prefix_token_ids) <= len(template_token_ids), ( + "template_prefix_token_ids longer than template_token_ids" + ) + template_cut_start = -1As per coding guidelines.
411-421: Prefer logger over print; ensure filter attaches to effective logger.Use vllm_async_llm_logger (or its parent) for the “Adding a vLLM logging filter …” notice instead of print, and confirm the filter is added to the logger actually emitting the “Added request …” messages (handlers/propagation may differ).
tests/unit/models/generation/test_vllm_generation.py (2)
1231-1395: Great coverage for _replace_prefix_tokens edge cases. Minor nit: avoid hard‑coded EOS ID.Tests thoroughly cover EOS/no‑EOS, multiple EOS, and empty model prefix. Consider relaxing the explicit eos_token_id == 151645 assertion to just assert it’s not None to reduce brittleness across tokenizer variants of the same model.
1188-1211: Use explicit /tokenize path to avoid '/../' traversal.Building the tokenize URL by stripping “/v1” from base_urls[0] is a bit clearer than “…/../tokenize”.
- response = requests.post(url=f"{base_urls[0]}/../tokenize", json=body) + base = base_urls[0].removesuffix("/v1") + response = requests.post(url=f"{base}/tokenize", json=body)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_rl/models/generation/vllm/vllm_worker_async.py(6 hunks)tests/unit/models/generation/test_vllm_generation.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts
Files:
nemo_rl/models/generation/vllm/vllm_worker_async.pytests/unit/models/generation/test_vllm_generation.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)
Files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
🧠 Learnings (1)
📚 Learning: 2025-09-10T05:29:34.349Z
Learnt from: bxyu-nvidia
PR: NVIDIA-NeMo/RL#1110
File: nemo_rl/models/generation/vllm/vllm_worker_async.py:98-105
Timestamp: 2025-09-10T05:29:34.349Z
Learning: In the _maybe_correct_merged_tokens function in nemo_rl/models/generation/vllm/vllm_worker_async.py, the loop condition `len(candidate_token_ids) < len(actual_token_ids) - 1` is intentionally designed to prevent accessing the final token in actual_token_ids, likely to handle specific tokenization edge cases in the vLLM HTTP server integration.
Applied to files:
nemo_rl/models/generation/vllm/vllm_worker_async.py
🧬 Code graph analysis (2)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
tests/unit/models/generation/test_vllm_generation.py (1)
tokenizer(239-242)
tests/unit/models/generation/test_vllm_generation.py (2)
nemo_rl/models/generation/vllm/vllm_worker_async.py (1)
_replace_prefix_tokens(38-121)tests/unit/environments/test_code_environment.py (1)
tokenizer(85-94)
🪛 Ruff (0.13.3)
nemo_rl/models/generation/vllm/vllm_worker_async.py
281-281: Local variable actual_corresponding_token_ids is assigned to but never used
Remove assignment to unused variable actual_corresponding_token_ids
(F841)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint check
- GitHub Check: Post automodel integration comment / Comment on PR
- GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (1)
tests/unit/models/generation/test_vllm_generation.py (1)
35-37: Tests aligned with new API.Import update to _replace_prefix_tokens looks good.
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
terrykong
left a comment
There was a problem hiding this comment.
very clear docstring!
Signed-off-by: Parth Chadha <pchadha@nvidia.com>
…ing-improve Signed-off-by: Parth Chadha <pchadha@nvidia.com>
4448c60 to
78811ba
Compare
…1337) Signed-off-by: Parth Chadha <pchadha@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
| final_token_ids = _replace_prefix_tokens( | ||
| tokenizer=tokenizer, | ||
| model_prefix_token_ids=request.required_prefix_token_ids, | ||
| template_prefix_token_ids=request.required_prefix_token_ids, |
There was a problem hiding this comment.
I think this should actually be actual_corresponding_token_ids
…1337) Signed-off-by: Parth Chadha <pchadha@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
…VIDIA-NeMo#1337) Signed-off-by: Parth Chadha <pchadha@nvidia.com>
…VIDIA-NeMo#1337) Signed-off-by: Parth Chadha <pchadha@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Replace decode-based prefix matching with EOS-boundary splicing to robustly preserve prior-turn tokens and prevent off-policy drift from retokenization
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
Bug Fixes
Refactor
Chores