fix: Anthropic streaming double-parsing + reasoning_content roundtrip#7358
fix: Anthropic streaming double-parsing + reasoning_content roundtrip#7358MatejKosec merged 26 commits intomainfrom
Conversation
WalkthroughThese changes add IPv6 address handling to ZMQ endpoint formatting in a publisher module, and implement fallback mechanisms in LLM protocol stream handlers to surface reasoning content when token limits are reached without producing other text output. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can approve the review once all CodeRabbit's comments are resolved.Enable the |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/src/dynamo/sglang/publisher.py (1)
41-43:⚠️ Potential issue | 🟡 MinorUpdate stale docs/comments about helper ownership.
The docs still say this uses SGLang’s
maybe_wrap_ipv6_address, but the helper is now local in this module.✏️ Suggested doc/comment fix
def format_zmq_endpoint(endpoint_template: str, ip_address: str) -> str: """Format ZMQ endpoint by replacing wildcard with IP address. Properly handles IPv6 addresses by wrapping them in square brackets. - Uses SGLang's maybe_wrap_ipv6_address for consistent formatting. + Uses local IPv6 wrapping helper for consistent formatting. @@ - # Use SGLang's utility to wrap IPv6 addresses in brackets + # Use local helper to wrap IPv6 addresses in brackets formatted_ip = maybe_wrap_ipv6_address(ip_address)Also applies to: 57-57
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/src/dynamo/sglang/publisher.py` around lines 41 - 43, Update the stale doc/comments that claim IPv6 wrapping uses SGLang’s maybe_wrap_ipv6_address to instead state that the helper is implemented locally in this module (the local maybe_wrap_ipv6_address function); update any inline comments or docstrings near the publisher logic and the other occurrence around the second mention (line ~57) to reference the local helper name and remove references to SGLang so readers look for the function in this file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@lib/llm/src/protocols/anthropic/stream_converter.rs`:
- Around line 363-393: The new fallback branch that emits a synthetic empty text
content block and a subsequent content_block_delta when thinking_block_started
&& !text_block_started && stop_reason == Some(AnthropicStopReason::MaxTokens)
changes the SSE shape and needs test coverage: update the tagged test helper
used by the Anthropic stream tests to recognize the new events emitted by
make_sse_event for AnthropicStreamEvent::ContentBlockStart and
AnthropicStreamEvent::ContentBlockDelta (with the "[Reasoning exceeded token
limit]" text), and add a unit test asserting that when thinking_block_started is
true, text_block_started is false, and stop_reason is MaxTokens the helper
captures both the content_block_start and content_block_delta in order
(including index handling via text_block_index/next_block_index). Ensure the
helper path/parser used in existing tagged tests is adjusted to accept this
two-event sequence so regressions will be caught.
In `@lib/llm/src/protocols/openai/chat_completions/aggregator.rs`:
- Around line 314-325: Add a unit test that exercises the reasoning→content
promotion: construct a delta where content is None, reasoning_content is
non-empty, and finish_reason is
dynamo_async_openai::types::FinishReason::Length, feed it through the same
aggregation path that contains the logic using reasoning_content and
ChatCompletionMessageContent::Text, and assert that the resulting message's
content is promoted to ChatCompletionMessageContent::Text with the reasoning
text (and that reasoning_content is cleared). Ensure the test initializes any
Aggregator/collector used by aggregator.rs so the branch with finish_reason ==
Length is executed.
---
Outside diff comments:
In `@components/src/dynamo/sglang/publisher.py`:
- Around line 41-43: Update the stale doc/comments that claim IPv6 wrapping uses
SGLang’s maybe_wrap_ipv6_address to instead state that the helper is implemented
locally in this module (the local maybe_wrap_ipv6_address function); update any
inline comments or docstrings near the publisher logic and the other occurrence
around the second mention (line ~57) to reference the local helper name and
remove references to SGLang so readers look for the function in this file.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0190c53f-8603-4be0-8c07-cfce1f1f0484
📒 Files selected for processing (3)
components/src/dynamo/sglang/publisher.pylib/llm/src/protocols/anthropic/stream_converter.rslib/llm/src/protocols/openai/chat_completions/aggregator.rs
8d57ae4 to
fd8c366
Compare
fd8c366 to
508fcf5
Compare
508fcf5 to
2bb6430
Compare
2bb6430 to
b850bea
Compare
b850bea to
f735dc8
Compare
f735dc8 to
0c6c720
Compare
8c82383 to
4b86cb3
Compare
…dary The Anthropic endpoint's reasoning parser fails to transition from thinking to content in the streaming path, causing all output (including the actual response after </think>) to be classified as reasoning_content. This makes clients like Claude Code appear stuck in 'Thinking...' state. Root cause: the Anthropic-to-OpenAI request conversion sets chat_template_args to None, so the model's chat template never receives enable_thinking=true. Without this, reasoning models like Nemotron-3-Super generate plain text without <think>...</think> tags, and the force_reasoning parser classifies everything as reasoning. Additionally, the Anthropic path passes thinking_enabled (from the request's thinking field) as prompt_injected_reasoning, which affects the streaming parser's stripped_think_start flag needed for correct </think> boundary detection. Changes: - types.rs: When Anthropic request has thinking enabled, pass enable_thinking=true in chat_template_args so the model's chat template emits <think>...</think> tags - anthropic.rs: Always pass prompt_injected_reasoning=true when a reasoning parser is configured, so the streaming parser's stripped_think_start flag is set correctly - publisher.py: Inline maybe_wrap_ipv6_address (removed in sglang 0.6.0) - 3 parser unit tests covering streaming thinking scenarios Tested with Nemotron-3-Super-120B-A12B-FP8 on B200. Signed-off-by: Matej Kosec <mkosec@nvidia.com>
The Anthropic handler applied parse_reasoning_content_from_stream() on top of the engine stream. But the engine pipeline already includes the OpenAI preprocessor (for ModelInput::Tokens backends), which applies reasoning parsing in its backward edge. The stream arriving at the Anthropic handler already has reasoning_content and content correctly split. The second parser, starting in force_reasoning=true mode, re-processed content chunks and misclassified them as reasoning because </think> was consumed by the first parser and never appears in detokenized text. Also: - Forward chat_template_kwargs from request to apply_chat_template() in InputParamManager (was silently dropped for ModelInput::Text path) - Set enable_thinking=true in chat_template_args when reasoning parser is configured and thinking isn't explicitly disabled Signed-off-by: Matej Kosec <mkosec@nvidia.com>
The previous comment incorrectly stated the non-streaming aggregator handles reasoning parsing for the ModelInput::Text path. In fact, DeltaAggregator::apply() ignores parsing_options entirely. Document this as a known gap affecting all streaming handlers equally. Signed-off-by: Matej Kosec <mkosec@nvidia.com>
The skip-when-reasoning_content-is-set logic caused DeepSeek v3 test failures: when the backend sets reasoning_content AND leaves reasoning text in content (without <think> tags), the skip+strip approach leaked reasoning into content. This logic is no longer needed — the Anthropic handler no longer applies a second reasoning parser, so there is no double-parsing to guard against in the preprocessor. Signed-off-by: Matej Kosec <mkosec@nvidia.com>
…template Unpacking user-supplied chat_template_args/kwargs alongside explicit tokenize=False and add_generation_prompt=True could raise TypeError if either key was present in the user dict. Strip them defensively. Signed-off-by: Matej Kosec <mkosec@nvidia.com>
- Revert publisher.py changes (PR #6736 handles the SGLang compat) - Unify /// doc comments to // regular comments in reasoning parser tests Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Chat templates only reference {{ message.content }} — they never look at
reasoning_content. When a model generates <think>reasoning</think> followed
by tool calls, the reasoning is parsed into reasoning_content on the
response. But when the client sends it back for the next turn, the Jinja
template ignores reasoning_content and the model never sees its own prior
chain-of-thought.
Before template rendering, convert reasoning_content back into <think>
blocks inside the content field. For Segments (interleaved reasoning),
each non-empty segment is individually wrapped. For Text (flat string),
a single <think> block wraps the entire reasoning.
Guarded by enable_thinking in chat_template_args to avoid injecting
<think> tokens for non-reasoning models.
Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Signed-off-by: Matej Kosec <mkosec@nvidia.com>
The enable_thinking flag is only set on the Anthropic path (by anthropic.rs). On the /v1/chat/completions path, clients don't set it and the preprocessor doesn't add it before render() runs, so the injection was silently skipped. The presence of reasoning_content on an assistant message is itself sufficient signal — a non-reasoning model would never produce it. This matches the DeepSeek V3.2 reference implementation which unconditionally processes reasoning_content without any flag checks. Signed-off-by: Matej Kosec <mkosec@nvidia.com>
…disabled Unconditional injection is too permissive — if a client replays old messages but explicitly disables thinking, <think> tags would confuse the model. Invert the check: inject by default, skip only when enable_thinking is explicitly false. Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Revert the enable_thinking gate — enable_thinking controls output (whether the model generates new reasoning), not input. Prior reasoning should always be visible in the prompt regardless of the current turn's thinking mode. Add end-to-end test that constructs a multi-turn conversation with reasoning_content, renders it through HfTokenizerConfigJsonFormatter, and verifies the <think> block appears in the final prompt text. Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Revert the enable_thinking gate — enable_thinking controls output (whether the model generates new reasoning), not input. Prior reasoning should always be visible regardless of the current turn's thinking mode. Replace the simple render test with two end-to-end roundtrip tests: 1. Text variant: assistant with reasoning_content + content 2. Agentic flow: assistant reasons → tool_call → tool result → assistant reasons again → final answer. Verifies both reasoning turns survive into the rendered prompt. Both tests go through HfTokenizerConfigJsonFormatter::render() with a real NvCreateChatCompletionRequest, exercising the full pipeline. Signed-off-by: Matej Kosec <mkosec@nvidia.com>
When content is an array (multimodal), prepend reasoning as a text part instead of replacing the entire array. Prevents silently dropping multimodal content on assistant messages that also have reasoning. Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Signed-off-by: Matej Kosec <mkosec@nvidia.com>
The oai.rs injection only runs for ModelInput::Tokens (preprocessor in the pipeline). For ModelInput::Text, the worker's input_params.py calls apply_chat_template directly — and reasoning_content was invisible to the template. Add _inject_reasoning_content() to input_params.py, called before apply_chat_template. Same logic as the Rust version: wraps reasoning in <think> blocks, handles Text/Segments/multimodal, removes the field after injection. This fixes the /v1/messages path where thinking blocks were being dropped (verified: input_tokens was identical regardless of thinking content). Signed-off-by: Matej Kosec <mkosec@nvidia.com>
8 tests covering: text variant, segments variant, null content, absent content, multimodal arrays, non-assistant skip, empty reasoning skip, and full agentic multi-turn flow with tool calls. Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Some chat templates (Nemotron, Qwen3) natively reference reasoning_content in their Jinja logic. Injecting <think> blocks AND letting the template render reasoning_content produces duplicate tags. Detect at model load time whether the template source contains "reasoning_content". If it does, skip injection on both paths: - Rust (oai.rs): check self.template_handles_reasoning - Python (input_params.py): check tokenizer.chat_template source Add tests for both cases: - Template without reasoning_content → injection happens - Template with reasoning_content → injection skipped, template renders natively, exactly one <think> block in output Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Nemotron (and likely other) chat templates default truncate_history_thinking to true, which strips <think> content from all assistant turns before the last user message. This means the model never sees its own prior reasoning — breaking multi-turn agentic flows where the model needs context on why it made prior decisions. Set truncate_history_thinking=false in chat_template_args alongside enable_thinking when a reasoning parser is configured. This flows through to the Jinja template on both ModelInput paths: - Rust: merged into template context in oai.rs render() - Python: forwarded as extra_kwargs in input_params.py Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Pre-existing code that breaks under the Docker build toolchain. v.len() returns Option<usize> on minijinja::Value, and_then + is_some_and chain was fragile. Use map_or for clarity. Signed-off-by: Matej Kosec <mkosec@nvidia.com>
…me_and" Pre-existing code, shouldn't be touched in this PR. Signed-off-by: Matej Kosec <mkosec@nvidia.com>
…y_thinking Signed-off-by: Matej Kosec <mkosec@nvidia.com>
The rebase conflict resolution left stale changes in publisher.py that reverted _compat imports back to sglang.srt.utils. Reset to main's version since this PR doesn't touch publisher.py. Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Summary
Two related fixes for reasoning model support:
Double-parsing fix: The Anthropic
/v1/messagesstreaming endpoint was double-parsing reasoning content, causing all model output to be classified asreasoning_contentwith nocontent— making thinking models like Nemotron-3-Super unusable via Claude Code and OpenClaw.Reasoning roundtrip fix:
reasoning_contentfrom prior assistant turns was silently dropped from the prompt. Chat templates only reference{{ message.content }}— they don't know aboutreasoning_content. The model never saw its own prior chain-of-thought across turns, degrading multi-turn reasoning quality and breaking KV cache prefix reuse.Root Cause (Double-parsing)
anthropic.rsapplied a secondparse_reasoning_content_from_stream()on the engine stream. But the engine pipeline already includes the OpenAI preprocessor (forModelInput::Tokensbackends), which applies reasoning parsing in its backward edge. The stream arriving at the Anthropic handler already hasreasoning_contentandcontentcorrectly split.The second parser (with
force_reasoning=true) re-classified post-think content chunks as reasoning because the</think>boundary was already consumed by the first parser and no longer appears in the detokenized text.Root Cause (Reasoning roundtrip)
After serialization to JSON,
reasoning_contentis present on assistant messages but Jinja chat templates never reference it. Before this fix, the field was carried through but ignored during template rendering. Verified by comparingprompt_tokens— mutatingreasoning_contentproduced identical token counts on both/v1/chat/completionsand/v1/messagespaths.Changes
1.
anthropic.rs— Remove the redundantparse_reasoning_content_from_streamcall. When a reasoning parser is configured and thinking isn't explicitly disabled, setenable_thinking=trueinchat_template_argsand inferprompt_injected_reasoning=trueso the preprocessor's parser starts in the correct mode.2.
types.rs— During Anthropic→OpenAI request conversion, forwardenable_thinking=trueinchat_template_argswhen the Anthropic request has thinking explicitly enabled. The handler (anthropic.rs) may augment this further based on parsing options.3.
input_params.py— Forwardchat_template_kwargs/chat_template_argstotokenizer.apply_chat_template(). Previously these were silently dropped, breaking per-request thinking control (e.g.enable_thinking=false) on theModelInput::Textpath.4.
oai.rs— Before Jinja template rendering, injectreasoning_contentback into thecontentfield as<think>blocks. Handles bothText(flat string) andSegments(interleaved reasoning) variants. Runs unconditionally —enable_thinkingcontrols output (whether the model generates new reasoning), not input (prior reasoning should always be visible).5.
reasoning/mod.rs— Add 3 streaming reasoning parser unit tests covering: normalset_in_reasoningflow,force_reasoningwithoutset_in_reasoning(the Anthropic path bug scenario), and</think>split across token boundaries.Test Plan
inject_reasoning_contentunit tests (segments, text, null content, non-assistant skip)reasoning_content→ verifies<think>block appears in rendered prompt viaHfTokenizerConfigJsonFormatter::render()thinking_delta+text_delta(was 0text_deltabefore fix)