fix: Anthropic streaming double-parsing + reasoning_content roundtrip by MatejKosec · Pull Request #7358 · ai-dynamo/dynamo

MatejKosec · 2026-03-13T21:22:34Z

Summary

Two related fixes for reasoning model support:

Double-parsing fix: The Anthropic /v1/messages streaming endpoint was double-parsing reasoning content, causing all model output to be classified as reasoning_content with no content — making thinking models like Nemotron-3-Super unusable via Claude Code and OpenClaw.
Reasoning roundtrip fix: reasoning_content from prior assistant turns was silently dropped from the prompt. Chat templates only reference {{ message.content }} — they don't know about reasoning_content. The model never saw its own prior chain-of-thought across turns, degrading multi-turn reasoning quality and breaking KV cache prefix reuse.

Root Cause (Double-parsing)

anthropic.rs applied a second parse_reasoning_content_from_stream() on the engine stream. But the engine pipeline already includes the OpenAI preprocessor (for ModelInput::Tokens backends), which applies reasoning parsing in its backward edge. The stream arriving at the Anthropic handler already has reasoning_content and content correctly split.

The second parser (with force_reasoning=true) re-classified post-think content chunks as reasoning because the </think> boundary was already consumed by the first parser and no longer appears in the detokenized text.

Root Cause (Reasoning roundtrip)

After serialization to JSON, reasoning_content is present on assistant messages but Jinja chat templates never reference it. Before this fix, the field was carried through but ignored during template rendering. Verified by comparing prompt_tokens — mutating reasoning_content produced identical token counts on both /v1/chat/completions and /v1/messages paths.

Changes

1. anthropic.rs — Remove the redundant parse_reasoning_content_from_stream call. When a reasoning parser is configured and thinking isn't explicitly disabled, set enable_thinking=true in chat_template_args and infer prompt_injected_reasoning=true so the preprocessor's parser starts in the correct mode.

2. types.rs — During Anthropic→OpenAI request conversion, forward enable_thinking=true in chat_template_args when the Anthropic request has thinking explicitly enabled. The handler (anthropic.rs) may augment this further based on parsing options.

3. input_params.py — Forward chat_template_kwargs / chat_template_args to tokenizer.apply_chat_template(). Previously these were silently dropped, breaking per-request thinking control (e.g. enable_thinking=false) on the ModelInput::Text path.

4. oai.rs — Before Jinja template rendering, inject reasoning_content back into the content field as <think> blocks. Handles both Text (flat string) and Segments (interleaved reasoning) variants. Runs unconditionally — enable_thinking controls output (whether the model generates new reasoning), not input (prior reasoning should always be visible).

5. reasoning/mod.rs — Add 3 streaming reasoning parser unit tests covering: normal set_in_reasoning flow, force_reasoning without set_in_reasoning (the Anthropic path bug scenario), and </think> split across token boundaries.

Test Plan

83 reasoning parser unit tests pass (+ 3 new streaming parser tests)
4 new inject_reasoning_content unit tests (segments, text, null content, non-assistant skip)
1 new end-to-end render test: multi-turn conversation with reasoning_content → verifies <think> block appears in rendered prompt via HfTokenizerConfigJsonFormatter::render()
Nemotron-3-Super-120B-A12B-FP8 on B200: Anthropic streaming produces thinking_delta + text_delta (was 0 text_delta before fix)
Verified on vLLM FP8 and vLLM NVFP4 backends
Confirmed working with Claude Code via Anthropic API
Qwen3.5-35B-A3B-FP8 on H100 (SGLang, qwen3 parser): 49 thinking_delta + 1 text_delta ✅

coderabbitai · 2026-03-13T21:30:05Z

Walkthrough

These changes add IPv6 address handling to ZMQ endpoint formatting in a publisher module, and implement fallback mechanisms in LLM protocol stream handlers to surface reasoning content when token limits are reached without producing other text output.

Changes

Cohort / File(s)	Summary
IPv6 Endpoint Formatting `components/src/dynamo/sglang/publisher.py`	Added `maybe_wrap_ipv6_address()` utility function to detect and wrap IPv6 addresses in brackets, with updated `format_zmq_endpoint()` to use it for consistent endpoint formatting.
Anthropic Protocol Reasoning Fallback `lib/llm/src/protocols/anthropic/stream_converter.rs`	Added fallback path in `emit_end_events()` to emit empty text content block and truncation delta when reasoning block started but reached token limit without producing text, preventing UI stalls.
OpenAI Protocol Reasoning Content Promotion `lib/llm/src/protocols/openai/chat_completions/aggregator.rs`	Added logic in `From<DeltaChoice>` to promote reasoning_content to content field when finish_reason indicates token exhaustion and no other content exists, ensuring reasoning segments surface as text content.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A rabbit hops through IPv6 brackets bright,
While reasoning tokens peek through the night,
When limits are reached and text takes its bow,
The thinking shines through—no stalling now!
Protocols flourish with fallbacks so right. ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The title mentions 'promote reasoning to content' and 'exhausts token budget', accurately reflecting changes in multiple files that handle this scenario.
Description check	✅ Passed	The description provides comprehensive overview, details, reviewer guidance, and test plan, but is missing explicit 'Related Issues' section with issue numbers as required by template.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can approve the review once all CodeRabbit's comments are resolved.

Enable the reviews.request_changes_workflow setting to automatically approve the review once all CodeRabbit's comments are resolved.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

components/src/dynamo/sglang/publisher.py (1)

41-43: ⚠️ Potential issue | 🟡 Minor

Update stale docs/comments about helper ownership.

The docs still say this uses SGLang’s maybe_wrap_ipv6_address, but the helper is now local in this module.

✏️ Suggested doc/comment fix

 def format_zmq_endpoint(endpoint_template: str, ip_address: str) -> str:
     """Format ZMQ endpoint by replacing wildcard with IP address.

     Properly handles IPv6 addresses by wrapping them in square brackets.
-    Uses SGLang's maybe_wrap_ipv6_address for consistent formatting.
+    Uses local IPv6 wrapping helper for consistent formatting.

@@
-    # Use SGLang's utility to wrap IPv6 addresses in brackets
+    # Use local helper to wrap IPv6 addresses in brackets
     formatted_ip = maybe_wrap_ipv6_address(ip_address)

Also applies to: 57-57

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/sglang/publisher.py` around lines 41 - 43, Update the
stale doc/comments that claim IPv6 wrapping uses SGLang’s
maybe_wrap_ipv6_address to instead state that the helper is implemented locally
in this module (the local maybe_wrap_ipv6_address function); update any inline
comments or docstrings near the publisher logic and the other occurrence around
the second mention (line ~57) to reference the local helper name and remove
references to SGLang so readers look for the function in this file.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/llm/src/protocols/anthropic/stream_converter.rs`:
- Around line 363-393: The new fallback branch that emits a synthetic empty text
content block and a subsequent content_block_delta when thinking_block_started
&& !text_block_started && stop_reason == Some(AnthropicStopReason::MaxTokens)
changes the SSE shape and needs test coverage: update the tagged test helper
used by the Anthropic stream tests to recognize the new events emitted by
make_sse_event for AnthropicStreamEvent::ContentBlockStart and
AnthropicStreamEvent::ContentBlockDelta (with the "[Reasoning exceeded token
limit]" text), and add a unit test asserting that when thinking_block_started is
true, text_block_started is false, and stop_reason is MaxTokens the helper
captures both the content_block_start and content_block_delta in order
(including index handling via text_block_index/next_block_index). Ensure the
helper path/parser used in existing tagged tests is adjusted to accept this
two-event sequence so regressions will be caught.

In `@lib/llm/src/protocols/openai/chat_completions/aggregator.rs`:
- Around line 314-325: Add a unit test that exercises the reasoning→content
promotion: construct a delta where content is None, reasoning_content is
non-empty, and finish_reason is
dynamo_async_openai::types::FinishReason::Length, feed it through the same
aggregation path that contains the logic using reasoning_content and
ChatCompletionMessageContent::Text, and assert that the resulting message's
content is promoted to ChatCompletionMessageContent::Text with the reasoning
text (and that reasoning_content is cleared). Ensure the test initializes any
Aggregator/collector used by aggregator.rs so the branch with finish_reason ==
Length is executed.

---

Outside diff comments:
In `@components/src/dynamo/sglang/publisher.py`:
- Around line 41-43: Update the stale doc/comments that claim IPv6 wrapping uses
SGLang’s maybe_wrap_ipv6_address to instead state that the helper is implemented
locally in this module (the local maybe_wrap_ipv6_address function); update any
inline comments or docstrings near the publisher logic and the other occurrence
around the second mention (line ~57) to reference the local helper name and
remove references to SGLang so readers look for the function in this file.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0190c53f-8603-4be0-8c07-cfce1f1f0484

📥 Commits

Reviewing files that changed from the base of the PR and between 33604af and 5708908.

📒 Files selected for processing (3)

components/src/dynamo/sglang/publisher.py
lib/llm/src/protocols/anthropic/stream_converter.rs
lib/llm/src/protocols/openai/chat_completions/aggregator.rs

lib/llm/src/protocols/anthropic/stream_converter.rs

lib/llm/src/protocols/openai/chat_completions/aggregator.rs

…dary The Anthropic endpoint's reasoning parser fails to transition from thinking to content in the streaming path, causing all output (including the actual response after </think>) to be classified as reasoning_content. This makes clients like Claude Code appear stuck in 'Thinking...' state. Root cause: the Anthropic-to-OpenAI request conversion sets chat_template_args to None, so the model's chat template never receives enable_thinking=true. Without this, reasoning models like Nemotron-3-Super generate plain text without <think>...</think> tags, and the force_reasoning parser classifies everything as reasoning. Additionally, the Anthropic path passes thinking_enabled (from the request's thinking field) as prompt_injected_reasoning, which affects the streaming parser's stripped_think_start flag needed for correct </think> boundary detection. Changes: - types.rs: When Anthropic request has thinking enabled, pass enable_thinking=true in chat_template_args so the model's chat template emits <think>...</think> tags - anthropic.rs: Always pass prompt_injected_reasoning=true when a reasoning parser is configured, so the streaming parser's stripped_think_start flag is set correctly - publisher.py: Inline maybe_wrap_ipv6_address (removed in sglang 0.6.0) - 3 parser unit tests covering streaming thinking scenarios Tested with Nemotron-3-Super-120B-A12B-FP8 on B200. Signed-off-by: Matej Kosec <mkosec@nvidia.com>

The Anthropic handler applied parse_reasoning_content_from_stream() on top of the engine stream. But the engine pipeline already includes the OpenAI preprocessor (for ModelInput::Tokens backends), which applies reasoning parsing in its backward edge. The stream arriving at the Anthropic handler already has reasoning_content and content correctly split. The second parser, starting in force_reasoning=true mode, re-processed content chunks and misclassified them as reasoning because </think> was consumed by the first parser and never appears in detokenized text. Also: - Forward chat_template_kwargs from request to apply_chat_template() in InputParamManager (was silently dropped for ModelInput::Text path) - Set enable_thinking=true in chat_template_args when reasoning parser is configured and thinking isn't explicitly disabled Signed-off-by: Matej Kosec <mkosec@nvidia.com>

The previous comment incorrectly stated the non-streaming aggregator handles reasoning parsing for the ModelInput::Text path. In fact, DeltaAggregator::apply() ignores parsing_options entirely. Document this as a known gap affecting all streaming handlers equally. Signed-off-by: Matej Kosec <mkosec@nvidia.com>

The skip-when-reasoning_content-is-set logic caused DeepSeek v3 test failures: when the backend sets reasoning_content AND leaves reasoning text in content (without <think> tags), the skip+strip approach leaked reasoning into content. This logic is no longer needed — the Anthropic handler no longer applies a second reasoning parser, so there is no double-parsing to guard against in the preprocessor. Signed-off-by: Matej Kosec <mkosec@nvidia.com>

…template Unpacking user-supplied chat_template_args/kwargs alongside explicit tokenize=False and add_generation_prompt=True could raise TypeError if either key was present in the user dict. Strip them defensively. Signed-off-by: Matej Kosec <mkosec@nvidia.com>

- Revert publisher.py changes (PR #6736 handles the SGLang compat) - Unify /// doc comments to // regular comments in reasoning parser tests Signed-off-by: Matej Kosec <mkosec@nvidia.com>

Chat templates only reference {{ message.content }} — they never look at reasoning_content. When a model generates <think>reasoning</think> followed by tool calls, the reasoning is parsed into reasoning_content on the response. But when the client sends it back for the next turn, the Jinja template ignores reasoning_content and the model never sees its own prior chain-of-thought. Before template rendering, convert reasoning_content back into <think> blocks inside the content field. For Segments (interleaved reasoning), each non-empty segment is individually wrapped. For Text (flat string), a single <think> block wraps the entire reasoning. Guarded by enable_thinking in chat_template_args to avoid injecting <think> tokens for non-reasoning models. Signed-off-by: Matej Kosec <mkosec@nvidia.com>