Skip to content

fix: Anthropic streaming double-parsing + reasoning_content roundtrip#7358

Merged
MatejKosec merged 26 commits intomainfrom
user/mkosec/reasoning-fallback-nemotron
Mar 30, 2026
Merged

fix: Anthropic streaming double-parsing + reasoning_content roundtrip#7358
MatejKosec merged 26 commits intomainfrom
user/mkosec/reasoning-fallback-nemotron

Conversation

@MatejKosec
Copy link
Copy Markdown
Contributor

@MatejKosec MatejKosec commented Mar 13, 2026

Summary

Two related fixes for reasoning model support:

  1. Double-parsing fix: The Anthropic /v1/messages streaming endpoint was double-parsing reasoning content, causing all model output to be classified as reasoning_content with no content — making thinking models like Nemotron-3-Super unusable via Claude Code and OpenClaw.

  2. Reasoning roundtrip fix: reasoning_content from prior assistant turns was silently dropped from the prompt. Chat templates only reference {{ message.content }} — they don't know about reasoning_content. The model never saw its own prior chain-of-thought across turns, degrading multi-turn reasoning quality and breaking KV cache prefix reuse.

Root Cause (Double-parsing)

anthropic.rs applied a second parse_reasoning_content_from_stream() on the engine stream. But the engine pipeline already includes the OpenAI preprocessor (for ModelInput::Tokens backends), which applies reasoning parsing in its backward edge. The stream arriving at the Anthropic handler already has reasoning_content and content correctly split.

The second parser (with force_reasoning=true) re-classified post-think content chunks as reasoning because the </think> boundary was already consumed by the first parser and no longer appears in the detokenized text.

Root Cause (Reasoning roundtrip)

After serialization to JSON, reasoning_content is present on assistant messages but Jinja chat templates never reference it. Before this fix, the field was carried through but ignored during template rendering. Verified by comparing prompt_tokens — mutating reasoning_content produced identical token counts on both /v1/chat/completions and /v1/messages paths.

Changes

1. anthropic.rs — Remove the redundant parse_reasoning_content_from_stream call. When a reasoning parser is configured and thinking isn't explicitly disabled, set enable_thinking=true in chat_template_args and infer prompt_injected_reasoning=true so the preprocessor's parser starts in the correct mode.

2. types.rs — During Anthropic→OpenAI request conversion, forward enable_thinking=true in chat_template_args when the Anthropic request has thinking explicitly enabled. The handler (anthropic.rs) may augment this further based on parsing options.

3. input_params.py — Forward chat_template_kwargs / chat_template_args to tokenizer.apply_chat_template(). Previously these were silently dropped, breaking per-request thinking control (e.g. enable_thinking=false) on the ModelInput::Text path.

4. oai.rs — Before Jinja template rendering, inject reasoning_content back into the content field as <think> blocks. Handles both Text (flat string) and Segments (interleaved reasoning) variants. Runs unconditionally — enable_thinking controls output (whether the model generates new reasoning), not input (prior reasoning should always be visible).

5. reasoning/mod.rs — Add 3 streaming reasoning parser unit tests covering: normal set_in_reasoning flow, force_reasoning without set_in_reasoning (the Anthropic path bug scenario), and </think> split across token boundaries.

Test Plan

  • 83 reasoning parser unit tests pass (+ 3 new streaming parser tests)
  • 4 new inject_reasoning_content unit tests (segments, text, null content, non-assistant skip)
  • 1 new end-to-end render test: multi-turn conversation with reasoning_content → verifies <think> block appears in rendered prompt via HfTokenizerConfigJsonFormatter::render()
  • Nemotron-3-Super-120B-A12B-FP8 on B200: Anthropic streaming produces thinking_delta + text_delta (was 0 text_delta before fix)
  • Verified on vLLM FP8 and vLLM NVFP4 backends
  • Confirmed working with Claude Code via Anthropic API
  • Qwen3.5-35B-A3B-FP8 on H100 (SGLang, qwen3 parser): 49 thinking_delta + 1 text_delta ✅

@MatejKosec MatejKosec requested review from a team as code owners March 13, 2026 21:22
@github-actions github-actions bot added fix backend::sglang Relates to the sglang backend frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` labels Mar 13, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 13, 2026

Walkthrough

These changes add IPv6 address handling to ZMQ endpoint formatting in a publisher module, and implement fallback mechanisms in LLM protocol stream handlers to surface reasoning content when token limits are reached without producing other text output.

Changes

Cohort / File(s) Summary
IPv6 Endpoint Formatting
components/src/dynamo/sglang/publisher.py
Added maybe_wrap_ipv6_address() utility function to detect and wrap IPv6 addresses in brackets, with updated format_zmq_endpoint() to use it for consistent endpoint formatting.
Anthropic Protocol Reasoning Fallback
lib/llm/src/protocols/anthropic/stream_converter.rs
Added fallback path in emit_end_events() to emit empty text content block and truncation delta when reasoning block started but reached token limit without producing text, preventing UI stalls.
OpenAI Protocol Reasoning Content Promotion
lib/llm/src/protocols/openai/chat_completions/aggregator.rs
Added logic in From<DeltaChoice> to promote reasoning_content to content field when finish_reason indicates token exhaustion and no other content exists, ensuring reasoning segments surface as text content.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A rabbit hops through IPv6 brackets bright,
While reasoning tokens peek through the night,
When limits are reached and text takes its bow,
The thinking shines through—no stalling now!
Protocols flourish with fallbacks so right. ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title mentions 'promote reasoning to content' and 'exhausts token budget', accurately reflecting changes in multiple files that handle this scenario.
Description check ✅ Passed The description provides comprehensive overview, details, reviewer guidance, and test plan, but is missing explicit 'Related Issues' section with issue numbers as required by template.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can approve the review once all CodeRabbit's comments are resolved.

Enable the reviews.request_changes_workflow setting to automatically approve the review once all CodeRabbit's comments are resolved.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/src/dynamo/sglang/publisher.py (1)

41-43: ⚠️ Potential issue | 🟡 Minor

Update stale docs/comments about helper ownership.

The docs still say this uses SGLang’s maybe_wrap_ipv6_address, but the helper is now local in this module.

✏️ Suggested doc/comment fix
 def format_zmq_endpoint(endpoint_template: str, ip_address: str) -> str:
     """Format ZMQ endpoint by replacing wildcard with IP address.

     Properly handles IPv6 addresses by wrapping them in square brackets.
-    Uses SGLang's maybe_wrap_ipv6_address for consistent formatting.
+    Uses local IPv6 wrapping helper for consistent formatting.

@@
-    # Use SGLang's utility to wrap IPv6 addresses in brackets
+    # Use local helper to wrap IPv6 addresses in brackets
     formatted_ip = maybe_wrap_ipv6_address(ip_address)

Also applies to: 57-57

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/src/dynamo/sglang/publisher.py` around lines 41 - 43, Update the
stale doc/comments that claim IPv6 wrapping uses SGLang’s
maybe_wrap_ipv6_address to instead state that the helper is implemented locally
in this module (the local maybe_wrap_ipv6_address function); update any inline
comments or docstrings near the publisher logic and the other occurrence around
the second mention (line ~57) to reference the local helper name and remove
references to SGLang so readers look for the function in this file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/llm/src/protocols/anthropic/stream_converter.rs`:
- Around line 363-393: The new fallback branch that emits a synthetic empty text
content block and a subsequent content_block_delta when thinking_block_started
&& !text_block_started && stop_reason == Some(AnthropicStopReason::MaxTokens)
changes the SSE shape and needs test coverage: update the tagged test helper
used by the Anthropic stream tests to recognize the new events emitted by
make_sse_event for AnthropicStreamEvent::ContentBlockStart and
AnthropicStreamEvent::ContentBlockDelta (with the "[Reasoning exceeded token
limit]" text), and add a unit test asserting that when thinking_block_started is
true, text_block_started is false, and stop_reason is MaxTokens the helper
captures both the content_block_start and content_block_delta in order
(including index handling via text_block_index/next_block_index). Ensure the
helper path/parser used in existing tagged tests is adjusted to accept this
two-event sequence so regressions will be caught.

In `@lib/llm/src/protocols/openai/chat_completions/aggregator.rs`:
- Around line 314-325: Add a unit test that exercises the reasoning→content
promotion: construct a delta where content is None, reasoning_content is
non-empty, and finish_reason is
dynamo_async_openai::types::FinishReason::Length, feed it through the same
aggregation path that contains the logic using reasoning_content and
ChatCompletionMessageContent::Text, and assert that the resulting message's
content is promoted to ChatCompletionMessageContent::Text with the reasoning
text (and that reasoning_content is cleared). Ensure the test initializes any
Aggregator/collector used by aggregator.rs so the branch with finish_reason ==
Length is executed.

---

Outside diff comments:
In `@components/src/dynamo/sglang/publisher.py`:
- Around line 41-43: Update the stale doc/comments that claim IPv6 wrapping uses
SGLang’s maybe_wrap_ipv6_address to instead state that the helper is implemented
locally in this module (the local maybe_wrap_ipv6_address function); update any
inline comments or docstrings near the publisher logic and the other occurrence
around the second mention (line ~57) to reference the local helper name and
remove references to SGLang so readers look for the function in this file.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0190c53f-8603-4be0-8c07-cfce1f1f0484

📥 Commits

Reviewing files that changed from the base of the PR and between 33604af and 5708908.

📒 Files selected for processing (3)
  • components/src/dynamo/sglang/publisher.py
  • lib/llm/src/protocols/anthropic/stream_converter.rs
  • lib/llm/src/protocols/openai/chat_completions/aggregator.rs

@MatejKosec MatejKosec force-pushed the user/mkosec/reasoning-fallback-nemotron branch from 8d57ae4 to fd8c366 Compare March 14, 2026 23:03
@MatejKosec MatejKosec changed the title fix: promote reasoning to content when model exhausts token budget fix: Anthropic streaming reasoning parser not detecting </think> boundary Mar 14, 2026
@MatejKosec MatejKosec force-pushed the user/mkosec/reasoning-fallback-nemotron branch from fd8c366 to 508fcf5 Compare March 14, 2026 23:34
@pull-request-size pull-request-size bot added size/L and removed size/M labels Mar 14, 2026
@MatejKosec MatejKosec force-pushed the user/mkosec/reasoning-fallback-nemotron branch from 508fcf5 to 2bb6430 Compare March 16, 2026 00:51
@MatejKosec MatejKosec force-pushed the user/mkosec/reasoning-fallback-nemotron branch from 2bb6430 to b850bea Compare March 16, 2026 01:19
@MatejKosec MatejKosec force-pushed the user/mkosec/reasoning-fallback-nemotron branch from b850bea to f735dc8 Compare March 16, 2026 03:51
@MatejKosec MatejKosec force-pushed the user/mkosec/reasoning-fallback-nemotron branch from f735dc8 to 0c6c720 Compare March 16, 2026 05:28
…dary

The Anthropic endpoint's reasoning parser fails to transition from thinking
to content in the streaming path, causing all output (including the actual
response after </think>) to be classified as reasoning_content. This makes
clients like Claude Code appear stuck in 'Thinking...' state.

Root cause: the Anthropic-to-OpenAI request conversion sets
chat_template_args to None, so the model's chat template never receives
enable_thinking=true. Without this, reasoning models like Nemotron-3-Super
generate plain text without <think>...</think> tags, and the
force_reasoning parser classifies everything as reasoning.

Additionally, the Anthropic path passes thinking_enabled (from the
request's thinking field) as prompt_injected_reasoning, which affects
the streaming parser's stripped_think_start flag needed for correct
</think> boundary detection.

Changes:
- types.rs: When Anthropic request has thinking enabled, pass
  enable_thinking=true in chat_template_args so the model's chat template
  emits <think>...</think> tags
- anthropic.rs: Always pass prompt_injected_reasoning=true when a
  reasoning parser is configured, so the streaming parser's
  stripped_think_start flag is set correctly
- publisher.py: Inline maybe_wrap_ipv6_address (removed in sglang 0.6.0)
- 3 parser unit tests covering streaming thinking scenarios

Tested with Nemotron-3-Super-120B-A12B-FP8 on B200.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
The Anthropic handler applied parse_reasoning_content_from_stream() on
top of the engine stream. But the engine pipeline already includes the
OpenAI preprocessor (for ModelInput::Tokens backends), which applies
reasoning parsing in its backward edge. The stream arriving at the
Anthropic handler already has reasoning_content and content correctly
split.

The second parser, starting in force_reasoning=true mode, re-processed
content chunks and misclassified them as reasoning because </think> was
consumed by the first parser and never appears in detokenized text.

Also:
- Forward chat_template_kwargs from request to apply_chat_template()
  in InputParamManager (was silently dropped for ModelInput::Text path)
- Set enable_thinking=true in chat_template_args when reasoning parser
  is configured and thinking isn't explicitly disabled

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
The previous comment incorrectly stated the non-streaming aggregator
handles reasoning parsing for the ModelInput::Text path. In fact,
DeltaAggregator::apply() ignores parsing_options entirely. Document
this as a known gap affecting all streaming handlers equally.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
The skip-when-reasoning_content-is-set logic caused DeepSeek v3 test
failures: when the backend sets reasoning_content AND leaves reasoning
text in content (without <think> tags), the skip+strip approach leaked
reasoning into content.

This logic is no longer needed — the Anthropic handler no longer applies
a second reasoning parser, so there is no double-parsing to guard against
in the preprocessor.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
…template

Unpacking user-supplied chat_template_args/kwargs alongside explicit
tokenize=False and add_generation_prompt=True could raise TypeError if
either key was present in the user dict. Strip them defensively.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
- Revert publisher.py changes (PR #6736 handles the SGLang compat)
- Unify /// doc comments to // regular comments in reasoning parser tests

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Chat templates only reference {{ message.content }} — they never look at
reasoning_content. When a model generates <think>reasoning</think> followed
by tool calls, the reasoning is parsed into reasoning_content on the
response. But when the client sends it back for the next turn, the Jinja
template ignores reasoning_content and the model never sees its own prior
chain-of-thought.

Before template rendering, convert reasoning_content back into <think>
blocks inside the content field. For Segments (interleaved reasoning),
each non-empty segment is individually wrapped. For Text (flat string),
a single <think> block wraps the entire reasoning.

Guarded by enable_thinking in chat_template_args to avoid injecting
<think> tokens for non-reasoning models.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Signed-off-by: Matej Kosec <mkosec@nvidia.com>
The enable_thinking flag is only set on the Anthropic path (by
anthropic.rs). On the /v1/chat/completions path, clients don't set it
and the preprocessor doesn't add it before render() runs, so the
injection was silently skipped.

The presence of reasoning_content on an assistant message is itself
sufficient signal — a non-reasoning model would never produce it. This
matches the DeepSeek V3.2 reference implementation which unconditionally
processes reasoning_content without any flag checks.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
…disabled

Unconditional injection is too permissive — if a client replays old
messages but explicitly disables thinking, <think> tags would confuse
the model. Invert the check: inject by default, skip only when
enable_thinking is explicitly false.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Revert the enable_thinking gate — enable_thinking controls output
(whether the model generates new reasoning), not input. Prior reasoning
should always be visible in the prompt regardless of the current turn's
thinking mode.

Add end-to-end test that constructs a multi-turn conversation with
reasoning_content, renders it through HfTokenizerConfigJsonFormatter,
and verifies the <think> block appears in the final prompt text.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Revert the enable_thinking gate — enable_thinking controls output
(whether the model generates new reasoning), not input. Prior reasoning
should always be visible regardless of the current turn's thinking mode.

Replace the simple render test with two end-to-end roundtrip tests:
1. Text variant: assistant with reasoning_content + content
2. Agentic flow: assistant reasons → tool_call → tool result →
   assistant reasons again → final answer. Verifies both reasoning
   turns survive into the rendered prompt.

Both tests go through HfTokenizerConfigJsonFormatter::render() with a
real NvCreateChatCompletionRequest, exercising the full pipeline.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
When content is an array (multimodal), prepend reasoning as a text part
instead of replacing the entire array. Prevents silently dropping
multimodal content on assistant messages that also have reasoning.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Signed-off-by: Matej Kosec <mkosec@nvidia.com>
The oai.rs injection only runs for ModelInput::Tokens (preprocessor in
the pipeline). For ModelInput::Text, the worker's input_params.py calls
apply_chat_template directly — and reasoning_content was invisible to
the template.

Add _inject_reasoning_content() to input_params.py, called before
apply_chat_template. Same logic as the Rust version: wraps reasoning in
<think> blocks, handles Text/Segments/multimodal, removes the field
after injection.

This fixes the /v1/messages path where thinking blocks were being
dropped (verified: input_tokens was identical regardless of thinking
content).

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
8 tests covering: text variant, segments variant, null content, absent
content, multimodal arrays, non-assistant skip, empty reasoning skip,
and full agentic multi-turn flow with tool calls.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Some chat templates (Nemotron, Qwen3) natively reference
reasoning_content in their Jinja logic. Injecting <think> blocks AND
letting the template render reasoning_content produces duplicate tags.

Detect at model load time whether the template source contains
"reasoning_content". If it does, skip injection on both paths:
- Rust (oai.rs): check self.template_handles_reasoning
- Python (input_params.py): check tokenizer.chat_template source

Add tests for both cases:
- Template without reasoning_content → injection happens
- Template with reasoning_content → injection skipped, template
  renders natively, exactly one <think> block in output

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Nemotron (and likely other) chat templates default
truncate_history_thinking to true, which strips <think> content from
all assistant turns before the last user message. This means the model
never sees its own prior reasoning — breaking multi-turn agentic flows
where the model needs context on why it made prior decisions.

Set truncate_history_thinking=false in chat_template_args alongside
enable_thinking when a reasoning parser is configured. This flows
through to the Jinja template on both ModelInput paths:
- Rust: merged into template context in oai.rs render()
- Python: forwarded as extra_kwargs in input_params.py

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Pre-existing code that breaks under the Docker build toolchain.
v.len() returns Option<usize> on minijinja::Value, and_then + is_some_and
chain was fragile. Use map_or for clarity.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
…me_and"

Pre-existing code, shouldn't be touched in this PR.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
…y_thinking

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
The rebase conflict resolution left stale changes in publisher.py
that reverted _compat imports back to sglang.srt.utils. Reset to
main's version since this PR doesn't touch publisher.py.

Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Signed-off-by: Matej Kosec <mkosec@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::sglang Relates to the sglang backend fix frontend `python -m dynamo.frontend` and `dynamo-run in=http|text|grpc` size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants