Skip to content

studio: auto-continue when model stops mid-plan#5549

Open
danielhanchen wants to merge 11 commits into
mainfrom
studio/auto-continue-mid-plan
Open

studio: auto-continue when model stops mid-plan#5549
danielhanchen wants to merge 11 commits into
mainfrom
studio/auto-continue-mid-plan

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Summary

The existing intent-signal re-prompt at studio/backend/core/inference/llama_cpp.py fires only when (a) tools are armed and (b) the response is short. In practice models often stop mid-plan in three other shapes that this path misses:

  1. Trailing intent ending the buffer: "Let me clone the repo."
  2. Numbered or bulleted list under a header: "Let me now proceed to:\n1. ...\n6. Run validation"
  3. Bare trailing colon: "Let me check the repo:"

When the model stops on any of these the turn ends with the workload only partially delivered.

Changes

  • Add three module-level regexes (_TRAILING_PLAN_INTENT, _TRAILING_PLAN_LIST, _TRAILING_PLAN_COLON) covering the observed shapes, with a 600-char trailing window so list endings are caught even when the cue is several lines above the last item.
  • Add _trailing_plan_hit() helper that ORs the three matchers.
  • Add _MAX_CONTINUES = 3 and a separate _continue_count so the auto-continue budget does not steal from the tool-coercive re-prompt budget.
  • In generate_chat_completion_with_tools, branch the existing _S_STREAMING follow-up block: tool-coercive re-prompt keeps the old behavior, the new auto-continue path nudges with a neutral "Continue." and works with or without tools.

Test plan

  • python -c 'import ast; ast.parse(open(path).read())' clean
  • Regex matchers verified against:
    • "Let me check the repo structure." -> match
    • "Let me now proceed to:\n1. Clone\n...\n6. Run validation" -> match
    • "I have the answer. Let me show you the diff:" -> match
    • "Here's my plan:\n- Investigate\n- Fix\n- Test" -> match
    • "The answer is 42. I have addressed all 9 sections." -> no match
    • "In the first place, let me clarify. The fix is X. The tests all pass. We are done." -> no match
  • Run a multi-section workload against Qwen3.5-27B GGUF quants: 9 of 9 required output sections delivered in a single user turn (previously stopped at 2 of 9).
  • Sanity-check that a clean complete response (period at end, no trailing colon, no numbered list) does not trigger a spurious continuation.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a8a648c216

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +67 to +69
r"let me|now let me|i['’]ll now|next,?\s*i['’]ll|"
r"i['’]m going to|i will now|let['’]s now"
r")[^.!?\n]*[.!?]?\s*$"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude closing 'let me know' from auto-continue

When a complete answer ends with a common closing such as If you need anything else, let me know., this regex still matches the trailing let me phrase because it is searched over the final window and does not exclude let me know. That sends the turn through _trailing_plan_hit() and can add up to three spurious Continue. generations after an already-finished response, increasing latency/cost and appending unrelated text for normal user-facing completions.

Useful? React with 👍 / 👎.

f"model ended turn mid-plan "
f"({len(_stripped)} chars)"
)
_nudge = "Continue."

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reset Anthropic cursors before auto-continued text

For Anthropic Messages requests with tools enabled, this new Continue. branch starts another assistant generation after yielding only a status event; however AnthropicStreamEmitter and _anthropic_tool_non_streaming ignore status and reset their cumulative-text cursor only on tool_end. When auto-continue fires without a tool call, the continuation is diffed against the previous assistant turn length, so shorter/equal continuations are dropped and longer ones lose their prefix. Either emit a boundary the Anthropic adapters handle or reset their cursor on this empty status.

Useful? React with 👍 / 👎.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to detect and handle mid-plan End-Of-Sequence (EOS) events by implementing auto-continue logic alongside existing tool-coercive re-prompts. It adds regex-based detectors for trailing intents, lists, and colons, and updates the inference loop to nudge the model with a "Continue." message when these patterns are detected. Feedback was provided to refactor the _trailing_plan_hit function using any() with a generator expression to improve conciseness and align with efficiency guidelines.

Comment on lines +98 to +104
if _TRAILING_PLAN_INTENT.search(tail) is not None:
return True
if _TRAILING_PLAN_LIST.search(tail) is not None:
return True
if _TRAILING_PLAN_COLON.search(tail) is not None:
return True
return False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For improved conciseness and readability, you can use any() with a generator expression to check against the multiple regex patterns. This avoids the series of if statements and combines checks into a single iteration, aligning with repository efficiency guidelines regarding redundant data iterations.

Suggested change
if _TRAILING_PLAN_INTENT.search(tail) is not None:
return True
if _TRAILING_PLAN_LIST.search(tail) is not None:
return True
if _TRAILING_PLAN_COLON.search(tail) is not None:
return True
return False
return any(
regex.search(tail) is not None
for regex in (_TRAILING_PLAN_INTENT, _TRAILING_PLAN_LIST, _TRAILING_PLAN_COLON)
)
References
  1. To improve efficiency, avoid redundant data iterations. Combine checks and transformations into a single loop and return computed values for callers to reuse.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6e6090d5f7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

r")[^.!?\n]*[.!?]?\s*$"
)
_TRAILING_PLAN_LIST = re.compile(
r"(?ims)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove multiline mode from trailing-list end detector

_TRAILING_PLAN_LIST is compiled with (?ims) and then searched with a terminal \s*$; the m flag makes $ match end-of-line, so this regex can match a list block even when more response text follows later in the same turn. That causes _trailing_plan_hit() to fire on already-complete answers (for example a list followed by a concluding sentence), triggering unnecessary Continue. retries and potentially appending unrelated extra output while increasing latency/cost.

Useful? React with 👍 / 👎.

Comment on lines 4085 to 4086
_extra = _MAX_REPROMPTS + _MAX_CONTINUES if max_tool_iterations > 0 else 0
for iteration in range(max_tool_iterations + _extra):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid inflating tool-iteration cap with unused continue budget

The loop bound now always adds _MAX_CONTINUES whenever max_tool_iterations > 0, so the agent can run up to max_tool_iterations + 6 full iterations even if no auto-continue is used. Because each extra iteration can still execute tool calls, this weakens the caller-provided cap (max_tool_iterations) and can produce unexpected extra tool executions/cost during long tool loops.

Useful? React with 👍 / 👎.

rhsCZ pushed a commit to rhsCZ/unsloth that referenced this pull request May 18, 2026
Codex review on unslothai#5549 flagged that endings like
"If you need anything else, let me know." match the trailing-plan
intent pattern (the regex matches "let me <anything>." at end of
buffer). On a finished, user-facing closing this fires the auto-
continue branch up to three times, costing latency / tokens and
appending unrelated text after the response.

Add a negative lookahead so "let me" only counts as a mid-plan signal
when it is NOT immediately followed by "know". Other intent phrases
("now let me", "i'll now", "i'm going to", "i will now", "let's now")
already require a planning verb so they are unaffected.

Verified against `scripts/r6_trailing_plan_regex_test.py`: closing
"let me know" variants no longer match; "let me clone/check/run …"
still does.
rhsCZ pushed a commit to rhsCZ/unsloth that referenced this pull request May 18, 2026
Codex P2 on unslothai#5549 flagged that the auto-continue branch yields only
a `{"type":"status","text":""}` event between turns; the Anthropic
streaming emitter (`AnthropicStreamEmitter`) and the non-streaming
tool path (`_anthropic_tool_non_streaming`) both ignore `status`
events, so their cumulative-text cursor still holds the previous
turn's full length when the continuation starts streaming. Shorter
continuations get dropped entirely and longer ones lose their prefix.

Treat the empty-text status as an auto-continue boundary in both
paths:

  - `AnthropicStreamEmitter`: close any open text block, open a fresh
    one (matches the `tool_end` reset pattern), and clear `_prev_text`.
  - `_anthropic_tool_non_streaming`: clear `prev_text` so the next
    `content` event's diff baseline is empty.

Non-empty status events (tool progress text) keep their existing
no-op semantics.
danielhanchen pushed a commit that referenced this pull request May 18, 2026
Codex P2 review on #5549 surfaced two related risks in the auto-continue
plumbing:

1. `_TRAILING_PLAN_LIST` was compiled with `(?ims)`. The `m` flag makes
   the terminal `\s*$` match end-of-line, so a complete answer like
   "Here's my plan:\n- a\n- b\n\nDone, that should work." still matched
   the list-block sub-pattern and tripped a spurious `Continue.` retry.
   Drop the `m` (and the unused `s`) flag and re-anchor with `\Z` so the
   list pattern only fires when the list is genuinely the last thing in
   the buffer.

2. The agent loop pre-reserved `_MAX_REPROMPTS + _MAX_CONTINUES` (= 6)
   extra iterations on top of the caller's `max_tool_iterations`
   unconditionally. That weakens the caller-provided budget: a turn
   that never trips the reprompt or continue path could still run up
   to N+6 full iterations and execute their tool calls.

   Switch the bound to a dynamic cap that grows only as reprompts /
   continues are actually consumed: `iteration < max_tool_iterations +
   _reprompt_count + _continue_count`. With both counters at zero the
   loop honors the caller cap exactly; once a continue or reprompt
   fires it earns its own slot back.

   Implemented with `itertools.count()` so the existing `continue`
   statements in the loop body keep their semantics.

Regex behaviour pinned by `scripts/r6_trailing_plan_regex_test.py`
(updated separately for the new list-tail case).

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2784872494

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +256 to +264
elif etype == "status" and not event.get("text"):
# Auto-continue boundary marker emitted by
# generate_chat_completion_with_tools — the next "content"
# event resets to a fresh cumulative baseline, so close any
# open text block and clear the diff cursor. Without this
# the next continuation gets diffed against the previous
# turn's length (shorter continuations are dropped, longer
# ones lose their prefix).
return self._handle_boundary()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don’t treat every empty status as a content boundary

Empty status events are emitted for more than auto-continue boundaries: the GGUF tool loop also yields them to clear the status badge at normal completion and after each tool call (llama_cpp.py lines 4501 and 4766). With Anthropic streaming, this branch now closes the currently open text block and immediately opens a new one for all of those cases, so a normal final answer or any tool call produces extra empty content_block_start/content_block_stop pairs in the Messages SSE stream. Reset the diff cursor only for the actual continuation boundary, or make the generator emit a distinct boundary event instead of overloading every empty status.

Useful? React with 👍 / 👎.

danielhanchen pushed a commit that referenced this pull request May 19, 2026
…am cleanly

Codex P2 on #5549 flagged that the cycle-5 cursor-reset (efa43c4) keyed
on every empty-status event, but generate_chat_completion_with_tools
emits empty status events in five places, only two of which are real
iteration boundaries:

  - Line 4497: emitted right before `continue` after a re-prompt /
    auto-continue. The next iteration starts a fresh assistant turn.
    BOUNDARY: reset cursor.
  - Line 4766: emitted right before `continue` after a tool call. The
    next iteration regenerates with tool results in history. BOUNDARY:
    reset cursor.
  - Line 4501: emitted at metadata-yield after normal streaming. Stream
    is about to end, no new iteration follows. NOT a boundary.
  - Line 4584: emitted in DRAINING-no-tool-call fallback path. Stream
    is about to end with buffered content_accum. NOT a boundary.
  - Line 4794: emitted at the final exit of the generator. NOT a
    boundary.

Treating all five as boundaries gave every Anthropic-streaming response
an extra content_block_stop + content_block_start pair around its final
text and around every tool call.

Fix by tagging the two real boundary sites with `"boundary": True` and
tightening both the Anthropic emitter (`anthropic_compat.py`) and the
OpenAI-compat tool path + Anthropic non-streaming path
(`routes/inference.py`) to reset the cumulative-text cursor only when
that flag is set. Plain empty-status events keep their existing badge-
clear semantics on the frontend (`tool_status` SSE with content "").

Add two regression tests in
`backend/tests/test_anthropic_messages.py::TestAnthropicStreamEmitter`:

  - test_boundary_flag_closes_block_and_resets_cursor: a boundary=True
    status closes the open text block and the next content delta
    streams from zero.
  - test_empty_status_without_boundary_does_not_close_block: a plain
    empty status leaves block_index unchanged and the next content
    delta is diffed against the previous text length.

107 tests pass across the four anthropic + trailing-plan test files.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 610c3873b5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# Clear tool status badge before next generation iteration.
# boundary=True: the model is about to start a fresh turn
# so cumulative-text adapters must reset their cursor.
yield {"type": "status", "text": "", "boundary": True}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid marking post-tool badge clears as boundaries

Fresh evidence in this revision is that the post-tool badge clear is now explicitly emitted with boundary=True. In the Anthropic streaming path, tool_end already resets the cursor and opens the next text block, so this immediately sends the same event through _handle_boundary(), closing that empty block and opening another one before any model text arrives; any streamed Anthropic request that executes a tool now gets an extra empty text content block in the SSE sequence before the real post-tool response.

Useful? React with 👍 / 👎.

danielhanchen pushed a commit that referenced this pull request May 19, 2026
… tool_end

Codex 02:42Z on #5549 caught that flagging the post-tool empty-status
event with boundary=True (cycle-12 commit 610c387) double-handles the
cursor reset in the Anthropic streaming path.

AnthropicStreamEmitter._handle_tool_end already:
  - closes the open tool_use block,
  - emits tool_result,
  - increments block_index,
  - opens a fresh text block, and
  - resets _prev_text = "".

When llama_cpp.py then yielded boundary=True on the very next event,
_handle_boundary fired _close_block + _open_text_block on that freshly
opened (still-empty) text block. Result: every tool call produced a
spurious content_block_stop + content_block_start pair before the
post-tool model text streamed.

Fix:

  - llama_cpp.py: drop boundary=True from the post-tool status emit
    (line ~4768). Keep boundary=True only at the auto-continue site
    (line ~4500), which has no preceding tool_end to do the cursor
    work.
  - routes/inference.py OpenAI-compat tool stream: mirror the
    Anthropic semantics by resetting prev_text on BOTH tool_start AND
    tool_end, so the post-tool empty-status no longer needs to do it.

Add backend/tests/test_anthropic_messages.py::TestAnthropicStreamEmitter::
test_post_tool_empty_status_does_not_double_close as a regression
test: content -> tool_start -> tool_end -> empty status -> content
must not bump block_index past tool_end's increment, and the post-tool
content must land in the text block tool_end opened.

95 tests pass across the anthropic + trailing-plan suites.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 003ae8a3c6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# not a plan signal -- excluded via negative lookahead.
_TRAILING_PLAN_INTENT = re.compile(
r"(?i)("
r"let me(?!\s+know\b)|now let me|i['’]ll now|next,?\s*i['’]ll|"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude prefixed 'now let me know' closers

Fresh evidence in this revision is that only the bare let me alternative has the (?!\s+know\b) guard; the new now let me alternative still matches complete closers such as Now let me know if you want another example.. When a GGUF response ends with that common closing, _trailing_plan_hit() treats the turn as mid-plan and can issue up to three unnecessary Continue. generations, adding latency/cost and potentially appending unrelated text after an already-finished answer.

Useful? React with 👍 / 👎.

danielhanchen and others added 10 commits May 19, 2026 04:35
The existing intent-signal re-prompt fires only when tools are armed
and the response is short. Models often stop mid-plan in other shapes
too: a trailing "Let me clone the repo.", a "Let me ...:" header
followed by a numbered list, or a bare trailing colon. When this
happens the turn ends with the structured workload only partially
delivered.

Add a neutral "Continue." nudge that runs alongside the tool-coercive
re-prompt:

- _TRAILING_PLAN_INTENT, _TRAILING_PLAN_LIST, _TRAILING_PLAN_COLON
  cover the three observed shapes, scanned over the last 600 chars.
- _trailing_plan_hit() returns True if any of them match.
- _MAX_CONTINUES (3) is independent of _MAX_REPROMPTS so the two
  paths cannot starve each other.
- _continue_count threads through the agentic loop; auto-continue
  fires with "Continue." regardless of tool armament.

Regex tested against the patterns above plus negative controls
(complete sentences, benign "let me" earlier in the buffer) before
landing.
Codex review on #5549 flagged that endings like
"If you need anything else, let me know." match the trailing-plan
intent pattern (the regex matches "let me <anything>." at end of
buffer). On a finished, user-facing closing this fires the auto-
continue branch up to three times, costing latency / tokens and
appending unrelated text after the response.

Add a negative lookahead so "let me" only counts as a mid-plan signal
when it is NOT immediately followed by "know". Other intent phrases
("now let me", "i'll now", "i'm going to", "i will now", "let's now")
already require a planning verb so they are unaffected.

Verified against `scripts/r6_trailing_plan_regex_test.py`: closing
"let me know" variants no longer match; "let me clone/check/run …"
still does.
Codex P2 on #5549 flagged that the auto-continue branch yields only
a `{"type":"status","text":""}` event between turns; the Anthropic
streaming emitter (`AnthropicStreamEmitter`) and the non-streaming
tool path (`_anthropic_tool_non_streaming`) both ignore `status`
events, so their cumulative-text cursor still holds the previous
turn's full length when the continuation starts streaming. Shorter
continuations get dropped entirely and longer ones lose their prefix.

Treat the empty-text status as an auto-continue boundary in both
paths:

  - `AnthropicStreamEmitter`: close any open text block, open a fresh
    one (matches the `tool_end` reset pattern), and clear `_prev_text`.
  - `_anthropic_tool_non_streaming`: clear `prev_text` so the next
    `content` event's diff baseline is empty.

Non-empty status events (tool progress text) keep their existing
no-op semantics.
Codex P2 review on #5549 surfaced two related risks in the auto-continue
plumbing:

1. `_TRAILING_PLAN_LIST` was compiled with `(?ims)`. The `m` flag makes
   the terminal `\s*$` match end-of-line, so a complete answer like
   "Here's my plan:\n- a\n- b\n\nDone, that should work." still matched
   the list-block sub-pattern and tripped a spurious `Continue.` retry.
   Drop the `m` (and the unused `s`) flag and re-anchor with `\Z` so the
   list pattern only fires when the list is genuinely the last thing in
   the buffer.

2. The agent loop pre-reserved `_MAX_REPROMPTS + _MAX_CONTINUES` (= 6)
   extra iterations on top of the caller's `max_tool_iterations`
   unconditionally. That weakens the caller-provided budget: a turn
   that never trips the reprompt or continue path could still run up
   to N+6 full iterations and execute their tool calls.

   Switch the bound to a dynamic cap that grows only as reprompts /
   continues are actually consumed: `iteration < max_tool_iterations +
   _reprompt_count + _continue_count`. With both counters at zero the
   loop honors the caller cap exactly; once a continue or reprompt
   fires it earns its own slot back.

   Implemented with `itertools.count()` so the existing `continue`
   statements in the loop body keep their semantics.

Regex behaviour pinned by `scripts/r6_trailing_plan_regex_test.py`
(updated separately for the new list-tail case).
While drafting backend/tests/test_trailing_plan.py for the changes
landed in b4e0985, the new tests surfaced a deeper false-positive the
earlier regex tightening missed.

For input "Let me explain:\n1. The function returns 42.\n\nThat's the
answer." the previous `(?:\s*(?:[-*•]|\d+\.)\s+[^\n]+\n?)+` allowed the
regex engine to backtrack and treat the in-prose "42." substring as a
second list-item marker: iter 1 consumed "1. The function returns 4"
and iter 2 consumed "2.\n\nThat's the answer." (with `\s+` greedily
crossing the empty-line break). The pattern then satisfied `\s*\Z` and
the buffer fired a spurious `Continue.` retry on what was already a
fully-formed answer.

Tighten the per-item boundary:

  - `[ \t]*` before the marker (no newlines): forces the marker to sit
    at the start of its own line. A mid-prose "42." cannot satisfy this
    because the engine cannot rewind past the preceding `\n` without
    invalidating the previous iteration's `[^\n]+\n` close.
  - `[ \t]+` between the marker and content: blocks an `\s+`-driven
    cross-newline reach into a closing paragraph.
  - `(?:\n|\Z)` at end of item: a real line break OR end of buffer.
    Preserves the "list at EOB with no trailing newline" case while
    eliminating the backtrack route.

Land backend/tests/test_trailing_plan.py at the same time, covering:

  - `_TRAILING_PLAN_INTENT` "let me know" closer exclusion (cycle-3 fix).
  - `_TRAILING_PLAN_LIST` correctly fires on genuine trailing lists
    (dash, asterisk, unicode bullet, numeric).
  - `_TRAILING_PLAN_LIST` does NOT fire on list + closing paragraph,
    list + closing sentence, list embedded mid-text, or the "42."
    in-prose digit case.
  - `_TRAILING_PLAN_COLON` fires on bare trailing intent-colons only.
  - `_trailing_plan_hit` composite cases.
  - The 600-char window slicing.

31 cases, all pass. Pins the regex behaviour against future regressions
inside the repo (the prior pin script lived only in the probing
workspace, not the studio tree).
…am cleanly

Codex P2 on #5549 flagged that the cycle-5 cursor-reset (efa43c4) keyed
on every empty-status event, but generate_chat_completion_with_tools
emits empty status events in five places, only two of which are real
iteration boundaries:

  - Line 4497: emitted right before `continue` after a re-prompt /
    auto-continue. The next iteration starts a fresh assistant turn.
    BOUNDARY: reset cursor.
  - Line 4766: emitted right before `continue` after a tool call. The
    next iteration regenerates with tool results in history. BOUNDARY:
    reset cursor.
  - Line 4501: emitted at metadata-yield after normal streaming. Stream
    is about to end, no new iteration follows. NOT a boundary.
  - Line 4584: emitted in DRAINING-no-tool-call fallback path. Stream
    is about to end with buffered content_accum. NOT a boundary.
  - Line 4794: emitted at the final exit of the generator. NOT a
    boundary.

Treating all five as boundaries gave every Anthropic-streaming response
an extra content_block_stop + content_block_start pair around its final
text and around every tool call.

Fix by tagging the two real boundary sites with `"boundary": True` and
tightening both the Anthropic emitter (`anthropic_compat.py`) and the
OpenAI-compat tool path + Anthropic non-streaming path
(`routes/inference.py`) to reset the cumulative-text cursor only when
that flag is set. Plain empty-status events keep their existing badge-
clear semantics on the frontend (`tool_status` SSE with content "").

Add two regression tests in
`backend/tests/test_anthropic_messages.py::TestAnthropicStreamEmitter`:

  - test_boundary_flag_closes_block_and_resets_cursor: a boundary=True
    status closes the open text block and the next content delta
    streams from zero.
  - test_empty_status_without_boundary_does_not_close_block: a plain
    empty status leaves block_index unchanged and the next content
    delta is diffed against the previous text length.

107 tests pass across the four anthropic + trailing-plan test files.
… tool_end

Codex 02:42Z on #5549 caught that flagging the post-tool empty-status
event with boundary=True (cycle-12 commit 610c387) double-handles the
cursor reset in the Anthropic streaming path.

AnthropicStreamEmitter._handle_tool_end already:
  - closes the open tool_use block,
  - emits tool_result,
  - increments block_index,
  - opens a fresh text block, and
  - resets _prev_text = "".

When llama_cpp.py then yielded boundary=True on the very next event,
_handle_boundary fired _close_block + _open_text_block on that freshly
opened (still-empty) text block. Result: every tool call produced a
spurious content_block_stop + content_block_start pair before the
post-tool model text streamed.

Fix:

  - llama_cpp.py: drop boundary=True from the post-tool status emit
    (line ~4768). Keep boundary=True only at the auto-continue site
    (line ~4500), which has no preceding tool_end to do the cursor
    work.
  - routes/inference.py OpenAI-compat tool stream: mirror the
    Anthropic semantics by resetting prev_text on BOTH tool_start AND
    tool_end, so the post-tool empty-status no longer needs to do it.

Add backend/tests/test_anthropic_messages.py::TestAnthropicStreamEmitter::
test_post_tool_empty_status_does_not_double_close as a regression
test: content -> tool_start -> tool_end -> empty status -> content
must not bump block_index past tool_end's increment, and the post-tool
content must land in the text block tool_end opened.

95 tests pass across the anthropic + trailing-plan suites.
@danielhanchen danielhanchen force-pushed the studio/auto-continue-mid-plan branch from 003ae8a to a1d79b7 Compare May 19, 2026 04:40

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a1d79b761e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if self._text_block_open:
events.append(self._close_block())
self.block_index += 1
events.extend(self._open_text_block())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Defer opening boundary text blocks until content arrives

When a server-tool Anthropic stream first emits intent text and the reprompt then produces a tool call, the boundary status emitted by llama_cpp.py line 4463 enters _handle_boundary(). Opening a new text block here means the following tool_start immediately closes that just-opened block before emitting the tool_use, so clients receive a zero-length text content block between the intent text and the tool call. Reset the cursor and close the current block at the boundary, but wait to open a text block until _handle_content() actually sees text.

Useful? React with 👍 / 👎.

…undary

PR 5549 cycle-15 codex review surfaced two P2 issues:

1. Trailing-plan regex matched "Now let me know if you want another
   example" (the closer "let me know" was negative-lookahead-guarded
   only on the bare alternative, not on "now let me"). Refactor to a
   single `(?:now\s+)?let me(?!\s+know\b)` alternative so both
   prefixes share the same guard.

2. Anthropic adapter opened a new text block immediately on the
   iteration boundary marker. If the reprompted iteration started
   with a tool_call rather than text, that just-opened text block was
   immediately closed, emitting a zero-length text content block
   between intent text and tool_use. Defer the open: close the
   current block + bump block_index + reset cursor at the boundary,
   and let `_handle_content` lazy-open when real text arrives.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants