Studio: don't re-prompt after model produced a complete answer#5714
Studio: don't re-prompt after model produced a complete answer#5714danielhanchen wants to merge 42 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 648ec8cbe6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| r"|<!doctype\b" # HTML page | ||
| r"|<html\b" | ||
| r"|<svg\b[\s\S]*?</svg>" # complete SVG | ||
| r"|(?:^|\n)\s*\d+\.\s+\S.*?\n\s*\d+\.", # 2+ numbered list items |
There was a problem hiding this comment.
Exclude planning lists from answer-artifact guard
The new numbered-list branch in _HAS_ANSWER_ARTIFACT treats any two-item 1./2. list as a completed answer, which suppresses the re-prompt even for plan-only stalls like Here's my plan: 1. I'll search... 2. I'll summarize.... That directly conflicts with the existing _INTENT_SIGNAL logic (which explicitly matches “here’s my plan”) and allows the model to skip required tool calls and return only a plan when tools are available.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request introduces a _HAS_ANSWER_ARTIFACT regex to prevent the model from re-prompting when it has already provided a substantive answer, such as a closed code block, HTML, SVG, or a numbered list. This addresses an issue where the re-prompt logic would over-fire on complete responses that happened to include intent-only language. Feedback suggests refining the code fence regex to support a wider range of language identifiers and to handle indented closing fences, which are common in model outputs.
| # require ALL of (intent signal, length < _REPROMPT_MAX_CHARS, no | ||
| # answer artifact) to fire. | ||
| _HAS_ANSWER_ARTIFACT = re.compile( | ||
| r"```[a-zA-Z]*\n[\s\S]+?\n```" # closed code fence |
There was a problem hiding this comment.
The current regex for closed code fences is too restrictive. It fails to detect artifacts when:
- The language identifier contains numbers or symbols (e.g.,
python3,c++,c#). - The closing fence is indented (e.g.,
```), which is common in model outputs when code is nested in lists or blockquotes.
Using [^\n]* for the language identifier and allowing optional horizontal whitespace before the closing backticks makes the detection much more robust.
| r"```[a-zA-Z]*\n[\s\S]+?\n```" # closed code fence | |
| r"```[^\n]*\n[\s\S]+?\n[ \t]*```" |
Live validation on Qwen3.5-35B-A3B Q8Ran the focused 4-prompt set (5 seeds each) against the same Studio config that scored 0/5 on
Aggregate: 8/20 -> 11/20 OK on a prompt set that was previously broken by the over-fire. Flappy deltaFlappy is the long-code prompt where the model needs to emit ~1500+ chars of pygame code. Looking at the 5 post-fix wall times: 13s / 10s / 51s / 41s / 114s. The shortest two used 1 tool call -- meaning the model decided to skip the web search and write code directly, but the response was truncated before reaching the 1500-char threshold the judge enforces. That is a separate issue (response truncation when the model bypasses the search tool), not caused by this PR. The pre-fix passing seed for flappy had a long search loop that fed enough context to push past the length threshold; with this PR the model can stop searching earlier and write code, but the judge's Billboard sweep regression testPre-PR Billboard sweep on the same config was 17/25 OK, best=9. Re-running 5 Billboard seeds post-fix on this slot: Plan-only stall behaviour is preserved by the new unit tests; the live sweep confirms no regression on the original PR #4700 capability. |
Billboard regression test resultSame Qwen3.5-35B-A3B Q8 Studio, 5 fresh billboard seeds post-fix:
5/5 OK post-fix vs 17/25 (68%) pre-fix on this same config. Plan-only stalls still re-prompt and the model still drives multi-turn tool calls — no regression on the PR #4700 capability. Summary of validation:
The fix is doing what it should: when the model already produced complete code/markup, the re-prompt no longer wipes it; when the model has only described intent without any artifact, the re-prompt still fires. |
8-config matrix validation (post-fix)Ran the focused 4-prompt set (python_game / flappy / weather / sloth_svg, 5 seeds each) against 8 distinct configs after restart with the patched llama_cpp.py. Honest read: Where the fix shines clearly: 3.5-35B-A3B Q8Before-vs-after on the same Studio (only delta is this PR):
Pre-fix: 0/5 pass; the model emitted ~1500-2700 chars of "I'll create a Snake game!" intent prose, then the re-prompt fired, the conversation pivoted, and the code block never got written. Post-fix: 3/5 pass; on the seeds where the model decided to write the code first, the re-prompt no longer wipes it. The visible content jumps 2x median chars and now contains actual code fences and Where the fix is within seed noise3.6-27B Q4/Q8, 3.5-27B Q4/Q8: median chars unchanged (~800-1000 or 0). These configs already produced little content; the fix doesn't help when the model is bottlenecked elsewhere (max_tokens=2048 cap, OOMs in tool execution, etc.). Where there's seed-variance regression3.6-35B-A3B UD-Q4_K_XL: 7374 → 2396 median chars. Looking at individual seeds, this is one lucky pre-fix seed-1 hitting a long input()-driven game versus post-fix seeds picking up random-event games that hit different judge code paths. With 5 seeds at temperature=0.7 the signal-to-noise is too low to draw a conclusion. NetThe fix is directly proven on 3.5-35B-A3B Q8 (and the test_llama_cpp_reprompt_guard.py unit tests pin the regex semantics). The matrix-wide noise comes from a 5-seed sample at temperature=0.7 on prompts whose judge is sensitive to game style. The fix is correct; the prompt set just needs more seeds or a more style-tolerant judge for a tight A/B across all configs. I also confirmed no Billboard regression: 5/5 pass on the same post-fix Studio (pre-fix was 17/25 on that config in the original sweep). |
Corrigendum: 25-seed matrix shows smaller python_game gain than initial 5-seed sampleRe-ran python_game with 25 seeds per config across all 8 matrix configs (200 post-fix vs 40 pre-fix trials, using a more permissive judge that accepts game-driven randomness / class-based state in addition to
The earlier 5-seed focused validation hit 3/5 on Qwen3.5-35B-A3B Q8 -- with a larger sample that drops to 1/25 (4%). The 5-seed signal was within seed noise. What actually changed (verified by character length distribution)Pre-fix runs hit additional re-prompt iterations (the bug); each iteration appended more Post-fix runs short-circuit the re-prompt when a complete artifact is present. The model produces one good response and stops there; total content is shorter but the answer is no longer wiped. Concretely on Qwen3.5-35B-A3B Q8 python_game:
So the fix does what it's supposed to do, just with a smaller-than-advertised effect size on the python_game pass rate. The dramatic case is Qwen3.6-35B-A3B-MTP UD-Q4_K_XL: 1/5 -> 6/25 (24%), which is a real 5x ratio that's outside seed noise. What this PR is and isn'tIt IS: a targeted fix for the over-fire heuristic that was wiping completed answers. Unit tests pin both directions of the regex semantics, including no-regression on plan-only stalls. It ISN'T: a complete fix for the model emitting short python_game responses. The python_game prompt has other bottlenecks (max_tokens=2048 cap, the I'd still recommend merging -- the fix is correct, the unit tests prove it, and it does measurably help the configs that already produce some code. But the headline shouldn't be "0/5 -> 3/5" from the small-sample focused test. |
Final A/B: 800 post-fix trials vs 160 pre-fix trials (8 configs × 4 prompts)Filling out the matrix to 25 seeds per cell on all 4 code-emitting prompts. Final numbers:
At the 800-trial sample size, aggregate pass rates are essentially unchanged. Individual configs swing both ways (e.g. Qwen3.6-35B-A3B Q4: weather 60% -> 88% up, python_game 60% -> 24% down). The 60% -> 24% python_game drop on the 3.6-35B-A3B Q4 config is the largest single regression and worth flagging -- though it's still 5x noisier than the 200-trial signal. Why aggregate didn't moveThe judge rubric (
What the fix demonstrably does
RecommendationThe PR is correct and safe but the effect on aggregate pass-rate is smaller than the initial 5-seed sample suggested. Reviewers should focus on the regex semantics (unit-tested) rather than the headline numbers. For a real fix to I'll leave this PR open for review and not push more validation runs. |
Robustness pass — cross-platform & adversarial inputDid a full theoretical + empirical sweep of edge cases (Linux/Mac/Windows portability, browser-agnostic since the change is server-side, adversarial regex input). Found and fixed two real issues. Pushed 5f92e80. Issues found in the initial commit
Fix_HAS_ANSWER_ARTIFACT = re.compile(
r"```[a-zA-Z]*\r?\n[\s\S]+?\r?\n```" # closed code fence
r"|<!doctype\b" # HTML page
r"|<html\b"
r"|<svg\b[\s\S]*?</svg>" # complete SVG
r"|(?:^|\r?\n)[ \t]*\d+\.[ \t]+\S.*?\r?\n[ \t]*\d+\.", # 2+ numbered list items
re.IGNORECASE,
)
VerificationRan a thorough isolated simulation suite under fresh
Cross-platform / browser scope
Is this PR fixing a real issue?Yes -- the original bug is reproduced empirically across 900 sweep runs on 15 Qwen GGUF configs (see Will merging break anything?No -- the change is additive (new |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4d6a470789
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # O(n^2) backtracking through embedded `\r\n` characters on adversarial | ||
| # inputs. | ||
| _HAS_ANSWER_ARTIFACT = re.compile( | ||
| r"```[a-zA-Z]*\r?\n[\s\S]+?\r?\n```" # closed code fence |
There was a problem hiding this comment.
Accept non-alphabetic fence info strings
Update _HAS_ANSWER_ARTIFACT so fenced code blocks are recognized when the info string contains symbols (for example c++, c#, or objective-c). The current pattern ```[a-zA-Z]*\r?\n...``` only matches pure letters, so complete answers like `First, let me write it...\n```c++ ... ```` are treated as having no artifact; with tools enabled, that still satisfies the intent check and triggers the STOP re-prompt path, wiping a valid final code answer.
Useful? React with 👍 / 👎.
The plan-without-action re-prompt at
`studio/backend/core/inference/llama_cpp.py` fires when the model
emits intent-only language ("first I'll ...", "let me ...") without
calling a tool. Previously the heuristic only checked an intent regex
and a 2000-char length cap. The same intent words occur in long
explanations that accompany REAL code or markup, so a complete reply
like "First, let me set up pygame. ```python ... ```" still tripped
the re-prompt, and the synthetic follow-up ("STOP. Do NOT write code
or explain.") wiped the user-visible answer.
Reproduced at scale in a 900-run sweep across 15 Qwen3.5/3.6 GGUF
configs: prompts that emit code or markup (Create a Python game,
Create a Flappy Bird game, weather dashboard HTML, sloth SVG)
landed empty `final_text` for the majority of seeds even on the
strongest configs.
Fix adds a `_HAS_ANSWER_ARTIFACT` regex covering:
- closed code fences (```...```)
- HTML pages (<!doctype, <html)
- complete SVG (<svg...</svg>)
- 2+ item numbered lists
and a `and not _HAS_ANSWER_ARTIFACT.search(_stripped)` guard on the
re-prompt condition. Plan-only stalls still re-prompt; complete
responses no longer do.
13 new unit tests in `test_llama_cpp_reprompt_guard.py` pin both
directions (artifact present -> no re-prompt; plan-only -> still
re-prompts).
for more information, see https://pre-commit.ci
…racking Two robustness fixes for the `_HAS_ANSWER_ARTIFACT` regex from the parent commit, both caught by a thorough simulation suite covering Linux/Mac/Windows line-ending portability and adversarial inputs. 1. **CRLF line endings.** The original `\n` literals missed Windows- authored or CRLF-converted content (model echoing a pasted prompt, etc.). Replaced with `\r?\n` everywhere a newline is required, so closed code fences, numbered lists, and end-to-end re-prompt decisions all work on `\r\n` as well as `\n`. 2. **Catastrophic backtracking on whitespace spam.** The numbered-list alternative `(?:^|\r?\n)\s*\d+\.\s+\S.*?\r?\n\s*\d+\.` was O(n^2) on long whitespace runs: `\s*` greedy + `\d+` failing + `\s` matching `\r\n` led to repeated backtracking through the newline characters. Measured at ~630ms for 10KB of `\r\n` repeats. Fix: restrict the post-newline indent to `[ \t]*` (spaces / tabs only). After `\r?\n` we are at column 0 and only spaces / tabs are a sensible leading indent for a list item; greedy whitespace was never needed. New worst case on the same input: <1ms (1000x speedup). Added 5 in-tree tests: - test_artifact_regex_handles_crlf_code_fence - test_artifact_regex_handles_crlf_numbered_list - test_artifact_regex_handles_mixed_lf_crlf - test_no_backtrack_on_crlf_spam (asserts <50ms on 10KB \r\n) - test_no_reprompt_on_crlf_complete_python_game All 18 reprompt-guard tests pass. All 253 llama_cpp-related tests pass. Out-of-tree simulation suite (84 tests) passes on both Python 3.12 and Python 3.13 inside isolated uv venvs.
for more information, see https://pre-commit.ci
4d6a470 to
6639a3b
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6639a3b31a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| r"|<!doctype\b" # HTML page | ||
| r"|<html\b" | ||
| r"|<svg\b[\s\S]*?</svg>" # complete SVG | ||
| r"|(?:^|\r?\n)[ \t]*\d+\.[ \t]+\S.*?\r?\n[ \t]*\d+\.", # 2+ numbered list items |
There was a problem hiding this comment.
Keep plan-only numbered lists eligible for re-prompt
The new numbered-list alternative in _HAS_ANSWER_ARTIFACT treats any two-item 1./2. list as a completed answer, so plan text like Here's my plan:\n1. I'll search...\n2. I'll summarize... now bypasses the tool-forcing re-prompt even though no tool was called. That defeats the purpose of the guard for a common stalling pattern when tools are available. Fresh evidence in this commit: test_artifact_regex_handles_crlf_numbered_list explicitly asserts "Here's the plan:\r\n1. one\r\n2. two\r\n" is an artifact, so this false-negative path is now locked in by tests.
Useful? React with 👍 / 👎.
| # O(n^2) backtracking through embedded `\r\n` characters on adversarial | ||
| # inputs. | ||
| _HAS_ANSWER_ARTIFACT = re.compile( | ||
| r"```[a-zA-Z]*\r?\n[\s\S]+?\r?\n```" # closed code fence |
There was a problem hiding this comment.
Recognize one-line fenced blocks before suppressing re-prompt
The closed-fence pattern requires a newline before the closing fence (...\r?\n```), so responses such as First, let me write it:\n```python\nprint('ok')```` are not detected as answer artifacts. In that case _INTENT_SIGNAL still matches (`First, let me...`), so with tools enabled the backend re-prompts and can overwrite a valid code answer that was already produced.
Useful? React with 👍 / 👎.
…ng lists, incomplete HTML
Addresses three follow-ups flagged on the first cut of this PR by static
reviewers and parallel reviewer runs:
1. Numbered plan-only stalls were treated as completed answers. A
response like `Here's my plan:\n1. Search the web\n2. Summarise`
matched both `_INTENT_SIGNAL` and the numbered-list branch of
`_HAS_ANSWER_ARTIFACT`, so the tool-forcing re-prompt was skipped.
That contradicted the PR's stated invariant that plan-only stalls
still re-prompt. The list now has to be paired with no plan framing
(no `Here's my plan` / `plan:` / `approach:`, no intent phrase
followed by a tool-action verb) to count as an artifact.
2. Closed code fences with non-alpha info strings (`python3`, `c++`,
`c#`, `objective-c`, `ts-node`, `bash-session`, `python linenums="1"`)
were not recognised by the `[a-zA-Z]*` info-string class. Complete
answers in those languages still re-prompted and could be wiped.
The info-string class is now `[^\r\n]{0,200}` and the closing fence
may be indented.
3. Bare `<!doctype` or `<html` text was treated as an artifact. A
plan-only response that mentions `<html>` in prose now no longer
bypasses the re-prompt; the HTML branch requires a closing
`</html>` (doctype prefix optional).
All `[\s\S]{...}?` runs are length-bounded so ReDoS-style adversarial
input stays linear. ReDoS guard tests cover CRLF spam and repeated
`<html ` openings without close.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
for more information, see https://pre-commit.ci
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
…ntemplative verbs
Three follow-up gaps surfaced by another reviewer sweep on the previous commit:
1. Tilde-fenced code (~~~lang ... ~~~) was not detected. CommonMark allows
it and several models emit it when the body itself contains backticks.
Add a tilde alternative to _HAS_ANSWER_ARTIFACT mirroring the backtick
form (any info string, optional indent on close, length-bounded body).
2. Bare "Plan:" / "Approach:" lines did not match _INTENT_SIGNAL, so a
"Plan:\n1. search\n2. summarise" stall slipped past the entry gate
entirely. Add the colon form to the step / plan framing alternative.
3. The plan-framing verb whitelist missed common contemplative verbs
(think / respond / answer / analy[sz]e / explore / outline / gather /
query / reason) so plan stalls phrased without explicit "Here's my
plan" framing were misclassified as completed answers. Keep the
whitelist conservative: write / create / make / build / read / list /
try are intentionally out because real answer lists use them
("1. Write a poem", "1. Read War and Peace").
Added regression tests for each fix plus an extra ReDoS budget test for
the doctype/<html alternation worst case (about 7 ms today; assert < 50
ms so a future quantifier change that drops the inner {0,4000} bound
fails loudly).
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
The follow-up commit used ``i['’]?ll`` (apostrophe optional) in ``_PLAN_LIST_FRAMING``. With the apostrophe optional the alternative also matches the word "ill" (sick), so a response like "She is ill. Here is the list:\n1. ...\n2. ..." plus an unrelated action verb within 80 chars was misclassified as a plan and re-prompted. Make the apostrophe required (``i['’]ll``) to mirror the original _INTENT_SIGNAL definition. Add a regression test that pins the distinction: "ill" as adjective does not trigger plan framing, but "I'll" / "I will" do.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Two more gaps surfaced by another reviewer sweep on the previous commit:
1. _PLAN_LIST_FRAMING was missing several intent forms that
_INTENT_SIGNAL accepts, so numbered tool-action plans phrased with
"Allow me", "I'm going to", "I'm gonna", "I am gonna", or "I shall"
were silently classified as completed answers and skipped the
tool-call re-prompt. Mirror the full intent set from _INTENT_SIGNAL
so the two regexes stay in lock-step.
2. Bare \b(?:plan|approach): in _INTENT_SIGNAL / _PLAN_LIST_FRAMING
matched any in-text occurrence of "plan:" / "approach:", including
"lesson plan:" / "meal plan:" / "migration plan:". A direct answer
like "Here is a lesson plan:\n1. Warm-up\n2. Group practice" would
trip _INTENT_SIGNAL and risk wiping the response. Anchor the colon
marker to start of line and only allow generic determiners (my, the,
our, a, this, that) between the line start and the keyword.
3. Add "Here is the plan" / "Here are my steps" to both _INTENT_SIGNAL
and _PLAN_LIST_FRAMING so non-apostrophe phrasings of the same
framing pattern are caught.
Added regression tests covering every intent form against a numbered
action plan, and a line-anchor test that distinguishes generic plan
framings ("My plan:", "The approach:") from content noun phrases
("lesson plan:", "meal plan:").
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
…uct text does not re-prompt
After narrowing the colon marker to lines starting with a generic
determiner ("My plan:" / "The approach:" / ...), inline product or
pricing answers like "Your current Plan: Pro includes local chats",
"The plan: Basic is free, Pro is $10/month", or
"My plan: use dynamic programming" still slipped into the re-prompt
path and could wipe a valid answer.
Add a lookahead requiring a newline (with optional trailing horizontal
whitespace) after the colon, so only header-style framings like
"Plan:\n1. search\n2. summarise" or "My approach:\n1. fetch" count.
Inline "Plan: <text>" is now treated as ordinary prose.
Add eight regression samples (lesson plan, meal plan, marketing plan,
pricing plan, recommended approach, migration plan, dynamic-programming
plan, currently active plan) all of which previously re-prompted under
the unanchored matcher and now correctly do not.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
…ents
Reviewer round 8 surfaced a real false positive in the previous commit:
a final answer naturally titled "Plan:" / "My plan:" / "Approach:" with
numbered content items now slipped through _INTENT_SIGNAL and got
wiped by the synthetic STOP turn. Examples:
Plan:
1. Warm-up: Students review fractions.
2. Group practice.
3. Assessment.
My plan:
1. Breakfast: oatmeal and fruit.
2. Lunch: rice bowl.
3. Dinner: lentil soup.
Here is the plan you asked for. It is two pages long.
Add a lookahead requiring one of the conservative re-prompt action
verbs (search / fetch / verify / look up / call / compare / think /
respond / etc.) to appear within 120 chars after the "Plan:" /
"Approach:" / "Here is the plan" / "Here are my steps" marker. Plan
stalls whose items are tool actions ("Plan:\n1. search the docs\n2.
summarise the result") still match and re-prompt; prose plans whose
items are content do not.
Also mirror "first" in _PLAN_LIST_FRAMING so numbered action plans
that start with "First" stay disqualified even after the helper enters
the numbered-list branch.
Factor the action-verb set out as _REPROMPT_ACTION_VERBS so both
regexes share one source of truth.
Six new regression samples: three lesson / meal / weather plans that
must NOT wipe, three action-plan headers that must re-prompt, three
prose "Here is the plan" answers that must not wipe.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aba4824de2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if _HAS_ANSWER_ARTIFACT.search(text): | ||
| return True |
There was a problem hiding this comment.
Keep plan-framed replies re-promptable with fenced placeholders
_has_answer_artifact returns True as soon as _HAS_ANSWER_ARTIFACT finds any closed fence, before the numbered-list/plan-framing checks run. With tools enabled, a short stall like Here's my plan: 1. Search the web... 2. Summarize... plus a placeholder fenced block (for example a stub snippet) now bypasses the re-prompt even though no tool was called, so the model can return intent-only text and skip required tool execution.
Useful? React with 👍 / 👎.
…t, list-item verb cross-check
- _has_unclosed_markup_block() now compares open / close tag counts.
A response with one closed <html> followed by a second still-open
<html> (multi-page mid-stream) or <svg></svg><svg> is unbalanced,
so the artifact path returns False and the re-prompt fires. The
helper now runs BEFORE _HAS_ANSWER_ARTIFACT so an earlier complete
artifact cannot mask a later open block.
- _looks_like_real_artifact() rejects empty <html></html> /
<svg></svg> skeletons. Plan-only mentions ("First, I'll create an
<html></html> skeleton, then add CSS.") no longer suppress the
re-prompt.
- _NUMBERED_ACTION_ITEM + _STRONG_INTENT_BEFORE_LIST catches plans
where the work verbs sit in the list ITEMS rather than before the
list (e.g. "First, I'll:\n1. Load the CSV.\n2. Compute total").
The verb whitelist is intentionally narrow (load, parse, calculate,
compute, analyze, run, execute, fetch, download, query, inspect,
extract) so ordinary algorithm answers ("First, use binary search:
1. Search the left half") stay valid. The intent gate excludes
bare "First" / "Step N:" for the same reason - direct first-person
pronoun is required.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7c08ade3a4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| opens_html = len(re.findall(r"<html\b", text, re.IGNORECASE)) | ||
| closes_html = len(re.findall(r"</html>", text, re.IGNORECASE)) | ||
| if opens_html > closes_html: | ||
| return True |
There was a problem hiding this comment.
Skip markup-balance gating for fenced code content
_has_unclosed_markup_block counts raw <html>/<svg> tokens across the entire response before checking for fenced-code artifacts, so a valid closed code block like `````python\nT = ""\n``` `` is treated as "unclosed markup" and _has_answer_artifact returns `False`. In tool-enabled turns with intent phrasing (`"First, let me..."`), this re-enables the STOP re-prompt and can overwrite a complete code answer that merely contains HTML/SVG literals inside code.
Useful? React with 👍 / 👎.
| m = _HAS_ANSWER_ARTIFACT.search(text) | ||
| if not m: | ||
| return False | ||
| matched = m.group(0).strip() | ||
| if _EMPTY_MARKUP_SKELETON.fullmatch(matched): |
There was a problem hiding this comment.
Keep searching after matching an empty markup skeleton
_looks_like_real_artifact inspects only the first _HAS_ANSWER_ARTIFACT match and returns False if that match is an empty <html></html>/<svg></svg> skeleton. If the same response later contains a substantive artifact (for example, a closed code fence), it is never considered, so _has_answer_artifact can still be False and the model gets re-prompted despite already producing a complete answer.
Useful? React with 👍 / 👎.
…tches, bare-intent colon plan - _has_answer_artifact() now strips closed code fences before checking for unclosed markup, and strips closed markup before checking for unclosed code fences. A Python / JS snippet containing literal "<html>" / "<svg>" strings no longer trips the unclosed-markup cross-check, and complete HTML containing a JS string with literal backticks no longer trips the unclosed-fence cross-check. - _looks_like_real_artifact() iterates every artifact match. An empty <html></html> / <svg></svg> skeleton followed by a real complete page no longer hides the real artifact. - _is_empty_markup_skeleton() strips an optional <!doctype ...> prefix before testing the empty-skeleton pattern, so "<!doctype html><html></html>" plan-only mentions also re-prompt. - _BARE_INTENT_NUMBERED_PLAN catches the tight "I'll:\n1. Open ..." / "Let me:\n1. Parse ..." shape where bare first-person intent + colon + newline is immediately followed by numbered action items. No work verb is required between the intent and the list.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ae8f70f05e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| r"step \d+|first|" | ||
| r"i['’](?:ll|m going to|m gonna)|i am (?:going to|gonna)|" | ||
| r"i will|i shall|let me|allow me|now i|next i)\b" | ||
| r"[\s\S]{0,2000}?" | ||
| rf"\b(?:{_TOOL_ACTION_VERBS})\b", |
There was a problem hiding this comment.
Keep generic "Step 1" numbered plans re-promptable
The numbered-list fallback currently treats lists as completed artifacts unless _PLAN_LIST_FRAMING matches a tool-action verb, so intent-only plans like Step 1:\n1. Analyze the request.\n2. Draft the answer. are classified as final answers and skip the tool-call re-prompt. This is a regression from the prior intent guard for common stalling phrasing (first/step N) when tools are available, because no tool call happened but the response is no longer nudged.
Useful? React with 👍 / 👎.
| if _has_unclosed_markup_block(text_without_closed_fences): | ||
| return False |
There was a problem hiding this comment.
Do not fail artifact detection on literal HTML tag mentions
The unclosed-markup gate runs before accepting an already complete artifact and counts raw <html>/<svg> tokens anywhere outside stripped closed fences, so a valid fenced-code answer followed by prose like ...replace <html>... is marked as "unclosed markup" and re-prompted. In tool-enabled turns with intent phrasing, this can still overwrite a complete answer even though the code artifact is finished.
Useful? React with 👍 / 👎.
When a real complete artifact is already in the response, prose mentions of bare <html> / <svg> tags in explanatory text are common (for example "Use the <html> tag for the root"). The unbalanced- open/close count would falsely classify the response as mid-stream and wipe the valid answer. The artifact-counting cross-check now only runs when NO real artifact has been emitted yet; once a real artifact exists, mid-stream second markup is rare enough that the count-based detector is not worth the false-positive cost. This also unblocks complete <html> answers that nest <svg> children or contain JS string literals like "<svg width=10>", since those unmatched markup tokens were being flagged as unclosed.
for more information, see https://pre-commit.ci
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
1 similar comment
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
…en first-person plan verbs
- Re-prompt path calls _strip_tool_markup(final=True) on content_accum
before measuring intent / artifact / length. An orphan
``<tool_call>...</tool_call>`` block containing a code fence no
longer hides the intent-only visible answer from the artifact check.
- _DIRECT_NUMBERED_PLAN_FRAMING splits into two branches:
* First-person intent ("I'll", "Let me", "I will", etc.) accepts a
broader work-verb set (open, read, search, check, review, inspect,
examine, etc.). Direct first-person announcements are strong
plan-like signals.
* Bare "First, ..." / "Step N: ..." keeps the narrow verb set so
algorithmic answers ("First, use binary search:") stay valid.
Catches stalls like "I will check the docs:\n1. Gather..." and
"Let me read the uploaded file:\n1. Identify the columns..." that
previously slipped past the freshness-gated lookup verbs.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
for more information, see https://pre-commit.ci
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
- _has_unclosed_code_fence() ignores a fence run when the trailing text on the same line starts with a space (typical English prose like "Use \`\`\` to start a markdown fence."). Real fence openers either end the line right after the delimiters or carry an info string with no leading space (\`\`\`python, \`\`\`bash-session). - _DIRECT_NUMBERED_PLAN_FRAMING accepts "take these steps", "follow these steps", and "perform these actions" as first-person intent verbs. Plans like "I'll take these steps:\n1. Open URL\n 2. Read" still re-prompt instead of being read as final answers.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
for more information, see https://pre-commit.ci
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
…her verbs - _has_unclosed_code_fence() splits inline fence handling from column-0 handling. Inline fences (text before the delimiters on the same line) now require a clean info-string trailing (no internal whitespace, no leading space) to count. Prose mentions like "Use \`\`\` to start" or "Use \`\`\`python to open a block." no longer falsely flag the response as mid-stream. Column-0 fences keep their permissive info-string parsing. - _INTENT_SIGNAL + _DIRECT_NUMBERED_PLAN_FRAMING + _BARE_INTENT_NUMBERED_PLAN + _STRONG_INTENT_BEFORE_LIST all add "i need to" as a direct first-person intent phrase. - _DIRECT_NUMBERED_PLAN_FRAMING and _BARE_INTENT_NUMBERED_PLAN add visit / access / navigate / gather / collect / identify / update / edit so browser-navigation and data-gathering plans still re-prompt. - _LOCAL_ACTION_VERBS adds gather / collect / identify so numbered lists whose item verbs match these still trigger the intent-+-action-item cross-check.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b6477eddb1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # Excludes "I can", "I should", "I want to", "let's" which | ||
| # appear frequently in direct answers / explanations. | ||
| r"\b(i['\u2019](ll|m going to|m gonna)|i am (going to|gonna)|i will|i shall|let me|allow me)\b" | ||
| r"\b(i['\u2019](ll|m going to|m gonna)|i am (going to|gonna)|i will|i shall|i need to|let me|allow me)\b" |
There was a problem hiding this comment.
Remove generic "I need to" from intent trigger
Adding i need to to _INTENT_SIGNAL makes many ordinary text answers look like plan-only stalls when tools are enabled. A response such as "I need to clarify one assumption before the final result" (with no code/HTML/SVG artifact) now satisfies the re-prompt guard and gets replaced by the synthetic STOP/tool-call turn, even though it can be a legitimate final explanation. This broad phrase was not previously part of the intent gate, so it introduces new false re-prompts and answer wiping for non-tool tasks.
Useful? React with 👍 / 👎.
…erbs
- _INTENT_SIGNAL, _DIRECT_NUMBERED_PLAN_FRAMING, _BARE_INTENT_NUMBERED_PLAN,
and _STRONG_INTENT_BEFORE_LIST all drop bare "i need to" from their
intent vocabulary. The phrase is too common in ordinary clarification
prose ("I need to know your operating system") and quoted answer text
("I need to leave early"), so adding it as a re-prompt trigger
produced too many false positives. Genuine "I need to X..." plans
are still caught when paired with "I'll", "Let me", or "first" /
"step N" framing elsewhere in the candidate.
- _DIRECT_NUMBERED_PLAN_FRAMING adds "complete these steps" /
"complete the following steps" to the verb whitelist, parallel
to the existing "take/follow these steps" / "perform these actions".
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d77b38d51d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if not real_artifact and _has_unclosed_markup_block(text_without_both): | ||
| return False |
There was a problem hiding this comment.
Defer unclosed-tag gate until after numbered-list fallback
_has_answer_artifact returns early on _has_unclosed_markup_block(...) before it evaluates _NUMBERED_LIST_ARTIFACT, so valid numbered-list answers that mention literal opening tags (for example 1. <html> / 2. <body>) are treated as non-artifacts and still trigger the tool-call re-prompt when intent phrasing like “First,” is present. In tool-enabled turns this wipes legitimate instructional answers about HTML/SVG syntax, even though the list is the final user-visible answer rather than a partial streamed markup block.
Useful? React with 👍 / 👎.
- _DIRECT_NUMBERED_PLAN_FRAMING collapses take/follow/complete steps and perform actions into the symmetric (take|follow|complete|perform) (these|the following) (steps|actions) pattern. Phrasings like "I'll perform these steps:" or "I will take the following actions:" are now caught alongside the existing variants. - _BARE_INTENT_NUMBERED_PLAN expands "look up" to "look (this|that|it|them)? up" so "I'll:\n1. Look this up." is treated as a plan stall, matching _DIRECT_NUMBERED_PLAN_FRAMING.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7f317b1e3a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| r"(?:open|read|search|" | ||
| r"look (?:this |that |it |them )?up|" | ||
| r"check|verify|create|build|add|set up|" | ||
| r"load|inspect|parse|calculate|compute|analy[sz]e|extract|run|execute|" | ||
| r"fetch|download|query|summari[sz]e|implement|generate|draft|write|" | ||
| r"visit|access|navigate|gather|collect|identify|update|edit)\b", |
There was a problem hiding this comment.
Keep bare "I will:" numbered plans re-promptable
_BARE_INTENT_NUMBERED_PLAN only treats a narrow verb list as plan actions, so replies like I will:\n1. review code.\n2. summarize. or I will:\n1. find information.\n2. summarize. are classified as answer artifacts and skip the tool-call re-prompt. In tool-enabled turns this is a regression from the previous guard behavior: the model can emit a forward-looking action plan without calling any tool, and the loop accepts it as final output.
Useful? React with 👍 / 👎.
| # unbalance the open/close count), so we rely on the closed-artifact | ||
| # path instead and skip the count check. | ||
| real_artifact = _looks_like_real_artifact(text) | ||
| if not real_artifact and _has_unclosed_markup_block(text_without_both): |
There was a problem hiding this comment.
Preserve unclosed-markup gating after prior artifact match
The if not real_artifact condition skips _has_unclosed_markup_block(...) whenever any earlier closed artifact exists, so a response like First, I'll search... + closed fence + trailing open <svg ...> is treated as complete and bypasses re-prompt. This accepts mid-stream/unfinished markup as final output in tool-enabled flows, even though the function docstring says unclosed markup should disqualify the artifact path.
Useful? React with 👍 / 👎.
Summary
The plan-without-action re-prompt at
studio/backend/core/inference/llama_cpp.pywas over-firing on long-but-complete responses. The heuristic previously checked only the intent regex_INTENT_SIGNALmatching "first" / "let me" / "I'll" / etc. and a 2000-char length cap.Those same intent words appear in long explanations that accompany REAL code or markup, so a complete reply like
First, let me set up pygame.followed by a closed Python code block still tripped the re-prompt, and the synthetic follow-up turn ("STOP. Do NOT write code or explain.") wiped the user-visible code from the conversation.Evidence
Reproduced at scale in a 900-run sweep across 15 Qwen3.5/3.6 GGUF configs on B200s. Prompts that emit code or markup landed empty
final_textfor the majority of seeds even on the strongest configs:Typical failure: empty visible response, finish_reason=stop, server log shows Re-prompt 1/3: model responded without calling tools (~1500 chars) immediately after a complete code block was streamed.
Fix
Adds a
_HAS_ANSWER_ARTIFACTregex matching closed code fences, HTML pages, complete SVG, and numbered lists. Addsand not _HAS_ANSWER_ARTIFACT.search(_stripped)to the re-prompt condition.Tests
studio/backend/tests/test_llama_cpp_reprompt_guard.pyadds 13 unit tests pinning both directions:All 13 tests pass locally.
Test plan