feat: add inference multi-turn chat interface#3723
Conversation
…toggle, thinking token stats
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR introduces an interactive multi-turn chat interface to axolotl. The implementation includes a complete REPL with slash-command support, parameter runtime adjustment, KV-cache reuse for causal models, diffusion model support, thinking block rendering, and CLI integration alongside existing inference modes. ChangesInteractive Chat Feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…sistant turns in parse_response format - Ctrl+C during a (diffusion) turn no longer crashes the REPL; the session survives and the user message is kept - exceptions in slash-command handlers no longer kill the session - consecutive user messages merge so strict templates never see two user turns after a failed generation - assistant turns are stored without special tokens, with thinking under reasoning_content (tokenizer parse_response schema when available, think-marker split otherwise); EOS markers no longer leak into the streamed display
|
📖 Documentation Preview: https://6a2fcee58d1652d06fe17e55--resonant-treacle-0fd729.netlify.app Deployed on Netlify from commit a17abca |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/axolotl/cli/chat.py`:
- Around line 981-984: When clearing or undoing the last assistant reply you
must also clear the cached hidden reasoning stored in session.last_think_text;
update cmd_new to set self.session.last_think_text = "" (or None) after
self.session.clear(), and similarly reset session.last_think_text in the handler
that removes the last assistant turn (the undo command around lines ~1047-1052,
e.g., cmd_undo or whichever function pops the last reply). Ensure any other code
paths that discard the last assistant turn also reset last_think_text so /expand
cannot reveal removed reasoning.
- Around line 217-225: tokenizer.parse_response() can return structured
content-parts but TurnGenerator.build_assistant_message leaves that unchanged,
while ChatSession.save_jsonl and ChatRepl.cmd_history assume message["content"]
is a plain string and will nest or fail; also _generate_turn sets
self.last_think_text but cmd_new and cmd_undo don’t reset it causing stale
hidden thinking for /expand. Fix by normalizing assistant messages to a single
internal schema in TurnGenerator.build_assistant_message (convert any
parse_response result to message["content"] being a plain string or a consistent
content-parts list), update ChatSession.save_jsonl and ChatRepl.cmd_history to
accept that normalized schema (extract text when content is string-like or pull
text from first content-part), and ensure cmd_new and cmd_undo clear
self.last_think_text so /expand never shows stale thinking.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0ca0369c-dcc6-4136-b9ec-b0a137689ddb
📒 Files selected for processing (12)
docs/cli.qmddocs/inference.qmdsrc/axolotl/cli/chat.pysrc/axolotl/cli/config.pysrc/axolotl/cli/inference.pysrc/axolotl/cli/main.pysrc/axolotl/cli/utils/__init__.pysrc/axolotl/cli/utils/load.pysrc/axolotl/logging_config.pytests/cli/test_chat_repl.pytests/cli/test_cli_inference.pytests/test_logging_config_file_capture.py
| def save_jsonl(self, path: str): | ||
| # content-parts format: text-only today, but matches the multimodal | ||
| # dataset format so saved sessions stay usable as training data | ||
| messages = [] | ||
| for message in self.conversation(): | ||
| out = { | ||
| "role": message["role"], | ||
| "content": [{"type": "text", "text": message.get("content") or ""}], | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate the target file
ls -la src/axolotl/cli/chat.py
# Show the relevant portions with line numbers
sed -n '150,280p' src/axolotl/cli/chat.py
sed -n '300,420p' src/axolotl/cli/chat.py
sed -n '980,1140p' src/axolotl/cli/chat.py
# Search for the functions mentioned in the original review comment
rg -n "def save_jsonl|def build_assistant_message|cmd_history|cmd_expand|last_think_text|escape\(" src/axolotl/cli/chat.py
# Dump the exact implementations of the cited functions/fields (when present)
# Use ranges around matches for context
for pat in "def save_jsonl" "def build_assistant_message" "def cmd_history" "def cmd_expand" "last_think_text"; do
line=$(rg -n "$pat" src/axolotl/cli/chat.py | head -n1 | cut -d: -f1 || true)
if [ -n "${line:-}" ]; then
start=$((line-40)); end=$((line+80))
sed -n "${start},${end}p" src/axolotl/cli/chat.py
fi
doneRepository: axolotl-ai-cloud/axolotl
Length of output: 39393
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="src/axolotl/cli/chat.py"
rg -n "last_think_text" "$file"
echo "---- cmd_new/cmd_undo region ----"
# show around cmd_new and cmd_undo (approx based on earlier snippet)
sed -n '880,980p' "$file"
echo "---- cmd_history/cmd_expand region ----"
sed -n '1000,1135p' "$file"
echo "---- save_jsonl region ----"
sed -n '200,270p' "$file"
echo "---- build_assistant_message region ----"
sed -n '300,390p' "$file"Repository: axolotl-ai-cloud/axolotl
Length of output: 16029
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="src/axolotl/cli/chat.py"
rg -n "def cmd_new|def cmd_undo" "$file"
for pat in "def cmd_new" "def cmd_undo"; do
line=$(rg -n "$pat" "$file" | head -n1 | cut -d: -f1)
start=$((line-25)); end=$((line+25))
echo "---- $pat around $line ----"
sed -n "${start},${end}p" "$file"
doneRepository: axolotl-ai-cloud/axolotl
Length of output: 4328
🌐 Web query:
Hugging Face Transformers tokenizer parse_response response_schema content shape list of content parts
💡 Result:
In the Hugging Face Transformers library, the parse_response method and its associated response_schema are used to convert raw model-generated text into a structured dictionary format [1][2]. The response_schema is a JSON schema that defines the expected structure of the output, augmented with custom fields—such as x-regex, x-regex-iterator, and x-parser—to control how string extraction and parsing are performed [1][3][4]. Regarding the specific query about a "list of content parts" in the schema, this is handled through schema definitions that utilize array types or iterative regex patterns [3][4]. When a schema field is defined with "type": "array", the tokenizer manages the extraction of multiple items into a list [4]. Specifically: 1. x-regex-iterator: This field allows you to provide a regular expression string that is applied to the input using re.finditer. The result is a list containing all matches extracted from the string [3]. 2. Array Items: When the schema type is "array", the "items" attribute defines the structure for each object in that list, allowing for complex parsing of multiple content blocks (such as tool calls or sequential message parts) [4]. 3. Parsing logic: Under the hood, these schemas map strings generated by the model to keys in a dictionary [1][3]. For example, a schema might use named regex groups to populate string fields or the "x-parser": "json" directive to parse a chunk of text as a JSON object [3][4]. To use this feature, you set the response_schema attribute on your tokenizer, which then persists the configuration so it can be saved and shared via save_pretrained or push_to_hub [1][3]. When you call tokenizer.parse_response(output_text), the tokenizer uses this schema to process the text into the structured format you defined [1][2].
Citations:
- 1: https://huggingface.co/docs/transformers/en/chat_response_parsing
- 2: https://huggingface.co/docs/transformers/en/internal/tokenization_utils
- 3: https://huggingface.co/docs/transformers/v5.5.4/chat_response_parsing
- 4: https://github.com/huggingface/transformers/blob/main/docs/source/en/chat_response_parsing.md
Normalize tokenizer.parse_response() assistant messages to a single internal schema before /save, /history, and /expand consume them.
TurnGenerator.build_assistant_message()returnstokenizer.parse_response(text)dict unchanged (around 330), butChatSession.save_jsonl()(217-225) andChatRepl.cmd_history()(1033-1038) assumemessage["content"]is string-like and wrap/escape it directly; ifcontentis structured (content-parts),/savewill nest it under"text"and/historycan fail.- Hidden thinking state is also inconsistent:
_generate_turn()setsself.last_think_text, butcmd_new()(981) andcmd_undo()(1047) don’t reset it, so/expand(1107-1109) can show stale hidden thinking after clear/undo.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/axolotl/cli/chat.py` around lines 217 - 225, tokenizer.parse_response()
can return structured content-parts but TurnGenerator.build_assistant_message
leaves that unchanged, while ChatSession.save_jsonl and ChatRepl.cmd_history
assume message["content"] is a plain string and will nest or fail; also
_generate_turn sets self.last_think_text but cmd_new and cmd_undo don’t reset it
causing stale hidden thinking for /expand. Fix by normalizing assistant messages
to a single internal schema in TurnGenerator.build_assistant_message (convert
any parse_response result to message["content"] being a plain string or a
consistent content-parts list), update ChatSession.save_jsonl and
ChatRepl.cmd_history to accept that normalized schema (extract text when content
is string-like or pull text from first content-part), and ensure cmd_new and
cmd_undo clear self.last_think_text so /expand never shows stale thinking.
| def cmd_new(self, _args: str) -> None: | ||
| self.session.clear() | ||
| self.console.print("[dim]Conversation cleared.[/dim]") | ||
| return None |
There was a problem hiding this comment.
Clear cached hidden reasoning when the last reply is removed.
/new and /undo mutate the conversation, but last_think_text survives, so /expand can still reveal reasoning for a reply the user just cleared. Reset that field anywhere the last assistant turn is discarded.
Also applies to: 1047-1052
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/axolotl/cli/chat.py` around lines 981 - 984, When clearing or undoing the
last assistant reply you must also clear the cached hidden reasoning stored in
session.last_think_text; update cmd_new to set self.session.last_think_text = ""
(or None) after self.session.clear(), and similarly reset
session.last_think_text in the handler that removes the last assistant turn (the
undo command around lines ~1047-1052, e.g., cmd_undo or whichever function pops
the last reply). Ensure any other code paths that discard the last assistant
turn also reset last_think_text so /expand cannot reveal removed reasoning.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
- /new now drops the cross-turn KV cache instead of leaving it on device until the next generation - throttle live thinking-tail rerenders to the 12 Hz repaint rate (was O(n^2) splitlines over the full think text per chunk) - split think markers once per turn and reuse for counts and the stored message, dropping the redundant full decode
Description
How to try it out, take a simple config:
Run:
Misc changes:
Motivation and Context
How has this been tested?
AI Usage Disclaimer
Screenshots (if appropriate)
Types of changes
Social Handles (Optional)
Summary by CodeRabbit
Release Notes
New Features
axolotl inference --chat) with streaming token output and runtime-adjustable generation parametersDocumentation
Bug Fixes