feat: add inference multi-turn chat interface by NanoCode012 · Pull Request #3723 · axolotl-ai-cloud/axolotl

NanoCode012 · 2026-06-11T10:26:03Z

Description

How to try it out, take a simple config:

base_model: Qwen/Qwen3-0.6B
chat_template: tokenizer_default
datasets:
  - path: mhenrichsen/alpaca_2k_test
    type: chat_template
output_dir: /tmp/qwen3-chat-out
sequence_len: 4096
micro_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1e-4

Run:

axolotl inference /tmp/qwen3-chat.yaml --chat

Misc changes:

Will suppress the HF hub token missing log

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Release Notes

New Features
- Added interactive multi-turn chat mode (axolotl inference --chat) with streaming token output and runtime-adjustable generation parameters
- Support for thinking models with streaming reasoning blocks and display collapse options
- Support for diffusion models in chat mode
- Slash commands for conversation management: new chat, undo, retry, view history, adjust parameters, and save conversations
Documentation
- Added comprehensive interactive chat usage guide with session controls and parameter tuning
- Added dedicated sections for thinking and diffusion model chat support
Bug Fixes
- Improved FP8 support detection robustness
- Suppressed Hugging Face Hub authentication warnings

…nly torch

…al parts format

…toggle, thinking token stats

coderabbitai · 2026-06-11T10:26:11Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b6c312df-b17b-4215-9889-4b64c8585191

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR introduces an interactive multi-turn chat interface to axolotl. The implementation includes a complete REPL with slash-command support, parameter runtime adjustment, KV-cache reuse for causal models, diffusion model support, thinking block rendering, and CLI integration alongside existing inference modes.

Changes

Interactive Chat Feature

Layer / File(s)	Summary
Chat template resolution helper `src/axolotl/cli/utils/load.py`, `src/axolotl/cli/utils/__init__.py`	New `resolve_chat_template_str()` helper determines effective chat template by precedence: configured template, dataset type, or None. Re-exported for use by both chat and inference paths.
Thinking markers and parameter specs `src/axolotl/cli/chat.py` (lines 1–172)	Template-aware detection of thinking marker pairs and generation parameter specifications with bounds checking, aliasing, parsing, and nullable handling.
Session state and turn generation framework `src/axolotl/cli/chat.py` (lines 173–423)	`ChatSession` manages conversation history with message merging, undo/retry support, and JSONL export. Base `TurnGenerator` and helpers for stopping, EOS trimming, and token-level thinking block splitting.
Causal generation with KV-cache reuse `src/axolotl/cli/chat.py` (lines 425–577)	Causal-mode turn generation using token-prefix matching for cross-turn cache reuse, background streaming, Ctrl+C recovery, EOS trimming, and thinking token counting.
Diffusion generation and thinking rendering `src/axolotl/cli/chat.py` (lines 579–744)	Diffusion-mode completion denoising and `ThinkStreamRenderer` for collapsed or expanded thinking block display with marker-based detection and live region updates.
REPL command system and loop `src/axolotl/cli/chat.py` (lines 745–1118)	Slash commands with aliases, multi-line input joining, turn generation orchestration, keyboard interrupt recovery, parameter setting, system prompt management, retry/undo, history, session export, and command help/suggestions.
Chat entrypoint `src/axolotl/cli/chat.py` (lines 1120–1236)	`do_chat()` validates interactive terminal, loads model/tokenizer, resolves template/markers, selects generator, builds banner, and runs the REPL.
CLI integration `src/axolotl/cli/inference.py`, `src/axolotl/cli/main.py`	Adds `--chat` flag to `inference` command with mutual exclusivity check against `--gradio`. Routes to `do_chat()` or standard inference. Updates chat template resolution in both CLI and Gradio paths.
Supporting infrastructure `src/axolotl/logging_config.py`, `src/axolotl/cli/config.py`	New `HubUnauthenticatedNagFilter` suppresses Hugging Face Hub warnings. Extended FP8 capability probing to catch `AssertionError`.
Documentation `docs/cli.qmd`, `docs/inference.qmd`	Updated command description and added comprehensive "Interactive Chat" section with usage examples, slash-command reference, parameter adjustment, thinking/diffusion model guidance, and `--prompter` limitations.
Test suite `tests/cli/test_chat_repl.py`, `tests/cli/test_cli_inference.py`, `tests/test_logging_config_file_capture.py`	Comprehensive coverage: parameter parsing/validation, session state management, cache planning, REPL command handling, thinking rendering, EOS trimming, and CLI `--chat` integration with launcher forwarding and mutual-exclusivity validation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

ready to merge

Suggested reviewers

winglian
SalmanMohammadi

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 7.23% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: add inference multi-turn chat interface' accurately summarizes the primary change: adding an interactive multi-turn chat feature to the inference CLI.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/inference-chat

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…sistant turns in parse_response format - Ctrl+C during a (diffusion) turn no longer crashes the REPL; the session survives and the user message is kept - exceptions in slash-command handlers no longer kill the session - consecutive user messages merge so strict templates never see two user turns after a failed generation - assistant turns are stored without special tokens, with thinking under reasoning_content (tokenizer parse_response schema when available, think-marker split otherwise); EOS markers no longer leak into the streamed display

github-actions · 2026-06-12T07:36:07Z

📖 Documentation Preview: https://6a2fcee58d1652d06fe17e55--resonant-treacle-0fd729.netlify.app

Deployed on Netlify from commit a17abca

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/axolotl/cli/chat.py`:
- Around line 981-984: When clearing or undoing the last assistant reply you
must also clear the cached hidden reasoning stored in session.last_think_text;
update cmd_new to set self.session.last_think_text = "" (or None) after
self.session.clear(), and similarly reset session.last_think_text in the handler
that removes the last assistant turn (the undo command around lines ~1047-1052,
e.g., cmd_undo or whichever function pops the last reply). Ensure any other code
paths that discard the last assistant turn also reset last_think_text so /expand
cannot reveal removed reasoning.
- Around line 217-225: tokenizer.parse_response() can return structured
content-parts but TurnGenerator.build_assistant_message leaves that unchanged,
while ChatSession.save_jsonl and ChatRepl.cmd_history assume message["content"]
is a plain string and will nest or fail; also _generate_turn sets
self.last_think_text but cmd_new and cmd_undo don’t reset it causing stale
hidden thinking for /expand. Fix by normalizing assistant messages to a single
internal schema in TurnGenerator.build_assistant_message (convert any
parse_response result to message["content"] being a plain string or a consistent
content-parts list), update ChatSession.save_jsonl and ChatRepl.cmd_history to
accept that normalized schema (extract text when content is string-like or pull
text from first content-part), and ensure cmd_new and cmd_undo clear
self.last_think_text so /expand never shows stale thinking.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0ca0369c-dcc6-4136-b9ec-b0a137689ddb

📥 Commits

Reviewing files that changed from the base of the PR and between 22bcb9a and 508a875.

📒 Files selected for processing (12)

docs/cli.qmd
docs/inference.qmd
src/axolotl/cli/chat.py
src/axolotl/cli/config.py
src/axolotl/cli/inference.py
src/axolotl/cli/main.py
src/axolotl/cli/utils/__init__.py
src/axolotl/cli/utils/load.py
src/axolotl/logging_config.py
tests/cli/test_chat_repl.py
tests/cli/test_cli_inference.py
tests/test_logging_config_file_capture.py

coderabbitai · 2026-06-12T07:38:59Z

+    def save_jsonl(self, path: str):
+        # content-parts format: text-only today, but matches the multimodal
+        # dataset format so saved sessions stay usable as training data
+        messages = []
+        for message in self.conversation():
+            out = {
+                "role": message["role"],
+                "content": [{"type": "text", "text": message.get("content") or ""}],
+            }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Locate the target file ls -la src/axolotl/cli/chat.py # Show the relevant portions with line numbers sed -n '150,280p' src/axolotl/cli/chat.py sed -n '300,420p' src/axolotl/cli/chat.py sed -n '980,1140p' src/axolotl/cli/chat.py # Search for the functions mentioned in the original review comment rg -n "def save_jsonl|def build_assistant_message|cmd_history|cmd_expand|last_think_text|escape\(" src/axolotl/cli/chat.py # Dump the exact implementations of the cited functions/fields (when present) # Use ranges around matches for context for pat in "def save_jsonl" "def build_assistant_message" "def cmd_history" "def cmd_expand" "last_think_text"; do line=$(rg -n "$pat" src/axolotl/cli/chat.py | head -n1 | cut -d: -f1 || true) if [ -n "${line:-}" ]; then start=$((line-40)); end=$((line+80)) sed -n "${start},${end}p" src/axolotl/cli/chat.py fi done

Repository: axolotl-ai-cloud/axolotl

Length of output: 39393

🏁 Script executed:

#!/bin/bash set -euo pipefail file="src/axolotl/cli/chat.py" rg -n "last_think_text" "$file" echo "---- cmd_new/cmd_undo region ----" # show around cmd_new and cmd_undo (approx based on earlier snippet) sed -n '880,980p' "$file" echo "---- cmd_history/cmd_expand region ----" sed -n '1000,1135p' "$file" echo "---- save_jsonl region ----" sed -n '200,270p' "$file" echo "---- build_assistant_message region ----" sed -n '300,390p' "$file"

Repository: axolotl-ai-cloud/axolotl

Length of output: 16029

🏁 Script executed:

#!/bin/bash set -euo pipefail file="src/axolotl/cli/chat.py" rg -n "def cmd_new|def cmd_undo" "$file" for pat in "def cmd_new" "def cmd_undo"; do line=$(rg -n "$pat" "$file" | head -n1 | cut -d: -f1) start=$((line-25)); end=$((line+25)) echo "---- $pat around $line ----" sed -n "${start},${end}p" "$file" done

Repository: axolotl-ai-cloud/axolotl

Length of output: 4328

🌐 Web query:

Hugging Face Transformers tokenizer parse_response response_schema content shape list of content parts

💡 Result:

In the Hugging Face Transformers library, the parse_response method and its associated response_schema are used to convert raw model-generated text into a structured dictionary format [1][2]. The response_schema is a JSON schema that defines the expected structure of the output, augmented with custom fields—such as x-regex, x-regex-iterator, and x-parser—to control how string extraction and parsing are performed [1][3][4]. Regarding the specific query about a "list of content parts" in the schema, this is handled through schema definitions that utilize array types or iterative regex patterns [3][4]. When a schema field is defined with "type": "array", the tokenizer manages the extraction of multiple items into a list [4]. Specifically: 1. x-regex-iterator: This field allows you to provide a regular expression string that is applied to the input using re.finditer. The result is a list containing all matches extracted from the string [3]. 2. Array Items: When the schema type is "array", the "items" attribute defines the structure for each object in that list, allowing for complex parsing of multiple content blocks (such as tool calls or sequential message parts) [4]. 3. Parsing logic: Under the hood, these schemas map strings generated by the model to keys in a dictionary [1][3]. For example, a schema might use named regex groups to populate string fields or the "x-parser": "json" directive to parse a chunk of text as a JSON object [3][4]. To use this feature, you set the response_schema attribute on your tokenizer, which then persists the configuration so it can be saved and shared via save_pretrained or push_to_hub [1][3]. When you call tokenizer.parse_response(output_text), the tokenizer uses this schema to process the text into the structured format you defined [1][2].

Citations:

1: https://huggingface.co/docs/transformers/en/chat_response_parsing

2: https://huggingface.co/docs/transformers/en/internal/tokenization_utils

3: https://huggingface.co/docs/transformers/v5.5.4/chat_response_parsing

4: https://github.com/huggingface/transformers/blob/main/docs/source/en/chat_response_parsing.md

Normalize tokenizer.parse_response() assistant messages to a single internal schema before /save, /history, and /expand consume them.

TurnGenerator.build_assistant_message() returns tokenizer.parse_response(text) dict unchanged (around 330), but ChatSession.save_jsonl() (217-225) and ChatRepl.cmd_history() (1033-1038) assume message["content"] is string-like and wrap/escape it directly; if content is structured (content-parts), /save will nest it under "text" and /history can fail.

Hidden thinking state is also inconsistent: _generate_turn() sets self.last_think_text, but cmd_new() (981) and cmd_undo() (1047) don’t reset it, so /expand (1107-1109) can show stale hidden thinking after clear/undo.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/axolotl/cli/chat.py` around lines 217 - 225, tokenizer.parse_response() can return structured content-parts but TurnGenerator.build_assistant_message leaves that unchanged, while ChatSession.save_jsonl and ChatRepl.cmd_history assume message["content"] is a plain string and will nest or fail; also _generate_turn sets self.last_think_text but cmd_new and cmd_undo don’t reset it causing stale hidden thinking for /expand. Fix by normalizing assistant messages to a single internal schema in TurnGenerator.build_assistant_message (convert any parse_response result to message["content"] being a plain string or a consistent content-parts list), update ChatSession.save_jsonl and ChatRepl.cmd_history to accept that normalized schema (extract text when content is string-like or pull text from first content-part), and ensure cmd_new and cmd_undo clear self.last_think_text so /expand never shows stale thinking.

coderabbitai · 2026-06-12T07:38:59Z

+    def cmd_new(self, _args: str) -> None:
+        self.session.clear()
+        self.console.print("[dim]Conversation cleared.[/dim]")
+        return None


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clear cached hidden reasoning when the last reply is removed.

/new and /undo mutate the conversation, but last_think_text survives, so /expand can still reveal reasoning for a reply the user just cleared. Reset that field anywhere the last assistant turn is discarded.

Also applies to: 1047-1052

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/axolotl/cli/chat.py` around lines 981 - 984, When clearing or undoing the last assistant reply you must also clear the cached hidden reasoning stored in session.last_think_text; update cmd_new to set self.session.last_think_text = "" (or None) after self.session.clear(), and similarly reset session.last_think_text in the handler that removes the last assistant turn (the undo command around lines ~1047-1052, e.g., cmd_undo or whichever function pops the last reply). Ensure any other code paths that discard the last assistant turn also reset last_think_text so /expand cannot reveal removed reasoning.

codecov · 2026-06-12T07:46:08Z

Codecov Report

❌ Patch coverage is 71.56334% with 211 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/cli/chat.py	71.84%	203 Missing ⚠️
src/axolotl/cli/utils/load.py	28.57%	5 Missing ⚠️
src/axolotl/cli/inference.py	66.66%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

- /new now drops the cross-turn KV cache instead of leaving it on device until the next generation - throttle live thinking-tail rerenders to the 12 Hz repaint rate (was O(n^2) splitlines over the full think text per chunk) - split think markers once per turn and reuse for counts and the stored message, dropping the redundant full decode

NanoCode012 added 7 commits June 11, 2026 16:23

feat: add interactive multi-turn chat mode (--chat) to inference CLI

7ff77e5

fix: apply_chat_template returns BatchEncoding in transformers v5

860d6b3

docs: document interactive chat mode for inference

1016533

feat: diffusion turn generation for chat mode; fix fp8 probe on CPU-o…

1c840d5

…nly torch

feat: suggest command aliases on typo; save chat sessions in multimod…

8f822a9

…al parts format

feat: collapse thinking blocks in chat with /expand, /think template …

c4f1f4c

…toggle, thinking token stats

fix: suppress unauthenticated HF Hub nag warning in logging config

f983fda

NanoCode012 changed the title ~~Feat/inference chat~~ feat: add inference multi-turn chat interface Jun 11, 2026

NanoCode012 marked this pull request as ready for review June 12, 2026 07:29

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

NanoCode012 added 2 commits June 12, 2026 18:52

refactor(chat): share the live thinking-tail FPS as a class constant

0c33c78

NanoCode012 added the under review label Jun 12, 2026

fix: interrupt cache race condition and parse edge case

a17abca

NanoCode012 added scheduled_release This PR is slated for the upcoming release ready to merge and removed under review labels Jun 15, 2026

NanoCode012 merged commit e86163d into main Jun 16, 2026
17 of 18 checks passed

NanoCode012 deleted the feat/inference-chat branch June 16, 2026 09:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add inference multi-turn chat interface#3723

feat: add inference multi-turn chat interface#3723
NanoCode012 merged 11 commits into
mainfrom
feat/inference-chat

NanoCode012 commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Uh oh!

coderabbitai Bot Jun 12, 2026

Uh oh!

codecov Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

NanoCode012 commented Jun 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NanoCode012 commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

github-actions Bot commented Jun 12, 2026 •

edited

Loading

codecov Bot commented Jun 12, 2026 •

edited

Loading