Skip to content

feat: add inference multi-turn chat interface#3723

Merged
NanoCode012 merged 11 commits into
mainfrom
feat/inference-chat
Jun 16, 2026
Merged

feat: add inference multi-turn chat interface#3723
NanoCode012 merged 11 commits into
mainfrom
feat/inference-chat

Conversation

@NanoCode012

@NanoCode012 NanoCode012 commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Description

How to try it out, take a simple config:

base_model: Qwen/Qwen3-0.6B
chat_template: tokenizer_default
datasets:
  - path: mhenrichsen/alpaca_2k_test
    type: chat_template
output_dir: /tmp/qwen3-chat-out
sequence_len: 4096
micro_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1e-4

Run:

axolotl inference /tmp/qwen3-chat.yaml --chat
image

Misc changes:

  • Will suppress the HF hub token missing log

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Release Notes

  • New Features

    • Added interactive multi-turn chat mode (axolotl inference --chat) with streaming token output and runtime-adjustable generation parameters
    • Support for thinking models with streaming reasoning blocks and display collapse options
    • Support for diffusion models in chat mode
    • Slash commands for conversation management: new chat, undo, retry, view history, adjust parameters, and save conversations
  • Documentation

    • Added comprehensive interactive chat usage guide with session controls and parameter tuning
    • Added dedicated sections for thinking and diffusion model chat support
  • Bug Fixes

    • Improved FP8 support detection robustness
    • Suppressed Hugging Face Hub authentication warnings

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b6c312df-b17b-4215-9889-4b64c8585191

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR introduces an interactive multi-turn chat interface to axolotl. The implementation includes a complete REPL with slash-command support, parameter runtime adjustment, KV-cache reuse for causal models, diffusion model support, thinking block rendering, and CLI integration alongside existing inference modes.

Changes

Interactive Chat Feature

Layer / File(s) Summary
Chat template resolution helper
src/axolotl/cli/utils/load.py, src/axolotl/cli/utils/__init__.py
New resolve_chat_template_str() helper determines effective chat template by precedence: configured template, dataset type, or None. Re-exported for use by both chat and inference paths.
Thinking markers and parameter specs
src/axolotl/cli/chat.py (lines 1–172)
Template-aware detection of thinking marker pairs and generation parameter specifications with bounds checking, aliasing, parsing, and nullable handling.
Session state and turn generation framework
src/axolotl/cli/chat.py (lines 173–423)
ChatSession manages conversation history with message merging, undo/retry support, and JSONL export. Base TurnGenerator and helpers for stopping, EOS trimming, and token-level thinking block splitting.
Causal generation with KV-cache reuse
src/axolotl/cli/chat.py (lines 425–577)
Causal-mode turn generation using token-prefix matching for cross-turn cache reuse, background streaming, Ctrl+C recovery, EOS trimming, and thinking token counting.
Diffusion generation and thinking rendering
src/axolotl/cli/chat.py (lines 579–744)
Diffusion-mode completion denoising and ThinkStreamRenderer for collapsed or expanded thinking block display with marker-based detection and live region updates.
REPL command system and loop
src/axolotl/cli/chat.py (lines 745–1118)
Slash commands with aliases, multi-line input joining, turn generation orchestration, keyboard interrupt recovery, parameter setting, system prompt management, retry/undo, history, session export, and command help/suggestions.
Chat entrypoint
src/axolotl/cli/chat.py (lines 1120–1236)
do_chat() validates interactive terminal, loads model/tokenizer, resolves template/markers, selects generator, builds banner, and runs the REPL.
CLI integration
src/axolotl/cli/inference.py, src/axolotl/cli/main.py
Adds --chat flag to inference command with mutual exclusivity check against --gradio. Routes to do_chat() or standard inference. Updates chat template resolution in both CLI and Gradio paths.
Supporting infrastructure
src/axolotl/logging_config.py, src/axolotl/cli/config.py
New HubUnauthenticatedNagFilter suppresses Hugging Face Hub warnings. Extended FP8 capability probing to catch AssertionError.
Documentation
docs/cli.qmd, docs/inference.qmd
Updated command description and added comprehensive "Interactive Chat" section with usage examples, slash-command reference, parameter adjustment, thinking/diffusion model guidance, and --prompter limitations.
Test suite
tests/cli/test_chat_repl.py, tests/cli/test_cli_inference.py, tests/test_logging_config_file_capture.py
Comprehensive coverage: parameter parsing/validation, session state management, cache planning, REPL command handling, thinking rendering, EOS trimming, and CLI --chat integration with launcher forwarding and mutual-exclusivity validation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes


Suggested labels

ready to merge


Suggested reviewers

  • winglian
  • SalmanMohammadi
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.23% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add inference multi-turn chat interface' accurately summarizes the primary change: adding an interactive multi-turn chat feature to the inference CLI.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/inference-chat

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@NanoCode012 NanoCode012 changed the title Feat/inference chat feat: add inference multi-turn chat interface Jun 11, 2026
…sistant turns in parse_response format

- Ctrl+C during a (diffusion) turn no longer crashes the REPL; the session
  survives and the user message is kept
- exceptions in slash-command handlers no longer kill the session
- consecutive user messages merge so strict templates never see two user
  turns after a failed generation
- assistant turns are stored without special tokens, with thinking under
  reasoning_content (tokenizer parse_response schema when available,
  think-marker split otherwise); EOS markers no longer leak into the
  streamed display
@NanoCode012 NanoCode012 marked this pull request as ready for review June 12, 2026 07:29
@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

📖 Documentation Preview: https://6a2fcee58d1652d06fe17e55--resonant-treacle-0fd729.netlify.app

Deployed on Netlify from commit a17abca

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/axolotl/cli/chat.py`:
- Around line 981-984: When clearing or undoing the last assistant reply you
must also clear the cached hidden reasoning stored in session.last_think_text;
update cmd_new to set self.session.last_think_text = "" (or None) after
self.session.clear(), and similarly reset session.last_think_text in the handler
that removes the last assistant turn (the undo command around lines ~1047-1052,
e.g., cmd_undo or whichever function pops the last reply). Ensure any other code
paths that discard the last assistant turn also reset last_think_text so /expand
cannot reveal removed reasoning.
- Around line 217-225: tokenizer.parse_response() can return structured
content-parts but TurnGenerator.build_assistant_message leaves that unchanged,
while ChatSession.save_jsonl and ChatRepl.cmd_history assume message["content"]
is a plain string and will nest or fail; also _generate_turn sets
self.last_think_text but cmd_new and cmd_undo don’t reset it causing stale
hidden thinking for /expand. Fix by normalizing assistant messages to a single
internal schema in TurnGenerator.build_assistant_message (convert any
parse_response result to message["content"] being a plain string or a consistent
content-parts list), update ChatSession.save_jsonl and ChatRepl.cmd_history to
accept that normalized schema (extract text when content is string-like or pull
text from first content-part), and ensure cmd_new and cmd_undo clear
self.last_think_text so /expand never shows stale thinking.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0ca0369c-dcc6-4136-b9ec-b0a137689ddb

📥 Commits

Reviewing files that changed from the base of the PR and between 22bcb9a and 508a875.

📒 Files selected for processing (12)
  • docs/cli.qmd
  • docs/inference.qmd
  • src/axolotl/cli/chat.py
  • src/axolotl/cli/config.py
  • src/axolotl/cli/inference.py
  • src/axolotl/cli/main.py
  • src/axolotl/cli/utils/__init__.py
  • src/axolotl/cli/utils/load.py
  • src/axolotl/logging_config.py
  • tests/cli/test_chat_repl.py
  • tests/cli/test_cli_inference.py
  • tests/test_logging_config_file_capture.py

Comment thread src/axolotl/cli/chat.py
Comment on lines +217 to +225
def save_jsonl(self, path: str):
# content-parts format: text-only today, but matches the multimodal
# dataset format so saved sessions stay usable as training data
messages = []
for message in self.conversation():
out = {
"role": message["role"],
"content": [{"type": "text", "text": message.get("content") or ""}],
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate the target file
ls -la src/axolotl/cli/chat.py

# Show the relevant portions with line numbers
sed -n '150,280p' src/axolotl/cli/chat.py
sed -n '300,420p' src/axolotl/cli/chat.py
sed -n '980,1140p' src/axolotl/cli/chat.py

# Search for the functions mentioned in the original review comment
rg -n "def save_jsonl|def build_assistant_message|cmd_history|cmd_expand|last_think_text|escape\(" src/axolotl/cli/chat.py

# Dump the exact implementations of the cited functions/fields (when present)
# Use ranges around matches for context
for pat in "def save_jsonl" "def build_assistant_message" "def cmd_history" "def cmd_expand" "last_think_text"; do
  line=$(rg -n "$pat" src/axolotl/cli/chat.py | head -n1 | cut -d: -f1 || true)
  if [ -n "${line:-}" ]; then
    start=$((line-40)); end=$((line+80))
    sed -n "${start},${end}p" src/axolotl/cli/chat.py
  fi
done

Repository: axolotl-ai-cloud/axolotl

Length of output: 39393


🏁 Script executed:

#!/bin/bash
set -euo pipefail
file="src/axolotl/cli/chat.py"

rg -n "last_think_text" "$file"
echo "---- cmd_new/cmd_undo region ----"
# show around cmd_new and cmd_undo (approx based on earlier snippet)
sed -n '880,980p' "$file"
echo "---- cmd_history/cmd_expand region ----"
sed -n '1000,1135p' "$file"
echo "---- save_jsonl region ----"
sed -n '200,270p' "$file"
echo "---- build_assistant_message region ----"
sed -n '300,390p' "$file"

Repository: axolotl-ai-cloud/axolotl

Length of output: 16029


🏁 Script executed:

#!/bin/bash
set -euo pipefail
file="src/axolotl/cli/chat.py"
rg -n "def cmd_new|def cmd_undo" "$file"
for pat in "def cmd_new" "def cmd_undo"; do
  line=$(rg -n "$pat" "$file" | head -n1 | cut -d: -f1)
  start=$((line-25)); end=$((line+25))
  echo "---- $pat around $line ----"
  sed -n "${start},${end}p" "$file"
done

Repository: axolotl-ai-cloud/axolotl

Length of output: 4328


🌐 Web query:

Hugging Face Transformers tokenizer parse_response response_schema content shape list of content parts

💡 Result:

In the Hugging Face Transformers library, the parse_response method and its associated response_schema are used to convert raw model-generated text into a structured dictionary format [1][2]. The response_schema is a JSON schema that defines the expected structure of the output, augmented with custom fields—such as x-regex, x-regex-iterator, and x-parser—to control how string extraction and parsing are performed [1][3][4]. Regarding the specific query about a "list of content parts" in the schema, this is handled through schema definitions that utilize array types or iterative regex patterns [3][4]. When a schema field is defined with "type": "array", the tokenizer manages the extraction of multiple items into a list [4]. Specifically: 1. x-regex-iterator: This field allows you to provide a regular expression string that is applied to the input using re.finditer. The result is a list containing all matches extracted from the string [3]. 2. Array Items: When the schema type is "array", the "items" attribute defines the structure for each object in that list, allowing for complex parsing of multiple content blocks (such as tool calls or sequential message parts) [4]. 3. Parsing logic: Under the hood, these schemas map strings generated by the model to keys in a dictionary [1][3]. For example, a schema might use named regex groups to populate string fields or the "x-parser": "json" directive to parse a chunk of text as a JSON object [3][4]. To use this feature, you set the response_schema attribute on your tokenizer, which then persists the configuration so it can be saved and shared via save_pretrained or push_to_hub [1][3]. When you call tokenizer.parse_response(output_text), the tokenizer uses this schema to process the text into the structured format you defined [1][2].

Citations:


Normalize tokenizer.parse_response() assistant messages to a single internal schema before /save, /history, and /expand consume them.

  • TurnGenerator.build_assistant_message() returns tokenizer.parse_response(text) dict unchanged (around 330), but ChatSession.save_jsonl() (217-225) and ChatRepl.cmd_history() (1033-1038) assume message["content"] is string-like and wrap/escape it directly; if content is structured (content-parts), /save will nest it under "text" and /history can fail.
  • Hidden thinking state is also inconsistent: _generate_turn() sets self.last_think_text, but cmd_new() (981) and cmd_undo() (1047) don’t reset it, so /expand (1107-1109) can show stale hidden thinking after clear/undo.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/axolotl/cli/chat.py` around lines 217 - 225, tokenizer.parse_response()
can return structured content-parts but TurnGenerator.build_assistant_message
leaves that unchanged, while ChatSession.save_jsonl and ChatRepl.cmd_history
assume message["content"] is a plain string and will nest or fail; also
_generate_turn sets self.last_think_text but cmd_new and cmd_undo don’t reset it
causing stale hidden thinking for /expand. Fix by normalizing assistant messages
to a single internal schema in TurnGenerator.build_assistant_message (convert
any parse_response result to message["content"] being a plain string or a
consistent content-parts list), update ChatSession.save_jsonl and
ChatRepl.cmd_history to accept that normalized schema (extract text when content
is string-like or pull text from first content-part), and ensure cmd_new and
cmd_undo clear self.last_think_text so /expand never shows stale thinking.

Comment thread src/axolotl/cli/chat.py
Comment on lines +981 to +984
def cmd_new(self, _args: str) -> None:
self.session.clear()
self.console.print("[dim]Conversation cleared.[/dim]")
return None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clear cached hidden reasoning when the last reply is removed.

/new and /undo mutate the conversation, but last_think_text survives, so /expand can still reveal reasoning for a reply the user just cleared. Reset that field anywhere the last assistant turn is discarded.

Also applies to: 1047-1052

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/axolotl/cli/chat.py` around lines 981 - 984, When clearing or undoing the
last assistant reply you must also clear the cached hidden reasoning stored in
session.last_think_text; update cmd_new to set self.session.last_think_text = ""
(or None) after self.session.clear(), and similarly reset
session.last_think_text in the handler that removes the last assistant turn (the
undo command around lines ~1047-1052, e.g., cmd_undo or whichever function pops
the last reply). Ensure any other code paths that discard the last assistant
turn also reset last_think_text so /expand cannot reveal removed reasoning.

@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 71.56334% with 211 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/axolotl/cli/chat.py 71.84% 203 Missing ⚠️
src/axolotl/cli/utils/load.py 28.57% 5 Missing ⚠️
src/axolotl/cli/inference.py 66.66% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

- /new now drops the cross-turn KV cache instead of leaving it on
  device until the next generation
- throttle live thinking-tail rerenders to the 12 Hz repaint rate
  (was O(n^2) splitlines over the full think text per chunk)
- split think markers once per turn and reuse for counts and the
  stored message, dropping the redundant full decode
@NanoCode012 NanoCode012 added scheduled_release This PR is slated for the upcoming release ready to merge and removed under review labels Jun 15, 2026
@NanoCode012 NanoCode012 merged commit e86163d into main Jun 16, 2026
17 of 18 checks passed
@NanoCode012 NanoCode012 deleted the feat/inference-chat branch June 16, 2026 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready to merge scheduled_release This PR is slated for the upcoming release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant