Feat: add support for datasets with `str` saved `messages` field by brightwind26 · Pull Request #3607 · axolotl-ai-cloud/axolotl

brightwind26 · 2026-04-16T10:48:14Z

This PR supports datasets that saves messages field in str format. This happens for agentic datasets with large traces such as https://huggingface.co/datasets/allenai/Sera-4.6-Lite-T2

To reproduce, simply run axolotl preprocess config.yaml --debug with the config below:

base_model: Qwen/Qwen3.5-0.8B
# Automatically upload checkpoint and final model to HF
# hub_model_id: username/custom_model_name

datasets:
  - path: allenai/Sera-4.6-Lite-T2
    type: chat_template
    field_messages: messages
    split: "train[:1%]"
    train_on_eot: turn
    roles_to_train: ["assistant"]
    message_property_mappings:
      role: role
      content: content
    message_field_training: "train"
    roles:
      assistant:
        - gpt
        - model
        - assistant
      user:
        - human
        - tool
        - user

val_set_size: 0.1
output_dir: ./outputs/lora-out

adapter: lora
lora_model_dir:

sequence_len: 32768
sample_packing: true
eval_sample_packing: true


lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 2
micro_batch_size: 2
num_epochs: 1

optimizer: adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0002

bf16: auto
tf32: false

gradient_checkpointing: true
resume_from_checkpoint:
logging_steps: 1
flash_attention: true

loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3

warmup_ratio: 0.1
evals_per_epoch: 4
saves_per_epoch: 1
weight_decay: 0.0
special_tokens:
  pad_token: "<|end_of_text|>"

# save_first_step: true  # uncomment this to validate checkpoint saving works with your config

With the fixes proposed in this PR, datasets such as Sera are properly loaded and tokenized

Summary by CodeRabbit

New Features
- Expanded input format support to accept JSON-encoded strings for tool and message specifications.
- Enhanced batch detection to recognize both string and list-formatted prompt values.

coderabbitai · 2026-04-16T10:48:32Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cb1d1b2a-7275-4f8a-adf6-ce164aec76b2

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Modified input handling in chat_template.py to accept JSON-encoded strings for tools and messages, and to treat prompt values as either strings or lists. These changes expand the eligibility criteria for batched prompts and add JSON decoding support for tool and message parameters.

Changes

Cohort / File(s)	Summary
Input handling enhancements `src/axolotl/prompt_strategies/chat_template.py`	Extended `is_prompt_batched` to accept string or list values; added JSON string decoding for `_get_tools` and `_get_messages` with type assertions for decoded values.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding support for datasets with string-saved messages fields, which is the primary objective addressed by the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

src/axolotl/prompt_strategies/chat_template.py (1)

1008-1010: Add contextual error handling for top-level tools JSON decoding.

If Line 1010 fails, the raw JSONDecodeError is raised without field context. A small wrapper here makes dataset debugging much easier.

Proposed refactor

         # Some datasets have tools set to str
         if isinstance(tools, str):
-            tools = json.loads(tools)
+            try:
+                tools = json.loads(tools)
+            except json.JSONDecodeError as e:
+                raise ValueError(
+                    f"Invalid JSON in `{self.prompter.field_tools}` field."
+                ) from e

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/prompt_strategies/chat_template.py` around lines 1008 - 1010,
Wrap the top-level tools JSON decoding (the json.loads call on the tools
variable) in a try/except that catches json.JSONDecodeError and re-raises a
clearer error (e.g., ValueError) that includes context about the field (the raw
tools string or dataset identifier) and the original exception as the cause;
this change should be implemented where tools is converted from str to JSON so
that failures provide dataset/field context for debugging rather than the raw
JSONDecodeError.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/axolotl/prompt_strategies/chat_template.py`:
- Around line 397-399: is_prompt_batched currently misclassifies a single-string
prompt as batched because iterating a str yields chars; fix the predicate so it
only treats prompts as batched when the relevant fields are actual lists: in
is_prompt_batched, require that prompt[self.prompter.field_messages] is an
instance of list (not str) and that all prompt.values() are lists (or more
precisely that fields that should be batched are lists), then validate each item
in prompt[self.prompter.field_messages] with all(isinstance(m, (str, list)) for
m in ...) to avoid iterating characters; update the condition around
prompt[self.prompter.field_messages] and any uses in tokenize_prompt
accordingly.
- Around line 1042-1049: The code block incorrectly uses assert (which can be
disabled) and references an undefined name type(message) before the loop;
replace the asserts with explicit runtime validation: after json.loads(messages)
(wrap in try/except to catch JSONDecodeError) validate that the result is a list
and if not raise a ValueError with a clear message (use type(messages) in that
message), then iterate over the list and validate each item is a dict, raising
ValueError that includes the index and the actual type of the offending element
(use the loop variable message for per-item checks); ensure no undefined
variables are referenced and that errors are descriptive for SFT dataset
parsing.

---

Nitpick comments:
In `@src/axolotl/prompt_strategies/chat_template.py`:
- Around line 1008-1010: Wrap the top-level tools JSON decoding (the json.loads
call on the tools variable) in a try/except that catches json.JSONDecodeError
and re-raises a clearer error (e.g., ValueError) that includes context about the
field (the raw tools string or dataset identifier) and the original exception as
the cause; this change should be implemented where tools is converted from str
to JSON so that failures provide dataset/field context for debugging rather than
the raw JSONDecodeError.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d06e34a5-5973-4378-b3aa-a0eccc51926f

📥 Commits

Reviewing files that changed from the base of the PR and between 323da79 and 6dc2c22.

📒 Files selected for processing (1)

src/axolotl/prompt_strategies/chat_template.py

coderabbitai · 2026-04-16T10:55:08Z

+            return all(isinstance(v, (str, list)) for v in prompt.values()) and all(
+                isinstance(v, (str, list)) for v in prompt[self.prompter.field_messages]
            )


⚠️ Potential issue | 🟠 Major

is_prompt_batched can misclassify single-string prompts as batched.

At Line 397 and Line 398, a single prompt with messages as str now passes both checks (because iterating a string yields str chars), so tokenize_prompt zips characters instead of examples.

Proposed fix

def is_prompt_batched(self, prompt: dict[str, Any]) -> bool: try: - return all(isinstance(v, (str, list)) for v in prompt.values()) and all( - isinstance(v, (str, list)) for v in prompt[self.prompter.field_messages] - ) + messages = prompt[self.prompter.field_messages] + if not isinstance(messages, list): + return False + return all(isinstance(v, list) for v in prompt.values()) and all( + isinstance(v, (str, list)) for v in messages + ) except KeyError: return False

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/axolotl/prompt_strategies/chat_template.py` around lines 397 - 399, is_prompt_batched currently misclassifies a single-string prompt as batched because iterating a str yields chars; fix the predicate so it only treats prompts as batched when the relevant fields are actual lists: in is_prompt_batched, require that prompt[self.prompter.field_messages] is an instance of list (not str) and that all prompt.values() are lists (or more precisely that fields that should be batched are lists), then validate each item in prompt[self.prompter.field_messages] with all(isinstance(m, (str, list)) for m in ...) to avoid iterating characters; update the condition around prompt[self.prompter.field_messages] and any uses in tokenize_prompt accordingly.

coderabbitai · 2026-04-16T10:55:08Z

+        if isinstance(messages, str):
+            messages = json.loads(messages)
+            assert isinstance(messages, list), f"For SFT datasets that are stored in `str` format, the turns must be saved in a list of dictionaries, got {type(message)}"
+
+            # Extra check here to make sure decoded json is a list of dicts.
+            for i, message in enumerate(messages):
+                assert isinstance(message, dict), f"For SFT datasets that are stored in `str` format, each turns must be saved in a dictionary, got {type(message)} for the turn {i}"
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verification: confirm the undefined name and assert-based runtime checks are present. rg -n "type\(message\)|assert isinstance\(messages, list\)|assert isinstance\(message, dict\)" src/axolotl/prompt_strategies/chat_template.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 1814

Replace runtime validation with explicit error handling and fix undefined variable name.

Line 1044 references type(message) which is not defined at that scope—the variable message is only introduced inside the loop at line 1047. This causes a NameError whenever messages is not a list. Additionally, using assert for runtime input validation is unreliable since assertions can be disabled with python -O, leaving malformed inputs undetected in production.

Proposed fix

if isinstance(messages, str): - messages = json.loads(messages) - assert isinstance(messages, list), f"For SFT datasets that are stored in `str` format, the turns must be saved in a list of dictionaries, got {type(message)}" - - # Extra check here to make sure decoded json is a list of dicts. - for i, message in enumerate(messages): - assert isinstance(message, dict), f"For SFT datasets that are stored in `str` format, each turns must be saved in a dictionary, got {type(message)} for the turn {i}" + try: + messages = json.loads(messages) + except json.JSONDecodeError as e: + raise ValueError( + f"Invalid JSON in `{self.prompter.field_messages}` field." + ) from e + + if not isinstance(messages, list): + raise ValueError( + "For SFT datasets stored as `str`, decoded `messages` must be a list[dict], " + f"got {type(messages)}." + ) + + # Extra check here to make sure decoded json is a list of dicts. + for i, message in enumerate(messages): + if not isinstance(message, dict): + raise ValueError( + "For SFT datasets stored as `str`, each turn must be a dict; " + f"got {type(message)} at turn {i}." + )

🧰 Tools

🪛 Ruff (0.15.10)

[error] 1044-1044: Undefined name message

(F821)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/axolotl/prompt_strategies/chat_template.py` around lines 1042 - 1049, The code block incorrectly uses assert (which can be disabled) and references an undefined name type(message) before the loop; replace the asserts with explicit runtime validation: after json.loads(messages) (wrap in try/except to catch JSONDecodeError) validate that the result is a list and if not raise a ValueError with a clear message (use type(messages) in that message), then iterate over the list and validate each item is a dict, raising ValueError that includes the index and the actual type of the offending element (use the loop variable message for per-item checks); ensure no undefined variables are referenced and that errors are descriptive for SFT dataset parsing.

ved1beta

looks good , nice work : )

NanoCode012 · 2026-04-17T04:28:12Z


+        # Some datasets have tools set to str
+        if isinstance(tools, str):
+            tools = json.loads(tools)


We should wrap in try except json.JSONDecodeError like above

NanoCode012 · 2026-04-17T04:28:28Z

            raise ValueError("Messages is null. Please check `field_messages`.")

+        if isinstance(messages, str):
+            messages = json.loads(messages)


Similar may need to be wrapped in try except

NanoCode012 · 2026-04-20T06:56:51Z

CI failure unrelated. Is fixed in #3545

codecov · 2026-04-22T07:21:31Z

Codecov Report

❌ Patch coverage is 18.75000% with 13 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/prompt_strategies/chat_template.py	18.75%	13 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai Bot reviewed Apr 16, 2026

View reviewed changes

ved1beta approved these changes Apr 16, 2026

View reviewed changes

NanoCode012 reviewed Apr 17, 2026

View reviewed changes

brightwind26 requested review from NanoCode012 and ved1beta April 17, 2026 06:27

NanoCode012 added ready to merge and removed ready to merge labels Apr 21, 2026

brightwind26 added 5 commits April 22, 2026 14:11

feat: support datasets saved in str format

312c0eb

add also str for tools

023e42f

format

dd44e2f

fix: address comments + add unit test

98a56ba

format

ee77c6e

NanoCode012 force-pushed the hybrid-think branch from 559336b to ee77c6e Compare April 22, 2026 07:11

winglian merged commit bcbe049 into axolotl-ai-cloud:main Apr 23, 2026
27 of 29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: add support for datasets with `str` saved `messages` field#3607

Feat: add support for datasets with `str` saved `messages` field#3607
winglian merged 5 commits into
axolotl-ai-cloud:mainfrom
brightwind26:hybrid-think

brightwind26 commented Apr 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 16, 2026

Uh oh!

coderabbitai Bot Apr 16, 2026

Uh oh!

ved1beta left a comment

Uh oh!

NanoCode012 Apr 17, 2026

Uh oh!

NanoCode012 Apr 17, 2026

Uh oh!

NanoCode012 commented Apr 20, 2026

Uh oh!

codecov Bot commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

brightwind26 commented Apr 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

ved1beta left a comment

Choose a reason for hiding this comment

Uh oh!

NanoCode012 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

NanoCode012 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

NanoCode012 commented Apr 20, 2026

Uh oh!

codecov Bot commented Apr 22, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

brightwind26 commented Apr 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading