Skip to content

Feat: add support for datasets with str saved messages field#3607

Merged
winglian merged 5 commits into
axolotl-ai-cloud:mainfrom
brightwind26:hybrid-think
Apr 23, 2026
Merged

Feat: add support for datasets with str saved messages field#3607
winglian merged 5 commits into
axolotl-ai-cloud:mainfrom
brightwind26:hybrid-think

Conversation

@brightwind26
Copy link
Copy Markdown
Contributor

@brightwind26 brightwind26 commented Apr 16, 2026

This PR supports datasets that saves messages field in str format. This happens for agentic datasets with large traces such as https://huggingface.co/datasets/allenai/Sera-4.6-Lite-T2

To reproduce, simply run axolotl preprocess config.yaml --debug with the config below:

base_model: Qwen/Qwen3.5-0.8B
# Automatically upload checkpoint and final model to HF
# hub_model_id: username/custom_model_name

datasets:
  - path: allenai/Sera-4.6-Lite-T2
    type: chat_template
    field_messages: messages
    split: "train[:1%]"
    train_on_eot: turn
    roles_to_train: ["assistant"]
    message_property_mappings:
      role: role
      content: content
    message_field_training: "train"
    roles:
      assistant:
        - gpt
        - model
        - assistant
      user:
        - human
        - tool
        - user

val_set_size: 0.1
output_dir: ./outputs/lora-out

adapter: lora
lora_model_dir:

sequence_len: 32768
sample_packing: true
eval_sample_packing: true


lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 2
micro_batch_size: 2
num_epochs: 1

optimizer: adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0002

bf16: auto
tf32: false

gradient_checkpointing: true
resume_from_checkpoint:
logging_steps: 1
flash_attention: true

loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3

warmup_ratio: 0.1
evals_per_epoch: 4
saves_per_epoch: 1
weight_decay: 0.0
special_tokens:
  pad_token: "<|end_of_text|>"

# save_first_step: true  # uncomment this to validate checkpoint saving works with your config

With the fixes proposed in this PR, datasets such as Sera are properly loaded and tokenized

Summary by CodeRabbit

  • New Features
    • Expanded input format support to accept JSON-encoded strings for tool and message specifications.
    • Enhanced batch detection to recognize both string and list-formatted prompt values.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 16, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cb1d1b2a-7275-4f8a-adf6-ce164aec76b2

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Modified input handling in chat_template.py to accept JSON-encoded strings for tools and messages, and to treat prompt values as either strings or lists. These changes expand the eligibility criteria for batched prompts and add JSON decoding support for tool and message parameters.

Changes

Cohort / File(s) Summary
Input handling enhancements
src/axolotl/prompt_strategies/chat_template.py
Extended is_prompt_batched to accept string or list values; added JSON string decoding for _get_tools and _get_messages with type assertions for decoded values.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding support for datasets with string-saved messages fields, which is the primary objective addressed by the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/axolotl/prompt_strategies/chat_template.py (1)

1008-1010: Add contextual error handling for top-level tools JSON decoding.

If Line 1010 fails, the raw JSONDecodeError is raised without field context. A small wrapper here makes dataset debugging much easier.

Proposed refactor
         # Some datasets have tools set to str
         if isinstance(tools, str):
-            tools = json.loads(tools)
+            try:
+                tools = json.loads(tools)
+            except json.JSONDecodeError as e:
+                raise ValueError(
+                    f"Invalid JSON in `{self.prompter.field_tools}` field."
+                ) from e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/prompt_strategies/chat_template.py` around lines 1008 - 1010,
Wrap the top-level tools JSON decoding (the json.loads call on the tools
variable) in a try/except that catches json.JSONDecodeError and re-raises a
clearer error (e.g., ValueError) that includes context about the field (the raw
tools string or dataset identifier) and the original exception as the cause;
this change should be implemented where tools is converted from str to JSON so
that failures provide dataset/field context for debugging rather than the raw
JSONDecodeError.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/axolotl/prompt_strategies/chat_template.py`:
- Around line 397-399: is_prompt_batched currently misclassifies a single-string
prompt as batched because iterating a str yields chars; fix the predicate so it
only treats prompts as batched when the relevant fields are actual lists: in
is_prompt_batched, require that prompt[self.prompter.field_messages] is an
instance of list (not str) and that all prompt.values() are lists (or more
precisely that fields that should be batched are lists), then validate each item
in prompt[self.prompter.field_messages] with all(isinstance(m, (str, list)) for
m in ...) to avoid iterating characters; update the condition around
prompt[self.prompter.field_messages] and any uses in tokenize_prompt
accordingly.
- Around line 1042-1049: The code block incorrectly uses assert (which can be
disabled) and references an undefined name type(message) before the loop;
replace the asserts with explicit runtime validation: after json.loads(messages)
(wrap in try/except to catch JSONDecodeError) validate that the result is a list
and if not raise a ValueError with a clear message (use type(messages) in that
message), then iterate over the list and validate each item is a dict, raising
ValueError that includes the index and the actual type of the offending element
(use the loop variable message for per-item checks); ensure no undefined
variables are referenced and that errors are descriptive for SFT dataset
parsing.

---

Nitpick comments:
In `@src/axolotl/prompt_strategies/chat_template.py`:
- Around line 1008-1010: Wrap the top-level tools JSON decoding (the json.loads
call on the tools variable) in a try/except that catches json.JSONDecodeError
and re-raises a clearer error (e.g., ValueError) that includes context about the
field (the raw tools string or dataset identifier) and the original exception as
the cause; this change should be implemented where tools is converted from str
to JSON so that failures provide dataset/field context for debugging rather than
the raw JSONDecodeError.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d06e34a5-5973-4378-b3aa-a0eccc51926f

📥 Commits

Reviewing files that changed from the base of the PR and between 323da79 and 6dc2c22.

📒 Files selected for processing (1)
  • src/axolotl/prompt_strategies/chat_template.py

Comment on lines +397 to 399
return all(isinstance(v, (str, list)) for v in prompt.values()) and all(
isinstance(v, (str, list)) for v in prompt[self.prompter.field_messages]
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

is_prompt_batched can misclassify single-string prompts as batched.

At Line 397 and Line 398, a single prompt with messages as str now passes both checks (because iterating a string yields str chars), so tokenize_prompt zips characters instead of examples.

Proposed fix
 def is_prompt_batched(self, prompt: dict[str, Any]) -> bool:
     try:
-            return all(isinstance(v, (str, list)) for v in prompt.values()) and all(
-                isinstance(v, (str, list)) for v in prompt[self.prompter.field_messages]
-            )
+            messages = prompt[self.prompter.field_messages]
+            if not isinstance(messages, list):
+                return False
+            return all(isinstance(v, list) for v in prompt.values()) and all(
+                isinstance(v, (str, list)) for v in messages
+            )
     except KeyError:
         return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/prompt_strategies/chat_template.py` around lines 397 - 399,
is_prompt_batched currently misclassifies a single-string prompt as batched
because iterating a str yields chars; fix the predicate so it only treats
prompts as batched when the relevant fields are actual lists: in
is_prompt_batched, require that prompt[self.prompter.field_messages] is an
instance of list (not str) and that all prompt.values() are lists (or more
precisely that fields that should be batched are lists), then validate each item
in prompt[self.prompter.field_messages] with all(isinstance(m, (str, list)) for
m in ...) to avoid iterating characters; update the condition around
prompt[self.prompter.field_messages] and any uses in tokenize_prompt
accordingly.

Comment on lines +1042 to +1049
if isinstance(messages, str):
messages = json.loads(messages)
assert isinstance(messages, list), f"For SFT datasets that are stored in `str` format, the turns must be saved in a list of dictionaries, got {type(message)}"

# Extra check here to make sure decoded json is a list of dicts.
for i, message in enumerate(messages):
assert isinstance(message, dict), f"For SFT datasets that are stored in `str` format, each turns must be saved in a dictionary, got {type(message)} for the turn {i}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verification: confirm the undefined name and assert-based runtime checks are present.
rg -n "type\(message\)|assert isinstance\(messages, list\)|assert isinstance\(message, dict\)" src/axolotl/prompt_strategies/chat_template.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 1814


Replace runtime validation with explicit error handling and fix undefined variable name.

Line 1044 references type(message) which is not defined at that scope—the variable message is only introduced inside the loop at line 1047. This causes a NameError whenever messages is not a list. Additionally, using assert for runtime input validation is unreliable since assertions can be disabled with python -O, leaving malformed inputs undetected in production.

Proposed fix
         if isinstance(messages, str):
-            messages = json.loads(messages)
-            assert isinstance(messages, list), f"For SFT datasets that are stored in `str` format, the turns must be saved in a list of dictionaries, got {type(message)}"
-
-            # Extra check here to make sure decoded json is a list of dicts.
-            for i, message in enumerate(messages):
-                assert isinstance(message, dict), f"For SFT datasets that are stored in `str` format, each turns must be saved in a dictionary, got {type(message)} for the turn {i}"
+            try:
+                messages = json.loads(messages)
+            except json.JSONDecodeError as e:
+                raise ValueError(
+                    f"Invalid JSON in `{self.prompter.field_messages}` field."
+                ) from e
+
+            if not isinstance(messages, list):
+                raise ValueError(
+                    "For SFT datasets stored as `str`, decoded `messages` must be a list[dict], "
+                    f"got {type(messages)}."
+                )
+
+            # Extra check here to make sure decoded json is a list of dicts.
+            for i, message in enumerate(messages):
+                if not isinstance(message, dict):
+                    raise ValueError(
+                        "For SFT datasets stored as `str`, each turn must be a dict; "
+                        f"got {type(message)} at turn {i}."
+                    )
🧰 Tools
🪛 Ruff (0.15.10)

[error] 1044-1044: Undefined name message

(F821)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/prompt_strategies/chat_template.py` around lines 1042 - 1049, The
code block incorrectly uses assert (which can be disabled) and references an
undefined name type(message) before the loop; replace the asserts with explicit
runtime validation: after json.loads(messages) (wrap in try/except to catch
JSONDecodeError) validate that the result is a list and if not raise a
ValueError with a clear message (use type(messages) in that message), then
iterate over the list and validate each item is a dict, raising ValueError that
includes the index and the actual type of the offending element (use the loop
variable message for per-item checks); ensure no undefined variables are
referenced and that errors are descriptive for SFT dataset parsing.

Copy link
Copy Markdown
Member

@ved1beta ved1beta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good , nice work : )


# Some datasets have tools set to str
if isinstance(tools, str):
tools = json.loads(tools)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should wrap in try except json.JSONDecodeError like above

raise ValueError("Messages is null. Please check `field_messages`.")

if isinstance(messages, str):
messages = json.loads(messages)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar may need to be wrapped in try except

@NanoCode012
Copy link
Copy Markdown
Collaborator

CI failure unrelated. Is fixed in #3545

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 18.75000% with 13 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/axolotl/prompt_strategies/chat_template.py 18.75% 13 Missing ⚠️

📢 Thoughts on this report? Let us know!

@winglian winglian merged commit bcbe049 into axolotl-ai-cloud:main Apr 23, 2026
27 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants