Feat: add support for datasets with str saved messages field#3607
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughModified input handling in Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/axolotl/prompt_strategies/chat_template.py (1)
1008-1010: Add contextual error handling for top-leveltoolsJSON decoding.If Line 1010 fails, the raw
JSONDecodeErroris raised without field context. A small wrapper here makes dataset debugging much easier.Proposed refactor
# Some datasets have tools set to str if isinstance(tools, str): - tools = json.loads(tools) + try: + tools = json.loads(tools) + except json.JSONDecodeError as e: + raise ValueError( + f"Invalid JSON in `{self.prompter.field_tools}` field." + ) from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/axolotl/prompt_strategies/chat_template.py` around lines 1008 - 1010, Wrap the top-level tools JSON decoding (the json.loads call on the tools variable) in a try/except that catches json.JSONDecodeError and re-raises a clearer error (e.g., ValueError) that includes context about the field (the raw tools string or dataset identifier) and the original exception as the cause; this change should be implemented where tools is converted from str to JSON so that failures provide dataset/field context for debugging rather than the raw JSONDecodeError.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/axolotl/prompt_strategies/chat_template.py`:
- Around line 397-399: is_prompt_batched currently misclassifies a single-string
prompt as batched because iterating a str yields chars; fix the predicate so it
only treats prompts as batched when the relevant fields are actual lists: in
is_prompt_batched, require that prompt[self.prompter.field_messages] is an
instance of list (not str) and that all prompt.values() are lists (or more
precisely that fields that should be batched are lists), then validate each item
in prompt[self.prompter.field_messages] with all(isinstance(m, (str, list)) for
m in ...) to avoid iterating characters; update the condition around
prompt[self.prompter.field_messages] and any uses in tokenize_prompt
accordingly.
- Around line 1042-1049: The code block incorrectly uses assert (which can be
disabled) and references an undefined name type(message) before the loop;
replace the asserts with explicit runtime validation: after json.loads(messages)
(wrap in try/except to catch JSONDecodeError) validate that the result is a list
and if not raise a ValueError with a clear message (use type(messages) in that
message), then iterate over the list and validate each item is a dict, raising
ValueError that includes the index and the actual type of the offending element
(use the loop variable message for per-item checks); ensure no undefined
variables are referenced and that errors are descriptive for SFT dataset
parsing.
---
Nitpick comments:
In `@src/axolotl/prompt_strategies/chat_template.py`:
- Around line 1008-1010: Wrap the top-level tools JSON decoding (the json.loads
call on the tools variable) in a try/except that catches json.JSONDecodeError
and re-raises a clearer error (e.g., ValueError) that includes context about the
field (the raw tools string or dataset identifier) and the original exception as
the cause; this change should be implemented where tools is converted from str
to JSON so that failures provide dataset/field context for debugging rather than
the raw JSONDecodeError.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d06e34a5-5973-4378-b3aa-a0eccc51926f
📒 Files selected for processing (1)
src/axolotl/prompt_strategies/chat_template.py
| return all(isinstance(v, (str, list)) for v in prompt.values()) and all( | ||
| isinstance(v, (str, list)) for v in prompt[self.prompter.field_messages] | ||
| ) |
There was a problem hiding this comment.
is_prompt_batched can misclassify single-string prompts as batched.
At Line 397 and Line 398, a single prompt with messages as str now passes both checks (because iterating a string yields str chars), so tokenize_prompt zips characters instead of examples.
Proposed fix
def is_prompt_batched(self, prompt: dict[str, Any]) -> bool:
try:
- return all(isinstance(v, (str, list)) for v in prompt.values()) and all(
- isinstance(v, (str, list)) for v in prompt[self.prompter.field_messages]
- )
+ messages = prompt[self.prompter.field_messages]
+ if not isinstance(messages, list):
+ return False
+ return all(isinstance(v, list) for v in prompt.values()) and all(
+ isinstance(v, (str, list)) for v in messages
+ )
except KeyError:
return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/axolotl/prompt_strategies/chat_template.py` around lines 397 - 399,
is_prompt_batched currently misclassifies a single-string prompt as batched
because iterating a str yields chars; fix the predicate so it only treats
prompts as batched when the relevant fields are actual lists: in
is_prompt_batched, require that prompt[self.prompter.field_messages] is an
instance of list (not str) and that all prompt.values() are lists (or more
precisely that fields that should be batched are lists), then validate each item
in prompt[self.prompter.field_messages] with all(isinstance(m, (str, list)) for
m in ...) to avoid iterating characters; update the condition around
prompt[self.prompter.field_messages] and any uses in tokenize_prompt
accordingly.
| if isinstance(messages, str): | ||
| messages = json.loads(messages) | ||
| assert isinstance(messages, list), f"For SFT datasets that are stored in `str` format, the turns must be saved in a list of dictionaries, got {type(message)}" | ||
|
|
||
| # Extra check here to make sure decoded json is a list of dicts. | ||
| for i, message in enumerate(messages): | ||
| assert isinstance(message, dict), f"For SFT datasets that are stored in `str` format, each turns must be saved in a dictionary, got {type(message)} for the turn {i}" | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verification: confirm the undefined name and assert-based runtime checks are present.
rg -n "type\(message\)|assert isinstance\(messages, list\)|assert isinstance\(message, dict\)" src/axolotl/prompt_strategies/chat_template.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 1814
Replace runtime validation with explicit error handling and fix undefined variable name.
Line 1044 references type(message) which is not defined at that scope—the variable message is only introduced inside the loop at line 1047. This causes a NameError whenever messages is not a list. Additionally, using assert for runtime input validation is unreliable since assertions can be disabled with python -O, leaving malformed inputs undetected in production.
Proposed fix
if isinstance(messages, str):
- messages = json.loads(messages)
- assert isinstance(messages, list), f"For SFT datasets that are stored in `str` format, the turns must be saved in a list of dictionaries, got {type(message)}"
-
- # Extra check here to make sure decoded json is a list of dicts.
- for i, message in enumerate(messages):
- assert isinstance(message, dict), f"For SFT datasets that are stored in `str` format, each turns must be saved in a dictionary, got {type(message)} for the turn {i}"
+ try:
+ messages = json.loads(messages)
+ except json.JSONDecodeError as e:
+ raise ValueError(
+ f"Invalid JSON in `{self.prompter.field_messages}` field."
+ ) from e
+
+ if not isinstance(messages, list):
+ raise ValueError(
+ "For SFT datasets stored as `str`, decoded `messages` must be a list[dict], "
+ f"got {type(messages)}."
+ )
+
+ # Extra check here to make sure decoded json is a list of dicts.
+ for i, message in enumerate(messages):
+ if not isinstance(message, dict):
+ raise ValueError(
+ "For SFT datasets stored as `str`, each turn must be a dict; "
+ f"got {type(message)} at turn {i}."
+ )🧰 Tools
🪛 Ruff (0.15.10)
[error] 1044-1044: Undefined name message
(F821)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/axolotl/prompt_strategies/chat_template.py` around lines 1042 - 1049, The
code block incorrectly uses assert (which can be disabled) and references an
undefined name type(message) before the loop; replace the asserts with explicit
runtime validation: after json.loads(messages) (wrap in try/except to catch
JSONDecodeError) validate that the result is a list and if not raise a
ValueError with a clear message (use type(messages) in that message), then
iterate over the list and validate each item is a dict, raising ValueError that
includes the index and the actual type of the offending element (use the loop
variable message for per-item checks); ensure no undefined variables are
referenced and that errors are descriptive for SFT dataset parsing.
|
|
||
| # Some datasets have tools set to str | ||
| if isinstance(tools, str): | ||
| tools = json.loads(tools) |
There was a problem hiding this comment.
We should wrap in try except json.JSONDecodeError like above
| raise ValueError("Messages is null. Please check `field_messages`.") | ||
|
|
||
| if isinstance(messages, str): | ||
| messages = json.loads(messages) |
There was a problem hiding this comment.
Similar may need to be wrapped in try except
|
CI failure unrelated. Is fixed in #3545 |
559336b to
ee77c6e
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This PR supports datasets that saves
messagesfield instrformat. This happens for agentic datasets with large traces such as https://huggingface.co/datasets/allenai/Sera-4.6-Lite-T2To reproduce, simply run
axolotl preprocess config.yaml --debugwith the config below:With the fixes proposed in this PR, datasets such as Sera are properly loaded and tokenized
Summary by CodeRabbit