Require datasets>=4.7.0 for Json dtype to prevent insertion of None values#5376
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| {"reasoning_effort": "low", "model_identity": "You are Tiny ChatGPT, a tiny language model."}, | ||
| ] | ||
| }) | ||
| }, on_mixed_types="use_json") |
There was a problem hiding this comment.
here it's because we mix turns with and without "thinking" 👍
| json.dumps([get_weather_forecast, get_wind_conditions]), | ||
| ] | ||
| }) | ||
| }, on_mixed_types="use_json") |
There was a problem hiding this comment.
here I'm not sure if it's need, json.dumps converts to str
There was a problem hiding this comment.
Oops, I forgot to remove json.dumps. 😅 I'm pushing a new commit for this.
qgallouedec
left a comment
There was a problem hiding this comment.
Looks good overall, just a few minor suggestions and comments.
I think bumping datasets to 4.7 is safe; I can’t think of any case where it would affect users.
For clarity in the release notes, could you update the title to explicitly mention this version bump?

Use datasets
Jsondtype to avoid the insertion of None values.IndexErrorin GRPOTrainer with Multimodal Models (Qwen3-VL) due toNoneValues in Chat Content #5364CC: @SolarWindRider, @qgallouedec
This PR updates dataset handling to improve compatibility with mixed data types and simplifies preprocessing in several trainer modules. The most significant changes include upgrading the
datasetslibrary to support theJsonnew feature, using theon_mixed_types="use_json"option when loading datasets, and removing theremove_none_valuescleaning step from all trainer classes: DPO, Reward, SFT.Related to:
Json()type for tool calling dataset format #5307Changes
Dataset compatibility and preprocessing improvements:
datasetslibrary version requirement from>=3.0.0to>=4.7.0inpyproject.tomlto enable support forJsontype fields and theon_mixed_types="use_json"option.scripts/generate_toolcall_dataset.pyto useon_mixed_types="use_json", ensuring robust handling of mixed-type columns.Trainer codebase simplification:
remove_none_valuesutility function fromDPOTrainer,RewardTrainer, andSFTTrainer, eliminating a redundant data cleaning step now handled by the updated dataset loading logic.Note
Medium Risk
Medium risk because it bumps the minimum
datasetsversion and changes trainer preprocessing behavior, which could affect users with older/saved datasets that still contain injectedNonevalues.Overview
Switches TRL to rely on
datasetsv4.7.0+ Json dtype handling for nested/mixed columns (viaon_mixed_types="use_json"in dataset generation scripts) to prevent Arrow/Parquet from inserting spuriousNonevalues.Removes the trainers’ automatic
remove_none_valuestransform inSFTTrainer,DPOTrainer, andRewardTrainer, and documents the new expectation + manual workaround inMIGRATION.mdfor users with legacy datasets that still containNone.Written by Cursor Bugbot for commit 27ba622. This will update automatically on new commits. Configure here.