Require datasets>=4.7.0 for Json dtype to prevent insertion of None values by albertvillanova · Pull Request #5376 · huggingface/trl

albertvillanova · 2026-03-26T09:29:33Z

Use datasets Json dtype to avoid the insertion of None values.

Note that this applies not only to tool-calling datasets, but also to multi-modal datasets. See:
- BUGfix: Fix image_grid_thw IndexError in GRPOTrainer with Multimodal Models (Qwen3-VL) due to None Values in Chat Content #5364

This PR updates dataset handling to improve compatibility with mixed data types and simplifies preprocessing in several trainer modules. The most significant changes include upgrading the datasets library to support the Json new feature, using the on_mixed_types="use_json" option when loading datasets, and removing the remove_none_values cleaning step from all trainer classes: DPO, Reward, SFT.

Related to:

Changes

Dataset compatibility and preprocessing improvements:

Upgraded the datasets library version requirement from >=3.0.0 to >=4.7.0 in pyproject.toml to enable support for Json type fields and the on_mixed_types="use_json" option.
Updated dataset loading calls in scripts/generate_toolcall_dataset.py to use on_mixed_types="use_json", ensuring robust handling of mixed-type columns.

Trainer codebase simplification:

Removed all imports and usage of the remove_none_values utility function from DPOTrainer, RewardTrainer, and SFTTrainer, eliminating a redundant data cleaning step now handled by the updated dataset loading logic.

Note

Medium Risk
Medium risk because it bumps the minimum datasets version and changes trainer preprocessing behavior, which could affect users with older/saved datasets that still contain injected None values.

Overview
Switches TRL to rely on datasets v4.7.0+ Json dtype handling for nested/mixed columns (via on_mixed_types="use_json" in dataset generation scripts) to prevent Arrow/Parquet from inserting spurious None values.

Removes the trainers’ automatic remove_none_values transform in SFTTrainer, DPOTrainer, and RewardTrainer, and documents the new expectation + manual workaround in MIGRATION.md for users with legacy datasets that still contain None.

^{Written by Cursor Bugbot for commit 27ba622. This will update automatically on new commits. Configure here.}

HuggingFaceDocBuilderDev · 2026-03-26T09:32:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

qgallouedec · 2026-03-26T17:00:23Z

            {"reasoning_effort": "low", "model_identity": "You are Tiny ChatGPT, a tiny language model."},
        ]
-    })
+    }, on_mixed_types="use_json")


here it's because we mix turns with and without "thinking" 👍

qgallouedec · 2026-03-26T17:00:58Z

            json.dumps([get_weather_forecast, get_wind_conditions]),
        ]
-    })
+    }, on_mixed_types="use_json")


here I'm not sure if it's need, json.dumps converts to str

Oops, I forgot to remove json.dumps. 😅 I'm pushing a new commit for this.

qgallouedec

Looks good overall, just a few minor suggestions and comments.

I think bumping datasets to 4.7 is safe; I can’t think of any case where it would affect users.

For clarity in the release notes, could you update the title to explicitly mention this version bump?

albertvillanova added 3 commits March 26, 2026 10:05

Pin datasets>=4.7.0 to support Json type

637d17f

Don't use remove_none_values in _prepare_dataset

f1334d3

Update generate_toolcall_dataset

4076574

albertvillanova changed the title ~~Use datasets Json dtype to avoid the insertion of None values~~ Use datasets Json dtype to prevent insertion of None values Mar 26, 2026

cursor Bot reviewed Mar 26, 2026

View reviewed changes

Comment thread trl/trainer/sft_trainer.py

albertvillanova added 2 commits March 26, 2026 11:20

Update generate_harmony_dataset

47f469a

Merge remote-tracking branch 'upstream/main' into rm-remove-none-values

1e911a9

cursor Bot reviewed Mar 26, 2026

View reviewed changes

Comment thread pyproject.toml

albertvillanova commented Mar 26, 2026

View reviewed changes

Comment thread trl/trainer/dpo_trainer.py

qgallouedec reviewed Mar 26, 2026

View reviewed changes

qgallouedec approved these changes Mar 26, 2026

View reviewed changes

albertvillanova added 5 commits March 27, 2026 14:47

Merge remote-tracking branch 'upstream/main' into rm-remove-none-values

2238f2a

Remove unnecessary json.dumps from generate_toolcall_dataset

8a167c0

Add comment explaining why we keep remove_none_values

3fd35ae

Add usage example to docstring of remove_none_values

05bc73f

Add entry to migration guide

27ba622

albertvillanova changed the title ~~Use datasets Json dtype to prevent insertion of None values~~ Require datasets>=4.7.0 for Json dtype to prevent insertion of None values Mar 27, 2026

albertvillanova merged commit ac5421b into huggingface:main Mar 27, 2026
11 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Require datasets>=4.7.0 for Json dtype to prevent insertion of None values#5376

Require datasets>=4.7.0 for Json dtype to prevent insertion of None values#5376
albertvillanova merged 10 commits into
huggingface:mainfrom
albertvillanova:rm-remove-none-values

albertvillanova commented Mar 26, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 26, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

qgallouedec Mar 26, 2026

Uh oh!

qgallouedec Mar 26, 2026

Uh oh!

albertvillanova Mar 26, 2026

Uh oh!

qgallouedec left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

albertvillanova commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

HuggingFaceDocBuilderDev commented Mar 26, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

qgallouedec Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

qgallouedec Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

albertvillanova Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

albertvillanova commented Mar 26, 2026 •

edited

Loading