Skip to content

Require datasets>=4.7.0 for Json dtype to prevent insertion of None values#5376

Merged
albertvillanova merged 10 commits into
huggingface:mainfrom
albertvillanova:rm-remove-none-values
Mar 27, 2026
Merged

Require datasets>=4.7.0 for Json dtype to prevent insertion of None values#5376
albertvillanova merged 10 commits into
huggingface:mainfrom
albertvillanova:rm-remove-none-values

Conversation

@albertvillanova

@albertvillanova albertvillanova commented Mar 26, 2026

Copy link
Copy Markdown
Member

Use datasets Json dtype to avoid the insertion of None values.

CC: @SolarWindRider, @qgallouedec

This PR updates dataset handling to improve compatibility with mixed data types and simplifies preprocessing in several trainer modules. The most significant changes include upgrading the datasets library to support the Json new feature, using the on_mixed_types="use_json" option when loading datasets, and removing the remove_none_values cleaning step from all trainer classes: DPO, Reward, SFT.

Related to:

Changes

Dataset compatibility and preprocessing improvements:

  • Upgraded the datasets library version requirement from >=3.0.0 to >=4.7.0 in pyproject.toml to enable support for Json type fields and the on_mixed_types="use_json" option.
  • Updated dataset loading calls in scripts/generate_toolcall_dataset.py to use on_mixed_types="use_json", ensuring robust handling of mixed-type columns.

Trainer codebase simplification:

  • Removed all imports and usage of the remove_none_values utility function from DPOTrainer, RewardTrainer, and SFTTrainer, eliminating a redundant data cleaning step now handled by the updated dataset loading logic.

Note

Medium Risk
Medium risk because it bumps the minimum datasets version and changes trainer preprocessing behavior, which could affect users with older/saved datasets that still contain injected None values.

Overview
Switches TRL to rely on datasets v4.7.0+ Json dtype handling for nested/mixed columns (via on_mixed_types="use_json" in dataset generation scripts) to prevent Arrow/Parquet from inserting spurious None values.

Removes the trainers’ automatic remove_none_values transform in SFTTrainer, DPOTrainer, and RewardTrainer, and documents the new expectation + manual workaround in MIGRATION.md for users with legacy datasets that still contain None.

Written by Cursor Bugbot for commit 27ba622. This will update automatically on new commits. Configure here.

@albertvillanova albertvillanova changed the title Use datasets Json dtype to avoid the insertion of None values Use datasets Json dtype to prevent insertion of None values Mar 26, 2026
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment thread trl/trainer/sft_trainer.py

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment thread pyproject.toml
Comment thread trl/trainer/dpo_trainer.py
{"reasoning_effort": "low", "model_identity": "You are Tiny ChatGPT, a tiny language model."},
]
})
}, on_mixed_types="use_json")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here it's because we mix turns with and without "thinking" 👍

json.dumps([get_weather_forecast, get_wind_conditions]),
]
})
}, on_mixed_types="use_json")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I'm not sure if it's need, json.dumps converts to str

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I forgot to remove json.dumps. 😅 I'm pushing a new commit for this.

@qgallouedec qgallouedec left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just a few minor suggestions and comments.

I think bumping datasets to 4.7 is safe; I can’t think of any case where it would affect users.

For clarity in the release notes, could you update the title to explicitly mention this version bump?

@albertvillanova albertvillanova changed the title Use datasets Json dtype to prevent insertion of None values Require datasets>=4.7.0 for Json dtype to prevent insertion of None values Mar 27, 2026
@albertvillanova albertvillanova merged commit ac5421b into huggingface:main Mar 27, 2026
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants