Skip to content

tests: Add tests for dataset quality#3603

Merged
KennethEnevoldsen merged 3 commits intomainfrom
add-test-for-dataset-quality
Nov 22, 2025
Merged

tests: Add tests for dataset quality#3603
KennethEnevoldsen merged 3 commits intomainfrom
add-test-for-dataset-quality

Conversation

@KennethEnevoldsen
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen commented Nov 22, 2025

This test ensures a minimum quality of task for future submissions

Currently, it tests for:

  • text duplicates
  • train test leakage
  • too short documents

I suspect we can likely add many more tests to this in future PRs

I would argue that this fixes #375 (by preventing duplicates it in future datasets)

fixes #375
fixes #3370

This tests ensures a minimum quality of task for future submissions

Currently it tests for:
- text duplicates
- train test leakage
- too short documents

I suspect we can likely add many more tests to this in future PRs

errors = []
if desc_stats is None:
return []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a failed test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a different test that test if desc stats is filled

Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
@KennethEnevoldsen KennethEnevoldsen enabled auto-merge (squash) November 22, 2025 16:39
@KennethEnevoldsen KennethEnevoldsen merged commit 9b898ea into main Nov 22, 2025
9 checks passed
@KennethEnevoldsen KennethEnevoldsen deleted the add-test-for-dataset-quality branch November 22, 2025 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make test for duplicated texts Tracking down dataset duplication

2 participants