Skip to content

Add dataset task filter#3685

Merged
Samoed merged 5 commits intomainfrom
task_filter
Dec 8, 2025
Merged

Add dataset task filter#3685
Samoed merged 5 commits intomainfrom
task_filter

Conversation

@Samoed
Copy link
Member

@Samoed Samoed commented Dec 6, 2025

Ref #3672

For now, this is MVP version of cleaning. I've added it for now because I needed to filter #3607. I'm not sure what is best way to implement it.

  1. We can create some processing pipeline task.filter_data(["filter1", "filter2", ...]), but this would probably hard to use.
  2. We can keep this as separate file with a bunch of util functions and maybe with some presets

@Samoed Samoed marked this pull request as draft December 6, 2025 21:58
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great with a first iteration

Should this a method on the AbsTaskClassification? I could also see the argument for not doing it (keeping it as a hidden thing until we have a decent implementation across tasks)

Comment on lines 25 to 26
logger.info(
f"[deduplicate] kept={len(indices_to_keep)}, removed={len(dataset) - len(indices_to_keep)}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't you rather want 10/1000 removed (10 duplicates removed out of all the documents 1000)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(would probably do this generally across

return test_dataset.select(indices)


def filter_controversial(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name to filter_unclear_label

@KennethEnevoldsen
Copy link
Contributor

Not sure if this closes #3672

(I could see us merging in this partial solution and expanding upon it - it us currently "private" so it shouldn't cause in issues)

@Samoed Samoed marked this pull request as ready for review December 8, 2025 13:03
@Samoed Samoed enabled auto-merge (squash) December 8, 2025 13:14
@Samoed Samoed merged commit 3afd6f8 into main Dec 8, 2025
9 checks passed
@Samoed Samoed deleted the task_filter branch December 8, 2025 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants