Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a threshold to DropColumnIfNull #1147

Open
rcap107 opened this issue Nov 19, 2024 · 3 comments
Open

Add a threshold to DropColumnIfNull #1147

rcap107 opened this issue Nov 19, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@rcap107
Copy link
Contributor

rcap107 commented Nov 19, 2024

Problem Description

The current version of DropColumnIfNull automatically drops any columns where all values are null. However, it would be good to have a threshold so that all columns that have a fraction of null values larger than a given threshold are dropped instead.

Feature Description

The DropColumnIfNull object should be modified so that it can take a "threshold" parameter, and then drop columns whose null fraction is above this threshold.

The value of the threshold should be 1.0 by default (i.e., all columns must be null to drop), and there should be a parameter for it in the TableVectorizer that the user can modify to tweak this.

Something to note is that the user may want to have different thresholds for different columns, which would complicate things.

Alternative Solutions

I am not sure whether we should have two separate objects like DropColumnIfNull and DropNullsAboveThreshold (tentative name), or if DropColumnIfNull should stay and get updated with the threshold.

We might want to have two different objects, one that takes a single value for the threshold, and one that allows for more granularity. Though, at that point it might just be better going with a TableVectorizer with specific transformers for each column 🤔

Additional Context

No response

@rcap107 rcap107 added the enhancement New feature or request label Nov 19, 2024
@Vincent-Maladiere
Copy link
Member

Hey @rcap107! We can add this parameter to the current DropNull object and rename it slightly IMO. Then, you could have a single parameter like drop_null_frac in the TableVectorizer, which is set to 1.0 by default (and None would mean no drop mechanism). Ideally, I would rather avoid having 2 separate input parameters in the TableVectorizer for this.

I guess having a single threshold for all columns is fine for now to keep things simple. WDYT?

@rcap107
Copy link
Contributor Author

rcap107 commented Nov 21, 2024

Agreed, the first version should definitely be just a single threshold that's used for all columns, and that can take a single default value.

I still think that having an object that does the filtering with different thresholds for each column would be useful, but I don't see that as being a "default" part of the TV. Something for a different PR!

@rcap107
Copy link
Contributor Author

rcap107 commented Nov 21, 2024

First draft in #1149

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants