You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current version of DropColumnIfNull automatically drops any columns where all values are null. However, it would be good to have a threshold so that all columns that have a fraction of null values larger than a given threshold are dropped instead.
Feature Description
The DropColumnIfNull object should be modified so that it can take a "threshold" parameter, and then drop columns whose null fraction is above this threshold.
The value of the threshold should be 1.0 by default (i.e., all columns must be null to drop), and there should be a parameter for it in the TableVectorizer that the user can modify to tweak this.
Something to note is that the user may want to have different thresholds for different columns, which would complicate things.
Alternative Solutions
I am not sure whether we should have two separate objects like DropColumnIfNull and DropNullsAboveThreshold (tentative name), or if DropColumnIfNull should stay and get updated with the threshold.
We might want to have two different objects, one that takes a single value for the threshold, and one that allows for more granularity. Though, at that point it might just be better going with a TableVectorizer with specific transformers for each column 🤔
Additional Context
No response
The text was updated successfully, but these errors were encountered:
Hey @rcap107! We can add this parameter to the current DropNull object and rename it slightly IMO. Then, you could have a single parameter like drop_null_frac in the TableVectorizer, which is set to 1.0 by default (and None would mean no drop mechanism). Ideally, I would rather avoid having 2 separate input parameters in the TableVectorizer for this.
I guess having a single threshold for all columns is fine for now to keep things simple. WDYT?
Agreed, the first version should definitely be just a single threshold that's used for all columns, and that can take a single default value.
I still think that having an object that does the filtering with different thresholds for each column would be useful, but I don't see that as being a "default" part of the TV. Something for a different PR!
Problem Description
The current version of
DropColumnIfNull
automatically drops any columns where all values are null. However, it would be good to have a threshold so that all columns that have a fraction of null values larger than a given threshold are dropped instead.Feature Description
The
DropColumnIfNull
object should be modified so that it can take a "threshold" parameter, and then drop columns whose null fraction is above this threshold.The value of the threshold should be 1.0 by default (i.e., all columns must be null to drop), and there should be a parameter for it in the TableVectorizer that the user can modify to tweak this.
Something to note is that the user may want to have different thresholds for different columns, which would complicate things.
Alternative Solutions
I am not sure whether we should have two separate objects like
DropColumnIfNull
andDropNullsAboveThreshold
(tentative name), or ifDropColumnIfNull
should stay and get updated with the threshold.We might want to have two different objects, one that takes a single value for the threshold, and one that allows for more granularity. Though, at that point it might just be better going with a TableVectorizer with specific transformers for each column 🤔
Additional Context
No response
The text was updated successfully, but these errors were encountered: