-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-16843][MLLIB] add the percentage ChiSquareSelector feature #14449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
I'm not sure it's worth a whole other API method for this. If you want to select 10% of features, you can trivially ask for 0.1 * numFeatures features from the selector. |
|
Hi @srowen, thanks for your comment. |
|
Hi @srowen , I also plan to submit some PR about feature selection methods based on univariate statistical test, like the methods in scikit-learn: SelectFpr (using false positive rate), SelectFdr ( using false discovery rate), and SelectFwe (family wise error ). |
|
Percentage is a useful addition to ChiSquareSelector, it is a common and intuitive param to data scientists and statistician as scikit-learn has, but it may be not worthy a whole other API in MLlib indeed. @srowen I suppose it could be implemented by adding a new Param in ML.ChiSqSelector? |
|
If anything it could be a Param in the |
|
As for other feature selection methods, feel free to create a JIRA to discuss. Some work has been done outside of Spark in packages - e.g. https://github.com/sramirez/spark-infotheoretic-feature-selection. Generally I think this is a good place for that kind of work to start - I don't think it necessarily must be in Spark itself. If usage and performance is high, it can always be considered for inclusion later on. |
|
I'd like to close this in favor of the changes in #14597 because I think it would actually lead towards making this functionality trivial to expose from the model class. |
Closes apache#10995 Closes apache#13658 Closes apache#14505 Closes apache#14536 Closes apache#12753 Closes apache#14449 Closes apache#12694 Closes apache#12695 Closes apache#14810
What changes were proposed in this pull request?
Now, there is only numTopFeatures Param in ChiSquareSelector. In practice, it is convenience to use the percentage as the Param.
We add the percentage Param for ChiSquareSelector in this PR.
How was this patch tested?
add scala ut