Skip to content

Conversation

@jkbradley
Copy link
Member

What changes were proposed in this pull request?

  • Renamed kbest to numTopFeatures
  • Renamed alpha to fpr
  • Added missing Since annotations
  • Doc cleanups

How was this patch tested?

Added new standardized unit tests for spark.ml.
Improved existing unit test coverage a bit.

* Renamed kbest to numTopFeatures
* Renamed alpha to fpr
* Added missing Since annotations
* Doc cleanups
* Added missing standard unit tests
@jkbradley
Copy link
Member Author

CC: @mpjlu @srowen @yanboliang

@SparkQA
Copy link

SparkQA commented Oct 26, 2016

Test build #67590 has finished for PR 15647 at commit 10f8b26.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 26, 2016

Test build #67593 has finished for PR 15647 at commit 1ead234.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 26, 2016

Test build #67596 has finished for PR 15647 at commit 77f05ef.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

test this please

@SparkQA
Copy link

SparkQA commented Oct 26, 2016

Test build #67599 has finished for PR 15647 at commit 77f05ef.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 26, 2016

Test build #67597 has finished for PR 15647 at commit 77f05ef.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

I'll fix these later tonight

@SparkQA
Copy link

SparkQA commented Oct 27, 2016

Test build #67626 has finished for PR 15647 at commit 8b3ed65.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number.
* `FPR` chooses all features whose false positive rate meets some threshold.
* `numTopFeatures` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use k here since KBest is changed to numTopFeatures?

By default, the selection method is `KBest`, the default number of top features is 50. User can use
`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods.
By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. User can use
`setNumTopFeatures`, `setPercentile` and `setFpr` to set different selection methods.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setNumTopFeatures, setPercentile, setFpr just set parameter, setSelectorType is used to set selection methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we use setSelectorType to switch between different method. Actually the document is incorrect before this PR.

* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number.
* `FPR` chooses all features whose false positive rate meets some threshold.
* `numTopFeatures` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

By default, the selection method is `KBest`, the default number of top features is 50. User can use
`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods.
By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. User can use
`setNumTopFeatures`, `setPercentile`, `fpr` to set different selection methods.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

*/
@Since("1.6.0")
final val numTopFeatures = new IntParam(this, "numTopFeatures",
"Number of features that selector will select, ordered by statistics value descending. If the" +
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ordered by pValue ascending. I missed this change in my PR.

*/
@Since("2.1.0")
final val percentile = new DoubleParam(this, "percentile",
"Percentile of features that selector will select, ordered by statistics value descending.",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ordered by pValue ascending.

* `fpr` chooses all features whose false positive rate meets some threshold.
* By default, the selection method is `kbest`, the default number of top features is 50.
* The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`.
* - `numTopFeatures` chooses the `k` top features according to a chi-squared test.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

* `fpr` chooses all features whose false positive rate meets some threshold.
* By default, the selection method is `kbest`, the default number of top features is 50.
* The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`.
* - `numTopFeatures` chooses the `k` top features according to a chi-squared test.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

}

test("ChiSqSelector by FPR transform test (sparse & dense vector)") {
test("ChiSqSelector by percentile transform test (sparse & dense vector)") {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be FPR

"all features.", typeConverter=TypeConverters.toInt)

percentile = Param(Params._dummy(), "percentile", "Percentile of features that selector " +
"will select, ordered by statistics value descending.",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ordered by pValue ascending

By default, the selection method is `KBest`, the default number of top features is 50. User can use
`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods.
By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. User can use
`setNumTopFeatures`, `setPercentile` and `setFpr` to set different selection methods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we use setSelectorType to switch between different method. Actually the document is incorrect before this PR.

with HasFeaturesCol with HasOutputCol with HasLabelCol {

/**
* Number of features that selector will select (ordered by statistic value descending). If the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ordered by pValue ascending

@jkbradley
Copy link
Member Author

Updated---thanks!

@SparkQA
Copy link

SparkQA commented Oct 27, 2016

Test build #67672 has finished for PR 15647 at commit 7d3c74c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

Any more comments, or shall I merge this?

@jkbradley
Copy link
Member Author

test this please

@SparkQA
Copy link

SparkQA commented Nov 1, 2016

Test build #67923 has finished for PR 15647 at commit 7d3c74c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

I'll go ahead and merge this, but please comment if it needs any follow-ups.

Merging with master

Thanks for the review!

@asfgit asfgit closed this in 91c33a0 Nov 2, 2016
@jkbradley jkbradley deleted the chisqselector-follow-ups branch November 2, 2016 02:18
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
- Renamed kbest to numTopFeatures
- Renamed alpha to fpr
- Added missing Since annotations
- Doc cleanups
## How was this patch tested?

Added new standardized unit tests for spark.ml.
Improved existing unit test coverage a bit.

Author: Joseph K. Bradley <[email protected]>

Closes apache#15647 from jkbradley/chisqselector-follow-ups.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants