Skip to content

Conversation

@mpjlu
Copy link

@mpjlu mpjlu commented Oct 12, 2016

What changes were proposed in this pull request?

For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic (ChiSqure value) to select the features. It select the features with the largest ChiSqure value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for different df, you cannot base on ChiSqure value to select features.

So we change statistic to pValue for SelectKBest and SelectPercentile

How was this patch tested?

change existing test

@SparkQA
Copy link

SparkQA commented Oct 12, 2016

Test build #66783 has finished for PR 15444 at commit 59ee17d.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 12, 2016

Test build #66787 has finished for PR 15444 at commit b98ccdf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Oct 14, 2016

Merged to master

@asfgit asfgit closed this in c8b612d Oct 14, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
…nd SelectPercentile because of DoF difference

## What changes were proposed in this pull request?

For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic (ChiSqure value) to select the features. It select the features with the largest ChiSqure value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for different df, you cannot base on ChiSqure value to select features.

So we change statistic to pValue for SelectKBest and SelectPercentile

## How was this patch tested?
change existing test

Author: Peng <[email protected]>

Closes apache#15444 from mpjlu/chisqure-bug.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…nd SelectPercentile because of DoF difference

## What changes were proposed in this pull request?

For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic (ChiSqure value) to select the features. It select the features with the largest ChiSqure value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for different df, you cannot base on ChiSqure value to select features.

So we change statistic to pValue for SelectKBest and SelectPercentile

## How was this patch tested?
change existing test

Author: Peng <[email protected]>

Closes apache#15444 from mpjlu/chisqure-bug.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants