Skip to content

Conversation

@yanboliang
Copy link
Contributor

@yanboliang yanboliang commented Jun 14, 2016

What changes were proposed in this pull request?

RFormula will index label only when it is string type currently. If the label is numeric type and we use RFormula to present a classification model, there is no label attributes in label column metadata. The label attributes are useful when making prediction for classification, so we can force to index label by StringIndexer whether it is numeric or string type for classification. Then SparkR wrappers can extract label attributes from label column metadata successfully. This feature can help us to fix bug similar with SPARK-15153.
For regression, we will still to keep label as numeric type.
In this PR, we add a param indexLabel to control whether to force to index label for RFormula.

How was this patch tested?

Unit tests.

@SparkQA
Copy link

SparkQA commented Jun 15, 2016

Test build #60538 has finished for PR 13675 at commit 24ef9ba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 15, 2016

Test build #60598 has finished for PR 13675 at commit a7015ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

cc @mengxr @jkbradley

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up this to 2.1.0 now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to "forceIndexLabel" since we can still index String labels when indexLabel=false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, done.

@SparkQA
Copy link

SparkQA commented Oct 8, 2016

Test build #3301 has finished for PR 13675 at commit a7015ef.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #66661 has finished for PR 13675 at commit 3a8660f.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #66662 has finished for PR 13675 at commit dfea620.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Usually we index label only when it is string type.
* If the formula was used by classification algorithms,
* we can force to index label even it is numeric type by setting this param with true.
* Default: false.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@group param

@jkbradley
Copy link
Member

I just noticed that last item, but otherwise, this looks ready to me. Thanks!

@SparkQA
Copy link

SparkQA commented Oct 11, 2016

Test build #66693 has finished for PR 13675 at commit 3079996.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

Does this affect R code - could we add some R tests for this?

@yanboliang
Copy link
Contributor Author

@felixcheung This PR does not affect R code, I will send another PR to fix issues like SPARK-15153 which need to add some R tests.

@yanboliang
Copy link
Contributor Author

I'll merge this into master, thanks for review! @jkbradley @felixcheung

@asfgit asfgit closed this in 19401a2 Oct 11, 2016
@yanboliang yanboliang deleted the spark-15957 branch October 11, 2016 05:54
ghost pushed a commit to dbtsai/spark that referenced this pull request Oct 14, 2016
…ceIndexLabel.

## What changes were proposed in this pull request?
Follow-up work of apache#13675, add Python API for ```RFormula forceIndexLabel```.

## How was this patch tested?
Unit test.

Author: Yanbo Liang <[email protected]>

Closes apache#15430 from yanboliang/spark-15957-python.
robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
…ceIndexLabel.

## What changes were proposed in this pull request?
Follow-up work of apache#13675, add Python API for ```RFormula forceIndexLabel```.

## How was this patch tested?
Unit test.

Author: Yanbo Liang <[email protected]>

Closes apache#15430 from yanboliang/spark-15957-python.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
```RFormula``` will index label only when it is string type currently. If the label is numeric type and we use ```RFormula``` to present a classification model, there is no label attributes in label column metadata. The label attributes are useful when making prediction for classification, so we can force to index label by ```StringIndexer``` whether it is numeric or string type for classification. Then SparkR wrappers can extract label attributes from label column metadata successfully. This feature can help us to fix bug similar with [SPARK-15153](https://issues.apache.org/jira/browse/SPARK-15153).
For regression, we will still to keep label as numeric type.
In this PR, we add a param ```indexLabel``` to control whether to force to index label for ```RFormula```.

## How was this patch tested?
Unit tests.

Author: Yanbo Liang <[email protected]>

Closes apache#13675 from yanboliang/spark-15957.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…ceIndexLabel.

## What changes were proposed in this pull request?
Follow-up work of apache#13675, add Python API for ```RFormula forceIndexLabel```.

## How was this patch tested?
Unit test.

Author: Yanbo Liang <[email protected]>

Closes apache#15430 from yanboliang/spark-15957-python.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants