Skip to content

Conversation

@smurching
Copy link
Contributor

@smurching smurching commented Nov 15, 2017

What changes were proposed in this pull request?

Breaks up #19433 to help unblock #19666; after this PR is merged, #19666 can be merged.

This PR contains the changes made to migrate functionality from RandomForest.scala into the following utility classes:

  • AggUpdateUtils
  • ImpurityUtils
  • SplitUtils

The PR also adds tests for split selection logic in TreeSplitUtilsSuite.

A follow-up PR will include the other changes from #19433:

  • Local decision tree data structures & tests
  • Local tree training logic & tests

How was this patch tested?

Adds unit tests for split selection logic in TreeSplitUtilsSuite

@SparkQA
Copy link

SparkQA commented Nov 15, 2017

Test build #83910 has finished for PR 19758 at commit b93f9f3.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 15, 2017

Test build #83911 has finished for PR 19758 at commit b4a5f3b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 15, 2017

Test build #83914 has finished for PR 19758 at commit b6291e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work. I mainly reviewed the new added testsuite part.

// label: 2 --> values: 2
// Expected split: feature value 1 on the left, values (0, 2) on the right
val values = Array(1, 1, 0, 2, 2)
val featureArity = values.max + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to make the test more strict, can you increase the featureArity, numExamples and numClasses ? e.g., featureArity = 6 and numExamples = 10 and numClasses = 5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeichenXu123 thanks for the feedback! Definitely agree that the test is a little weak right now.

IMO it's mainly weak due to the low feature arity (there only three possible splits, so the right one could be picked by chance). I think increasing the number of classes/examples substantially might make the test harder to reason about, but not opposed to that either - let me know what you think.

What about something like:

    val values = Array(0, 1, 2, 3, 2, 2, 4)
    val labels = Array(0.0, 0.0, 1.0, 1.0, 2.0, 2.0, 2.0)
    // label: 0 --> values: 0, 1
    // label: 1 --> values: 2, 3
    // label: 2 --> values: 2, 2, 4
    // Expected split: feature values (0, 1) on the left, values (2, 3, 4) on the right

This way we still test multiclass classification & test the split-selection logic more rigorously.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is OK. thanks!

@SparkQA
Copy link

SparkQA commented Dec 1, 2017

Test build #84379 has finished for PR 19758 at commit 5bcccda.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

ping?
I'm mostly interested in SPARK-3162

@wenbochang
Copy link

Any updates? this PR seems to address critical issue: https://issues.apache.org/jira/browse/SPARK-3162

@asfgit asfgit closed this in 1a4fda8 Jul 19, 2018
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
Closes apache#17422
Closes apache#17619
Closes apache#18034
Closes apache#18229
Closes apache#18268
Closes apache#17973
Closes apache#18125
Closes apache#18918
Closes apache#19274
Closes apache#19456
Closes apache#19510
Closes apache#19420
Closes apache#20090
Closes apache#20177
Closes apache#20304
Closes apache#20319
Closes apache#20543
Closes apache#20437
Closes apache#21261
Closes apache#21726
Closes apache#14653
Closes apache#13143
Closes apache#17894
Closes apache#19758
Closes apache#12951
Closes apache#17092
Closes apache#21240
Closes apache#16910
Closes apache#12904
Closes apache#21731
Closes apache#21095

Added:
Closes apache#19233
Closes apache#20100
Closes apache#21453
Closes apache#21455
Closes apache#18477

Added:
Closes apache#21812
Closes apache#21787

Author: hyukjinkwon <[email protected]>

Closes apache#21781 from HyukjinKwon/closing-prs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants