-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22884][ML][TESTS] ML test for StructuredStreaming: spark.ml.clustering #20319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder | ||
|
|
||
| private[clustering] object Encoders { | ||
| implicit val vectorEncoder = ExpressionEncoder[Vector]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better solution to provide an implicit Encoder[Vector] for testTransformer?
Is it ok here, or is there a better place for it?
e.g. org.apache.spark.mllib.util.MLlibTestSparkContext.testImplicits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for asking; you shouldn't need to do this. I'll comment on BisectingKMeansSuite.scala
about using testImplicits instead. You basically just need to import testImplicits._ and use Tuple1 for the type param for testTransformer.
|
Jenkins, add to whitelist |
|
Test build #86391 has finished for PR 20319 at commit
|
|
@jkbradley could you check out this change, please? |
|
Test build #86479 has finished for PR 20319 at commit
|
|
@smurakozi Thanks for the PR! I have bandwidth to review this now. Do you have time to rebase this to fix the merge conflicts? |
|
@smurakozi Thanks for the PR! Could you resolve conflicts first? and then I will make a review. If you're busy I can also take over it. |
|
Test build #89063 has finished for PR 20319 at commit
|
|
@jkbradley, @WeichenXu123 thanks for checking it out. I've resolved the conflicts, build is green. |
|
Reviewing now! |
jkbradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with review; thanks!
| import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder | ||
|
|
||
| private[clustering] object Encoders { | ||
| implicit val vectorEncoder = ExpressionEncoder[Vector]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for asking; you shouldn't need to do this. I'll comment on BisectingKMeansSuite.scala
about using testImplicits instead. You basically just need to import testImplicits._ and use Tuple1 for the type param for testTransformer.
| extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { | ||
| class BisectingKMeansSuite extends MLTest with DefaultReadWriteTest { | ||
|
|
||
| import Encoders._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import testImplicits._ instead
| // Verify we hit the edge case | ||
| assert(numClusters < k && numClusters > 1) | ||
|
|
||
| testTransformerByGlobalCheckFunc[Vector](sparseDataset.toDF(), model, "prediction") { rows => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Tuple1[Vector] instead of Vector
| val clusters = rows.map(_.getAs[Int](predictionColName)).toSet | ||
| assert(clusters.size === k) | ||
| assert(clusters === Set(0, 1, 2, 3, 4)) | ||
| assert(model.computeCost(dataset) < 0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These checks which do not use "rows" should go outside of testTransformerByGlobalCheckFunc
|
@smurakozi Do you have time to update this? I did a full review, though it now has a small merge conflict. Thanks! |
|
I'm going to take this over to get this done, but @smurakozi you'll be the primary author. I'll link the PR here in a minute |
|
Done! Here it is: #21358 @smurakozi Could you please close this issue and help review the new PR if you have time? Thanks! |
## What changes were proposed in this pull request? Converting clustering tests to also check code with structured streaming, using the ML testing infrastructure implemented in SPARK-22882. This PR is a new version of #20319 Author: Sandor Murakozi <[email protected]> Author: Joseph K. Bradley <[email protected]> Closes #21358 from jkbradley/smurakozi-SPARK-22884.
|
Can one of the admins verify this patch? |
Closes apache#17422 Closes apache#17619 Closes apache#18034 Closes apache#18229 Closes apache#18268 Closes apache#17973 Closes apache#18125 Closes apache#18918 Closes apache#19274 Closes apache#19456 Closes apache#19510 Closes apache#19420 Closes apache#20090 Closes apache#20177 Closes apache#20304 Closes apache#20319 Closes apache#20543 Closes apache#20437 Closes apache#21261 Closes apache#21726 Closes apache#14653 Closes apache#13143 Closes apache#17894 Closes apache#19758 Closes apache#12951 Closes apache#17092 Closes apache#21240 Closes apache#16910 Closes apache#12904 Closes apache#21731 Closes apache#21095 Added: Closes apache#19233 Closes apache#20100 Closes apache#21453 Closes apache#21455 Closes apache#18477 Added: Closes apache#21812 Closes apache#21787 Author: hyukjinkwon <[email protected]> Closes apache#21781 from HyukjinKwon/closing-prs.
What changes were proposed in this pull request?
Converting clustering tests to also check code with structured streaming, using the ML testing infrastructure implemented in SPARK-22882.
How was this patch tested?
N/A