[SPARK-20003] [ML] FPGrowthModel setMinConfidence should affect rules generation and transform#17336
[SPARK-20003] [ML] FPGrowthModel setMinConfidence should affect rules generation and transform#17336hhbyyh wants to merge 7 commits intoapache:masterfrom
Conversation
|
ping @jkbradley and @srowen to be aware of the issue. also cc @MLnick who's working on python API. |
| ).first().getAs[Seq[String]]("prediction") | ||
|
|
||
| assert(prediction === Seq("3")) | ||
| } |
There was a problem hiding this comment.
Didn't change this one, just move it to keep parameter and save/load check at the bottom.
|
Test build #74752 has finished for PR 17336 at commit
|
|
Test build #74995 has finished for PR 17336 at commit
|
|
ping @jkbradley as this is something we should fix before release. |
|
Thanks for this PR! Do you think it's worth adding the caching logic? I'm now wondering if we should change associationRules into a method which recomputes the DataFrame every time it is called. That will make it easier for the user to understand the semantics, and it would be easy for a user to define a val to hold onto the computed DataFrame if needed. Would you mind updating this accordingly? Thanks! |
|
The major thing I'm concerned is that |
|
Test build #75448 has finished for PR 17336 at commit
|
jkbradley
left a comment
There was a problem hiding this comment.
Thanks for mentioning about transform(). I'd been thinking about the user calling associationRules directly. But you're right about it being better to cache for transform(). Sorry to ask this, but would you mind reverting back to the cached version? Thanks a lot.
| model2.freqItemsets.sort("items").collect()) | ||
| assert(model.freqItemsets.collect().toSet.equals( | ||
| model2.freqItemsets.collect().toSet)) | ||
| assert(model.associationRules.collect().toSet.equals( |
There was a problem hiding this comment.
No need to add these 2 since they are values computed from the model data. Checking freqItemsets is sufficient.
There was a problem hiding this comment.
Thanks for comment. I added the check since we added some internal cache fields, I'd like to ensure it does not interfere with the model loading. Let me know it is still redundant.
| FPGrowthSuite.allParamSettings, checkModelData) | ||
| } | ||
|
|
||
| test("FPGrowth prediction should not contain duplicates") { |
There was a problem hiding this comment.
For the future, I'd prefer not to move stuff around unless it's necessary since it makes the diff larger. No need to revert this, though, since I already checked it.
|
Test build #75517 has finished for PR 17336 at commit
|
|
Thanks a lot for the second update! This LGTM |
What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-20003
I was doing some test and found the issue. ml.fpm.FPGrowthModel
setMinConfidenceshould always affect rules generation and transform.Currently associationRules in FPGrowthModel is a lazy val and
setMinConfidencein FPGrowthModel has no impact once associationRules got computed .I try to cache the associationRules to avoid re-computation if
minConfidenceis not changed, but this makes FPGrowthModel somehow stateful. Let me know if there's any concern.How was this patch tested?
new unit test and I strength the unit test for model save/load to ensure the cache mechanism.