[SPARK-14503][ML] spark.ml API for FPGrowth#15415
[SPARK-14503][ML] spark.ml API for FPGrowth#15415hhbyyh wants to merge 34 commits intoapache:masterfrom
Conversation
|
Test build #66630 has finished for PR 15415 at commit
|
|
Test build #67995 has finished for PR 15415 at commit
|
|
@yanboliang @jkbradley This should be ready for review. Thanks. |
|
Test build #68337 has finished for PR 15415 at commit
|
|
Test build #70283 has finished for PR 15415 at commit
|
| * Param for minimum confidence, range [0.0, 1.0]. | ||
| * @group param | ||
| */ | ||
| final val minConfidence: DoubleParam = new DoubleParam(this, "minConfidence", "min confidence") |
There was a problem hiding this comment.
there should be a ParamValidators.inRange(...)
| /** | ||
| * Computes the association rules with confidence above [[minConfidence]]. | ||
| * @param freqItemsets DataFrame containing frequent itemset obtained from algorithms like | ||
| * [[FPGrowth]]. Users can set itemsCol (frequent itemSet, Array[String]) |
There was a problem hiding this comment.
Array[String] confilct with Array[Int] in https://github.com/apache/spark/pull/15415/files#diff-0a641720038f962d333ef38402a02207R41
and is there some way to support general types?
| val freqItemSetRdd = freqItemsets.select($(itemsCol), $(freqCol)).rdd | ||
| .map(row => new FreqItemset(row.getSeq[String](0).toArray, row.getLong(1))) | ||
|
|
||
| val sqlContext = SparkSession.builder().getOrCreate() |
There was a problem hiding this comment.
Since val sqlContext is of type SparkSession, what about rename it spark?
| */ | ||
| @Since("2.1.0") | ||
| @Experimental | ||
| class AssociationRules(override val uid: String) extends Params { |
There was a problem hiding this comment.
Since AssociationRules transform DataFrame freqItemsets to DataFrame rules, can it be a subclass of Transformer?
There was a problem hiding this comment.
freqItemsets and rules does not have a one-to-one mapping relation and will probably violates the primitives of Transformer.
| * | ||
| */ | ||
| @Since("2.1.0") | ||
| def run(freqItemsets: Dataset[_]): DataFrame = { |
There was a problem hiding this comment.
If inheriting Transformer, here should be override def transform(dataset: Dataset[_]): DataFrame
| * Generates association rules from frequent itemsets ("items", "freq"). This method only generates | ||
| * association rules which have a single item as the consequent. | ||
| */ | ||
| @Since("2.1.0") |
| */ | ||
| @Since("2.2.0") | ||
| val minSupport: DoubleParam = new DoubleParam(this, "minSupport", | ||
| "the minimal support level of the frequent pattern (Default: 0.3)") |
There was a problem hiding this comment.
also need a ParamValidator here
| /** @group getParam */ | ||
| @Since("2.2.0") | ||
| def getMinSupport: Double = $(minSupport) | ||
|
|
There was a problem hiding this comment.
MLLib's FPGrowth have a param numPartitions, will it be included here?
|
@jkbradley Sent an update to refine the transform code and address the comments. Regarding to the behavior changing concern, I think different partition strategy will only affect the overall efficiency and maybe the order of the frequent itemsets and association rules. As long as the result set does not change, I don't think it will disturb the users. Let me know if I miss anything. |
jkbradley
left a comment
There was a problem hiding this comment.
Thanks for the updates!
Regarding to the behavior changing concern
I think you're right. I'd never read through the algorithm carefully, but now that I do, I think you're right.
refine the transform code
I'd prefer we hold off on this. If we're changing the transform code (rather than wrapping the old code), then we'll need to spend more time writing new tests and QAing it. We don't really have time to do that before the 2.2 release. Also, when we get around to optimizing it, I'd prefer to do everything using DataFrame operations if possible. Could you please save your code but revert the transform() code to wrap the old AssociationRules code? Thanks!
|
|
||
| test("FPGrowth fit and transform with different data types") { | ||
| Array(IntegerType, StringType, ShortType, LongType, ByteType).foreach { dt => | ||
| val intData = dataset.withColumn("features", col("features").cast(ArrayType(dt))) |
| assert(checkDF.count() == 3 && checkDF.filter(col("freq") === col("freqExp")).count() == 3) | ||
| } | ||
|
|
||
| test("FPGrowth getFreqItems with Null") { |
There was a problem hiding this comment.
In FPGrowth, document that null values are treated as empty sequences.
There was a problem hiding this comment.
Btw, I could imagine us wanting to change this later. If we're recommending items a user could add to their basket, then we might want to suggest the most frequent item rather than nothing.
|
Hi @jkbradley
Do you mean to make |
Sorry, forget this comment from me; I was thinking that something like FPGrowthModel.transform had already been implemented, but it's new. Btw, I'm adding more comments now, so please hold off on changes for an hour. Thanks! |
jkbradley
left a comment
There was a problem hiding this comment.
OK, done with another pass. I'll think more about genericTransform too. Thanks!
| * @return a DataFrame("antecedent", "consequent", "confidence") containing the association | ||
| * rules. | ||
| */ | ||
| def getAssociationRulesFromFP[T: ClassTag](dataset: Dataset[_], |
| * :: Experimental :: | ||
| * A parallel FP-growth algorithm to mine frequent itemsets. | ||
| * | ||
| * @see [[http://dx.doi.org/10.1145/1454008.1454027 Li et al., PFP: Parallel FP-Growth for Query |
There was a problem hiding this comment.
Could you please go ahead and copy the relevant text and links from the Scaladoc string for mllib.fpm.FPGrowth?
| */ | ||
| @Since("2.2.0") | ||
| val minSupport: DoubleParam = new DoubleParam(this, "minSupport", | ||
| "the minimal support level of the frequent pattern (Default: 0.3)", |
There was a problem hiding this comment.
"of the frequent pattern" -> "of a frequent pattern"
| */ | ||
| @Since("2.2.0") | ||
| val minConfidence: DoubleParam = new DoubleParam(this, "minConfidence", | ||
| "minimal confidence for generating Association Rule (Default: 0.8)", |
There was a problem hiding this comment.
Don't state default value in built-in Param doc
| def setNumPartitions(value: Int): this.type = set(numPartitions, value) | ||
|
|
||
| /** @group setParam | ||
| * Note that minConfidence has no effect during fitting. |
There was a problem hiding this comment.
Remove this line; it's already in the minConfidence doc
| // Save metadata and Params | ||
| DefaultParamsWriter.saveMetadata(instance, path, sc) | ||
| val dataPath = new Path(path, "data").toString | ||
| instance.freqItemsets.write.save(dataPath) |
| override def load(path: String): FPGrowthModel = { | ||
| val metadata = DefaultParamsReader.loadMetadata(path, sc, className) | ||
| val dataPath = new Path(path, "data").toString | ||
| val frequentItems = sparkSession.read.load(dataPath) |
| * rules. | ||
| */ | ||
| def getAssociationRulesFromFP[T: ClassTag](dataset: Dataset[_], | ||
| itemsCol: String = "items", |
There was a problem hiding this comment.
No need to specify default arg values. If we make this public, we will need to avoid default args to make it Java friendly.
| * :: Experimental :: | ||
| * Model fitted by FPGrowth. | ||
| * | ||
| * @param freqItemsets frequent items in the format of DataFrame("items", "freq") |
| private def genericFit[T: ClassTag](dataset: Dataset[_]): FPGrowthModel = { | ||
| val data = dataset.select($(featuresCol)) | ||
| val items = data.where(col($(featuresCol)).isNotNull).rdd.map(r => r.getSeq[T](0).toArray) | ||
| val parentModel = new MLlibFPGrowth().setMinSupport($(minSupport)).run(items) |
|
Test build #73354 has finished for PR 15415 at commit
|
jkbradley
left a comment
There was a problem hiding this comment.
Thanks for the updates! I think this is almost done. We should try to optimize transform in the future, but I think it's OK for an MVP. It'd be interesting to write it using DataFrames instead of RDDs and benchmark (in the future).
| Seq.empty | ||
| } | ||
| (id, consequents) | ||
| }.aggregateByKey(new ArrayBuffer[T])((ar, seq) => ar ++= seq, (ar, seq) => ar ++= seq) |
There was a problem hiding this comment.
How about using an OpenHashSet here to avoid collecting duplicates during aggregation?
There was a problem hiding this comment.
I tried OpenHashSet and it's about %15 slower than ArrayBuffer.
.aggregateByKey(new OpenHashSet[T])((set, seq) => {
seq.foreach(t => set.add(t))
set
} , (set1, set2) => set1.union(set2))
There was a problem hiding this comment.
Huh, that's surprising to me. Maybe it depends on how many duplicates are introduced. Let's leave it as is then.
Just curious: What dataset did you test it on?
There was a problem hiding this comment.
random number... The duplicate number should be small.
| .map { case (index, cons) => (index, cons.distinct) } | ||
|
|
||
| val rowAndConsequents = dataset.toDF().rdd.zipWithIndex().map(_.swap) | ||
| .join(indexToConsequents).sortByKey(ascending = true, dataset.rdd.getNumPartitions) |
There was a problem hiding this comment.
Shall we try to keep the original order of the input dataset? The time cost is about 5% of the total transform time.
There was a problem hiding this comment.
DataFrames don't really have a row ordering, so we don't need to maintain one.
There was a problem hiding this comment.
If that's the case, does zipWithIndex() guarantees the same result for multiple calls?
And if can ignore the ordering, perhaps I have a faster way to write the transform.
There was a problem hiding this comment.
I think zipWithIndex guarantees the order if the data have not been shuffled between load/creation and the zipWithIndex. AFAIK, shuffling can reorder rows in a partition arbitrarily.
We can definitely ignore the ordering for DataFrames.
There was a problem hiding this comment.
Checked again and the current implementation is as quick. I will just remove the sort.
| * independent group of mining tasks. The FP-Growth algorithm is described in | ||
| * <a href="http://dx.doi.org/10.1145/335191.335372">Han et al., Mining frequent patterns without | ||
| * candidate generation</a>. | ||
| * |
There was a problem hiding this comment.
Here or elsewhere, comment that null featuresCol values are ignored during fit() and are treated as empty sets during transform().
|
I tried a few different ways to implement the transform. https://gist.github.com/hhbyyh/889b88ae2176d1263fdc9dd3e29d1c2d. The performance actually are similiar, while the current one can maintain the original order of the input dataset. I would be glad to see a more optimized version. |
|
Test build #73384 has finished for PR 15415 at commit
|
|
Hi @jkbradley After further performance comparison, I found using broadcast would give much better performance for the transform. I tested with some public data from http://fimi.ua.ac.be/data/. The test can be verified by the code: Thinking again, the broadcast implementation may have much better performance in any case, especially when the rule number is large. The major issue is how to support generic with the UDF, the only way I found is to enum the datatype in a match phrase. Any suggestion? |
|
I agree that, if the set of rules is small (1-2 GB max), then collecting and broadcasting it is best. But for larger sets of rules, we'd have to keep it distributed. I'm very surprised by the time difference in your comparison. I'll experiment a little myself and get back soon. |
jkbradley
left a comment
There was a problem hiding this comment.
Thinking about it more, I'm OK with keeping the broadcast. Later on, we could write this using DataFrames and give a broadcast hint. Can you please make the warning for transform() more distinct (maybe with a "WARNING" label) and specific about broadcasting if we're going with this option?
I got this to work:
private def genericTransform4(dataset: Dataset[_]): DataFrame = {
val rules: Array[(Seq[Any], Seq[Any])] = associationRules.rdd.map(r =>
(r.getSeq(0), r.getSeq(1))
).collect().asInstanceOf[Array[(Seq[Any], Seq[Any])]]
val brRules = dataset.sparkSession.sparkContext.broadcast(rules)
val dt = dataset.schema($(featuresCol)).dataType
// For each rule, examine the input items and summarize the consequents
val predictUDF = udf((items: Seq[_]) => brRules.value.flatMap( rule =>
if (items != null && rule._1.forall(item => items.contains(item))) {
rule._2.filter(item => !items.contains(item))
} else {
Seq.empty
}
).distinct, dt)
dataset.withColumn($(predictionCol), predictUDF(col($(featuresCol))))
}
Thanks!
| */ | ||
| @Since("2.2.0") | ||
| @transient lazy val associationRules: DataFrame = { | ||
| val freqItems = freqItemsets |
|
Thanks @jkbradley for contributing the code. That helps a lot. I'll merge and send an update now. |
|
@jkbradley I changed a little from your version to first convert Seq to Set, to speedup contains operations.
Do we need to support this now? That may not be expected for all the scenarios. And it seems an Imputer can help with the issue. |
|
Test build #73452 has finished for PR 15415 at commit
|
|
Test build #73453 has finished for PR 15415 at commit
|
|
I don't think we need to support the default prediction (for empty/null inputs) now. I agree we could use an inputer or add something as an option later on. Will take a final look now |
jkbradley
left a comment
There was a problem hiding this comment.
Looks good, just small comments
| if (items != null) { | ||
| val itemset = items.toSet | ||
| brRules.value.flatMap(rule => | ||
| if (items != null && rule._1.forall(item => itemset.contains(item))) { |
There was a problem hiding this comment.
No need to check items != null here since you put that above.
| val predictUDF = udf((items: Seq[_]) => { | ||
| if (items != null) { | ||
| val itemset = items.toSet | ||
| brRules.value.flatMap(rule => |
There was a problem hiding this comment.
style:
brRules.value.flatMap { rule =>
...
}
|
test this please |
|
I'm going to go ahead and merge this after tests to make sure it's in 2.2, but can you please send a follow-up for my last 2 comments? Thanks! |
|
Sorry to miss your comments. I can send a follow-up together with document. |
|
Test build #73616 has finished for PR 15415 at commit
|
|
No problem, thanks! Could you please create a subtask for docs? Merging with master |
What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-14503
Function parity: Add FPGrowth and AssociationRules to ML.
design doc: https://docs.google.com/document/d/1bVhABn5DiEj8bw0upqGMJT2L4nvO_0_cXdwu4uMT6uU/pub
Currently I make FPGrowthModel a transformer. For each association rule, it will just examine the input items against antecedents and summarize the consequents.
Update:
Thinking again, FPGrowth is only the algorithm to find the frequent itemsets, and can be replaced by other algorithms. The frequent itemsets are used by AssociationRules to generate the association rules. Then we can use the association rules to predict with other records.
For reviewers, Let's first decide if the current
transformfunction meets your expectation.Current options:
Current implementation: Use Estimator and Transformer pattern in ML, the
transformfunction will examine the input items against all the association rules and summarize the consequents. Users can also access frequent items and association rules via other model members.Keep the Estimator and Transformer pattern. But AssociationRulesModel and FPGrowthModel will have empty
transformfunction, meaning DataFrame has no change after transform. But users can access frequent items and association rules via other model members.(mentioned by @zhengruifeng) Keep the Estimator and Transformer pattern. But
FPGrowthModelandAssociationRulesModelwill just return frequent itemsets and association rules DataFrame in thetransformfunction. Meaning the resulting DataFrame aftertransformwill not be related to the input DataFrame.Discard the Estimator and Transformer pattern. Both FPGrowth and FPGrowthModel will directly extend from PipelineStage, thus we don't need to have a
transformfunction.I'd like to hear more concrete suggestions. I would prefer option 1 or 2.
update 2:
As discussed in the jira, we will not expose AssociationRules as a public API for now.
How was this patch tested?
new unit test suites