[SPARK-5832][Mllib] Add Affinity Propagation clustering algorithm#4622
[SPARK-5832][Mllib] Add Affinity Propagation clustering algorithm#4622viirya wants to merge 18 commits intoapache:masterfrom
Conversation
|
Test build #27546 has finished for PR 4622 at commit
|
|
Test build #27556 has finished for PR 4622 at commit
|
|
@viirya For new algorithms, we need to discuss whether we want to include it in MLlib on the JIRA page first (before implementing it). Could you describe more about this algorithm on the JIRA page, e.g., scalability, usage, and alternatives? |
|
@mengxr okay. |
|
@mengxr I updated the JIRA page. Please take a look. Thanks! |
|
Test build #27622 has finished for PR 4622 at commit
|
|
Test build #27628 has finished for PR 4622 at commit
|
|
@mengxr have we decided no to include this algorithm? If so, please let me know. Then I will close this pr and maintain it as third-party package. Thanks! |
|
I didn't see |
There was a problem hiding this comment.
Please only import mutable and use mutable.Map and mutable.Set in the code. Sometimes, mixing the default Map with mutable Map causes problems that are hard to debug.
|
Test build #30031 has finished for PR 4622 at commit
|
|
@mengxr I have addressed all your comments. Please take a look. Thanks. |
|
Test build #30037 has finished for PR 4622 at commit
|
|
Test build #30067 has finished for PR 4622 at commit
|
There was a problem hiding this comment.
Is it reasonable to assume that k is small but the number of vertices are large? Then storing members as Array[Long] may run out of memory. We can store id and exemplar and the driver and cluster assignments distributively as RDD[(Long, Int)] (vertex id, cluster id). Lookup becomes less expensive in this setup. Or we can store (id, exemplar) as an RDD too, which may not be necessary.
Btw, is it sufficient to use Int for cluster id? It won't provide much information if AP outputs more than Int.MaxValue clusters.
|
General question. In euclidean space, the negative squared error is used as similarity. If we want to use affinity propagation to clustering lots of samples in euclidean space, it's impossible to create all the pairs of similarity data even it's symmetrical. What's the criteria to filter out those pairs which have very low similarity? Also, it's impossible to compute all the pairs of I really like this algorithm, but still have concern about how people can use it in practice. Thanks. |
|
@dbtsai Thanks for comments and the question. |
|
@viirya Maybe you can comment on this in the documentation and in the comment of the code, and it will be useful for people trying to understand the use-case. |
There was a problem hiding this comment.
persist(StorageLevel.MEMORY_AND_DISK)
There was a problem hiding this comment.
PS, I don't know if we can use approximated Median here since sort is very expensive.
There was a problem hiding this comment.
Approximated median might be good to reduce computation time. But looks like there is no corresponding algorithm in Spark? I just saw approximated mean but no approximated median. Maybe we can use it.
|
Test build #32017 has finished for PR 4622 at commit
|
There was a problem hiding this comment.
This is going to do a lookup for a fraction because (count % 2 !== 0). Is that how it's supposed to work?
There was a problem hiding this comment.
We will get the largest integer less than or equal to the fraction, i.e., the result of Math.floor().
|
Test build #32103 has finished for PR 4622 at commit
|
There was a problem hiding this comment.
I still think this will not work well in real big data setting. Doing a total sort will require a lot of shuffle and will be extremely slow.
I will recommend that we implement the streaming quantiles and median which computes an estimate of the median but has the benefit of not requiring the data to be sorted in org.apache.spark.util.StatCounter first.
Mahout and pig datafu have this implementation. We may port the logic to Spark. Please open a JIRA for this.
http://datafu.incubator.apache.org/docs/datafu/getting-started.html
There was a problem hiding this comment.
@dbtsai Thanks for suggesting the information. I open the JIRA at https://issues.apache.org/jira/browse/SPARK-7486. I will take a look at the implementation of datafu.
|
@mengxr any plan to revisit this and merge it? |
|
I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks! |
Affinity Propagation (AP), a graph clustering algorithm based on the concept of "message passing" between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not require the number of clusters to be determined or estimated before running it. AP is developed by Frey and Dueck. Please refer to the paper.