[SPARK-5832][Mllib] Add Affinity Propagation clustering algorithm by viirya · Pull Request #4622 · apache/spark

viirya · 2015-02-16T08:44:27Z

Affinity Propagation (AP), a graph clustering algorithm based on the concept of "message passing" between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not require the number of clusters to be determined or estimated before running it. AP is developed by Frey and Dueck. Please refer to the paper.

SparkQA · 2015-02-16T10:03:50Z

Test build #27546 has finished for PR 4622 at commit d469d12.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class AffinityPropagationModel(

SparkQA · 2015-02-16T17:34:37Z

Test build #27556 has finished for PR 4622 at commit 99d812a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class AffinityPropagationModel(

mengxr · 2015-02-16T19:24:54Z

@viirya For new algorithms, we need to discuss whether we want to include it in MLlib on the JIRA page first (before implementing it). Could you describe more about this algorithm on the JIRA page, e.g., scalability, usage, and alternatives?

viirya · 2015-02-17T06:16:35Z

@mengxr okay.

…metric mode.

viirya · 2015-02-17T09:21:28Z

@mengxr I updated the JIRA page. Please take a look. Thanks!

SparkQA · 2015-02-17T09:26:23Z

Test build #27622 has finished for PR 4622 at commit 6dbec7d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class AffinityPropagationModel(

SparkQA · 2015-02-17T11:22:52Z

Test build #27628 has finished for PR 4622 at commit 6cddeb2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class AffinityPropagationModel(

viirya · 2015-02-20T17:38:21Z

@mengxr have we decided no to include this algorithm? If so, please let me know. Then I will close this pr and maintain it as third-party package. Thanks!

mengxr · 2015-02-20T18:00:02Z

I didn't see cartesian in your code. So the complexity is really O(nnz * k) but not O(n^2 * k), correct? This is the same complexity as PIC/PageRank. If this is the case, let's check the code.

mengxr · 2015-02-20T18:47:39Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala

Please only import mutable and use mutable.Map and mutable.Set in the code. Sometimes, mixing the default Map with mutable Map causes problems that are hard to debug.

SparkQA · 2015-04-10T15:54:24Z

Test build #30031 has finished for PR 4622 at commit a485422.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class JavaAffinityPropagation
- case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long])
- class AffinityPropagationModel(
This patch does not change any dependencies.

viirya · 2015-04-10T16:03:09Z

@mengxr I have addressed all your comments. Please take a look. Thanks.

SparkQA · 2015-04-10T18:11:34Z

Test build #30037 has finished for PR 4622 at commit ffe06c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class JavaAffinityPropagation
- case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long])
- class AffinityPropagationModel(
This patch does not change any dependencies.

SparkQA · 2015-04-11T08:58:42Z

Test build #30067 has finished for PR 4622 at commit 97cef01.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class JavaAffinityPropagation
- case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long])
- class AffinityPropagationModel(
This patch does not change any dependencies.

mengxr · 2015-05-05T15:08:14Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala

Is it reasonable to assume that k is small but the number of vertices are large? Then storing members as Array[Long] may run out of memory. We can store id and exemplar and the driver and cluster assignments distributively as RDD[(Long, Int)] (vertex id, cluster id). Lookup becomes less expensive in this setup. Or we can store (id, exemplar) as an RDD too, which may not be necessary.

Btw, is it sufficient to use Int for cluster id? It won't provide much information if AP outputs more than Int.MaxValue clusters.

dbtsai · 2015-05-06T07:29:53Z

General question. In euclidean space, the negative squared error is used as similarity. If we want to use affinity propagation to clustering lots of samples in euclidean space, it's impossible to create all the pairs of similarity data even it's symmetrical. What's the criteria to filter out those pairs which have very low similarity? Also, it's impossible to compute all the pairs of RDD[Vector] since it's O(N^2) operation, and how people address this in practice?

I really like this algorithm, but still have concern about how people can use it in practice.

Thanks.

viirya · 2015-05-06T17:14:31Z

@dbtsai Thanks for comments and the question.
I have no very good answer to the question. We have used a threshold to filter out the insignificant similarities of large scale data. As you said, it is impossible to compute all the pairs. However, the data we process is very high dimensional and very sparse, so the all-pair computation can be much reduced by only considering the pairs having corresponding dimensions with values more than zero (or a threshold defined).

dbtsai · 2015-05-06T18:32:59Z

@viirya Maybe you can comment on this in the documentation and in the comment of the code, and it will be useful for people trying to understand the use-case.

dbtsai · 2015-05-06T19:08:50Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala

persist(StorageLevel.MEMORY_AND_DISK)

PS, I don't know if we can use approximated Median here since sort is very expensive.

Approximated median might be good to reduce computation time. But looks like there is no corresponding algorithm in Spark? I just saw approximated mean but no approximated median. Maybe we can use it.

SparkQA · 2015-05-06T19:11:02Z

Test build #32017 has finished for PR 4622 at commit e062a94.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class JavaAffinityPropagation
- case class AffinityPropagationAssignment(val id: Long, val exemplar: Long, val member: Long)
- case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long])
- class AffinityPropagationModel(
- class JoinedRow6 extends Row
- case class WindowSpecDefinition(
- case class WindowSpecReference(name: String) extends WindowSpec
- sealed trait FrameBoundary
- case class ValuePreceding(value: Int) extends FrameBoundary
- case class ValueFollowing(value: Int) extends FrameBoundary
- case class SpecifiedWindowFrame(
- trait WindowFunction extends Expression
- case class UnresolvedWindowFunction(
- case class UnresolvedWindowExpression(
- case class WindowExpression(
- case class WithWindowDefinition(
- case class Window(
- case class Window(
- case class ComputedWindow(

duncanfinney · 2015-05-07T04:18:40Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala

This is going to do a lookup for a fraction because (count % 2 !== 0). Is that how it's supposed to work?

We will get the largest integer less than or equal to the fraction, i.e., the result of Math.floor().

SparkQA · 2015-05-07T12:07:14Z

Test build #32103 has finished for PR 4622 at commit 0c7a26f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class JavaAffinityPropagation
- case class AffinityPropagationAssignment(val id: Long, val exemplar: Long, val member: Long)
- case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long])
- class AffinityPropagationModel(

dbtsai · 2015-05-07T21:17:31Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala

I still think this will not work well in real big data setting. Doing a total sort will require a lot of shuffle and will be extremely slow.

I will recommend that we implement the streaming quantiles and median which computes an estimate of the median but has the benefit of not requiring the data to be sorted in org.apache.spark.util.StatCounter first.

Mahout and pig datafu have this implementation. We may port the logic to Spark. Please open a JIRA for this.
http://datafu.incubator.apache.org/docs/datafu/getting-started.html

@dbtsai Thanks for suggesting the information. I open the JIRA at https://issues.apache.org/jira/browse/SPARK-7486. I will take a look at the implementation of datafu.

@dbtsai I submit the support of approximate quantile at #6042. It can be used to find approximate median by this pr.

@viirya Sounds great. I'll take a look at #6042 Thanks.

viirya · 2015-07-15T04:54:01Z

@mengxr any plan to revisit this and merge it?

rxin · 2015-12-31T02:42:38Z

I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks!

Add Affinity Propagation clustering algorithm.

d469d12

Choose exemplars in distributed way. Add exemplars to model.

99d812a

viirya added 2 commits February 17, 2015 16:03

Ap doesn't require similarity to be negative. Fix normalization bug.

6dbec7d

Add preferences to unit test data. Don't duplicate preferences in sym…

6cddeb2

…metric mode.

mengxr reviewed Feb 20, 2015
View reviewed changes

Add Java-friendly determinePreferences and example.

ffe06c3

Add function to manually set up preferences.

97cef01

mengxr reviewed May 5, 2015
View reviewed changes

viirya added 2 commits May 7, 2015 02:51

Merge remote-tracking branch 'upstream/master' into ap_clustering

6c283ce

Address comments.

e062a94

dbtsai reviewed May 6, 2015
View reviewed changes

duncanfinney reviewed May 7, 2015
View reviewed changes

Fix style and use StorageLevel.MEMORY_AND_DISK for persisting.

0c7a26f

dbtsai reviewed May 7, 2015
View reviewed changes

asfgit closed this in 7b4452b Dec 31, 2015

viirya deleted the ap_clustering branch December 27, 2023 18:17

Conversation

viirya commented Feb 16, 2015

Uh oh!

SparkQA commented Feb 16, 2015

Uh oh!

SparkQA commented Feb 16, 2015

Uh oh!

mengxr commented Feb 16, 2015

Uh oh!

viirya commented Feb 17, 2015

Uh oh!

viirya commented Feb 17, 2015

Uh oh!

SparkQA commented Feb 17, 2015

Uh oh!

SparkQA commented Feb 17, 2015

Uh oh!

viirya commented Feb 20, 2015

Uh oh!

mengxr commented Feb 20, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 10, 2015

Uh oh!

viirya commented Apr 10, 2015

Uh oh!

SparkQA commented Apr 10, 2015

Uh oh!

SparkQA commented Apr 11, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dbtsai commented May 6, 2015

Uh oh!

viirya commented May 6, 2015

Uh oh!

dbtsai commented May 6, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 6, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Jul 15, 2015

Uh oh!

rxin commented Dec 31, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants