Skip to content

[SPARK-5832][Mllib] Add Affinity Propagation clustering algorithm#4622

Closed
viirya wants to merge 18 commits intoapache:masterfrom
viirya:ap_clustering
Closed

[SPARK-5832][Mllib] Add Affinity Propagation clustering algorithm#4622
viirya wants to merge 18 commits intoapache:masterfrom
viirya:ap_clustering

Conversation

@viirya
Copy link
Member

@viirya viirya commented Feb 16, 2015

Affinity Propagation (AP), a graph clustering algorithm based on the concept of "message passing" between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not require the number of clusters to be determined or estimated before running it. AP is developed by Frey and Dueck. Please refer to the paper.

@SparkQA
Copy link

SparkQA commented Feb 16, 2015

Test build #27546 has finished for PR 4622 at commit d469d12.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class AffinityPropagationModel(

@SparkQA
Copy link

SparkQA commented Feb 16, 2015

Test build #27556 has finished for PR 4622 at commit 99d812a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class AffinityPropagationModel(

@mengxr
Copy link
Contributor

mengxr commented Feb 16, 2015

@viirya For new algorithms, we need to discuss whether we want to include it in MLlib on the JIRA page first (before implementing it). Could you describe more about this algorithm on the JIRA page, e.g., scalability, usage, and alternatives?

@viirya
Copy link
Member Author

viirya commented Feb 17, 2015

@mengxr okay.

@viirya
Copy link
Member Author

viirya commented Feb 17, 2015

@mengxr I updated the JIRA page. Please take a look. Thanks!

@SparkQA
Copy link

SparkQA commented Feb 17, 2015

Test build #27622 has finished for PR 4622 at commit 6dbec7d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class AffinityPropagationModel(

@SparkQA
Copy link

SparkQA commented Feb 17, 2015

Test build #27628 has finished for PR 4622 at commit 6cddeb2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class AffinityPropagationModel(

@viirya
Copy link
Member Author

viirya commented Feb 20, 2015

@mengxr have we decided no to include this algorithm? If so, please let me know. Then I will close this pr and maintain it as third-party package. Thanks!

@mengxr
Copy link
Contributor

mengxr commented Feb 20, 2015

I didn't see cartesian in your code. So the complexity is really O(nnz * k) but not O(n^2 * k), correct? This is the same complexity as PIC/PageRank. If this is the case, let's check the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please only import mutable and use mutable.Map and mutable.Set in the code. Sometimes, mixing the default Map with mutable Map causes problems that are hard to debug.

@SparkQA
Copy link

SparkQA commented Apr 10, 2015

Test build #30031 has finished for PR 4622 at commit a485422.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class JavaAffinityPropagation
    • case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long])
    • class AffinityPropagationModel(
  • This patch does not change any dependencies.

@viirya
Copy link
Member Author

viirya commented Apr 10, 2015

@mengxr I have addressed all your comments. Please take a look. Thanks.

@SparkQA
Copy link

SparkQA commented Apr 10, 2015

Test build #30037 has finished for PR 4622 at commit ffe06c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class JavaAffinityPropagation
    • case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long])
    • class AffinityPropagationModel(
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 11, 2015

Test build #30067 has finished for PR 4622 at commit 97cef01.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class JavaAffinityPropagation
    • case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long])
    • class AffinityPropagationModel(
  • This patch does not change any dependencies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it reasonable to assume that k is small but the number of vertices are large? Then storing members as Array[Long] may run out of memory. We can store id and exemplar and the driver and cluster assignments distributively as RDD[(Long, Int)] (vertex id, cluster id). Lookup becomes less expensive in this setup. Or we can store (id, exemplar) as an RDD too, which may not be necessary.

Btw, is it sufficient to use Int for cluster id? It won't provide much information if AP outputs more than Int.MaxValue clusters.

@dbtsai
Copy link
Member

dbtsai commented May 6, 2015

General question. In euclidean space, the negative squared error is used as similarity. If we want to use affinity propagation to clustering lots of samples in euclidean space, it's impossible to create all the pairs of similarity data even it's symmetrical. What's the criteria to filter out those pairs which have very low similarity? Also, it's impossible to compute all the pairs of RDD[Vector] since it's O(N^2) operation, and how people address this in practice?

I really like this algorithm, but still have concern about how people can use it in practice.

Thanks.

@viirya
Copy link
Member Author

viirya commented May 6, 2015

@dbtsai Thanks for comments and the question.
I have no very good answer to the question. We have used a threshold to filter out the insignificant similarities of large scale data. As you said, it is impossible to compute all the pairs. However, the data we process is very high dimensional and very sparse, so the all-pair computation can be much reduced by only considering the pairs having corresponding dimensions with values more than zero (or a threshold defined).

@dbtsai
Copy link
Member

dbtsai commented May 6, 2015

@viirya Maybe you can comment on this in the documentation and in the comment of the code, and it will be useful for people trying to understand the use-case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

persist(StorageLevel.MEMORY_AND_DISK)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS, I don't know if we can use approximated Median here since sort is very expensive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approximated median might be good to reduce computation time. But looks like there is no corresponding algorithm in Spark? I just saw approximated mean but no approximated median. Maybe we can use it.

@SparkQA
Copy link

SparkQA commented May 6, 2015

Test build #32017 has finished for PR 4622 at commit e062a94.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class JavaAffinityPropagation
    • case class AffinityPropagationAssignment(val id: Long, val exemplar: Long, val member: Long)
    • case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long])
    • class AffinityPropagationModel(
    • class JoinedRow6 extends Row
    • case class WindowSpecDefinition(
    • case class WindowSpecReference(name: String) extends WindowSpec
    • sealed trait FrameBoundary
    • case class ValuePreceding(value: Int) extends FrameBoundary
    • case class ValueFollowing(value: Int) extends FrameBoundary
    • case class SpecifiedWindowFrame(
    • trait WindowFunction extends Expression
    • case class UnresolvedWindowFunction(
    • case class UnresolvedWindowExpression(
    • case class WindowExpression(
    • case class WithWindowDefinition(
    • case class Window(
    • case class Window(
    • case class ComputedWindow(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to do a lookup for a fraction because (count % 2 !== 0). Is that how it's supposed to work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will get the largest integer less than or equal to the fraction, i.e., the result of Math.floor().

@SparkQA
Copy link

SparkQA commented May 7, 2015

Test build #32103 has finished for PR 4622 at commit 0c7a26f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class JavaAffinityPropagation
    • case class AffinityPropagationAssignment(val id: Long, val exemplar: Long, val member: Long)
    • case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long])
    • class AffinityPropagationModel(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this will not work well in real big data setting. Doing a total sort will require a lot of shuffle and will be extremely slow.

I will recommend that we implement the streaming quantiles and median which computes an estimate of the median but has the benefit of not requiring the data to be sorted in org.apache.spark.util.StatCounter first.

Mahout and pig datafu have this implementation. We may port the logic to Spark. Please open a JIRA for this.
http://datafu.incubator.apache.org/docs/datafu/getting-started.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbtsai Thanks for suggesting the information. I open the JIRA at https://issues.apache.org/jira/browse/SPARK-7486. I will take a look at the implementation of datafu.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbtsai I submit the support of approximate quantile at #6042. It can be used to find approximate median by this pr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Sounds great. I'll take a look at #6042 Thanks.

@viirya
Copy link
Member Author

viirya commented Jul 15, 2015

@mengxr any plan to revisit this and merge it?

@rxin
Copy link
Contributor

rxin commented Dec 31, 2015

I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks!

@asfgit asfgit closed this in 7b4452b Dec 31, 2015
@viirya viirya deleted the ap_clustering branch December 27, 2023 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants