[SEDONA-655] DBSCAN (#1589)

* add dbscan scala * add dbscan python * add dbscan tests, pom file changes, pip changes * disable broadcast joins for all dbscan tests * disable non-sedona broadcast joins for all dbscan tests * unpersist dbscan result assuming that graphframes PR will eventually be merged. * revisions from Paweł * add documentation * styling in docs Co-authored-by: Kelly-Ann Dolor <[email protected]> * reword stats documentation Co-authored-by: Kelly-Ann Dolor <[email protected]> * clean up --------- Co-authored-by: jameswillis <[email protected]> Co-authored-by: Kelly-Ann Dolor <[email protected]>
apache · Oct 8, 2024 · f83915e · f83915e
1 parent 05245fa
commit f83915e
Show file tree

Hide file tree

Showing 14 changed files with 657 additions and 0 deletions.
diff --git a/docs/api/stats/sql.md b/docs/api/stats/sql.md
@@ -0,0 +1,31 @@
+## Overview
+
+Sedona's stats module provides Scala and Python functions for conducting geospatial
+statistical analysis on dataframes with spatial columns.
+The stats module is built on top of the core module and provides a set of functions
+that can be used to perform spatial analysis on these dataframes. The stats module
+is designed to be used with the core module and the viz module to provide a
+complete set of geospatial analysis tools.
+
+## Using DBSCAN
+
+The DBSCAN function is provided at `org.apache.sedona.stats.DBSCAN.dbscan` in scala/java and `sedona.stats.dbscan.dbscan` in python.
+
+The function annotates a dataframe with a cluster label for each data record using the DBSCAN algorithm.
+The dataframe should contain at least one `GeometryType` column. Rows must be unique. If one
+geometry column is present it will be used automatically. If two are present, the one named
+'geometry' will be used. If more than one are present and none are named 'geometry', the
+column name must be provided. The new column will be named 'cluster'.
+
+### Parameters
+
+names in parentheses are python variable names
+
+- dataframe - dataframe to cluster. Must contain at least one GeometryType column
+- epsilon - minimum distance parameter of DBSCAN algorithm
+- minPts (min_pts) - minimum number of points parameter of DBSCAN algorithm
+- geometry - name of the geometry column
+- includeOutliers (include_outliers) - whether to include outliers in the output. Default is false
+- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false
+
+The output is the input DataFrame with the cluster label added to each row. Outlier will have a cluster value of -1 if included.
diff --git a/docs/tutorial/sql.md b/docs/tutorial/sql.md
@@ -739,6 +739,60 @@ The coordinates of polygons have been changed. The output will be like this:
 
 ```
 
+## Cluster with DBSCAN
+
+Sedona provides an implementation of the [DBSCAN](https://en.wikipedia.org/wiki/Dbscan) algorithm to cluster spatial data.
+
+The algorithm is available as a Scala and Python function called on a spatial dataframe. The returned dataframe has an additional column added containing the unique identifier of the cluster that record is a member of and a boolean column indicating if the record is a core point.
+
+The first parameter is the dataframe, the next two are the epsilon and min_points parameters of the DBSCAN algorithm.
+
+=== "Scala"
+
+	```scala
+	import org.apache.sedona.stats.DBSCAN.dbscan
+
+	dbscan(df, 0.1, 5).show()
+	```
+
+=== "Java"
+
+	```java
+	import org.apache.sedona.stats.DBSCAN;
+
+	DBSCAN.dbscan(df, 0.1, 5).show();
+	```
+
+=== "Python"
+
+	```python
+	from sedona.stats.dbscan import dbscan
+
+	dbscan(df, 0.1, 5).show()
+	```
+
+The output will look like this:
+
+```
++----------------+---+------+-------+
+|        geometry| id|isCore|cluster|
++----------------+---+------+-------+
+|   POINT (2.5 4)|  3| false|      1|
+|     POINT (3 4)|  2| false|      1|
+|     POINT (3 5)|  5| false|      1|
+|     POINT (1 3)|  9|  true|      0|
+| POINT (2.5 4.5)|  7|  true|      1|
+|     POINT (1 2)|  1|  true|      0|
+| POINT (1.5 2.5)|  4|  true|      0|
+| POINT (1.2 2.5)|  8|  true|      0|
+|   POINT (1 2.5)| 11|  true|      0|
+|     POINT (1 5)| 10| false|     -1|
+|     POINT (5 6)| 12| false|     -1|
+|POINT (12.8 4.5)|  6| false|     -1|
+|     POINT (4 3)| 13| false|     -1|
++----------------+---+------+-------+
+```
+
 ## Run spatial queries
 
 After creating a Geometry type column, you are able to run spatial queries.

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -83,6 +83,8 @@ nav:
               - Parameter: api/sql/Parameter.md
           - RDD (core):
               - Scala/Java doc: api/java-api.md
+          - Stats:
+              - DataFrame: api/stats/sql.md
           - Viz:
               - DataFrame/SQL: api/viz/sql.md
               - RDD: api/viz/java-api.md

diff --git a/pom.xml b/pom.xml
@@ -85,6 +85,7 @@
         <spark.version>3.3.0</spark.version>
         <spark.compat.version>3.3</spark.compat.version>
         <log4j.version>2.17.2</log4j.version>
+        <graphframe.version>0.8.3-spark3.4</graphframe.version>
 
         <flink.version>1.19.0</flink.version>
         <slf4j.version>1.7.36</slf4j.version>
@@ -394,6 +395,10 @@
                 <enabled>true</enabled>
             </releases>
         </repository>
+        <repository>
+            <id>Spark Packages</id>
+            <url>https://repos.spark-packages.org/</url>
+        </repository>
     </repositories>
     <build>
         <pluginManagement>
@@ -578,6 +583,8 @@
                         <scala.compat.version>${scala.compat.version}</scala.compat.version>
                         <spark.version>${spark.version}</spark.version>
                         <scala.version>${scala.version}</scala.version>
+                        <log4j.version>${log4j.version}</log4j.version>
+                        <graphframe.version>${graphframe.version}</graphframe.version>
                     </properties>
                 </configuration>
                 <executions>
@@ -686,6 +693,7 @@
                 <spark.version>3.0.3</spark.version>
                 <spark.compat.version>3.0</spark.compat.version>
                 <log4j.version>2.17.2</log4j.version>
+                <graphframe.version>0.8.1-spark3.0</graphframe.version>
                 <!-- Skip deploying parent module. it will be deployed with sedona-spark-3.3 -->
                 <skip.deploy.common.modules>true</skip.deploy.common.modules>
             </properties>
@@ -703,6 +711,7 @@
                 <spark.version>3.1.2</spark.version>
                 <spark.compat.version>3.1</spark.compat.version>
                 <log4j.version>2.17.2</log4j.version>
+                <graphframe.version>0.8.2-spark3.1</graphframe.version>
                 <!-- Skip deploying parent module. it will be deployed with sedona-spark-3.3 -->
                 <skip.deploy.common.modules>true</skip.deploy.common.modules>
             </properties>
@@ -720,6 +729,7 @@
                 <spark.version>3.2.0</spark.version>
                 <spark.compat.version>3.2</spark.compat.version>
                 <log4j.version>2.17.2</log4j.version>
+                <graphframe.version>0.8.2-spark3.2</graphframe.version>
                 <!-- Skip deploying parent module. it will be deployed with sedona-spark-3.3 -->
                 <skip.deploy.common.modules>true</skip.deploy.common.modules>
             </properties>
@@ -738,6 +748,7 @@
                 <spark.version>3.3.0</spark.version>
                 <spark.compat.version>3.3</spark.compat.version>
                 <log4j.version>2.17.2</log4j.version>
+                <graphframe.version>0.8.3-spark3.4</graphframe.version>
             </properties>
         </profile>
         <profile>
@@ -752,6 +763,7 @@
                 <spark.version>3.4.0</spark.version>
                 <spark.compat.version>3.4</spark.compat.version>
                 <log4j.version>2.19.0</log4j.version>
+                <graphframe.version>0.8.3-spark3.4</graphframe.version>
                 <!-- Skip deploying parent module. it will be deployed with sedona-spark-3.3 -->
                 <skip.deploy.common.modules>true</skip.deploy.common.modules>
             </properties>
@@ -768,6 +780,7 @@
                 <spark.version>3.5.0</spark.version>
                 <spark.compat.version>3.5</spark.compat.version>
                 <log4j.version>2.20.0</log4j.version>
+                <graphframe.version>0.8.3-spark3.5</graphframe.version>
                 <!-- Skip deploying parent module. it will be deployed with sedona-spark-3.3 -->
                 <skip.deploy.common.modules>true</skip.deploy.common.modules>
             </properties>

diff --git a/python/Pipfile b/python/Pipfile
@@ -10,6 +10,8 @@ jupyter="*"
 mkdocs="*"
 pytest-cov = "*"
 
+scikit-learn = "*"
+
 [packages]
 pandas="<=1.5.3"
 numpy="<2"

diff --git a/python/sedona/stats/__init__.py b/python/sedona/stats/__init__.py
@@ -0,0 +1,16 @@
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing,
+#  software distributed under the License is distributed on an
+#  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+#  KIND, either express or implied.  See the License for the
+#  specific language governing permissions and limitations
+#  under the License.
diff --git a/python/sedona/stats/clustering/__init__.py b/python/sedona/stats/clustering/__init__.py
@@ -0,0 +1,21 @@
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing,
+#  software distributed under the License is distributed on an
+#  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+#  KIND, either express or implied.  See the License for the
+#  specific language governing permissions and limitations
+#  under the License.
+
+"""The clustering module contains spark based implementations of popular geospatial clustering algorithms.
+
+These implementations are designed to scale to larger datasets and support various geometric feature types.
+"""
diff --git a/python/sedona/stats/clustering/dbscan.py b/python/sedona/stats/clustering/dbscan.py
@@ -0,0 +1,68 @@
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing,
+#  software distributed under the License is distributed on an
+#  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+#  KIND, either express or implied.  See the License for the
+#  specific language governing permissions and limitations
+#  under the License.
+
+"""DBSCAN is a popular clustering algorithm for spatial data.
+
+It identifies groups of data where enough records are close enough to each other. This implementation leverages spark,
+sedona and graphframes to support large scale datasets and various, heterogeneous geometric feature types.
+"""
+from typing import Optional
+
+from pyspark.sql import DataFrame, SparkSession
+
+ID_COLUMN_NAME = "__id"
+DEFAULT_MAX_SAMPLE_SIZE = 1000000  # 1 million
+
+
+def dbscan(
+    dataframe: DataFrame,
+    epsilon: float,
+    min_pts: int,
+    geometry: Optional[str] = None,
+    include_outliers: bool = True,
+    use_spheroid=False,
+):
+    """Annotates a dataframe with a cluster label for each data record using the DBSCAN algorithm.
+
+    The dataframe should contain at least one GeometryType column. Rows must be unique. If one geometry column is
+    present it will be used automatically. If two are present, the one named 'geometry' will be used. If more than one
+    are present and neither is named 'geometry', the column name must be provided.
+
+    Args:
+        dataframe: spark dataframe containing the geometries
+        epsilon: minimum distance parameter of DBSCAN algorithm
+        min_pts: minimum number of points parameter of DBSCAN algorithm
+        geometry: name of the geometry column
+        include_outliers: whether to return outlier points. If True, outliers are returned with a cluster value of -1.
+            Default is False
+        use_spheroid: whether to use a cartesian or spheroidal distance calculation. Default is false
+
+    Returns:
+        A PySpark DataFrame containing the cluster label for each row
+    """
+    sedona = SparkSession.getActiveSession()
+
+    result_df = sedona._jvm.org.apache.sedona.stats.clustering.DBSCAN.dbscan(
+        dataframe._jdf,
+        float(epsilon),
+        min_pts,
+        geometry,
+        include_outliers,
+        use_spheroid,
+    )
+
+    return DataFrame(result_df, sedona)
diff --git a/python/tests/stats/__init__.py b/python/tests/stats/__init__.py