GeoSpark is a cluster computing system for processing large-scale spatial data. GeoSpark extends Apache Spark with a set of out-of-the-box Spatial Resilient Distributed Datasets (SRDDs) that efficiently load, process, and analyze large-scale spatial data across machines. GeoSpark provides APIs for Apache Spark programmer to easily develop their spatial analysis programs with Spatial Resilient Distributed Datasets (SRDDs) which have in house support for geometrical and distance operations.
GeoSpark is listed as Infrastructure Project in Apache Spark Third Party Project Wiki Page (Link)
GeoSpark artifacts are hosted in Maven Central. You can add a Maven dependency with the following coordinates:
groupId: org.datasyslab
artifactId: geospark
version: 0.3.2
The following artifact supports Apache Spark 1.X versions:
groupId: org.datasyslab
artifactId: geospark
version: 0.3.2-spark-1.x
Version information (Full List)
Version | Summary |
---|---|
0.3.2 | Functionality enhancement: 1. JTSplus Spatial Objects now carry the original input data. Each object stores "UserData" and provides getter and setter. 2. Add a new SpatialRDD constructor to transform a regular data RDD to a spatial partitioned SpatialRDD. |
0.3.1 | Bug fix: Support Apache Spark 2.X version, fix a bug which results in inaccurate results when doing join query, add more unit test cases |
0.3 | Major updates: Significantly shorten query time on spatial join for skewed data; Support load balanced spatial partitioning methods (also serve as the global index); Optimize code for iterative spatial data mining |
Master branch | even with 0.3.2 |
Spark 1.X branch | even with 0.3.2 but only supports Apache Spark 1.X |
- Apache Spark 2.X releases (Apache Spark 1.X releases support available in GeoSpark for Spark 1.X branch)
- JDK 1.7
- You might need to modify the dependencies in "POM.xml" and make it consistent with your environment.
Note: GeoSpark Master branch supports Apache Spark 2.X releases and GeoSpark for Spark 1.X branch supports Apache Spark 1.X releases. Please refer to the proper branch you need.
- Have your Spark cluster ready.
- Download pre-compiled GeoSpark jar under "Release" tag.
- Run Spark shell with GeoSpark as a dependency.
./bin/spark-shell --jars GeoSpark_COMPILED.jar
- You can now call GeoSpark APIs directly in your Spark shell!
- Create your own Apache Spark project in Scala or Java
- Add GeoSpark Maven coordinates into your project dependencies.
- You can now use GeoSpark APIs in your Spark program!
- Use spark-submit to submit your compiled self-contained Spark program.
Please refer GeoSpark Scala and Java API Usage
GeoSpark extends RDDs to form Spatial RDDs (SRDDs) and efficiently partitions SRDD data elements across machines and introduces novel parallelized spatial (geometric operations that follows the Open Geosptial Consortium (OGC) standard) transformations and actions (for SRDD) that provide a more intuitive interface for users to write spatial data analytics programs. Moreover, GeoSpark extends the SRDD layer to execute spatial queries (e.g., Range query, KNN query, and Join query) on large-scale spatial datasets. After geometrical objects are retrieved in the Spatial RDD layer, users can invoke spatial query processing operations provided in the Spatial Query Processing Layer of GeoSpark which runs over the in-memory cluster, decides how spatial object-relational tuples could be stored, indexed, and accessed using SRDDs, and returns the spatial query results required by user.
(column, column,..., Longitude, Latitude, column, column,...)
(column, column,...,Longitude 1, Longitude 2, Latitude 1, Latitude 2,column, column,...)
Two pairs of longitude and latitude present the vertexes lie on the diagonal of one rectangle.
(column, column,...,Longitude 1, Latitude 1, Longitude 2, Latitude 2, ...)
Each tuple contains unlimited points.
GeoSpark supports Comma-Separated Values ("csv"), Tab-separated values ("tsv"), Well-Known Text ("wkt"), and GeoJSON ("geojson") as the input formats. Users only need to specify input format as Splitter and the start column (if necessary) of spatial info in one tuple as Offset when call Constructors.
GeoSpark supports equal size ("equalgrid"), R-Tree ("rtree") and Voronoi diagram ("voronoi") spatial partitioning methods. Spatial partitioning is to repartition RDD according to objects' spatial locations. Spatial join on spatial paritioned RDD will be very fast.
GeoSpark supports two Spatial Indexes, Quad-Tree and R-Tree.
GeoSpark currently provides native support for Inside, Overlap, DatasetBoundary, Minimum Bounding Rectangle and Polygon Union in SRDDS following Open Geospatial Consortium (OGC) standard.
GeoSpark so far provides spatial range query, join query and KNN query in SRDDs.
Jia Yu, Jinxuan Wu, Mohamed Sarwat. "A Demonstration of GeoSpark: A Cluster Computing Framework for Processing Big Spatial Data". (demo paper) In Proceeding of IEEE International Conference on Data Engineering ICDE 2016, Helsinki, FI, May 2016
Jia Yu, Jinxuan Wu, Mohamed Sarwat. "GeoSpark: A Cluster Computing Framework for Processing Large-Scale Spatial Data". (short paper) In Proceeding of the ACM International Conference on Advances in Geographic Information Systems ACM SIGSPATIAL GIS 2015, Seattle, WA, USA November 2015
GeoSaprk makes use of JTS Plus (An extended JTS Topology Suite Version 1.14) for some geometrical computations.
Please refer to JTS Topology Suite website and JTS Plus for more details.
We appreciate the help and suggestions from the following GeoSpark users (List is increasing..):
- @gaufung
- @lrojas94
- @mdespriee
- @sabman
- @samchorlton
- @Tsarazin
- @TBuc
- ...
-
Jia Yu (Email: [email protected])
-
Jinxuan Wu (Email: [email protected])
-
Mohamed Sarwat (Email: [email protected])
Please visit GeoSpark project wesbite for latest news and releases.
GeoSpark is one of the projects under DataSys Lab at Arizona State University. The mission of DataSys Lab is designing and developing experimental data management systems (e.g., database systems).