[Part 1] Add geo support#5654
Conversation
There was a problem hiding this comment.
plz ignore this, seems some search/replace error
kishoreg
left a comment
There was a problem hiding this comment.
Nicely done! Looks much simpler than I thought. Will the BYTES column for StPoint be dictionary encoded?
There was a problem hiding this comment.
what does this function do? please add java docs
There was a problem hiding this comment.
Added. This is an abstract class for implementing the geo constructor functions like StGeomFromText and StGeogFromText
There was a problem hiding this comment.
hmm, then I cannot define scalar function in this package for inbuilt transformation functions.
|
Add sample queries to the description and also update the java docs. |
|
@yupeng9 , I would like to review this PR. Please give a day or two to go over the code. |
pinot-core/pom.xml
Outdated
There was a problem hiding this comment.
Is Eclipse license ok to add? So far we have taken Apache/MIT/Gnu.
There was a problem hiding this comment.
My understanding is okay to include.
Per https://www.apache.org/legal/resolved.html#category-a, Eclipse Distribution License 1.0 can be included. And JTS is dual-licensed under Eclipse Public License 2.0 and Eclipse Distribution License 1.0 (https://github.com/locationtech/jts#license)
There was a problem hiding this comment.
Tab space indicates not following Pinot code-styling.
There was a problem hiding this comment.
Initialize list with size if known.
There was a problem hiding this comment.
Please use Pinot code style (name of member variables starts with _ to avoid qualifying with this.
There was a problem hiding this comment.
Consider using a Static Map, if this list has a chance to grow.
There was a problem hiding this comment.
This list is unlikely to grow, given the OGC geo is a well-defined standard per https://www.ogc.org/standards/sfa
There was a problem hiding this comment.
Utils.rethrow will preserve the original exception.
There was a problem hiding this comment.
Good to see this util
There was a problem hiding this comment.
Would be good to add the benchmark results in the PR description.
There was a problem hiding this comment.
Added https://gist.github.com/yupeng9/8e2b081ffb372593492ebb6a41da97fd to the description
There was a problem hiding this comment.
Unsure if LOGGER should be used here?
There was a problem hiding this comment.
removed. it was for debugging purpose.
There was a problem hiding this comment.
This seems unrelated to this PR? Would be good to call it out in the description, along with the motivation for the change.
There was a problem hiding this comment.
You are right. Reverted this since it was for local debugging purpose.
|
@kishoreg @mayankshriv thanks for the review. Will address the comments. |
yupeng9
left a comment
There was a problem hiding this comment.
Comments addressed
pinot-core/pom.xml
Outdated
There was a problem hiding this comment.
My understanding is okay to include.
Per https://www.apache.org/legal/resolved.html#category-a, Eclipse Distribution License 1.0 can be included. And JTS is dual-licensed under Eclipse Public License 2.0 and Eclipse Distribution License 1.0 (https://github.com/locationtech/jts#license)
There was a problem hiding this comment.
This list is unlikely to grow, given the OGC geo is a well-defined standard per https://www.ogc.org/standards/sfa
There was a problem hiding this comment.
Added. This is an abstract class for implementing the geo constructor functions like StGeomFromText and StGeogFromText
There was a problem hiding this comment.
Good to see this util
There was a problem hiding this comment.
removed. it was for debugging purpose.
There was a problem hiding this comment.
You are right. Reverted this since it was for local debugging purpose.
There was a problem hiding this comment.
Please follow the Pinot coding convention of using underscore as the prefix for member variables. Same for other classes
There was a problem hiding this comment.
As an example of Pinot coding convention
| this.multitype = multitype; | |
| _multitype = multitype; |
There was a problem hiding this comment.
Thanks for taking a pass, updated the style
There was a problem hiding this comment.
@Jackie-Jiang any reason we set checkstyle severity to warning but not error. I saw we have such a rule for member variable, but the maven checkstyle does not fail the build
There was a problem hiding this comment.
Good suggestion. We can switch it to error once we fixed all the existing code with wrong code styles. Opened issue #5675 to track this
Jackie-Jiang
left a comment
There was a problem hiding this comment.
High level question, why are we using JTS library to handle both geometry as well as geography? Shouldn't we use ESRI for geography?
Good call.
The tradeoff of taking this approach is that JTS is a library for Euclidean planar linear geometry, so all the geography-related operations have to be implemented using JTS's primitives. That's why there is some lengthy logic on geography measurement functions. Those implementations are similar to what Presto is doing. Will update the design doc to reflect this change |
For ser-de, we are using the customized serializer, so I don't think there will be performance difference between these 2 libraries?
|
Though we use customized serializer, there could be some difference due to the internal representation of the fields, their accessor implementations. The PR I linked above shows about the 20% difference. Another notable reason is that JTS conforms to the ISO standards better. I believe this is the primary reason that Presto community decided to move from ESRI to JTS. I suggest we take the lessons learned from them. Lastly, many users query Pinot via the Presto connector, so it's also a desirable property that Pinot geo functions return same or similar results as Presto's for better unification.
There are not too many geographical functions, so I believe its implementation is still manageable.
|
- add geo-spatial data model - add serde - add benchmark - add geospatial functions
|
@Jackie-Jiang I changed the relationship functions to work only for the Geometry objects, to align with Presto's behavior. PTAL |
| // Special type for annotation based scalar functions | ||
| SCALAR("scalar"); | ||
| SCALAR("scalar"), | ||
| // geo constructors |
There was a problem hiding this comment.
(nit) Add an empty line in front, and capitalize the comment
pinot-common/src/main/java/org/apache/pinot/common/function/TransformFunctionType.java
Show resolved
Hide resolved
| // geo measurements | ||
| ST_AREA("ST_Area"), | ||
| ST_DISTANCE("ST_Distance"), | ||
| ST_GEOMETRY_TYPE("ST_GEOMETRY_TYPE"), |
There was a problem hiding this comment.
| ST_GEOMETRY_TYPE("ST_GEOMETRY_TYPE"), | |
| ST_GEOMETRY_TYPE("ST_GeometryType"), |
There was a problem hiding this comment.
Is there a standard function to return the SRID of the geometry? (Identify whether it is geometry or geography)
There was a problem hiding this comment.
Yes, it returns the type of the geometry as a string. EG: 'ST_Linestring', 'ST_Polygon','ST_MultiPolygon' etc
| * <ul> | ||
| * <li>For dimension, time, date time fields, support {@link DataType}: INT, LONG, FLOAT, DOUBLE, STRING</li> | ||
| * <li>For dimension, time, date time fields, support {@link DataType}: INT, LONG, FLOAT, DOUBLE, STRING, BYTES</li> | ||
| * <li>For non-derived metric fields, support {@link DataType}: INT, LONG, FLOAT, DOUBLE</li> |
There was a problem hiding this comment.
Thanks for updating the javadoc. Add BYTES here as well
| <groupId>org.openjdk.jmh</groupId> | ||
| <artifactId>jmh-core</artifactId> | ||
| <version>1.21</version> | ||
| <version>${jmh.version}</version> |
There was a problem hiding this comment.
Move these to the root pom and specify the version there
| return null; | ||
| } | ||
|
|
||
| validateGeographyType("ST_Distance", leftGeometry, EnumSet.of(GeometryType.POINT)); |
There was a problem hiding this comment.
Can be simplified to geometry instanceof Point
There was a problem hiding this comment.
try to reuse the same error msg template of several functions
There was a problem hiding this comment.
There will be quite big performance difference, especially for per-value check
| return _results; | ||
| } | ||
|
|
||
| public static void checkLatitude(double latitude) { |
There was a problem hiding this comment.
Make all helper methods private
| for (int i = 0; i < projectionBlock.getNumDocs(); i++) { | ||
| Geometry firstGeometry = GeometrySerializer.deserialize(firstValues[i]); | ||
| Geometry secondGeometry = GeometrySerializer.deserialize(secondValues[i]); | ||
| if (GeometryUtils.isGeography(firstGeometry) || GeometryUtils.isGeography(secondGeometry)) { |
There was a problem hiding this comment.
Equals should work on geography as well?
| /** | ||
| * Constructor function for polygon object from text. | ||
| */ | ||
| public class StPolygonFunction extends ConstructFromTextFunction { |
There was a problem hiding this comment.
This doesn't seem right that St_Polygon is the same as ST_GeomFromText
There was a problem hiding this comment.
yup, added the constraint of checking polygon type
| protected static final TransformResultMetadata BYTES_SV_NO_DICTIONARY_METADATA = | ||
| new TransformResultMetadata(DataType.BYTES, true, false); | ||
|
|
||
| private boolean[] _booleanValuesSV; |
yupeng9
left a comment
There was a problem hiding this comment.
@Jackie-Jiang Thanks for the detailed review. Comments addressed
| // geo measurements | ||
| ST_AREA("ST_Area"), | ||
| ST_DISTANCE("ST_Distance"), | ||
| ST_GEOMETRY_TYPE("ST_GEOMETRY_TYPE"), |
There was a problem hiding this comment.
Yes, it returns the type of the geometry as a string. EG: 'ST_Linestring', 'ST_Polygon','ST_MultiPolygon' etc
pinot-core/pom.xml
Outdated
| <artifactId>lucene-analyzers-common</artifactId> | ||
| <version>${lucene.version}</version> | ||
| </dependency> | ||
| <dependency> |
There was a problem hiding this comment.
moved it to pinot-perf project.
|
|
||
| POINT(false, "ST_Point"), | ||
| MULTI_POINT(true, "ST_MultiPoint"), | ||
| LINE_STRING(false, "ST_LineString"), |
There was a problem hiding this comment.
LINEAR_RING is a subtype of LINEAR_STRING
| _name = name; | ||
| } | ||
|
|
||
| public boolean isMultitype() { |
There was a problem hiding this comment.
not in this PR. It's useful in function like https://postgis.net/docs/ST_GeometryN.html
| /** | ||
| * Provides methods to efficiently serialize and deserialize geometry types. | ||
| */ | ||
| public class GeometrySerde extends Serializer { |
There was a problem hiding this comment.
They are not exactly same, in particular, the differences are:
- Presto uses schema to indicate geometry vs geography info, while we encode this in the type byte.
- Presto serializes additional information such as envelope to be compatible with ESRI serialization, but the serde here does not, which is simpler and faster
Added this to the comments
| return null; | ||
| } | ||
|
|
||
| validateGeographyType("ST_Distance", leftGeometry, EnumSet.of(GeometryType.POINT)); |
There was a problem hiding this comment.
try to reuse the same error msg template of several functions
| * This assumes a spherical Earth, and uses the Vincenty formula. (https://en.wikipedia | ||
| * .org/wiki/Great-circle_distance) | ||
| */ | ||
| public static double greatCircleDistance(double latitude1, double longitude1, double latitude2, double longitude2) { |
There was a problem hiding this comment.
| for (int i = 0; i < projectionBlock.getNumDocs(); i++) { | ||
| Geometry firstGeometry = GeometrySerializer.deserialize(firstValues[i]); | ||
| Geometry secondGeometry = GeometrySerializer.deserialize(secondValues[i]); | ||
| if (GeometryUtils.isGeography(firstGeometry) || GeometryUtils.isGeography(secondGeometry)) { |
| /** | ||
| * Constructor function for polygon object from text. | ||
| */ | ||
| public class StPolygonFunction extends ConstructFromTextFunction { |
There was a problem hiding this comment.
yup, added the constraint of checking polygon type
| <groupId>org.openjdk.jmh</groupId> | ||
| <artifactId>jmh-core</artifactId> | ||
| <version>1.21</version> | ||
| <version>${jmh.version}</version> |
Jackie-Jiang
left a comment
There was a problem hiding this comment.
LGTM with some comments.
Let me know when you address the comments and ready to merge
| */ | ||
| public enum GeometryType { | ||
|
|
||
| POINT(false, 0,"ST_Point"), |
| MULTI_POLYGON(true, 5,"ST_MultiPolygon"), | ||
| GEOMETRY_COLLECTION(true, 6,"ST_GeomCollection"); | ||
|
|
||
| private final boolean _multitype; |
There was a problem hiding this comment.
(nit) _multiType? (IDE identify multitype as typo)
| /** | ||
| * The geometry type used in serialization | ||
| */ | ||
| public enum GeometrySerializationType { |
| * @return the serialization type | ||
| */ | ||
| public static GeometryType fromID(int id) { | ||
| switch (id) { |
There was a problem hiding this comment.
Keep an static GeometryType array
private static final GeometryType[] ID_TO_TYPE_MAP = new GeometryType[] {POINT, MULTI_POINT, ...};
Then you can avoid the switch branching for better performance
return ID_TO_TYPE_MAP[id];
| * - The envelope info is not serialized | ||
| */ | ||
| public class GeometrySerde { | ||
| private static final Logger LOGGER = LoggerFactory.getLogger(GeometrySerde.class); |
There was a problem hiding this comment.
(nit) Remove the unused LOGGER (we don't want to log within serde as it is per-value based and can easily flood the log)
| @ScalarFunction | ||
| public static byte[] stPoint(double longitude, double latitude) { | ||
| return GeometrySerializer | ||
| .serialize(GeometryUtils.GEOMETRY_FACTORY.createPoint(new Coordinate(longitude, latitude))); |
There was a problem hiding this comment.
(Major) Should this be GEOGRAPHY_FACTORY for longitude and latitude?
There was a problem hiding this comment.
good point. Changed the argument to x,y
| Geometry geometry; | ||
| for (int i = 0; i < projectionBlock.getNumDocs(); i++) { | ||
| geometry = GeometrySerializer.deserialize(values[i]); |
There was a problem hiding this comment.
Not necessary. I did some benchmark on this and there is performance difference
| return null; | ||
| } | ||
|
|
||
| validateGeographyType("ST_Distance", leftGeometry, EnumSet.of(GeometryType.POINT)); |
There was a problem hiding this comment.
There will be quite big performance difference, especially for per-value check
| _results[i] = sphericalDistance(firstGeometry, secondGeometry); | ||
| } else { | ||
| _results[i] = | ||
| firstGeometry.isEmpty() || secondGeometry.isEmpty() ? null : firstGeometry.distance(secondGeometry); |
There was a problem hiding this comment.
I think you can return Double.NaN here to indicate empty geometry
| Utils.rethrowException( | ||
| new RuntimeException(String.format("Failed to parse geometry from string: %s", argumentValues[i]))); |
There was a problem hiding this comment.
| Utils.rethrowException( | |
| new RuntimeException(String.format("Failed to parse geometry from string: %s", argumentValues[i]))); | |
| throw new RuntimeException(String.format("Failed to parse geometry from string: %s", argumentValues[i])); |
|
@Jackie-Jiang thanks for taking another pass. Comments addressed, and feel free to merge |
Description
First part of #5280. Design doc
This PR added the following
add geo-spatial data model
The data model includes both geometry and geography, which is differentiated by a spatial reference identifier (SRID). Notably, uses SRID=4326 as the coordinate system of lat/lng per https://epsg.io/4326.
add serde
Added the serialization/deserialization from geo-spatial value to bytes with kryo library. Also added a benchmark for performance evaluation
Benchmark result: https://gist.github.com/yupeng9/8e2b081ffb372593492ebb6a41da97fd
add geospatial functions
geo constructors
geo measurements
geo outputs
geo relationship
Updates to MeetupRsvp quickstart example
Added a new
locationfield from the longitude and latitude of the event, using an inbuiltstPointtransform functionUpgrade Notes
Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)
Does this PR fix a zero-downtime upgrade introduced earlier?
Does this PR otherwise need attention when creating release notes? Things to consider:
release-notesand complete the section on Release Notes)Release Notes
Yes, added a new experimental feature
Documentation
If you have introduced a new feature or configuration, please add it to the documentation as well.
See https://docs.pinot.apache.org/developers/developers-and-contributors/update-document