-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixed concerns: Encoding + Geometry Type #207
Comments
There was quite some discussion about this on the PR #189 (see for example the thread above and below this comment: #189 (comment)). The PR initially started with an But in the end we moved away of this tight coupling with geoarrow for the naming in the spec here, although we are still using a subset of the geoarrow specification. The format spec itself is still very clear about this being based on GeoArrow: https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#native-encodings-based-on-geoarrow (and yes, that means you might need to read the spec when encountering a file with such a column, but IMO that would still be the case anyway even if the encoding said "geoarrow"). It's true that the |
Although I see we only link to the GeoArrow document with the names, not the the document with the memory layouts (https://geoarrow.org/format.html#native-encoding). That's something we should improve. |
It could be useful to know in some circumstances that the encoding is multi polygon but the column includes both polygons and multi polygons |
I think that it is impossible not to mix some concerns with the single geometry-type encodings...the solution we settled on does have some overlap between the encoding name and the geometry type, but avoids mixing some other concerns and concepts to more accurately convey the relationships among the single-geometry layouts, Parquet and Arrow, for example.
I was under the impression that any features in a
Technically it will tell you if you have |
In the future, GeoParquet should really support mixed geometry types in the same column. Sedona community will propose some solutions soon. For example, in the lates release of Overture Maps data (https://docs.overturemaps.org/schema/), |
I think it may be too late, as we've already got some implementations released with this, but I was also wondering why we didn't put some 'prefix' on the encodings. Like if not Someone recently pointed out to me that there's potentially more efficient geometry encoding techniques, like 'zigzag encoding coordinate deltas in varints', and I had been thinking that geoparquet spec is 'open' to having a newer encoding, but the current arrow ones seem to take up the 'default namespace'. Like we could add some cool new encoding, and they'll likely have the same geometry types. I don't think it's a huge deal to have like |
That's the risk of implementing an unreleased spec, I'd say... It should still be possible to make a change. |
While we didn't add prefixes now, that doesn't mean we can't add suffixes later if we want to add other encodings, like
I wouldn't say the "right" way, but I think it is the simplest way. I personally think it is fine to use the implicit "default" namespace for that. While I am personally happy with what we have right now, I agree we should still allow ourselves to change things (although if we think we want to do that, I wouldn't wait too long with that, so GDAL/geopandas can be updated to follow this as fast as possible). (that's generally the trade-off we have with the desire to have some implementations to test things before calling the 1.1 spec final, and then actually thinking to change things before doing that ..) |
FWIW I also want to clarify that I find using "columnar" in this context is ambiguous / confusing. When using "columnar" in the context of Parquet as a "columnar file format" or Arrow's "Columnar (in-memory) Format" specification, everything we discuss here is columnar. Also with the WKB encoding, all those WKB values in the geometry column are stored in a "columnar" fashion (all the binary blobs of the WKB values are stored together in one buffer). |
Somewhat tangential - I've been pretty eager to test out new encodings like the varint/delta/zigzag Chris mentioned (used in OSM PBF) and also Parquet's native bit-packed integer encodings as well. Although these would also require x,y columns to be integers not doubles - which I thought might be a non-starter |
I am assuming that's from/inspired by twkb? https://github.com/TWKB/Specification/blob/master/twkb.md#zigzag-encode . It would be derailing this thread to talk about that, but I think that would undo the main benefit of this encoding which is that no Geo-specific parsing is required.
This particular change was an open PR for months that had some pretty significant discussion around how to name the encodings. It was not easy to come up with a consensus...I don't think the result is perfect, but I also don't think it has to perfectly separate concerns (just be specific enough that implementations are able to work with this, which it seems that they are). If/when a better encoding comes along, the name can be updated. |
I agree and don't want to derail the conversation here. However if we could enable x and y to be fixed precision decimal we could utilize parquet native integer encodings (e.g. delta binary packed) and not need geo-specific parsing to read the coordinates properly... but let's leave that for another discussion :) |
Yeah, I felt bad that I didn't read it closely enough / didn't think about there being new potential encodings.
Ah, that's a good point - I was sorta thinking we'd be 'stuck' with these names. But I guess the client logic would just have to be that at version 1.4 or whatever it would need to check the version number.
Cool, I like this idea.
Yeah, that sounds good. I'll look into tweaks that may make it clearer. It's good in the full description (one of the geometry types, or WKB), but in the single line it just looks like one of many options.
Sounds great. Thanks to everyone for sounding in, and I agree. I definitely didn't want to try to be pushing for a change at this late time. I mostly just wanted to be sure we had a path to other encodings, and weren't backing ourselves into a corner with declaring this one. But it sounds like we have lots of options. And it's good to have the discussion, to point at in the future if/when people want to propose other encodings. |
Yeah, I was pretty sure that was a poor name - I just didn't want to spend a lot of time crafting an 'ideal' name when all I was trying to do was to make the point. If we had overwhelming enthusiasm to change (which to be clear I personally don't) then we could figure out the 'right' name. |
Closing this issue, as it looks like it was well discussed. There were some interesting things in this discussion, like #207 (comment) but we should just have cleaner issues / PR's. |
I stumbled across the new geometry types mentioned in the encoding. It seems the encoding is GeoArrow, but why are the concerns mixed here?
Why are geometry types in the encoding and that happens with the geometry type then? Are they alway the same for GeoArrow?
From an external perspective I'd have expected something like this:
endocing = geoarrow
geometry_types = [Point]
Also, the schema allows for example:
encoding = point
geometry_types = [Polygon]
The empty array for geometry_types also doesn't make sense for GeoArrow encoding.
In case of geoarrow the geometry_types can have a maximum of one array item, but that could be encoded...
(Without reading the spec, I also can't guess from the value that it's a GeoArrow encoding. Might make things just a little simpler to grasp by default.)
Wouldn't this also be more future-proof with regards to apache/parquet-format#44 ?
The text was updated successfully, but these errors were encountered: