-
Notifications
You must be signed in to change notification settings - Fork 3k
Spec: Support geo type #10981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spec: Support geo type #10981
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -219,6 +219,8 @@ Supported primitive types are defined in the table below. Primitive types added | |
| | | **`uuid`** | Universally unique identifiers | Should use 16-byte fixed | | ||
| | | **`fixed(L)`** | Fixed-length byte array of length L | | | ||
| | | **`binary`** | Arbitrary-length byte array | | | ||
| | [v3](#version-3) | **`geometry(C)`** | Geospatial features from [OGC – Simple feature access][1001]. Edge-interpolation is always linear/planar. See [Appendix G](#appendix-g-geospatial-notes). Parameterized by CRS C. If not specified, C is `OGC:CRS84`. | | | ||
| | [v3](#version-3) | **`geography(C, A)`** | Geospatial features from [OGC – Simple feature access][1001]. See [Appendix G](#appendix-g-geospatial-notes). Parameterized by CRS C and edge-interpolation algoritm A. If not specified, C is `OGC:CRS84` and A is `spherical`. | | ||
|
|
||
| Notes: | ||
|
|
||
|
|
@@ -228,6 +230,30 @@ Notes: | |
|
|
||
| For details on how to serialize a schema to JSON, see Appendix C. | ||
|
|
||
| [1001]: <https://portal.ogc.org/files/?artifact_id=25355> "OGC Simple feature access" | ||
|
|
||
| ##### CRS | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does Can values of CRS be of format format
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, |
||
|
|
||
| For `geometry` and `geography` types, the parameter C refers to the CRS (coordinate reference system), a mapping of how coordinates refer to locations on Earth. | ||
|
|
||
| The default CRS value `OGC:CRS84` means that the objects must be stored in longitude, latitude based on the WGS84 datum. | ||
|
|
||
| Custom CRS values can be specified by a string of the format `type:identifier`, where `type` is one of the following values: | ||
|
|
||
| * `srid`: [Spatial reference identifier](https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier), `identifier` is the SRID itself. | ||
|
Comment on lines
+239
to
+243
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is some discrepency in the description here. CRS values are mentioned to be specified in two format:
The default value of
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @redblackcoder It is also equivalent to srid:4326 but flip the axis order to lon/lat There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't SRID database specific? From the open specification, it is like following:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Not exactly. SRID usually means "the primary key in the context of those data". In a spatial database, this is the primary key of the For mapping SRID to EPSG code in a spatial database, we need to look at the other columns. If and only if the value of Note that even when There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the explanation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jiayuasu Why is there a need to flip the axis order? According to projjson spec you linked:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @stefankandic In Iceberg Geo spec (as well as Parquet Geo spec), we explicitly enforce the x/y axis order to be With that in mind, OGC:CRS84 fits better as a default value. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that it may be safer to only said that the default is OGC:CRS84, without saying that it is equivalent to EPSG:4326 with axis order overridden. Overriding the axis order is deprecated by OGC directive 14 because it is ambiguous. It works for the most common CRS, but is an impediment to interoperability with less common CRS such as South-orientated or polar projections. The practice of overriding axis order in a specification was established in old standards such as GeoTIFF and Simple Features. Then, it became a self-sustaining problem, with some (fortunately not all) newer standards citing Simple Features as a precedent, despite OGC directive 14 explicitly asking us to stop doing so. Note that the problem is not the desire to have (x, y) axis order. The problem is in the way to achieve that goal, when that order is enforced by the specification. OGC directive 14 proposes a choice between 3 other ways to obtain the same result without the ambiguity of the "specification overrides authoritative definition" approach. In the case of OGC:CRS84, that definition is non-ambiguous regarding axis order, so it is okay as a default value. OGC:CRS84 is ambiguous regarding the datum however, which means that (longitude, latitude) coordinates in OGC:CRS84 have an uncertainty of about 2 meters. I think this uncertainty would have been worth a note, and I think that the sentence "EPSG:4326 and OGC:CRS84 are equivalent with respect to this specification" should be omitted in order to be more quiet about this deprecated practice. |
||
| * `projjson`: [PROJJSON](https://proj.org/en/stable/specifications/projjson.html), `identifier` is the name of a table property where the projjson string is stored. | ||
|
|
||
| For `geography` types, the custom CRS must be geographic, with longitudes bound by [-180, 180] and latitudes bound by [-90, 90]. | ||
|
|
||
| ##### Edge-Interpolation Algorithm | ||
|
|
||
| For `geography` types, an additional parameter A specifies an algorithm for interpolating edges, and is one of the following values: | ||
|
|
||
| * `spherical`: edges are interpolated as geodesics on a sphere. | ||
| * `vincenty`: [https://en.wikipedia.org/wiki/Vincenty%27s_formulae](https://en.wikipedia.org/wiki/Vincenty%27s_formulae) | ||
| * `thomas`: Thomas, Paul D. Spheroidal geodesics, reference systems, & local geometry. US Naval Oceanographic Office, 1970. | ||
| * `andoyer`: Thomas, Paul D. Mathematical models for navigation systems. US Naval Oceanographic Office, 1965. | ||
| * `karney`: [Karney, Charles FF. "Algorithms for geodesics." Journal of Geodesy 87 (2013): 43-55](https://link.springer.com/content/pdf/10.1007/s00190-012-0578-z.pdf), and [GeographicLib](https://geographiclib.sourceforge.io/) | ||
|
|
||
| #### Default values | ||
|
|
||
|
|
@@ -468,7 +494,7 @@ Partition field IDs must be reused if an existing partition spec contains an equ | |
|
|
||
| | Transform name | Description | Source types | Result type | | ||
| |-------------------|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-------------| | ||
| | **`identity`** | Source value, unmodified | Any except for `variant` | Source type | | ||
| | **`identity`** | Source value, unmodified | Any except for `geometry`, `geography`, and `variant` | Source type | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! |
||
| | **`bucket[N]`** | Hash of value, mod `N` (see below) | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int` | | ||
| | **`truncate[W]`** | Value truncated to width `W` (see below) | `int`, `long`, `decimal`, `string`, `binary` | Source type | | ||
| | **`year`** | Extract a date or timestamp year, as years from 1970 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | | ||
|
|
@@ -623,6 +649,8 @@ Notes: | |
| 5. The `content_offset` and `content_size_in_bytes` fields are used to reference a specific blob for direct access to a deletion vector. For deletion vectors, these values are required and must exactly match the `offset` and `length` stored in the Puffin footer for the deletion vector blob. | ||
| 6. The following field ids are reserved on `data_file`: 141. | ||
|
|
||
| For `geometry` and `geography` types, `lower_bounds` and `upper_bounds` are both points of the following coordinates X, Y, Z, and M (see [Appendix G](#appendix-g-geospatial-notes)) which are the lower / upper bound of all objects in the file. For the X values only, xmin may be greater than xmax, in which case an object in this bounding box may match if it contains an X such that `x >= xmin` OR`x <= xmax`. In geographic terminology, the concepts of `xmin`, `xmax`, `ymin`, and `ymax` are also known as `westernmost`, `easternmost`, `southernmost` and `northernmost`, respectively. For `geography` types, these points are further restricted to the canonical ranges of [-180 180] for X and [-90 90] for Y. | ||
|
|
||
|
Comment on lines
+652
to
+653
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do |
||
| The `partition` struct stores the tuple of partition values for each file. Its type is derived from the partition fields of the partition spec used to write the manifest file. In v2, the partition struct's field ids must match the ids from the partition spec. | ||
|
|
||
| The column metrics maps are used when filtering to select both data and delete files. For delete files, the metrics must store bounds and counts for all deleted rows, or must be omitted. Storing metrics for deleted rows ensures that the values can be used during job planning to find delete files that must be merged during a scan. | ||
|
|
@@ -1179,6 +1207,8 @@ Maps with non-string keys must use an array representation with the `map` logica | |
| |**`list`**|`array`|| | ||
| |**`map`**|`array` of key-value records, or `map` when keys are strings (optional).|Array storage must use logical type name `map` and must store elements that are 2-field records. The first field is a non-null key and the second field is the value.| | ||
| |**`variant`**|`record` with `metadata` and `value` fields. `metadata` and `value` must not be assigned field IDs and the fields are accessed through names. |Shredding is not supported in Avro.| | ||
| |**`geometry`**|`bytes`|WKB format, see [Appendix G](#appendix-g-geospatial-notes)| | ||
| |**`geography`**|`bytes`|WKB format, see [Appendix G](#appendix-g-geospatial-notes)| | ||
|
|
||
| Notes: | ||
|
|
||
|
|
@@ -1234,6 +1264,8 @@ Lists must use the [3-level representation](https://github.com/apache/parquet-fo | |
| | **`list`** | `3-level list` | `LIST` | See Parquet docs for 3-level representation. | | ||
| | **`map`** | `3-level map` | `MAP` | See Parquet docs for 3-level representation. | | ||
| | **`variant`** | `group` with `metadata` and `value` fields. `metadata` and `value` must not be assigned field IDs and the fields are accessed through names.| `VARIANT` | See Parquet docs for [Variant encoding](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md) and [Variant shredding encoding](https://github.com/apache/parquet-format/blob/master/VariantShredding.md). | | ||
| | **`geometry`** | `binary` | `GEOMETRY` | WKB format, see [Appendix G](#appendix-g-geospatial-notes). | | ||
| | **`geography`** | `binary` | `GEOGRAPHY` | WKB format, see [Appendix G](#appendix-g-geospatial-notes). | | ||
|
|
||
|
|
||
| When reading an `unknown` column, any corresponding column must be ignored and replaced with `null` values. | ||
|
|
@@ -1266,6 +1298,9 @@ When reading an `unknown` column, any corresponding column must be ignored and r | |
| | **`list`** | `array` | | | | ||
| | **`map`** | `map` | | | | ||
| | **`variant`** | `struct` with `metadata` and `value` fields. `metadata` and `value` must not be assigned field IDs. | `iceberg.struct-type`=`VARIANT` | Shredding is not supported in ORC. | | ||
| | **`geometry`** | `binary` | `iceberg.binary-type`=`GEOMETRY` | WKB format, see [Appendix G](#appendix-g-geospatial-notes). | | ||
| | **`geography`** | `binary` | `iceberg.binary-type`=`GEOMETRY` | WKB format, see [Appendix G](#appendix-g-geospatial-notes). | | ||
|
|
||
|
|
||
| Notes: | ||
|
|
||
|
|
@@ -1361,6 +1396,8 @@ Types are serialized according to this table: | |
| |**`list`**|`JSON object: {`<br /> `"type": "list",`<br /> `"element-id": <id int>,`<br /> `"element-required": <bool>`<br /> `"element": <type JSON>`<br />`}`|`{`<br /> `"type": "list",`<br /> `"element-id": 3,`<br /> `"element-required": true,`<br /> `"element": "string"`<br />`}`| | ||
| |**`map`**|`JSON object: {`<br /> `"type": "map",`<br /> `"key-id": <key id int>,`<br /> `"key": <type JSON>,`<br /> `"value-id": <val id int>,`<br /> `"value-required": <bool>`<br /> `"value": <type JSON>`<br />`}`|`{`<br /> `"type": "map",`<br /> `"key-id": 4,`<br /> `"key": "string",`<br /> `"value-id": 5,`<br /> `"value-required": false,`<br /> `"value": "double"`<br />`}`| | ||
| | **`variant`**| `JSON string: "variant"`|`"variant"`| | ||
| | **`geometry(C)`** | `JSON object: {`<br /> `"type": "geometry",`<br /> `"crs": <C>`<br />`}` | `{`<br /> `"type": "geometry",`<br /> `"crs": "OGC:CRS84"`<br />`}` | | ||
| | **`geography(C, A)`** | `JSON object: {`<br /> `"type": "geography",`<br /> `"crs": <C>,`<br /> `"algorithm": <A>`<br />}` | `{`<br /> `"type": "geography",`<br /> `"crs": "OGC:CRS84",`<br /> `"algorithm": "spherical"` <br /> `}` | | ||
|
|
||
| Note that default values are serialized using the JSON single-value serialization in [Appendix D](#appendix-d-single-value-serialization). | ||
|
|
||
|
|
@@ -1486,7 +1523,7 @@ Example | |
|
|
||
| ### Binary single-value serialization | ||
|
|
||
| This serialization scheme is for storing single values as individual binary values in the lower and upper bounds maps of manifest files. | ||
| This serialization scheme is for storing single values as individual binary values. | ||
|
|
||
| | Type | Binary serialization | | ||
| |------------------------------|--------------------------------------------------------------------------------------------------------------| | ||
|
|
@@ -1511,6 +1548,17 @@ This serialization scheme is for storing single values as individual binary valu | |
| | **`list`** | Not supported | | ||
| | **`map`** | Not supported | | ||
| | **`variant`** | Not supported | | ||
| | **`geometry`** | WKB format, see [Appendix G](#appendix-g-geospatial-notes) | | ||
| | **`geography`** | WKB format, see [Appendix G](#appendix-g-geospatial-notes) | | ||
|
|
||
| ### Bound serialization | ||
|
|
||
| The binary single-value serialization can be used to store the lower and upper bounds maps of manifest files, except as specified by the following table. | ||
|
|
||
| | Type | Binary serialization | | ||
| |------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| | **`geometry`** | A single point, encoded as a x:y:z:m concatenation of its 8-byte little-endian IEEE 754 coordinate values. x and y are mandatory. This becomes x:y if z and m are both unset, x:y:z if only m is unset, and x:y:NaN:m if only z is unset. | | ||
| | **`geography`** | A single point, encoded as a x:y:z:m concatenation of its 8-byte little-endian IEEE 754 coordinate values. x and y are mandatory. This becomes x:y if z and m are both unset, x:y:z if only m is unset, and x:y:NaN:m if only z is unset. | | ||
|
|
||
| ### JSON single-value serialization | ||
|
|
||
|
|
@@ -1537,6 +1585,8 @@ This serialization scheme is for storing single values as individual binary valu | |
| | **`struct`** | **`JSON object by field ID`** | `{"1": 1, "2": "bar"}` | Stores struct fields using the field ID as the JSON field name; field values are stored using this JSON single-value format | | ||
| | **`list`** | **`JSON array of values`** | `[1, 2, 3]` | Stores a JSON array of values that are serialized using this JSON single-value format | | ||
| | **`map`** | **`JSON object of key and value arrays`** | `{ "keys": ["a", "b"], "values": [1, 2] }` | Stores arrays of keys and values; individual keys and values are serialized using this JSON single-value format | | ||
| | **`geometry`** | **`JSON string`** | `POINT (30 10)` | Stored using WKT representation, see [Appendix G](#appendix-g-geospatial-notes) | | ||
| | **`geography`** | **`JSON string`** | `POINT (30 10)` | Stored using WKT representation, see [Appendix G](#appendix-g-geospatial-notes) | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Like above, this table covers arbitrary |
||
|
|
||
|
|
||
|
|
||
|
|
@@ -1709,3 +1759,11 @@ Snapshot summary can include metrics fields to track numeric stats of the snapsh | |
| | **`source-snapshot-id`** | "12345678" | The original id of a cherry-picked snapshot | | ||
| | **`engine-name`** | "spark" | Name of the engine that created the snapshot | | ||
| | **`engine-version`** | "3.5.4" | Version of the engine that created the snapshot | | ||
|
|
||
| ## Appendix G: Geospatial Notes | ||
|
|
||
| The Geometry and Geography class hierarchy and its Well-known text (WKT) and Well-known binary (WKB) serializations (ISO supporting XY, XYZ, XYM, XYZM) are defined by [OpenGIS Implementation Specification for Geographic information – Simple feature access – Part 1: Common architecture](https://portal.ogc.org/files/?artifact_id=25355), from [OGC (Open Geospatial Consortium)](https://www.ogc.org/standard/sfa/). | ||
|
|
||
szehon-ho marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| Points are always defined by the coordinates X, Y, Z (optional), and M (optional), in this order. X is the longitude/easting, Y is the latitude/northing, and Z is usually the height, or elevation. M is a fourth optional dimension, for example a linear reference value (e.g., highway milepost value), a timestamp, or some other value as defined by the CRS. | ||
|
|
||
| The version of the OGC standard first used here is 1.2.1, but future versions may also used if the WKB representation remains wire-compatible. | ||
Uh oh!
There was an error while loading. Please reload this page.