Skip to content

Conversation

@ljw9111
Copy link
Contributor

@ljw9111 ljw9111 commented Mar 6, 2025

Description

This PR implements the native ESRI reader for reading Esri JSON which can be used for geospatial queries. (NOTE: we only support UTC timezone in this port)

Customer can now submit geospatial query on a table using ESRI serde.

DDL example

CREATE external TABLE earthquakes
(
 earthquake_date string,
 latitude double,
 longitude double,
 depth double,
 magnitude double,
 magtype string,
 mbstations string,
 gap string,
 distance string,
 rms string,
 source string,
 eventid string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION 's3://amzn-s3-demo-bucket/my-query-log/csv/';

CREATE external TABLE IF NOT EXISTS counties
 (
 Name string,
 BoundaryShape binary
 )
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.EsriJsonSerDe'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedEsriJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://amzn-s3-demo-bucket/my-query-log/json/';

Example data is from https://docs.aws.amazon.com/athena/latest/ug/geospatial-example-queries.html

DML example

trino:esri> SELECT c.name,
         ->         COUNT(*) cnt
         -> FROM esri.counties as c
         -> CROSS JOIN esri.earthquakes
         -> WHERE ST_CONTAINS (geometry_from_hadoop_shape(c.boundaryshape), ST_POINT(earthquakes.longitude, earthquakes.latitude))
         -> GROUP BY  c.name
         -> ORDER BY  cnt DESC;
      name       | cnt 
-----------------+-----
 Kern            | 288 
 San Bernardino  | 280 
 Imperial        | 224 
 Inyo            | 160 
 Los Angeles     | 144 
 Monterey        | 112 
 Riverside       | 112 
 Santa Clara     |  96 
 Fresno          |  88 
 San Benito      |  88 
 San Diego       |  56 
 Santa Cruz      |  40 
 San Luis Obispo |  24 
 Ventura         |  24 
 Orange          |  16 
 San Mateo       |   8 
(16 rows)

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Hive connector
* Add support for reading tables using the ESRI JSON format ({issue}`25241`)

@cla-bot cla-bot bot added the cla-signed label Mar 6, 2025
@github-actions github-actions bot added the hive Hive connector label Mar 6, 2025
@ljw9111 ljw9111 force-pushed the native-esri-reader branch 3 times, most recently from 1361efe to 7f29bc9 Compare March 7, 2025 00:31
@ljw9111 ljw9111 self-assigned this Mar 7, 2025
@ljw9111 ljw9111 force-pushed the native-esri-reader branch 2 times, most recently from 5b41a78 to 16784ed Compare March 10, 2025 17:52
Copy link
Contributor

@findinpath findinpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After skimming the code and trying TestEsri I definitely understand the purpose of this contribution.

The referenced library geometry-api-java seems not lively anymore.

It would be useful to have a test reading all types.

However before adding any other changes, I think it is worth asking the maintainers @wendigo , @dain whether this contribution is basically fit from a functional perspective to be inclued in the Trino project code.

@ljw9111 ljw9111 force-pushed the native-esri-reader branch from 16784ed to f02ca1b Compare March 10, 2025 19:44
@ljw9111 ljw9111 force-pushed the native-esri-reader branch 3 times, most recently from 06d5677 to d28634f Compare March 17, 2025 14:00
@ljw9111 ljw9111 force-pushed the native-esri-reader branch 2 times, most recently from 8de204a to 87d9d72 Compare March 17, 2025 19:12
@ljw9111 ljw9111 force-pushed the native-esri-reader branch from 87d9d72 to 2a28c44 Compare March 19, 2025 18:30
@ljw9111 ljw9111 force-pushed the native-esri-reader branch from 2a28c44 to a7df4a8 Compare March 21, 2025 16:20
dain
dain previously requested changes Mar 30, 2025
Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a good while reviewing this. Overall I think the approach is sound, but the code is missing defenses against bad data files. In this code we should strive to be bug-for-bug compatible with hive, and this includes handling of "bad" files, because users often rely on these undocumented behaviors.

Additionally, Jackson has some unexpected behaviors when recursing into nested structures, and this code falls into that trap. Specifically, the code isn't properly skipping nexted data which can result in processing inside of objects that is not expected (I had to learn this the hard way a couple of years back). In general, I used (copied) the framework laid out in the Json reader, which handles these issues.

Instead of adding a lot of mundane comments, I just applied them to the code which you can find in this commit dain@8b95a73

Finally, the tests seem to be missing cases for some of the supported attribute types... I see then when running the tests with coverage.

@github-actions
Copy link

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

@github-actions github-actions bot added the stale label Apr 21, 2025
@github-actions
Copy link

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

@github-actions github-actions bot closed this May 13, 2025
@electrum electrum reopened this May 13, 2025
@github-actions github-actions bot removed the stale label May 14, 2025
@ljw9111 ljw9111 force-pushed the native-esri-reader branch from a7df4a8 to 51a6492 Compare May 19, 2025 20:57
@ljw9111
Copy link
Contributor Author

ljw9111 commented May 19, 2025

@dain Thank you so much for the review and update! I just pulled your commit and made new revision with additional unit test called testSupportedAttributeTypes and fixing assertGeometry in TestEsriDeserializer.java a little bit. Could you review it when you get chance?

@dain dain self-requested a review June 11, 2025 23:15
@dain dain dismissed their stale review June 11, 2025 23:16

My comments have been integrated. James will take over the for the final review.

@dain dain removed their request for review June 11, 2025 23:16
@ljw9111 ljw9111 force-pushed the native-esri-reader branch from 51a6492 to 1ac5357 Compare June 24, 2025 15:55
@ljw9111 ljw9111 force-pushed the native-esri-reader branch 2 times, most recently from fe322fc to f89a2d6 Compare June 25, 2025 19:38
Co-authored-by: Dain Sundstrom <[email protected]>
@ljw9111 ljw9111 force-pushed the native-esri-reader branch from f89a2d6 to 44a8841 Compare June 25, 2025 20:48
@github-actions github-actions bot added the docs label Jun 25, 2025
Copy link
Member

@pettyjamesm pettyjamesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, thanks @ljw9111!

@pettyjamesm pettyjamesm merged commit f819b92 into trinodb:master Jun 26, 2025
62 checks passed
@github-actions github-actions bot added this to the 477 milestone Jun 26, 2025
@ljw9111 ljw9111 deleted the native-esri-reader branch June 30, 2025 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

6 participants