Skip to content

Avoid creating Hive tables with double slashed location#17958

Merged
findepi merged 1 commit intomasterfrom
findepi/hive-restore-normalization
Jun 20, 2023
Merged

Avoid creating Hive tables with double slashed location#17958
findepi merged 1 commit intomasterfrom
findepi/hive-restore-normalization

Conversation

@findepi
Copy link
Copy Markdown
Member

@findepi findepi commented Jun 19, 2023

Hive connector does not support table locations containing double slashes. On S3 this leads to correctness issues (e.g. INSERT works, but SELECT does not find any data).

This commit

  • restores normalization of implicit table location during CREATE TABLE. There used to be such normalization until 8bd9f75.
  • rejects explicit table locations containing double slash during CREATE TABLE .. WITH (external_location = ...). Before 8bd9f75 there used to be normalization also during this flow, but rejecting such unsupported locations is deemed more correct.

Alternative to #17947
Stop gap solution until we fix all problems related to double slashes in Hive connector (#17803).

After merging this, #17803 is no longer a release-blocker.

@cla-bot cla-bot bot added the cla-signed label Jun 19, 2023
@findepi findepi force-pushed the findepi/hive-restore-normalization branch from 69692fc to d0e7c99 Compare June 19, 2023 15:30
Comment thread plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java Outdated
Comment thread plugin/trino-hive/src/main/java/io/trino/plugin/hive/util/HiveWriteUtils.java Outdated
Copy link
Copy Markdown
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @findinpath @alexjo2144 PTAL.

@github-actions github-actions bot added hive Hive connector tests:hive labels Jun 19, 2023
Copy link
Copy Markdown
Contributor

@findinpath findinpath Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trino> create schema hive.doubleslashes with (location='s3://trino-test/double_slash//location');
CREATE SCHEMA
trino> show create schema hive.doubleslashes;
                         Create Schema                          
----------------------------------------------------------------
 CREATE SCHEMA hive.doubleslashes                               
 WITH (                                                         
    location = 's3://trino-test/double_slash//location' 
 )                                                              
(1 row)

trino> create table hive.doubleslashes.test1 (c integer);
CREATE TABLE
trino> insert into hive.doubleslashes.test1 values 1,2,3;
INSERT: 3 rows
sele
Query 20230620_060740_00002_8c4bz, FINISHED, 1 node
http://localhost:8080/ui/query.html?20230620_060740_00002_8c4bz
Splits: 14 total, 14 done (100.00%)
CPU Time: 0.3s total,     0 rows/s,     0B/s, 28% active
Per Node: 0.1 parallelism,     0 rows/s,     0B/s
Parallelism: 0.1
Peak Memory: 38.2KB
3.11 [0 rows, 0B] [0 rows/s, 0B/s]

trino> select "$path"  from hive.doubleslashes.test1;
                                                        $path                                                         
----------------------------------------------------------------------------------------------------------------------
 s3://trino-test/double_slash/location/test1/20230620_060740_00002_8c4bz_e91a2ac5-4bc1-40ba-ab3f-2145eaa6bf97 
 s3://trino-test/double_slash/location/test1/20230620_060740_00002_8c4bz_e91a2ac5-4bc1-40ba-ab3f-2145eaa6bf97 
 s3://trino-test/double_slash/location/test1/20230620_060740_00002_8c4bz_e91a2ac5-4bc1-40ba-ab3f-2145eaa6bf97 
(3 rows)

Note that this strategy may eventually cause issues with managed tables which get created into schemas having a double-slashed location.

schema location: s3://trino-test/double_slash//location
table location: s3://trino-test/double_slash/location/test1

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schema location: s3://trino-test/double_slash//location
table location: s3://trino-test/double_slash/location/test1

well, that's the whole point of this PR

thanks for calling this out!

@findepi findepi force-pushed the findepi/hive-restore-normalization branch from d0e7c99 to b509013 Compare June 20, 2023 08:24
@github-actions github-actions bot added delta-lake Delta Lake connector iceberg Iceberg connector labels Jun 20, 2023
Comment on lines 1286 to 1293
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote the logic to accept locations such as file:///tmp/6383092576615971655, file:///tmp/6383092576615971655/, s3://bucket/foo/.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as found in #17964 , two trailing slashes are not OK either.
i will not add a test case for that, but will block such locations.
test case coming in the other PR

Copy link
Copy Markdown
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good except for CI failure:

Error:  Failures: 
Error:    TestHiveS3AndGlueMetastoreTest.testBasicOperationsWithProvidedTableLocation:137 
expected: "s3://trino-ci-test/test_glue_s3_icyjziw7ri/trailing_slash/test_basic_operations_kjxzwqj9yr"
 but was: "s3://trino-ci-test/test_glue_s3_icyjziw7ri/trailing_slash/test_basic_operations_kjxzwqj9yr/"
Error:    TestHiveS3AndGlueMetastoreTest.testBasicOperationsWithProvidedTableLocation:137 
expected: "s3://trino-ci-test/test_glue_s3_icyjziw7ri/trailing_slash/test_basic_operations_tuoqpbqkww"
 but was: "s3://trino-ci-test/test_glue_s3_icyjziw7ri/trailing_slash/test_basic_operations_tuoqpbqkww/"

Hive connector does not support table locations containing double
slashes. On S3 this leads to correctness issues (e.g. INSERT works, but
SELECT does not find any data).

This commit

- restores normalization of implicit table location during CREATE TABLE.
  There used to be such normalization until 8bd9f75.
- rejects explicit table locations containing double slash during
  `CREATE TABLE .. WITH (external_location = ...)`.
  Before 8bd9f75 there used to be
  normalization also during this flow, but rejecting such unsupported
  locations is deemed more correct.
@findepi findepi force-pushed the findepi/hive-restore-normalization branch from 38a2eda to 9151dc8 Compare June 20, 2023 14:09
@findepi findepi merged commit 431f3e0 into master Jun 20, 2023
@findepi findepi deleted the findepi/hive-restore-normalization branch June 20, 2023 21:23
@github-actions github-actions bot added this to the 420 milestone Jun 20, 2023
@colebow
Copy link
Copy Markdown
Member

colebow commented Jun 21, 2023

Does this need release notes? @findepi

@findepi
Copy link
Copy Markdown
Member Author

findepi commented Jun 22, 2023

yes. it fixes #17803

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed delta-lake Delta Lake connector hive Hive connector iceberg Iceberg connector

Development

Successfully merging this pull request may close these issues.

5 participants