Skip to content

Fix reading from Hive table with location being S3 bucket itself#17848

Merged
findepi merged 3 commits intotrinodb:masterfrom
findinpath:findinpath/hive-external-table-top-of-bucket
Jun 15, 2023
Merged

Fix reading from Hive table with location being S3 bucket itself#17848
findepi merged 3 commits intotrinodb:masterfrom
findinpath:findinpath/hive-external-table-top-of-bucket

Conversation

@findinpath
Copy link
Copy Markdown
Contributor

@findinpath findinpath commented Jun 12, 2023

Description

In case that a user decides to dedicate an S3 bucket for a Hive table, this PR adds the required handling to support reading from such a table.

CREATE TABLE myhivetable (c integer) WITH (external_location='s3://myhivetable/');

The hadoop Path corresponding to 's3://myhivetable/' has the string representation 's3://myhivetable/'.
The hadoop Path corresponding to 's3://myhivetable/somedir/' has the string representation 's3://myhivetable/somedir' (note that in this case the / at the end of the string representation is not present).

Specific handling has to be done in such a case to find the data files at the top level of the bucket.

Not present in the tests provided with this PR, but still a valid scenario is the usage of hive.recursive-directories=true hive setting. Tested manually so far.

Additional context and related issues

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Fix reading from Hive table with location being S3 bucket itself. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Jun 12, 2023
@findinpath findinpath self-assigned this Jun 12, 2023
@findinpath findinpath added the no-release-notes This pull request does not require release notes entry label Jun 12, 2023
@github-actions github-actions bot added hive Hive connector tests:hive labels Jun 12, 2023
@findepi findepi changed the title Test reading from Hive external table place on top of S3 bucket Fix reading from Hive table with location being S3 bucket itself Jun 12, 2023
@findinpath findinpath force-pushed the findinpath/hive-external-table-top-of-bucket branch from 0e172dd to f472bf9 Compare June 12, 2023 22:14
@findinpath findinpath requested review from electrum and findepi June 12, 2023 22:14
@findinpath findinpath force-pushed the findinpath/hive-external-table-top-of-bucket branch from f472bf9 to e7c9a04 Compare June 12, 2023 22:22
@findinpath findinpath requested a review from ebyhr June 13, 2023 06:47
@findinpath findinpath force-pushed the findinpath/hive-external-table-top-of-bucket branch 2 times, most recently from 9dae8cd to d80159f Compare June 13, 2023 13:15
@github-actions github-actions bot added the delta-lake Delta Lake connector label Jun 13, 2023
@findinpath findinpath force-pushed the findinpath/hive-external-table-top-of-bucket branch 5 times, most recently from 31362c8 to 15a34ef Compare June 14, 2023 11:12
@findepi findepi force-pushed the findinpath/hive-external-table-top-of-bucket branch from 15a34ef to 628c763 Compare June 14, 2023 12:53
@findepi findepi force-pushed the findinpath/hive-external-table-top-of-bucket branch from 628c763 to 8ebb3e6 Compare June 14, 2023 12:57
@findepi
Copy link
Copy Markdown
Member

findepi commented Jun 14, 2023

/test-with-secrets sha=8ebb3e6a06a635958dcf826c40a35f514328817d

https://github.com/trinodb/trino/actions/runs/5271753669

@ebyhr
Copy link
Copy Markdown
Member

ebyhr commented Jun 15, 2023

Some CI jobs about Hive were cancelled due to timeout. Let me retrigger.

@ebyhr
Copy link
Copy Markdown
Member

ebyhr commented Jun 15, 2023

/test-with-secrets sha=8ebb3e6a06a635958dcf826c40a35f514328817d

https://github.com/trinodb/trino/actions/runs/5275190201

@findepi findepi merged commit 6a4ebd4 into trinodb:master Jun 15, 2023
String transactionLogDir = getTransactionLogDir(tableLocation);
TrinoFileSystem fileSystem = fileSystemFactory.create(session);
String commonPathPrefix = tableLocation + "/";
String commonPathPrefix = tableLocation.endsWith("/") ? tableLocation : tableLocation + "/";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding clarifying parentheses?
tableLocation.endsWith("/") ? tableLocation : (tableLocation + "/");

assertThat(getAllDataFilesFromTableDirectory(tableName)).isEqualTo(union(initialFiles, updatedFiles));

// vacuum with low retention period
MILLISECONDS.sleep(1_000 - timeSinceUpdate.elapsed(MILLISECONDS) + 1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be divided to two tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed delta-lake Delta Lake connector hive Hive connector no-release-notes This pull request does not require release notes entry RELEASE-BLOCKER

Development

Successfully merging this pull request may close these issues.

5 participants