Skip to content

Conversation

@Praveen2112
Copy link
Member

Description

Add support for varchar to timestamp coercer in hive tables. In case of partitioned table it is supported by most of the format and in case of unpartitioned tables only ORC format is supported as of now.

The coercion that was supported as of current master for unpartitioned tables are inherently supported by the ColumnReader and this PR introduces a framework which maps a OrcTypeKind to a corresponding TrinoType and also re-uses the TypeCoercer used by partitioned tables. In case of invalid String we append null instead of failing

Additional context and related issues

This is on top of #16869

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Section
* Add support for vachar to timestamp coercer in hive tables

@cla-bot cla-bot bot added the cla-signed label Apr 17, 2023
@Praveen2112 Praveen2112 force-pushed the praveen/string_timestamp_coercion branch from 2d40635 to 8480728 Compare April 17, 2023 16:59
@github-actions github-actions bot added hive Hive connector tests:hive labels Apr 17, 2023
@findepi
Copy link
Member

findepi commented Apr 19, 2023

let's finish #16869 first.

@Praveen2112 Praveen2112 force-pushed the praveen/string_timestamp_coercion branch from 8480728 to 4141ec6 Compare April 28, 2023 06:05
@github-actions github-actions bot added the iceberg Iceberg connector label Apr 28, 2023
@Praveen2112 Praveen2112 force-pushed the praveen/string_timestamp_coercion branch from 4141ec6 to 6d6223d Compare May 2, 2023 13:30
@Praveen2112
Copy link
Member Author

@findepi #16869 was finished so this is ready for a new round of review

Copy link
Contributor

@huberty89 huberty89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.orElse(VARCHAR);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like we could always use VARCHAR here.
i don't think fromOrcType.getLength() matters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the schema of a column changes from unbounded varchar to a bounded varchar and them to timestamp, won't it affect the result ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a test coverage for this - but when we try to append data to an existing ORC file - the footer could capture the new schema

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought this is being used on read, not on write?

also, i think orc files are immutable, you cannot append to an existing data file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insert-only tables mean that you can insert to them, but you cannot update, delete from them
this is a hive concept

yes, i vaguely remember some mention of some streaming-ingest for ORC where files are built incrementally, but Trino does not do that and will never do that. Object storage doesn't support that either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay !! We can assume that - varchar length specified in the footer will be in always greater than or equal to the length of Slice in case of VariableWidthBlock i.e we won't require a truncating of varchar during reads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove this code related to fromOrcType.getLength()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this also obsolete Capture OrcType in OrcColumn commit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes !! So we will remove that commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add cases with micros and nano precision,
.123499
.123500
.123501
.123499999
.123500000
.123500001

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But these cases cannot be applied at the same time right ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand the question.
do you mean we're limited to just one test data point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing test we have is limited to run with a default precision - I'll try to add additional test coverage by having a dedicated table wit varying precision or we could try be setting the default precision to NANO_SECONDS

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. we should probably exercise this logic with all supported precisions
let's have a dedicated test

@Praveen2112 Praveen2112 force-pushed the praveen/string_timestamp_coercion branch from 6d6223d to 1061ca2 Compare May 5, 2023 07:39
@Praveen2112 Praveen2112 force-pushed the praveen/string_timestamp_coercion branch 2 times, most recently from cfe2a4f to 1aa53e5 Compare May 5, 2023 17:11
@Praveen2112
Copy link
Member Author

@findepi Squashed the commits - and one major change we did is that - we used Timestamp.valueOf to parse the String instead of using a DateTimeFormatter - the orc version reference I took was a recent one while hive 3.1.3-rc uses java.sql.Timestamp - Ref : https://github.com/apache/orc/blob/2880540d8caf18ccbbe3442fca47be150116a235/java/core/src/java/org/apache/orc/impl/ConvertTreeReaderFactory.java#L231

@findepi
Copy link
Member

findepi commented May 8, 2023

the orc version reference I took was a recent one while hive 3.1.3-rc uses java.sql.Timestamp - Ref :

Does this mean we have to choose which Hive version we're going to be compatible with?

@Praveen2112
Copy link
Member Author

For a few edge cases - it is dependent on the hive version or future versions

@ebyhr ebyhr removed their request for review May 16, 2023 05:11
@Praveen2112 Praveen2112 force-pushed the praveen/string_timestamp_coercion branch 3 times, most recently from 6348577 to 07212e9 Compare May 17, 2023 15:22
Hive 2.+ and Hive 3.+ uses `java.sql.Timestamp#toString` for coercing Timestamp
to Varchar types. `java.sql.Timestamp#toString` doesn't capture the historical dates correctly
@Praveen2112 Praveen2112 force-pushed the praveen/string_timestamp_coercion branch from 07212e9 to 00c08af Compare May 23, 2023 14:43
@Praveen2112
Copy link
Member Author

Due to this #17604, Ill have to rebase this PR to the different base. Sorry for the inconvenience

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed hive Hive connector iceberg Iceberg connector

Development

Successfully merging this pull request may close these issues.

3 participants