-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Add support for varchar to timestamp coercer in hive tables #17071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for varchar to timestamp coercer in hive tables #17071
Conversation
2d40635 to
8480728
Compare
|
let's finish #16869 first. |
8480728 to
4141ec6
Compare
4141ec6 to
6d6223d
Compare
huberty89
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
plugin/trino-hive/src/main/java/io/trino/plugin/hive/coercions/TimestampCoercer.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/orc/OrcTypeTranslator.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/coercions/TimestampCoercer.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/coercions/TimestampCoercer.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/coercions/TimestampCoercer.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/coercions/TimestampCoercer.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/coercions/TimestampCoercer.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/coercions/TimestampCoercer.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/orc/OrcTypeTranslator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.orElse(VARCHAR);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like we could always use VARCHAR here.
i don't think fromOrcType.getLength() matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the schema of a column changes from unbounded varchar to a bounded varchar and them to timestamp, won't it affect the result ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a test coverage for this - but when we try to append data to an existing ORC file - the footer could capture the new schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought this is being used on read, not on write?
also, i think orc files are immutable, you cannot append to an existing data file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
insert-only tables mean that you can insert to them, but you cannot update, delete from them
this is a hive concept
yes, i vaguely remember some mention of some streaming-ingest for ORC where files are built incrementally, but Trino does not do that and will never do that. Object storage doesn't support that either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay !! We can assume that - varchar length specified in the footer will be in always greater than or equal to the length of Slice in case of VariableWidthBlock i.e we won't require a truncating of varchar during reads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove this code related to fromOrcType.getLength()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this also obsolete Capture OrcType in OrcColumn commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes !! So we will remove that commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add cases with micros and nano precision,
.123499
.123500
.123501
.123499999
.123500000
.123500001
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But these cases cannot be applied at the same time right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't understand the question.
do you mean we're limited to just one test data point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing test we have is limited to run with a default precision - I'll try to add additional test coverage by having a dedicated table wit varying precision or we could try be setting the default precision to NANO_SECONDS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. we should probably exercise this logic with all supported precisions
let's have a dedicated test
6d6223d to
1061ca2
Compare
plugin/trino-hive/src/main/java/io/trino/plugin/hive/coercions/TimestampCoercer.java
Outdated
Show resolved
Hide resolved
cfe2a4f to
1aa53e5
Compare
|
@findepi Squashed the commits - and one major change we did is that - we used |
Does this mean we have to choose which Hive version we're going to be compatible with? |
|
For a few edge cases - it is dependent on the hive version or future versions |
6348577 to
07212e9
Compare
Use == for enum comparison
Hive 2.+ and Hive 3.+ uses `java.sql.Timestamp#toString` for coercing Timestamp to Varchar types. `java.sql.Timestamp#toString` doesn't capture the historical dates correctly
07212e9 to
00c08af
Compare
|
Due to this #17604, Ill have to rebase this PR to the different base. Sorry for the inconvenience |
Description
Add support for varchar to timestamp coercer in hive tables. In case of partitioned table it is supported by most of the format and in case of unpartitioned tables only ORC format is supported as of now.
The coercion that was supported as of current master for unpartitioned tables are inherently supported by the ColumnReader and this PR introduces a framework which maps a OrcTypeKind to a corresponding TrinoType and also re-uses the TypeCoercer used by partitioned tables. In case of invalid String we append null instead of failing
Additional context and related issues
This is on top of #16869
Release notes
( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text: