-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Fix timestamp to varchar coercer in hive tables #17604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix timestamp to varchar coercer in hive tables #17604
Conversation
Use == for enum comparison
11f847e to
e3ac258
Compare
plugin/trino-hive/src/main/java/io/trino/plugin/hive/coercions/TimestampCoercer.java
Outdated
Show resolved
Hide resolved
e3ac258 to
744bcd3
Compare
|
@findepi Can you please take a look at this PR during your spare time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It allows us to override the values for a specific distribution of hive or a customized egine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It allows us to override the values for a specific distribution of hive or a customized egine
testing/trino-product-tests/src/main/java/io/trino/tests/product/hive/BaseTestHiveCoercion.java
Outdated
Show resolved
Hide resolved
testing/trino-product-tests/src/main/java/io/trino/tests/product/hive/BaseTestHiveCoercion.java
Outdated
Show resolved
Hide resolved
744bcd3 to
f204670
Compare
| // TODO: These expected failures should be fixed. | ||
| return ImmutableMap.<ColumnContext, String>builder() | ||
| // Expected failures from BaseTestHiveCoercion | ||
| .putAll(super.expectedExceptionsWithTrinoContext()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is meaningless since super is empty right now.
squash with commit that changes super
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is now squashed, but remains meaningless, as super remains empty.
maybe make super abstract instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove that method instead for now we don't have any exceptions in the base class which has to be skipped in each implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should return same results as hive, not "somewhat similar"
the column is varchar. if hive returns 2121-07-15 15:30:12.123499999 (29 characters), Trino should also return 2121-07-15 15:30:12.123499999 (29 characters).
remove this method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should it be irrespective of the timestamp precision specified as a part session or config property ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hive timestamp doesn't have the concept of precision - how do we handle them in trino
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should it be irrespective of the timestamp precision specified as a part session or config property ?
i believe so
Hive timestamp doesn't have the concept of precision -
i know. they are always nanoseconds precision
how do we handle them in trino
regular timestamps get truncated to selected precision for performance reasons
for columns that were timestamps and are now varchars, we don't have these performance considerations and we should be compatible with hive.
findepi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f204670 to
f6b307a
Compare
Get type directly from HiveColumnHandle instead of generating them from HiveType.
This will be irrespective of the precision configured or specified as session property.
f6b307a to
5427d60
Compare
|
@findepi I have addressed the comments - Now we are returning same values as that of hive i.e we are treating it as nanoseconds irrespective of coercion specified in config or session property. |
|
If the overall PR is okay - I could extract them into two |
| Type fromType = fromHiveType.getType(typeManager, timestampPrecision); | ||
| // Hive treats TIMESTAMP with NANOSECONDS precision and when we try to coerce from a timestamp column, | ||
| // we read it as TIMESTAMP(9) column and coerce accordingly. | ||
| Type fromType = fromHiveType.getType(typeManager, HiveTimestampPrecision.NANOSECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of changing here, let's inject this on return Optional.of(new LongTimestampToVarcharCoercer(timestampType, varcharType)); line, replacing timestampType with TIMESATAMP_NANOS
| fromHiveType.getType(typeManager)); | ||
| // Hive treats TIMESTAMP with NANOSECONDS precision and when we try to coerce from a timestamp column, | ||
| // we read it as TIMESTAMP(9) column and coerce accordingly. | ||
| fromHiveType.getType(typeManager, HiveTimestampPrecision.NANOSECONDS)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if fromHiveType is a Row/Struct and timestamp field isn't being coerced.
can injecting NANOS leak here?
in fact, i don't know why change here at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing here would allows to read the coerced column as NANO from the underlying file - is there any extension point we could use so as to force the column to be read as NANO ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fromHiveType.getType(typeManager) doesn't consider the precision from the session or config - so aren't we injecting MILLIS - which is mapped to DEFAULT precision here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fromHiveType.getType(typeManager) doesn't consider the precision from the session or config
that's why it was deprecated. the code before the changes wasn't good
but i don't know why NANOSECONDS is always good here.
is this applied for coerced columns only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we didn't apply NANOSECONDS then the data from the underlying ConnectorPageSource is read as MILLISECONDS - and we can't get a LongTimestamp object from the block - IIUC this applied if the coerced column is a timestamp column, I'll double check for the complex column like STRUCT - but without this change we would be leaking MILLI - it is not intentional right ?
| columnHandle.getBaseHiveColumnIndex(), | ||
| fromHiveTypeBase, | ||
| fromHiveTypeBase.getType(typeManager), | ||
| fromHiveTypeBase.getType(typeManager, HiveTimestampPrecision.NANOSECONDS), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can continue the discussion here - #17604 (comment)
| } | ||
| // Hive treats TIMESTAMP with NANOSECONDS precision and when we try to coerce from a timestamp column, | ||
| // we read it as TIMESTAMP(9) column and coerce accordingly. | ||
| TimestampType timestampType = createTimestampType(HiveTimestampPrecision.NANOSECONDS.getPrecision()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use TIMESTAMP_NANOS
| } | ||
| } | ||
|
|
||
| protected Map<String, List<Object>> expectedRowsForEngineProvider(Engine engine, HiveTimestampPrecision hiveTimestampPrecision) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expectedRowsForEngineProvider no longer uses its parameters (which is good)
inline
and replace expectedTinoResults and expectedHiveResults with one variable
| // TODO: These expected failures should be fixed. | ||
| return ImmutableMap.<ColumnContext, String>builder() | ||
| // Expected failures from BaseTestHiveCoercion | ||
| .putAll(super.expectedExceptionsWithTrinoContext()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is now squashed, but remains meaningless, as super remains empty.
maybe make super abstract instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add 0001-01-01 00:00 test case. this is favorite users' date
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For historical dates we don't support coercion so this is the closest favorite data
This will be irrespective of the precision configured or specified as session property.
Hive 2.+ and Hive 3.+ uses `java.sql.Timestamp#toString` for coercing Timestamp to Varchar types. `java.sql.Timestamp#toString` doesn't capture the historical dates correctly
5427d60 to
030c40c
Compare
Description
Timestamp to varchar coercion which was introduced in #16869 has a few edge cases which are not covered.
In case of partitioned tables - we don't consider
Hive 2.+ and Hive 3.+ uses
java.sql.Timestamp#toStringfor coercing Timestampto Varchar types.
java.sql.Timestamp#toStringdoesn't capture the historical dates correctlyAdditional context and related issues
Release notes
( ) This is not user-visible or docs only and no release notes are required.
(x) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text: