Skip to content

Conversation

@Praveen2112
Copy link
Member

Description

Timestamp to varchar coercion which was introduced in #16869 has a few edge cases which are not covered.

  • Support for non-default precision in case of partitioned tables.
    In case of partitioned tables - we don't consider
  • Restrict support for historical dates
    Hive 2.+ and Hive 3.+ uses java.sql.Timestamp#toString for coercing Timestamp
    to Varchar types. java.sql.Timestamp#toString doesn't capture the historical dates correctly

Additional context and related issues

Release notes

( ) This is not user-visible or docs only and no release notes are required.
(x) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

Use == for enum comparison
@cla-bot cla-bot bot added the cla-signed label May 23, 2023
@github-actions github-actions bot added hive Hive connector tests:hive labels May 23, 2023
@Praveen2112 Praveen2112 force-pushed the praveen/timestamp_string_coercion_fix branch 5 times, most recently from 11f847e to e3ac258 Compare May 24, 2023 05:10
@Praveen2112 Praveen2112 force-pushed the praveen/timestamp_string_coercion_fix branch from e3ac258 to 744bcd3 Compare May 24, 2023 07:01
@Praveen2112
Copy link
Member Author

Praveen2112 commented May 29, 2023

@findepi Can you please take a look at this PR during your spare time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be private?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It allows us to override the values for a specific distribution of hive or a customized egine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be private?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It allows us to override the values for a specific distribution of hive or a customized egine

@Praveen2112 Praveen2112 force-pushed the praveen/timestamp_string_coercion_fix branch from 744bcd3 to f204670 Compare May 31, 2023 11:39
// TODO: These expected failures should be fixed.
return ImmutableMap.<ColumnContext, String>builder()
// Expected failures from BaseTestHiveCoercion
.putAll(super.expectedExceptionsWithTrinoContext())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is meaningless since super is empty right now.
squash with commit that changes super

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now squashed, but remains meaningless, as super remains empty.

maybe make super abstract instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove that method instead for now we don't have any exceptions in the base class which has to be skipped in each implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should return same results as hive, not "somewhat similar"

the column is varchar. if hive returns 2121-07-15 15:30:12.123499999 (29 characters), Trino should also return 2121-07-15 15:30:12.123499999 (29 characters).

remove this method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should it be irrespective of the timestamp precision specified as a part session or config property ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive timestamp doesn't have the concept of precision - how do we handle them in trino

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should it be irrespective of the timestamp precision specified as a part session or config property ?

i believe so

Hive timestamp doesn't have the concept of precision -

i know. they are always nanoseconds precision

how do we handle them in trino

regular timestamps get truncated to selected precision for performance reasons
for columns that were timestamps and are now varchars, we don't have these performance considerations and we should be compatible with hive.

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Praveen2112 Praveen2112 requested a review from hashhar June 1, 2023 08:54
@Praveen2112 Praveen2112 force-pushed the praveen/timestamp_string_coercion_fix branch from f204670 to f6b307a Compare June 2, 2023 11:43
Get type directly from HiveColumnHandle instead of generating them
from HiveType.
This will be irrespective of the precision configured or specified as
session property.
@Praveen2112 Praveen2112 force-pushed the praveen/timestamp_string_coercion_fix branch from f6b307a to 5427d60 Compare June 2, 2023 13:56
@Praveen2112
Copy link
Member Author

@findepi I have addressed the comments - Now we are returning same values as that of hive i.e we are treating it as nanoseconds irrespective of coercion specified in config or session property.

@Praveen2112
Copy link
Member Author

If the overall PR is okay - I could extract them into two

Type fromType = fromHiveType.getType(typeManager, timestampPrecision);
// Hive treats TIMESTAMP with NANOSECONDS precision and when we try to coerce from a timestamp column,
// we read it as TIMESTAMP(9) column and coerce accordingly.
Type fromType = fromHiveType.getType(typeManager, HiveTimestampPrecision.NANOSECONDS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of changing here, let's inject this on return Optional.of(new LongTimestampToVarcharCoercer(timestampType, varcharType)); line, replacing timestampType with TIMESATAMP_NANOS

fromHiveType.getType(typeManager));
// Hive treats TIMESTAMP with NANOSECONDS precision and when we try to coerce from a timestamp column,
// we read it as TIMESTAMP(9) column and coerce accordingly.
fromHiveType.getType(typeManager, HiveTimestampPrecision.NANOSECONDS));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if fromHiveType is a Row/Struct and timestamp field isn't being coerced.
can injecting NANOS leak here?

in fact, i don't know why change here at all

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing here would allows to read the coerced column as NANO from the underlying file - is there any extension point we could use so as to force the column to be read as NANO ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fromHiveType.getType(typeManager) doesn't consider the precision from the session or config - so aren't we injecting MILLIS - which is mapped to DEFAULT precision here ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fromHiveType.getType(typeManager) doesn't consider the precision from the session or config

that's why it was deprecated. the code before the changes wasn't good

but i don't know why NANOSECONDS is always good here.
is this applied for coerced columns only?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we didn't apply NANOSECONDS then the data from the underlying ConnectorPageSource is read as MILLISECONDS - and we can't get a LongTimestamp object from the block - IIUC this applied if the coerced column is a timestamp column, I'll double check for the complex column like STRUCT - but without this change we would be leaking MILLI - it is not intentional right ?

columnHandle.getBaseHiveColumnIndex(),
fromHiveTypeBase,
fromHiveTypeBase.getType(typeManager),
fromHiveTypeBase.getType(typeManager, HiveTimestampPrecision.NANOSECONDS),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same here)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can continue the discussion here - #17604 (comment)

}
// Hive treats TIMESTAMP with NANOSECONDS precision and when we try to coerce from a timestamp column,
// we read it as TIMESTAMP(9) column and coerce accordingly.
TimestampType timestampType = createTimestampType(HiveTimestampPrecision.NANOSECONDS.getPrecision());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use TIMESTAMP_NANOS

}
}

protected Map<String, List<Object>> expectedRowsForEngineProvider(Engine engine, HiveTimestampPrecision hiveTimestampPrecision)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expectedRowsForEngineProvider no longer uses its parameters (which is good)

inline
and replace expectedTinoResults and expectedHiveResults with one variable

// TODO: These expected failures should be fixed.
return ImmutableMap.<ColumnContext, String>builder()
// Expected failures from BaseTestHiveCoercion
.putAll(super.expectedExceptionsWithTrinoContext())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now squashed, but remains meaningless, as super remains empty.

maybe make super abstract instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add 0001-01-01 00:00 test case. this is favorite users' date

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For historical dates we don't support coercion so this is the closest favorite data

This will be irrespective of the precision configured or specified as
session property.
Hive 2.+ and Hive 3.+ uses `java.sql.Timestamp#toString` for coercing Timestamp
to Varchar types. `java.sql.Timestamp#toString` doesn't capture the historical dates correctly
@Praveen2112 Praveen2112 force-pushed the praveen/timestamp_string_coercion_fix branch from 5427d60 to 030c40c Compare June 5, 2023 11:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed hive Hive connector

Development

Successfully merging this pull request may close these issues.

4 participants