Skip to content

Conversation

@Praveen2112
Copy link
Member

Description

Add support for timestamp to varchar coercer in hive tables. In case of partitioned table it is supported by most of the format and in case of unpartitioned tables only ORC format is supported as of now.

The coercion that was supported as of current master for unpartitioned tables are inherently supported by the ColumnReader and this PR introduces a framework which maps a OrcTypeKind to a corresponding TrinoType and also re-uses the TypeCoercer used by partitioned tables.

Additional context and related issues

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Section
* Add support for timestamp to varchar coercer in hive tables

@Praveen2112
Copy link
Member Author

Will be adding some additional test coverage for TimestampCoercer but wanted to have a early feedback here for the design/approach.

Copy link
Member

@skrzypo987 skrzypo987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok

@Praveen2112 Praveen2112 force-pushed the praveen/timestamp_coercion branch from 53b81db to 5468428 Compare April 4, 2023 11:44
@github-actions github-actions bot added hive Hive connector tests:hive labels Apr 4, 2023
@Praveen2112 Praveen2112 force-pushed the praveen/timestamp_coercion branch from 5468428 to 17eb591 Compare April 11, 2023 13:02
@Praveen2112 Praveen2112 marked this pull request as ready for review April 11, 2023 13:03
@Praveen2112
Copy link
Member Author

@skrzypo987 AC

@Praveen2112 Praveen2112 force-pushed the praveen/timestamp_coercion branch 2 times, most recently from 67ef13a to dcee921 Compare April 12, 2023 16:52
Copy link
Member

@skrzypo987 skrzypo987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have an IDE warning on this line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share the warning here - Mine is not showing any warning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When doing implicit coercion to timestamp, hive cast to a zone specific String - so wanted it to be in sync with UTC (like we do in Trino). This is only specific to ORC file for other formats it doesn't consider the timezone of the HiveServer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick google search suggests that TZ env variable means time zone name. /usr/share/zoneinfo/UTC looks like a file path, not a time zone name. How does it work?

When doing implicit coercion to timestamp, hive cast to a zone specific String - so wanted it to be in sync with UTC (like we do in Trino).

would this information be useful for future code maintainers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick google search suggested me to set TZ to this file and it worked for me. Is there any workaround I could try here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick google search suggested me to set TZ to this file

interesting. i should have taken a note what was the search terms in my quick google search...

what happens if you

container.withEnv("TZ", "UTC")

(instead of the file)?

and what happens if you

container.withEnv("TZ", "/dev/null")

or

container.withEnv("TZ", "absolutely-random-string")

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like specifying a timezone also works like passing the file

container.withEnv("TZ", "UTC")

If we pass

container.withEnv("TZ", "/dev/null")

It switches to default value UTC

For

container.withEnv("TZ", "absolutely-random-string")

It switches to default value UTC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we pass

container.withEnv("TZ", "/dev/null")

It switches to default value UTC

For

container.withEnv("TZ", "absolutely-random-string")

It switches to default value UTC.

so /usr/share/zoneinfo/UTC looks like something meaningful, but actually is "garbage time zone" and causes something (the JVM?) to switch to UTC.

funny

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually /usr/share/zoneinfo/UTC is not a garbage timezone - pointing to a different file like usr/share/zoneinfo/EET changes to EEST timezone unlike UTC

@Praveen2112
Copy link
Member Author

@findepi AC

@Praveen2112
Copy link
Member Author

Will fix a few test failures

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be clearer to define a BlockTypeCoercion interface for public Block apply(Block block) and then use that in TypeCoercer instead of Function<Block, Block> and here instead of UnaryOperator<Block>.
It will also be easier to track possible implementations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The from and to types of the coercion would be useful info in the toString

@Praveen2112 Praveen2112 force-pushed the praveen/timestamp_coercion branch from a3246bd to 174032d Compare April 20, 2023 08:15
@Praveen2112
Copy link
Member Author

@findepi / @raunaqmorarka AC

@Praveen2112 Praveen2112 force-pushed the praveen/timestamp_coercion branch 2 times, most recently from 0bc0616 to 3c55156 Compare April 20, 2023 08:18
@Praveen2112
Copy link
Member Author

@raunaqmorarka AC (first one)

@Praveen2112
Copy link
Member Author

@findepi / @raunaqmorarka Can you PTAL ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move above all other members

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the other members are static final so do we need to move the the default constructor ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could support to-varchar directly in io.trino.orc.reader.TimestampColumnReader,

no change requested (i like the current approach)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could but orc is being used in iceberg connector also so we might be applying coercion (IIRC it is not supported by iceberg)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick google search suggested me to set TZ to this file

interesting. i should have taken a note what was the search terms in my quick google search...

what happens if you

container.withEnv("TZ", "UTC")

(instead of the file)?

and what happens if you

container.withEnv("TZ", "/dev/null")

or

container.withEnv("TZ", "absolutely-random-string")

?

@Praveen2112 Praveen2112 force-pushed the praveen/timestamp_coercion branch from 05556a4 to 367940c Compare April 26, 2023 14:32
@Praveen2112
Copy link
Member Author

@findepi AC

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we pass

container.withEnv("TZ", "/dev/null")

It switches to default value UTC

For

container.withEnv("TZ", "absolutely-random-string")

It switches to default value UTC.

so /usr/share/zoneinfo/UTC looks like something meaningful, but actually is "garbage time zone" and causes something (the JVM?) to switch to UTC.

funny

@Praveen2112 Praveen2112 force-pushed the praveen/timestamp_coercion branch from 367940c to 420ef58 Compare April 26, 2023 16:21
@Praveen2112
Copy link
Member Author

@findepi AC

@Praveen2112 Praveen2112 force-pushed the praveen/timestamp_coercion branch from 420ef58 to af5b3d2 Compare April 27, 2023 06:36
@Praveen2112
Copy link
Member Author

Did a few minor changes on TestTimestampCoercer which could help us in #17071

@Praveen2112 Praveen2112 force-pushed the praveen/timestamp_coercion branch from af5b3d2 to 5e298c7 Compare April 27, 2023 08:38
@Praveen2112 Praveen2112 force-pushed the praveen/timestamp_coercion branch from 5e298c7 to c888e65 Compare April 27, 2023 10:12
@Praveen2112
Copy link
Member Author

@huberty89 AC

@Praveen2112
Copy link
Member Author

@raunaqmorarka Gentle ping

@Praveen2112 Praveen2112 requested a review from raunaqmorarka May 2, 2023 08:37
Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
I'm wondering why this is specifically supported for ORC and not parquet or other file formats for unpartitioned columns. Does Apache Hive support this for non-orc formats ? If this coercion is to be supported regardless of file format, then we might want the coercion to be done in HivePageSource rather than OrcPageSource

@Praveen2112 Praveen2112 force-pushed the praveen/timestamp_coercion branch from c888e65 to 60285de Compare May 2, 2023 10:11
@Praveen2112
Copy link
Member Author

why this is specifically supported for ORC and not parquet or other file formats for unpartitioned columns

Parquet doesn't support all types of coercion like ORC - For instance Parquet supports timestamp to varchar coercion but it doesn't support varchar to timestamp coercion, while both are supported by ORC formats. So to start with we are planning for orc and then we could extend it for other file formats.

then we might want the coercion to be done in HivePageSource rather than OrcPageSource

In that case we might need to introduce some sort of an abstraction that allows to Reader to provide the schema and compare with the HivePageSource in case of partitioned tables we compare the global schema with a local partition schema and we apply coercion policy.

Implementing for ORC allows us to create some sort of a testing framework which could be utilized for other formats in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed hive Hive connector

Development

Successfully merging this pull request may close these issues.

5 participants