Skip to content

[New Transformation function] Presto compatible DateTrunc#4740

Merged
Jackie-Jiang merged 1 commit intoapache:masterfrom
agrawaldevesh:presto_date_trunc
Oct 29, 2019
Merged

[New Transformation function] Presto compatible DateTrunc#4740
Jackie-Jiang merged 1 commit intoapache:masterfrom
agrawaldevesh:presto_date_trunc

Conversation

@agrawaldevesh
Copy link
Contributor

This will only be called by Presto connector (prestodb/presto#13504) and it has identically semantics
to presto's SQL date_trunc, albeit with a timezone specialization.

This is needed so that the presto's date_trunc invocations can be
faithfully translated as is to this new function. Its a new function so
that it is trivial to roll out to harmlessly without a lot of regression
testing.

Without this function, we cannot handle timezones nor week truncations
in the existing pinot's dateTimeConvert function.

It basically copies the PrestoDB code:
https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/DateTimeFunctions.java,
(specializations of date_trunc for TIMESTAMP and
TIMESTAMP_WITH_TIME_ZONE

I am even checking in the zone-index.properties used by presto to ensure that
even the time zones are 1:1 b/w presto and this function. (sync'd to the
latest prestodb repo)

Understanding this UDF requires knowledge of the joda-time API. I am not
documenting this heavily since it is a copy of the Presto UDF.

@agrawaldevesh
Copy link
Contributor Author

@snleee @mayankshriv can you please review this PR.

The context is that we would like presto's use of the date_trunc (using PR prestodb/presto#13504 in the prestodb) to be translated to this new presto compatible date_trunc I am adding in pinot. This will ensure faithful translation supporting both timezones and week truncations.

Currently the Pinot dateTimeConvert function truncates to the week starting at Thursday and that is incorrect from presto's standpoint.

@kishoreg kishoreg requested a review from npawar October 23, 2019 23:26
Copy link
Member

@kishoreg kishoreg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This is adding the functionality described here rt? https://mode.com/blog/date-trunc-sql-timestamp-function-count-on

If this is generic enough and we can use this without Presto, lets drop the Presto prefix.

@agrawaldevesh
Copy link
Contributor Author

Thanks for the review @Jackie-Jiang and @kishoreg. Updated per your feedback.

@codecov-io
Copy link

codecov-io commented Oct 24, 2019

Codecov Report

Merging #4740 into master will decrease coverage by 0.03%.
The diff coverage is 55.55%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #4740      +/-   ##
============================================
- Coverage     57.79%   57.75%   -0.04%     
  Complexity        4        4              
============================================
  Files          1207     1209       +2     
  Lines         64744    64933     +189     
  Branches       9413     9456      +43     
============================================
+ Hits          37419    37503      +84     
- Misses        24493    24586      +93     
- Partials       2832     2844      +12
Impacted Files Coverage Δ Complexity Δ
...r/transform/function/TransformFunctionFactory.java 69.64% <100%> (+0.55%) 0 <0> (ø) ⬇️
.../core/operator/transform/function/TimeZoneKey.java 38.93% <38.93%> (ø) 0 <0> (?)
...transform/function/DateTruncTransformFunction.java 80% <80%> (ø) 0 <0> (?)
...he/pinot/core/query/pruner/ValidSegmentPruner.java 57.14% <0%> (-28.58%) 0% <0%> (ø)
...altime/ServerSegmentCompletionProtocolHandler.java 35.11% <0%> (-15.27%) 0% <0%> (ø)
...e/operator/dociditerators/MVScanDocIdIterator.java 46.03% <0%> (-14.29%) 0% <0%> (ø)
...impl/dictionary/FloatOffHeapMutableDictionary.java 60.21% <0%> (-12.91%) 0% <0%> (ø)
...impl/dictionary/DoubleOnHeapMutableDictionary.java 37.8% <0%> (-8.54%) 0% <0%> (ø)
...e/impl/dictionary/LongOnHeapMutableDictionary.java 56.09% <0%> (-7.32%) 0% <0%> (ø)
...e/pinot/common/utils/FileUploadDownloadClient.java 63.25% <0%> (-6.63%) 0% <0%> (ø)
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 716a3b2...a6e01b2. Read the comment docs.

@agrawaldevesh
Copy link
Contributor Author

I am not sure what lead to the massive decrease in code coverage. The report looks fishy and lists several unrelated files. I did add unit tests for these new classes :). Can you help me figure out if the code coverage decrease is legit or not. Thanks !

@Jackie-Jiang
Copy link
Contributor

@agrawaldevesh You don't need to worry about the code coverage. It could be that some coverage files are not sent to the server.

@agrawaldevesh
Copy link
Contributor Author

@Jackie-Jiang Thanks for your comments earlier. I have incorporated all of your feedback. Thank you !

@agrawaldevesh
Copy link
Contributor Author

Hi @siddharthteotia ... take a look at the PR now. I added the requested e2e unit test and also added documentation. So this UDF is now "unhidden" and can be used by anyone.

I believe this should allay your concerns and allow this PR to be merged in. Let me know if you think something else needs to be done here.

Thanks for the review !

@siddharthteotia
Copy link
Contributor

Hi @siddharthteotia ... take a look at the PR now. I added the requested e2e unit test and also added documentation. So this UDF is now "unhidden" and can be used by anyone.

I believe this should allay your concerns and allow this PR to be merged in. Let me know if you think something else needs to be done here.

Thanks for the review !

Thanks, @agrawaldevesh. LGTM

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the input value is simply a long value (no timezone info included), how do we decide the time zone for the output value? Or we always assume the input value is in UTC time?
By reading the documentation for Presto, PostgreSQL and Redshift, seems all of them take only 2 arguments, and that is much easier to use (which is very similar to the current TimeConvert. What is missing on Pinot side is TimeZoneConvert. Am I missing something here?

@agrawaldevesh
Copy link
Contributor Author

HI @Jackie-Jiang ..

First some context: This diff didn't start its life out (and it is not my intention) to be a general time truncation UDF. Instead, it was meant to be called only by the presto-pinot connector and thus generate a value that presto can understand natively. Thus it returned "long milliseconds since epoch in UTC" because that is what presto natively understands as time.

The typical way for this function to be invoked is:

Pinot lacks a proper timestamp type. A type that can encode other bits of information associated with it: the timezone and the granularity.

Presto achieves this with two timestamp types:

  • timestamp -> Stores the milliseconds since epoch and is always in UTC
  • timestamp_with_tz -> Stores the 'timestamp above' and a couple of bits to denote the timezone.

Given this:

  • date_trunc('hour', timestamp) -> returns a timestamp
  • date_trunc('hour', timestamp_with_tz) -> returns a timestamp with timezone.

They also have functions to change timezones, ie to go from a timestamp to a timestamp_with_tz and to change the timezone burned into a timestamp_with_tz.

In the absence of a proper pinot type representing time or timezones, we need to inline that into the return type and thus have arguments to configure them. For example, we can have additional input arguments saying what the output timezone and the output time granularity are in.

We similarly need additional arguments to specify what the input granularity and timezone are.

And thus the need for 6 or so arguments:

  • Input long value
  • Truncation interval: hour, day, quarter etc
  • Input granularity (ms or seconds etc)
  • Input timezone
  • Output granularity. (When called from Presto this is always in Milliseconds)
  • Output timezone. (When called from Presto this is always UTC)

(For full disclosure, I don't think we should allow the output timezone to be changed. What would a user do with that ? Because now they need to remember what the output timezone is)

What do you think ? Should we make this function be fully general with these 6 arguments or should we keep it very custom and specific to Presto and thereby reduce it down to three arguments:

  • Input long value
  • Truncation interval: hour, day, quarter etc
  • Input granularity (Hardcoded to seconds when called via Presto)
  • Input timezone
  • Output granularity. (Hardcoded to milliseconds when called via Presto)
  • Output timezone. (When called from Presto this is always UTC)

I like your idea of thinking about this more fully and introducing more time UDFs like timezone conversion, or something to print timestamps with timezones. But I think that is widening the scope of this PR: I need something minimal such that our presto-pinot customers can translate their presto expressions like below to pinot:

date_trunc('month', from_unixtime(ts_in_millis / 1000, 'America/Los_Angeles'))

If it unblocks this PR: I can create an issue for better time handling in Pinot and narrow this PR to be very Presto specific and remove this function from the Pinot documentation (ie let it remain hidden).

Please let me know what would you need to see in this PR such that it can be merged in.

@Jackie-Jiang
Copy link
Contributor

@agrawaldevesh Thanks for the explanation.
IMO, date_trunc() should not have any knowledge of the timezone, but only truncate the time based on the given timestamp. Given the input value, date_trunc() should be able to figure out the input granularity automatically, and truncate it to the specified truncation interval. With these, we can achieve the standard date_trunc() semantic with only 2 arguments, e.g. date_trunc('DAYS', timestamp).
Then the missing part is a timezone conversion function from arbitrary timezone & value to a timestamp in another timezone. The time value can be of numeric type or string type (e.g. DateTimeFormat).
With these 2 functions de-coupled and chain-able, it becomes much more flexible. We can support timestamp/DateTimeFormat input in any timezone, and return truncated timestamp in another timezone. Also, we have the standard semantic for date_trunc() function.
What do you think? Does this proposal meet all the requirements for the Presto connector?

@agrawaldevesh
Copy link
Contributor Author

HI @Jackie-Jiang ... I understand that splitting up the functions and making them be composable is flexible and great. But I am afraid that the approach you have pointed, won't work:

In the library joda-time we are using: Truncation depends on the timezone. Which makes sense: How do you truncate to a quarter, when the quarter starts at different timestamps across the world ?

We cannot work around the fact that Pinot does not have a "Timestamp with TZ" type: so there is no way to known of the input timezone without specifying that as an argument. So at the minimum we need to have three arguments: Namely the input timezone, the truncation granularity 'hour'/'week' etc, and the input value.

We need the timezone from truncation: either as an input field or as a whole new type. The latter is obviously a much bigger change :-)

@Jackie-Jiang
Copy link
Contributor

@agrawaldevesh I think I might have missed something? Within the same timezone (both input and output value in the same timezone), will date_trunc always return the same value?
E.g.

  • Input value in timezone1, truncate the time in timezone1, output value in timezone 2: timezone_convert(date_trunc('DAYS', input), timezone1, timezone2)
  • Input value in timezone1, output value in timezone2, truncate the time in timezone2:
    date_trunc('DAYS', timezone_convert(input, timezone1, timezone2))

If this does not work, and we need to have the timezone info, then I agree we should put at as an input field. We can make it optional and default to UTC.

@agrawaldevesh
Copy link
Contributor Author

  • Input value in timezone1, truncate the time in timezone1, output value in timezone 2: timezone_convert(date_trunc('DAYS', input), timezone1, timezone2)

The problem is that "input" is just a long. How would date_trunc know that 'input' is in timezone1 ?

@Jackie-Jiang
Copy link
Contributor

@agrawaldevesh I see where I was missing. So both the input and output values are unix time (in UTC), and we just want to truncate the time based on some timezone right?
Then the current approach is good. My suggestion would be making timezone (default UTC), input granularity (default millis) and output granularity (default millis) optional. The function can take 2-5 arguments, and by default using UTC to truncate in millis.

@Jackie-Jiang
Copy link
Contributor

Also, based on the input value, input granularity can also be automatically figured out

@agrawaldevesh agrawaldevesh force-pushed the presto_date_trunc branch 2 times, most recently from 2276b92 to a112f08 Compare October 29, 2019 05:04
@agrawaldevesh
Copy link
Contributor Author

Hi @Jackie-Jiang ... thanks for discussing the shape of this function with me on Slack. I have updated it per what we discussed:

The new argument format is:

  • truncation granularity
  • Input value that is always since UTC epoch
  • input granularity: milliseconds, seconds etc
  • optional: Truncation TZ defaulting to UTC
  • optional: Output granularity defaulting to Input granularity

Thanks for helping me nail this interface.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor comments

…o's date_trunc

This will only be called by Presto connector (prestodb/presto#13504) and it has identically semantics
to presto's SQL date_trunc, albeit with a timezone specialization.

Please see details on presto's date_trunc function here:
https://prestodb.github.io/docs/current/functions/datetime.html

This is needed so that the presto's date_trunc invocations can be
faithfully translated as is to this new function. Its a new function so
that it is trivial to roll out to harmlessly without a lot of regression
testing.

Without this function, we cannot handle timezones nor week truncations
in the existing pinot's dateTimeConvert function.

It basically copies the PrestoDB code:
https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/DateTimeFunctions.java,
(specializations of date_trunc for TIMESTAMP and
TIMESTAMP_WITH_TIME_ZONE

I am even checking in the zone-index.properties used by presto to ensure that
even the time zones are 1:1 b/w presto and this function. (sync'd to the
latest prestodb/presto repo)

Understanding this UDF requires knowledge of the joda-time API. I am not
documenting this heavily since it is a copy of the Presto UDF.
@Jackie-Jiang Jackie-Jiang merged commit 4f849ea into apache:master Oct 29, 2019
chenboat pushed a commit to chenboat/incubator-pinot that referenced this pull request Nov 9, 2019
…o's date_trunc (apache#4740)

This will only be called by Presto connector (prestodb/presto#13504) and it has identically semantics
to presto's SQL date_trunc, albeit with a timezone specialization.

Please see details on presto's date_trunc function here:
https://prestodb.github.io/docs/current/functions/datetime.html

This is needed so that the presto's date_trunc invocations can be
faithfully translated as is to this new function. Its a new function so
that it is trivial to roll out to harmlessly without a lot of regression
testing.

Without this function, we cannot handle timezones nor week truncations
in the existing pinot's dateTimeConvert function.

It basically copies the PrestoDB code:
https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/DateTimeFunctions.java,
(specializations of date_trunc for TIMESTAMP and
TIMESTAMP_WITH_TIME_ZONE

I am even checking in the zone-index.properties used by presto to ensure that
even the time zones are 1:1 b/w presto and this function. (sync'd to the
latest prestodb/presto repo)

Understanding this UDF requires knowledge of the joda-time API. I am not
documenting this heavily since it is a copy of the Presto UDF.
chenboat pushed a commit to chenboat/incubator-pinot that referenced this pull request Nov 15, 2019
…o's date_trunc (apache#4740)

This will only be called by Presto connector (prestodb/presto#13504) and it has identically semantics
to presto's SQL date_trunc, albeit with a timezone specialization.

Please see details on presto's date_trunc function here:
https://prestodb.github.io/docs/current/functions/datetime.html

This is needed so that the presto's date_trunc invocations can be
faithfully translated as is to this new function. Its a new function so
that it is trivial to roll out to harmlessly without a lot of regression
testing.

Without this function, we cannot handle timezones nor week truncations
in the existing pinot's dateTimeConvert function.

It basically copies the PrestoDB code:
https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/DateTimeFunctions.java,
(specializations of date_trunc for TIMESTAMP and
TIMESTAMP_WITH_TIME_ZONE

I am even checking in the zone-index.properties used by presto to ensure that
even the time zones are 1:1 b/w presto and this function. (sync'd to the
latest prestodb/presto repo)

Understanding this UDF requires knowledge of the joda-time API. I am not
documenting this heavily since it is a copy of the Presto UDF.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants