[New Transformation function] Presto compatible DateTrunc by agrawaldevesh · Pull Request #4740 · apache/pinot

agrawaldevesh · 2019-10-23T23:07:25Z

This will only be called by Presto connector (prestodb/presto#13504) and it has identically semantics
to presto's SQL date_trunc, albeit with a timezone specialization.

This is needed so that the presto's date_trunc invocations can be
faithfully translated as is to this new function. Its a new function so
that it is trivial to roll out to harmlessly without a lot of regression
testing.

Without this function, we cannot handle timezones nor week truncations
in the existing pinot's dateTimeConvert function.

It basically copies the PrestoDB code:
https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/DateTimeFunctions.java,
(specializations of date_trunc for TIMESTAMP and
TIMESTAMP_WITH_TIME_ZONE

I am even checking in the zone-index.properties used by presto to ensure that
even the time zones are 1:1 b/w presto and this function. (sync'd to the
latest prestodb repo)

Understanding this UDF requires knowledge of the joda-time API. I am not
documenting this heavily since it is a copy of the Presto UDF.

agrawaldevesh · 2019-10-23T23:12:12Z

@snleee @mayankshriv can you please review this PR.

The context is that we would like presto's use of the date_trunc (using PR prestodb/presto#13504 in the prestodb) to be translated to this new presto compatible date_trunc I am adding in pinot. This will ensure faithful translation supporting both timezones and week truncations.

Currently the Pinot dateTimeConvert function truncates to the week starting at Thursday and that is incorrect from presto's standpoint.

kishoreg

LGTM. This is adding the functionality described here rt? https://mode.com/blog/date-trunc-sql-timestamp-function-count-on

If this is generic enough and we can use this without Presto, lets drop the Presto prefix.

pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/PrestoDateTrunc.java

...src/main/java/org/apache/pinot/core/operator/transform/function/PrestoDateTruncFunction.java

pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/TimeZoneKey.java

...src/main/java/org/apache/pinot/core/operator/transform/function/PrestoDateTruncFunction.java

...rc/main/java/org/apache/pinot/core/operator/transform/function/TransformFunctionFactory.java

...src/main/java/org/apache/pinot/core/operator/transform/function/PrestoDateTruncFunction.java

agrawaldevesh · 2019-10-24T03:35:01Z

Thanks for the review @Jackie-Jiang and @kishoreg. Updated per your feedback.

codecov-io · 2019-10-24T04:19:04Z

Codecov Report

Merging #4740 into master will decrease coverage by 0.03%.
The diff coverage is 55.55%.

@@             Coverage Diff              @@
##             master    #4740      +/-   ##
============================================
- Coverage     57.79%   57.75%   -0.04%     
  Complexity        4        4              
============================================
  Files          1207     1209       +2     
  Lines         64744    64933     +189     
  Branches       9413     9456      +43     
============================================
+ Hits          37419    37503      +84     
- Misses        24493    24586      +93     
- Partials       2832     2844      +12

Impacted Files	Coverage Δ	Complexity Δ
...r/transform/function/TransformFunctionFactory.java	`69.64% <100%> (+0.55%)`	`0 <0> (ø)`	⬇️
.../core/operator/transform/function/TimeZoneKey.java	`38.93% <38.93%> (ø)`	`0 <0> (?)`
...transform/function/DateTruncTransformFunction.java	`80% <80%> (ø)`	`0 <0> (?)`
...he/pinot/core/query/pruner/ValidSegmentPruner.java	`57.14% <0%> (-28.58%)`	`0% <0%> (ø)`
...altime/ServerSegmentCompletionProtocolHandler.java	`35.11% <0%> (-15.27%)`	`0% <0%> (ø)`
...e/operator/dociditerators/MVScanDocIdIterator.java	`46.03% <0%> (-14.29%)`	`0% <0%> (ø)`
...impl/dictionary/FloatOffHeapMutableDictionary.java	`60.21% <0%> (-12.91%)`	`0% <0%> (ø)`
...impl/dictionary/DoubleOnHeapMutableDictionary.java	`37.8% <0%> (-8.54%)`	`0% <0%> (ø)`
...e/impl/dictionary/LongOnHeapMutableDictionary.java	`56.09% <0%> (-7.32%)`	`0% <0%> (ø)`
...e/pinot/common/utils/FileUploadDownloadClient.java	`63.25% <0%> (-6.63%)`	`0% <0%> (ø)`
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 716a3b2...a6e01b2. Read the comment docs.

agrawaldevesh · 2019-10-24T04:58:06Z

I am not sure what lead to the massive decrease in code coverage. The report looks fishy and lists several unrelated files. I did add unit tests for these new classes :). Can you help me figure out if the code coverage decrease is legit or not. Thanks !

Jackie-Jiang · 2019-10-24T20:43:02Z

@agrawaldevesh You don't need to worry about the code coverage. It could be that some coverage files are not sent to the server.

...-core/src/main/java/org/apache/pinot/core/operator/transform/function/DateTruncFunction.java

agrawaldevesh · 2019-10-25T05:58:24Z

@Jackie-Jiang Thanks for your comments earlier. I have incorporated all of your feedback. Thank you !

...t/java/org/apache/pinot/core/operator/transform/function/DateTruncTransformFunctionTest.java

agrawaldevesh · 2019-10-26T06:46:20Z

Hi @siddharthteotia ... take a look at the PR now. I added the requested e2e unit test and also added documentation. So this UDF is now "unhidden" and can be used by anyone.

I believe this should allay your concerns and allow this PR to be merged in. Let me know if you think something else needs to be done here.

Thanks for the review !

siddharthteotia · 2019-10-26T08:53:23Z

Hi @siddharthteotia ... take a look at the PR now. I added the requested e2e unit test and also added documentation. So this UDF is now "unhidden" and can be used by anyone.

I believe this should allay your concerns and allow this PR to be merged in. Let me know if you think something else needs to be done here.

Thanks for the review !

Thanks, @agrawaldevesh. LGTM

Jackie-Jiang

If the input value is simply a long value (no timezone info included), how do we decide the time zone for the output value? Or we always assume the input value is in UTC time?
By reading the documentation for Presto, PostgreSQL and Redshift, seems all of them take only 2 arguments, and that is much easier to use (which is very similar to the current TimeConvert. What is missing on Pinot side is TimeZoneConvert. Am I missing something here?

docs/pql_examples.rst

agrawaldevesh · 2019-10-28T20:57:05Z

HI @Jackie-Jiang ..

First some context: This diff didn't start its life out (and it is not my intention) to be a general time truncation UDF. Instead, it was meant to be called only by the presto-pinot connector and thus generate a value that presto can understand natively. Thus it returned "long milliseconds since epoch in UTC" because that is what presto natively understands as time.

The typical way for this function to be invoked is:

Pinot lacks a proper timestamp type. A type that can encode other bits of information associated with it: the timezone and the granularity.

Presto achieves this with two timestamp types:

timestamp -> Stores the milliseconds since epoch and is always in UTC
timestamp_with_tz -> Stores the 'timestamp above' and a couple of bits to denote the timezone.

Given this:

date_trunc('hour', timestamp) -> returns a timestamp
date_trunc('hour', timestamp_with_tz) -> returns a timestamp with timezone.

They also have functions to change timezones, ie to go from a timestamp to a timestamp_with_tz and to change the timezone burned into a timestamp_with_tz.

In the absence of a proper pinot type representing time or timezones, we need to inline that into the return type and thus have arguments to configure them. For example, we can have additional input arguments saying what the output timezone and the output time granularity are in.

We similarly need additional arguments to specify what the input granularity and timezone are.

And thus the need for 6 or so arguments:

Input long value
Truncation interval: hour, day, quarter etc
Input granularity (ms or seconds etc)
Input timezone
Output granularity. (When called from Presto this is always in Milliseconds)
Output timezone. (When called from Presto this is always UTC)

(For full disclosure, I don't think we should allow the output timezone to be changed. What would a user do with that ? Because now they need to remember what the output timezone is)

What do you think ? Should we make this function be fully general with these 6 arguments or should we keep it very custom and specific to Presto and thereby reduce it down to three arguments:

Input long value
Truncation interval: hour, day, quarter etc
~~Input granularity (Hardcoded to seconds when called via Presto)~~
Input timezone
~~Output granularity. (Hardcoded to milliseconds when called via Presto)~~
~~Output timezone. (When called from Presto this is always UTC)~~

I like your idea of thinking about this more fully and introducing more time UDFs like timezone conversion, or something to print timestamps with timezones. But I think that is widening the scope of this PR: I need something minimal such that our presto-pinot customers can translate their presto expressions like below to pinot:

date_trunc('month', from_unixtime(ts_in_millis / 1000, 'America/Los_Angeles'))

If it unblocks this PR: I can create an issue for better time handling in Pinot and narrow this PR to be very Presto specific and remove this function from the Pinot documentation (ie let it remain hidden).

Please let me know what would you need to see in this PR such that it can be merged in.

Jackie-Jiang · 2019-10-28T21:44:06Z

@agrawaldevesh Thanks for the explanation.
IMO, date_trunc() should not have any knowledge of the timezone, but only truncate the time based on the given timestamp. Given the input value, date_trunc() should be able to figure out the input granularity automatically, and truncate it to the specified truncation interval. With these, we can achieve the standard date_trunc() semantic with only 2 arguments, e.g. date_trunc('DAYS', timestamp).
Then the missing part is a timezone conversion function from arbitrary timezone & value to a timestamp in another timezone. The time value can be of numeric type or string type (e.g. DateTimeFormat).
With these 2 functions de-coupled and chain-able, it becomes much more flexible. We can support timestamp/DateTimeFormat input in any timezone, and return truncated timestamp in another timezone. Also, we have the standard semantic for date_trunc() function.
What do you think? Does this proposal meet all the requirements for the Presto connector?

agrawaldevesh · 2019-10-28T22:29:01Z

HI @Jackie-Jiang ... I understand that splitting up the functions and making them be composable is flexible and great. But I am afraid that the approach you have pointed, won't work:

In the library joda-time we are using: Truncation depends on the timezone. Which makes sense: How do you truncate to a quarter, when the quarter starts at different timestamps across the world ?

We cannot work around the fact that Pinot does not have a "Timestamp with TZ" type: so there is no way to known of the input timezone without specifying that as an argument. So at the minimum we need to have three arguments: Namely the input timezone, the truncation granularity 'hour'/'week' etc, and the input value.

We need the timezone from truncation: either as an input field or as a whole new type. The latter is obviously a much bigger change :-)

Jackie-Jiang · 2019-10-28T23:53:53Z

@agrawaldevesh I think I might have missed something? Within the same timezone (both input and output value in the same timezone), will date_trunc always return the same value?
E.g.

Input value in timezone1, truncate the time in timezone1, output value in timezone 2: timezone_convert(date_trunc('DAYS', input), timezone1, timezone2)
Input value in timezone1, output value in timezone2, truncate the time in timezone2:
date_trunc('DAYS', timezone_convert(input, timezone1, timezone2))

If this does not work, and we need to have the timezone info, then I agree we should put at as an input field. We can make it optional and default to UTC.

agrawaldevesh · 2019-10-29T00:45:48Z

Input value in timezone1, truncate the time in timezone1, output value in timezone 2: timezone_convert(date_trunc('DAYS', input), timezone1, timezone2)

The problem is that "input" is just a long. How would date_trunc know that 'input' is in timezone1 ?

Jackie-Jiang · 2019-10-29T00:58:18Z

@agrawaldevesh I see where I was missing. So both the input and output values are unix time (in UTC), and we just want to truncate the time based on some timezone right?
Then the current approach is good. My suggestion would be making timezone (default UTC), input granularity (default millis) and output granularity (default millis) optional. The function can take 2-5 arguments, and by default using UTC to truncate in millis.

Jackie-Jiang · 2019-10-29T01:01:02Z

Also, based on the input value, input granularity can also be automatically figured out

agrawaldevesh · 2019-10-29T05:11:51Z

Hi @Jackie-Jiang ... thanks for discussing the shape of this function with me on Slack. I have updated it per what we discussed:

The new argument format is:

truncation granularity
Input value that is always since UTC epoch
input granularity: milliseconds, seconds etc
optional: Truncation TZ defaulting to UTC
optional: Output granularity defaulting to Input granularity

Thanks for helping me nail this interface.

Jackie-Jiang

LGTM with minor comments

.../main/java/org/apache/pinot/core/operator/transform/function/DateTruncTransformFunction.java

pinot-core/src/test/java/org/apache/pinot/queries/TransformQueriesTest.java

…o's date_trunc This will only be called by Presto connector (prestodb/presto#13504) and it has identically semantics to presto's SQL date_trunc, albeit with a timezone specialization. Please see details on presto's date_trunc function here: https://prestodb.github.io/docs/current/functions/datetime.html This is needed so that the presto's date_trunc invocations can be faithfully translated as is to this new function. Its a new function so that it is trivial to roll out to harmlessly without a lot of regression testing. Without this function, we cannot handle timezones nor week truncations in the existing pinot's dateTimeConvert function. It basically copies the PrestoDB code: https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/DateTimeFunctions.java, (specializations of date_trunc for TIMESTAMP and TIMESTAMP_WITH_TIME_ZONE I am even checking in the zone-index.properties used by presto to ensure that even the time zones are 1:1 b/w presto and this function. (sync'd to the latest prestodb/presto repo) Understanding this UDF requires knowledge of the joda-time API. I am not documenting this heavily since it is a copy of the Presto UDF.

…o's date_trunc (apache#4740) This will only be called by Presto connector (prestodb/presto#13504) and it has identically semantics to presto's SQL date_trunc, albeit with a timezone specialization. Please see details on presto's date_trunc function here: https://prestodb.github.io/docs/current/functions/datetime.html This is needed so that the presto's date_trunc invocations can be faithfully translated as is to this new function. Its a new function so that it is trivial to roll out to harmlessly without a lot of regression testing. Without this function, we cannot handle timezones nor week truncations in the existing pinot's dateTimeConvert function. It basically copies the PrestoDB code: https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/DateTimeFunctions.java, (specializations of date_trunc for TIMESTAMP and TIMESTAMP_WITH_TIME_ZONE I am even checking in the zone-index.properties used by presto to ensure that even the time zones are 1:1 b/w presto and this function. (sync'd to the latest prestodb/presto repo) Understanding this UDF requires knowledge of the joda-time API. I am not documenting this heavily since it is a copy of the Presto UDF.

kishoreg requested a review from npawar October 23, 2019 23:26

kishoreg reviewed Oct 23, 2019

View reviewed changes

pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/PrestoDateTrunc.java Outdated Show resolved Hide resolved

Jackie-Jiang reviewed Oct 24, 2019

View reviewed changes

agrawaldevesh force-pushed the presto_date_trunc branch from 896b912 to c1ad5d3 Compare October 24, 2019 03:34

Jackie-Jiang reviewed Oct 24, 2019

View reviewed changes

...-core/src/main/java/org/apache/pinot/core/operator/transform/function/DateTruncFunction.java Outdated Show resolved Hide resolved

agrawaldevesh force-pushed the presto_date_trunc branch from c1ad5d3 to 968d928 Compare October 25, 2019 02:56

siddharthteotia reviewed Oct 25, 2019

View reviewed changes

...t/java/org/apache/pinot/core/operator/transform/function/DateTruncTransformFunctionTest.java Outdated Show resolved Hide resolved

siddharthteotia reviewed Oct 25, 2019

View reviewed changes

...t/java/org/apache/pinot/core/operator/transform/function/DateTruncTransformFunctionTest.java Outdated Show resolved Hide resolved

agrawaldevesh force-pushed the presto_date_trunc branch from 968d928 to 4ebaaba Compare October 26, 2019 06:36

Jackie-Jiang reviewed Oct 28, 2019

View reviewed changes

docs/pql_examples.rst Outdated Show resolved Hide resolved

docs/pql_examples.rst Outdated Show resolved Hide resolved

agrawaldevesh force-pushed the presto_date_trunc branch 2 times, most recently from 2276b92 to a112f08 Compare October 29, 2019 05:04

Jackie-Jiang approved these changes Oct 29, 2019

View reviewed changes

agrawaldevesh force-pushed the presto_date_trunc branch from a112f08 to a6e01b2 Compare October 29, 2019 18:17

Jackie-Jiang merged commit 4f849ea into apache:master Oct 29, 2019

agrawaldevesh mentioned this pull request Sep 14, 2020

DATETIMECONVERT udf does not work for customized timezone and bucket size > 1 day #3513

Closed

Conversation

agrawaldevesh commented Oct 23, 2019

Uh oh!

agrawaldevesh commented Oct 23, 2019

Uh oh!

kishoreg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

agrawaldevesh commented Oct 24, 2019

Uh oh!

codecov-io commented Oct 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

agrawaldevesh commented Oct 24, 2019

Uh oh!

Jackie-Jiang commented Oct 24, 2019

Uh oh!

Uh oh!

agrawaldevesh commented Oct 25, 2019

Uh oh!

Uh oh!

Uh oh!

agrawaldevesh commented Oct 26, 2019

Uh oh!

siddharthteotia commented Oct 26, 2019

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

agrawaldevesh commented Oct 28, 2019

Uh oh!

Jackie-Jiang commented Oct 28, 2019

Uh oh!

agrawaldevesh commented Oct 28, 2019

Uh oh!

Jackie-Jiang commented Oct 28, 2019

Uh oh!

agrawaldevesh commented Oct 29, 2019

Uh oh!

Jackie-Jiang commented Oct 29, 2019

Uh oh!

Jackie-Jiang commented Oct 29, 2019

Uh oh!

agrawaldevesh commented Oct 29, 2019

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-io commented Oct 24, 2019 •

edited

Loading