[New Transformation function] Presto compatible DateTrunc#4740
[New Transformation function] Presto compatible DateTrunc#4740Jackie-Jiang merged 1 commit intoapache:masterfrom
Conversation
|
@snleee @mayankshriv can you please review this PR. The context is that we would like presto's use of the date_trunc (using PR prestodb/presto#13504 in the prestodb) to be translated to this new presto compatible date_trunc I am adding in pinot. This will ensure faithful translation supporting both timezones and week truncations. Currently the Pinot dateTimeConvert function truncates to the week starting at Thursday and that is incorrect from presto's standpoint. |
kishoreg
left a comment
There was a problem hiding this comment.
LGTM. This is adding the functionality described here rt? https://mode.com/blog/date-trunc-sql-timestamp-function-count-on
If this is generic enough and we can use this without Presto, lets drop the Presto prefix.
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/PrestoDateTrunc.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/pinot/core/operator/transform/function/PrestoDateTruncFunction.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/pinot/core/operator/transform/function/PrestoDateTruncFunction.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/TimeZoneKey.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/pinot/core/operator/transform/function/PrestoDateTruncFunction.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/apache/pinot/core/operator/transform/function/TransformFunctionFactory.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/pinot/core/operator/transform/function/PrestoDateTruncFunction.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/pinot/core/operator/transform/function/PrestoDateTruncFunction.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/pinot/core/operator/transform/function/PrestoDateTruncFunction.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/pinot/core/operator/transform/function/PrestoDateTruncFunction.java
Outdated
Show resolved
Hide resolved
896b912 to
c1ad5d3
Compare
|
Thanks for the review @Jackie-Jiang and @kishoreg. Updated per your feedback. |
Codecov Report
@@ Coverage Diff @@
## master #4740 +/- ##
============================================
- Coverage 57.79% 57.75% -0.04%
Complexity 4 4
============================================
Files 1207 1209 +2
Lines 64744 64933 +189
Branches 9413 9456 +43
============================================
+ Hits 37419 37503 +84
- Misses 24493 24586 +93
- Partials 2832 2844 +12
Continue to review full report at Codecov.
|
|
I am not sure what lead to the massive decrease in code coverage. The report looks fishy and lists several unrelated files. I did add unit tests for these new classes :). Can you help me figure out if the code coverage decrease is legit or not. Thanks ! |
|
@agrawaldevesh You don't need to worry about the code coverage. It could be that some coverage files are not sent to the server. |
...-core/src/main/java/org/apache/pinot/core/operator/transform/function/DateTruncFunction.java
Outdated
Show resolved
Hide resolved
c1ad5d3 to
968d928
Compare
|
@Jackie-Jiang Thanks for your comments earlier. I have incorporated all of your feedback. Thank you ! |
...t/java/org/apache/pinot/core/operator/transform/function/DateTruncTransformFunctionTest.java
Outdated
Show resolved
Hide resolved
...t/java/org/apache/pinot/core/operator/transform/function/DateTruncTransformFunctionTest.java
Outdated
Show resolved
Hide resolved
968d928 to
4ebaaba
Compare
|
Hi @siddharthteotia ... take a look at the PR now. I added the requested e2e unit test and also added documentation. So this UDF is now "unhidden" and can be used by anyone. I believe this should allay your concerns and allow this PR to be merged in. Let me know if you think something else needs to be done here. Thanks for the review ! |
Thanks, @agrawaldevesh. LGTM |
Jackie-Jiang
left a comment
There was a problem hiding this comment.
If the input value is simply a long value (no timezone info included), how do we decide the time zone for the output value? Or we always assume the input value is in UTC time?
By reading the documentation for Presto, PostgreSQL and Redshift, seems all of them take only 2 arguments, and that is much easier to use (which is very similar to the current TimeConvert. What is missing on Pinot side is TimeZoneConvert. Am I missing something here?
|
HI @Jackie-Jiang .. First some context: This diff didn't start its life out (and it is not my intention) to be a general time truncation UDF. Instead, it was meant to be called only by the presto-pinot connector and thus generate a value that presto can understand natively. Thus it returned "long milliseconds since epoch in UTC" because that is what presto natively understands as time. The typical way for this function to be invoked is: Pinot lacks a proper timestamp type. A type that can encode other bits of information associated with it: the timezone and the granularity. Presto achieves this with two timestamp types:
Given this:
They also have functions to change timezones, ie to go from a timestamp to a timestamp_with_tz and to change the timezone burned into a timestamp_with_tz. In the absence of a proper pinot type representing time or timezones, we need to inline that into the return type and thus have arguments to configure them. For example, we can have additional input arguments saying what the output timezone and the output time granularity are in. We similarly need additional arguments to specify what the input granularity and timezone are. And thus the need for 6 or so arguments:
(For full disclosure, I don't think we should allow the output timezone to be changed. What would a user do with that ? Because now they need to remember what the output timezone is) What do you think ? Should we make this function be fully general with these 6 arguments or should we keep it very custom and specific to Presto and thereby reduce it down to three arguments:
I like your idea of thinking about this more fully and introducing more time UDFs like timezone conversion, or something to print timestamps with timezones. But I think that is widening the scope of this PR: I need something minimal such that our presto-pinot customers can translate their presto expressions like below to pinot: If it unblocks this PR: I can create an issue for better time handling in Pinot and narrow this PR to be very Presto specific and remove this function from the Pinot documentation (ie let it remain hidden). Please let me know what would you need to see in this PR such that it can be merged in. |
|
@agrawaldevesh Thanks for the explanation. |
|
HI @Jackie-Jiang ... I understand that splitting up the functions and making them be composable is flexible and great. But I am afraid that the approach you have pointed, won't work: In the library joda-time we are using: Truncation depends on the timezone. Which makes sense: How do you truncate to a quarter, when the quarter starts at different timestamps across the world ? We cannot work around the fact that Pinot does not have a "Timestamp with TZ" type: so there is no way to known of the input timezone without specifying that as an argument. So at the minimum we need to have three arguments: Namely the input timezone, the truncation granularity 'hour'/'week' etc, and the input value. We need the timezone from truncation: either as an input field or as a whole new type. The latter is obviously a much bigger change :-) |
|
@agrawaldevesh I think I might have missed something? Within the same timezone (both input and output value in the same timezone), will
If this does not work, and we need to have the timezone info, then I agree we should put at as an input field. We can make it optional and default to UTC. |
The problem is that "input" is just a long. How would date_trunc know that 'input' is in timezone1 ? |
|
@agrawaldevesh I see where I was missing. So both the input and output values are unix time (in UTC), and we just want to truncate the time based on some timezone right? |
|
Also, based on the input value, input granularity can also be automatically figured out |
2276b92 to
a112f08
Compare
|
Hi @Jackie-Jiang ... thanks for discussing the shape of this function with me on Slack. I have updated it per what we discussed: The new argument format is:
Thanks for helping me nail this interface. |
Jackie-Jiang
left a comment
There was a problem hiding this comment.
LGTM with minor comments
.../main/java/org/apache/pinot/core/operator/transform/function/DateTruncTransformFunction.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/pinot/core/operator/transform/function/DateTruncTransformFunction.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/pinot/core/operator/transform/function/DateTruncTransformFunction.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/pinot/core/operator/transform/function/DateTruncTransformFunction.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/pinot/core/operator/transform/function/DateTruncTransformFunction.java
Outdated
Show resolved
Hide resolved
.../main/java/org/apache/pinot/core/operator/transform/function/DateTruncTransformFunction.java
Outdated
Show resolved
Hide resolved
pinot-core/src/test/java/org/apache/pinot/queries/TransformQueriesTest.java
Outdated
Show resolved
Hide resolved
pinot-core/src/test/java/org/apache/pinot/queries/TransformQueriesTest.java
Outdated
Show resolved
Hide resolved
pinot-core/src/test/java/org/apache/pinot/queries/TransformQueriesTest.java
Outdated
Show resolved
Hide resolved
pinot-core/src/test/java/org/apache/pinot/queries/TransformQueriesTest.java
Outdated
Show resolved
Hide resolved
…o's date_trunc This will only be called by Presto connector (prestodb/presto#13504) and it has identically semantics to presto's SQL date_trunc, albeit with a timezone specialization. Please see details on presto's date_trunc function here: https://prestodb.github.io/docs/current/functions/datetime.html This is needed so that the presto's date_trunc invocations can be faithfully translated as is to this new function. Its a new function so that it is trivial to roll out to harmlessly without a lot of regression testing. Without this function, we cannot handle timezones nor week truncations in the existing pinot's dateTimeConvert function. It basically copies the PrestoDB code: https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/DateTimeFunctions.java, (specializations of date_trunc for TIMESTAMP and TIMESTAMP_WITH_TIME_ZONE I am even checking in the zone-index.properties used by presto to ensure that even the time zones are 1:1 b/w presto and this function. (sync'd to the latest prestodb/presto repo) Understanding this UDF requires knowledge of the joda-time API. I am not documenting this heavily since it is a copy of the Presto UDF.
a112f08 to
a6e01b2
Compare
…o's date_trunc (apache#4740) This will only be called by Presto connector (prestodb/presto#13504) and it has identically semantics to presto's SQL date_trunc, albeit with a timezone specialization. Please see details on presto's date_trunc function here: https://prestodb.github.io/docs/current/functions/datetime.html This is needed so that the presto's date_trunc invocations can be faithfully translated as is to this new function. Its a new function so that it is trivial to roll out to harmlessly without a lot of regression testing. Without this function, we cannot handle timezones nor week truncations in the existing pinot's dateTimeConvert function. It basically copies the PrestoDB code: https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/DateTimeFunctions.java, (specializations of date_trunc for TIMESTAMP and TIMESTAMP_WITH_TIME_ZONE I am even checking in the zone-index.properties used by presto to ensure that even the time zones are 1:1 b/w presto and this function. (sync'd to the latest prestodb/presto repo) Understanding this UDF requires knowledge of the joda-time API. I am not documenting this heavily since it is a copy of the Presto UDF.
…o's date_trunc (apache#4740) This will only be called by Presto connector (prestodb/presto#13504) and it has identically semantics to presto's SQL date_trunc, albeit with a timezone specialization. Please see details on presto's date_trunc function here: https://prestodb.github.io/docs/current/functions/datetime.html This is needed so that the presto's date_trunc invocations can be faithfully translated as is to this new function. Its a new function so that it is trivial to roll out to harmlessly without a lot of regression testing. Without this function, we cannot handle timezones nor week truncations in the existing pinot's dateTimeConvert function. It basically copies the PrestoDB code: https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/DateTimeFunctions.java, (specializations of date_trunc for TIMESTAMP and TIMESTAMP_WITH_TIME_ZONE I am even checking in the zone-index.properties used by presto to ensure that even the time zones are 1:1 b/w presto and this function. (sync'd to the latest prestodb/presto repo) Understanding this UDF requires knowledge of the joda-time API. I am not documenting this heavily since it is a copy of the Presto UDF.
This will only be called by Presto connector (prestodb/presto#13504) and it has identically semantics
to presto's SQL date_trunc, albeit with a timezone specialization.
This is needed so that the presto's date_trunc invocations can be
faithfully translated as is to this new function. Its a new function so
that it is trivial to roll out to harmlessly without a lot of regression
testing.
Without this function, we cannot handle timezones nor week truncations
in the existing pinot's dateTimeConvert function.
It basically copies the PrestoDB code:
https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/scalar/DateTimeFunctions.java,
(specializations of date_trunc for TIMESTAMP and
TIMESTAMP_WITH_TIME_ZONE
I am even checking in the zone-index.properties used by presto to ensure that
even the time zones are 1:1 b/w presto and this function. (sync'd to the
latest prestodb repo)
Understanding this UDF requires knowledge of the joda-time API. I am not
documenting this heavily since it is a copy of the Presto UDF.