-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10817: [Rust] [DataFusion] Implement TypedString and DATE coercion #8892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8892 +/- ##
===========================================
+ Coverage 52.98% 77.04% +24.05%
===========================================
Files 172 173 +1
Lines 30750 40172 +9422
===========================================
+ Hits 16294 30950 +14656
+ Misses 14456 9222 -5234
Continue to review full report at Codecov.
|
|
@andygrove could you please have a look? |
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @seddonm1 this looks great 🚀
|
I should note that these changes make the test more realistic and will likely reduce performance so we will need to bear this in mind when comparing benchmark results to previous results. |
|
Thanks Andy. I will resolve the merge conflict. |
536bb1b to
5ce4408
Compare
|
Rebased so should be good to merge. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- looks great @seddonm1 !
| and l_discount between 0.06 - 0.01 and 0.06 + 0.01 | ||
| l_shipdate >= date '1994-01-01' | ||
| and l_shipdate < date '1995-01-01' | ||
| and l_discount > 0.06 - 0.01 and l_discount < 0.06 + 0.01 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this have between that was also added recently? @seddonm1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sorry i will fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, no. The raw query as per TPC-H does not use between for that clause:
where l_shipdate >= date '[DATE]'
and l_shipdate < date '[DATE]' + interval '1' year
and l_discount between [DISCOUNT] - 0.01 and [DISCOUNT] + 0.01Which is different as BETWEEN is inclusive (>= AND <=)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a between on the last line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, sorry I thought you meant add it to the l_shipdate component. Yes, i will fix that.
|
@andygrove I am actually expecting that comparison on date (in ms since epoch / int64) should be faster than on strings (if no weird implementation)? @seddonm1 unfortunately I am getting an error on master, as the CSV reader doesn't support dates yet. |
|
@Dandandan I was thinking about the overhead of converting the strings to dates, but you're right, if the data is stored natively in date format in Parquet then it should be faster. It would be slower than before against CSV though, probably. |
|
@andygrove makes sense |
|
@andygrove checked it, in #8913 , indeed parsing is a bit slower, queries are faster |
…WEEN @Dandandan Fixes per #8892 (comment) Closes #8906 from seddonm1/update-tpch-queries Authored-by: Mike Seddon <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
That is the right tradeoff in my opinion |
|
Sorry I reopened this by accident. |
…WEEN @Dandandan Fixes per apache/arrow#8892 (comment) Closes #8906 from seddonm1/update-tpch-queries Authored-by: Mike Seddon <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
This PR adds support for what the
sqlparsercrate callsTypedStringwhich is basically syntactic sugar for an inline-cast. As this was an effort to get theTPC-Hqueries behaving correctly I then went a step further and added support forDate(temporal) coercion. I can split this PR if needed.is equivalent to
FYI I am planning to tackle
INTERVALnext.