Skip to content

Conversation

@seddonm1
Copy link
Contributor

@seddonm1 seddonm1 commented Dec 10, 2020

This PR adds support for what the sqlparser crate calls TypedString which is basically syntactic sugar for an inline-cast. As this was an effort to get the TPC-H queries behaving correctly I then went a step further and added support for Date (temporal) coercion. I can split this PR if needed.

where
    l_shipdate <= date '1998-09-02'

is equivalent to

where
    l_shipdate <= CAST('1998-09-02' AS DATE)

FYI I am planning to tackle INTERVAL next.

@github-actions
Copy link

@codecov-io
Copy link

codecov-io commented Dec 10, 2020

Codecov Report

Merging #8892 (536bb1b) into master (3deae8d) will increase coverage by 24.05%.
The diff coverage is 77.41%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #8892       +/-   ##
===========================================
+ Coverage   52.98%   77.04%   +24.05%     
===========================================
  Files         172      173        +1     
  Lines       30750    40172     +9422     
===========================================
+ Hits        16294    30950    +14656     
+ Misses      14456     9222     -5234     
Impacted Files Coverage Δ
rust/datafusion/src/physical_plan/expressions.rs 0.00% <ø> (ø)
rust/datafusion/src/sql/planner.rs 0.00% <0.00%> (ø)
rust/arrow/src/compute/kernels/cast.rs 96.33% <96.00%> (+0.12%) ⬆️
...ntegration-testing/src/bin/arrow-file-to-stream.rs
...ion-testing/src/bin/arrow-json-integration-test.rs
...ntegration-testing/src/bin/arrow-stream-to-file.rs
rust/parquet/src/util/hash_util.rs 95.77% <0.00%> (ø)
rust/parquet/src/util/bit_packing.rs 99.96% <0.00%> (ø)
rust/parquet/src/file/footer.rs 96.22% <0.00%> (ø)
rust/parquet/src/util/test_common/file_util.rs 77.77% <0.00%> (ø)
... and 49 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b0b9269...536bb1b. Read the comment docs.

@github-actions github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Dec 11, 2020
@seddonm1 seddonm1 changed the title ARROW-10817: [Rust] [DataFusion] Implement TypedString ARROW-10817: [Rust] [DataFusion] Implement TypedString and DATE coercion Dec 12, 2020
@seddonm1
Copy link
Contributor Author

@andygrove could you please have a look?

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @seddonm1 this looks great 🚀

@andygrove
Copy link
Member

I should note that these changes make the test more realistic and will likely reduce performance so we will need to bear this in mind when comparing benchmark results to previous results.

@seddonm1
Copy link
Contributor Author

Thanks Andy. I will resolve the merge conflict.

@github-actions github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Dec 13, 2020
@seddonm1
Copy link
Contributor Author

Rebased so should be good to merge.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- looks great @seddonm1 !

@alamb alamb closed this in e980ef8 Dec 13, 2020
and l_discount between 0.06 - 0.01 and 0.06 + 0.01
l_shipdate >= date '1994-01-01'
and l_shipdate < date '1995-01-01'
and l_discount > 0.06 - 0.01 and l_discount < 0.06 + 0.01
Copy link
Contributor

@Dandandan Dandandan Dec 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this have between that was also added recently? @seddonm1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sorry i will fix this.

Copy link
Contributor Author

@seddonm1 seddonm1 Dec 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, no. The raw query as per TPC-H does not use between for that clause:

where l_shipdate >= date '[DATE]' 
and l_shipdate < date '[DATE]' + interval '1' year
and l_discount between [DISCOUNT] - 0.01 and [DISCOUNT] + 0.01

Which is different as BETWEEN is inclusive (>= AND <=)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a between on the last line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, sorry I thought you meant add it to the l_shipdate component. Yes, i will fix that.

@Dandandan
Copy link
Contributor

Dandandan commented Dec 13, 2020

@andygrove I am actually expecting that comparison on date (in ms since epoch / int64) should be faster than on strings (if no weird implementation)?

@seddonm1 unfortunately I am getting an error on master, as the CSV reader doesn't support dates yet.

Running benchmarks with the following options: BenchmarkOpt { query: 10, debug: false, iterations: 10, concurrency: 10, batch_size: 20000, path: "/home/danielheres/Code/gdd/arrow/rust/benchmarks/tpch-dbgen", file_format: "tbl", mem_table: true }
Loading table 'part' into memory
Loaded table 'part' into memory in 157 ms
Loading table 'supplier' into memory
Loaded table 'supplier' into memory in 12 ms
Loading table 'partsupp' into memory
Loaded table 'partsupp' into memory in 553 ms
Loading table 'customer' into memory
Loaded table 'customer' into memory in 139 ms
Loading table 'orders' into memory
Error: ArrowError(ParseError("Unsupported data type Date32(Day)"))

@andygrove
Copy link
Member

@Dandandan I was thinking about the overhead of converting the strings to dates, but you're right, if the data is stored natively in date format in Parquet then it should be faster. It would be slower than before against CSV though, probably.

@Dandandan
Copy link
Contributor

@andygrove makes sense

@Dandandan
Copy link
Contributor

@andygrove checked it, in #8913 , indeed parsing is a bit slower, queries are faster

alamb pushed a commit that referenced this pull request Dec 14, 2020
…WEEN

@Dandandan
Fixes per #8892 (comment)

Closes #8906 from seddonm1/update-tpch-queries

Authored-by: Mike Seddon <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
@alamb
Copy link
Contributor

alamb commented Dec 14, 2020

@andygrove checked it, in #8913 , indeed parsing is a bit slower, queries are faster

That is the right tradeoff in my opinion

@alamb alamb reopened this Dec 14, 2020
@alamb alamb closed this Dec 14, 2020
@alamb
Copy link
Contributor

alamb commented Dec 14, 2020

Sorry I reopened this by accident.

@seddonm1 seddonm1 deleted the typed_string branch January 4, 2021 22:48
alamb pushed a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
…WEEN

@Dandandan
Fixes per apache/arrow#8892 (comment)

Closes #8906 from seddonm1/update-tpch-queries

Authored-by: Mike Seddon <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants