ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader #8611

andizimmerer · 2020-11-08T17:28:22Z

This PR partly handles the issue https://issues.apache.org/jira/browse/ARROW-4804.

Things to discuss:

Date32 currently assumes to have DateUnit::Day and Date64 assumes DateUnit::Millisecond respectively.
Dates have to be in ISO format YYY-MM-DD or YYY-MM-DDTHH:MM:SS. I'm not sure if this is a reasonable assumption. It should also be noted that the C++ implementation does not strictly follow the ISO format because it uses YYY-MM-DD HH:MM:SS instead (without the T). The difference is also a point that has to be discussed before merging.
I'm using std::mem::transmute_copy to convert i32 and i64 to T::Native. I struggled with Rust's type system here and I'd appreciate suggestions on how to do this in a less-unsafe way.

Things that are not part of this PR:

Handling of other temporal types like timestamp and time.
No list of columns for temporal types (reference: "To keep inference performant. user should provide a Vec of which columns to try convert to a temporal array")

github-actions · 2020-11-08T17:32:05Z

https://issues.apache.org/jira/browse/ARROW-4804

andizimmerer · 2020-11-08T17:34:00Z

Just saw #8609 and num_cast() might actually be the thing I was looking for to convert i64 to T::Native?

vertexclique · 2020-11-09T22:48:53Z

Truly it is. Otherwise platform native primitives mightn't fit into the return types.

jorgecarleitao

Hi @Jibbow Thanks a lot for this PR: I can see from the issue number that this is a really old issue. Thanks for taking the time to implement it!

I went through this, and my only concern is the transmute, that IMO causes undefined behavior in this case. I see you already know how to address it, though :)

jorgecarleitao · 2020-11-10T06:26:40Z

rust/arrow/src/csv/reader.rs

+        DataType::Date32(DateUnit::Day) => {
+            let days = chrono::NaiveDate::parse_from_str(s, "%Y-%m-%d")
+                .map(|t| since(t, from_ymd(1970, 1, 1)).num_days() as i32);
+            days.map(|t| unsafe { std::mem::transmute_copy::<i32, T::Native>(&t) })


what @vertexclique said: transmute is one of the most unsafe operations in rust, and this can easily lead to undefined behavior if it overflows.

Another alternative would be to extend the ArrowNativeType with from_i32 and from_i64, following the model of from_usize and then implement those functions for i32 and i64 respectively (as those are the underlying native types)

I tried this approach out on a branch in case you are interested / want to take the change:
Commit with change: alamb@cc61e7a

The branch (with your change) is here: https://github.com/alamb/arrow/tree/alamb/less-unsafe)

jorgecarleitao · 2020-11-10T06:28:01Z

rust/arrow/src/csv/reader.rs

        .case_insensitive(true)
        .build()
        .unwrap();
+    static ref DATE_RE: Regex = Regex::new(r"^\d\d\d\d-\d\d-\d\d$").unwrap();


isn't there a \d{4} or something like that? May make it a bit easier to read and more expressive, IMO

andizimmerer · 2020-11-10T10:16:39Z

Thanks for the feedback @vertexclique and @jorgecarleitao!
I'll update the PR when #8609 is merged.

alamb · 2020-11-10T12:38:35Z

rust/arrow/src/csv/reader.rs


+/// Parses a string into the specified `ArrowPrimitiveType`.
+fn parse_field<T: ArrowPrimitiveType>(s: &str) -> Result<T::Native> {
+    let from_ymd = chrono::NaiveDate::from_ymd;


Note there is also code to convert strings to nanosecond timestamps (string_to_timestamp_nanos) here: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/datetime_expressions.rs#L30

In the long run, this is motivation to centralise this into temporal kernels; so we can share this with the JSON reader.
One of the things I'll submit a PR for in the coming days/weeks, is a crate that parses strings to numbers faster than what libcore does.

alamb · 2020-11-10T13:06:12Z

rust/arrow/src/csv/reader.rs

+        DataType::Date32(DateUnit::Day) => {
+            let days = chrono::NaiveDate::parse_from_str(s, "%Y-%m-%d")
+                .map(|t| since(t, from_ymd(1970, 1, 1)).num_days() as i32);
+            days.map(|t| unsafe { std::mem::transmute_copy::<i32, T::Native>(&t) })


Another alternative would be to extend the ArrowNativeType with from_i32 and from_i64, following the model of from_usize and then implement those functions for i32 and i64 respectively (as those are the underlying native types)

I tried this approach out on a branch in case you are interested / want to take the change:
Commit with change: alamb@cc61e7a

The branch (with your change) is here: https://github.com/alamb/arrow/tree/alamb/less-unsafe)

alamb · 2020-11-10T13:07:09Z

Date32 currently assumes to have DateUnit::Day and Date64 assumes DateUnit::Millisecond respectively.

I think this is fine as long as reasonable errors (unsupported XXX) are produced if alternate types are used

nevi-me · 2020-11-10T14:24:27Z

@alamb I can't respond to your comment.
I see you're using T::Native::from_i32(since(days, from_ymd(1970, 1, 1)).num_days() as i32). Shouldn't there be a way for us to get the desired value from chrono without introducing a fallible cast/conversion?

alamb · 2020-11-10T14:59:39Z

@nevi-me -- good call -- I was simply copy/pasting what was in this PR in terms of as i32 without looking carefully enough. I updated https://github.com/alamb/arrow/tree/alamb/less-unsafe with a new commit that doesn't use as i32 and instead uses try_into()

86741de

nevi-me · 2020-11-21T09:02:56Z

Hi @Jibbow , I think we need not wait for #8609 before you can update this PR

Dandandan · 2020-11-21T12:20:32Z

rust/arrow/src/csv/reader.rs

+    let from_ymd = chrono::NaiveDate::from_ymd;
+    let since = chrono::NaiveDate::signed_duration_since;
+
+    match T::DATA_TYPE {


I think this will benefit from changes in this PR to include a trait.
#8714

jorgecarleitao · 2020-12-08T06:08:40Z

@Jibbow, are you blocked or need help here? Let us know if that is the case so that we can help.

Dandandan · 2020-12-13T17:15:04Z

Benchmarks on master are failing as the csv parser doesn't support date32 / date64 related to this addition: #8892

@Jibbow any plans on finishing the PR? Would be really nice to add!

seddonm1 · 2020-12-13T23:24:44Z

Benchmarks on master are failing as the csv parser doesn't support date32 / date64 related to this addition: #8892

@Jibbow any plans on finishing the PR? Would be really nice to add!

Sorry that I broke this. I had been running my benchmarks with Parquet so did not notice. I am working with Andy to uplift the tests to catch this kind of issue.

@alamb

This is based on #8611 by @Jibbow and some suggestions by @alamb Adds date32 / date64 to the csv reader. This also fixes the benchmark which now includes date types which were added by @seddonm1 There are some missing parts in the date format support (such as actual ms support) but those can be implemented I think as separate PRs. Closes #8913 from Dandandan/date_csv Authored-by: Heres, Daniel <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>

alamb · 2020-12-15T21:48:16Z

FYI we have incorporated this code in via #8913 - thanks for the help @Jibbow . I am going to close this PR for now as it has a conflict and we are trying to clean up the PR queue -- please let me know if that was a mistake.

andizimmerer · 2020-12-23T13:47:10Z

Sorry for being so quiet the last weeks. I had a research paper deadline coming up. Thanks @Dandandan for taking over!

@alamb

This is based on apache/arrow#8611 by @Jibbow and some suggestions by @alamb Adds date32 / date64 to the csv reader. This also fixes the benchmark which now includes date types which were added by @seddonm1 There are some missing parts in the date format support (such as actual ms support) but those can be implemented I think as separate PRs. Closes #8913 from Dandandan/date_csv Authored-by: Heres, Daniel <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>

ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader

f035c40

github-actions bot added the Component: Rust label Nov 8, 2020

jorgecarleitao reviewed Nov 10, 2020

View reviewed changes

alamb reviewed Nov 10, 2020

View reviewed changes

nevi-me mentioned this pull request Nov 19, 2020

ARROW-10649: [Rust] Parse manually in infer_field_schema, remove lazy static dependency #8710

Closed

Dandandan reviewed Nov 21, 2020

View reviewed changes

github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 25, 2020

Dandandan mentioned this pull request Dec 14, 2020

ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader #8913

Closed

alamb closed this Dec 15, 2020

asfimport mentioned this pull request Dec 27, 2020

[Rust] Read temporal values from CSV - Parse Date32 and Date64 in CSV reader #21322

Closed

ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader #8611

ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader #8611

Uh oh!

Conversation

andizimmerer commented Nov 8, 2020

Uh oh!

github-actions bot commented Nov 8, 2020

Uh oh!

andizimmerer commented Nov 8, 2020

Uh oh!

vertexclique commented Nov 9, 2020

Uh oh!

jorgecarleitao left a comment

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Nov 10, 2020

Choose a reason for hiding this comment

Uh oh!

alamb Nov 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Nov 10, 2020

Choose a reason for hiding this comment

Uh oh!

andizimmerer commented Nov 10, 2020

Uh oh!

alamb Nov 10, 2020

Choose a reason for hiding this comment

Uh oh!

nevi-me Nov 10, 2020

Choose a reason for hiding this comment

Uh oh!

alamb Nov 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Nov 10, 2020

Uh oh!

nevi-me commented Nov 10, 2020

Uh oh!

alamb commented Nov 10, 2020

Uh oh!

nevi-me commented Nov 21, 2020

Uh oh!

Dandandan Nov 21, 2020

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao commented Dec 8, 2020

Uh oh!

Dandandan commented Dec 13, 2020

Uh oh!

seddonm1 commented Dec 13, 2020

Uh oh!

alamb commented Dec 15, 2020

Uh oh!

andizimmerer commented Dec 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

alamb Nov 10, 2020 •

edited

Loading

alamb Nov 10, 2020 •

edited

Loading