-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-4804: [Rust] Parse Date32 and Date64 in CSV reader #8611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Just saw #8609 and |
|
Truly it is. Otherwise platform native primitives mightn't fit into the return types. |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Jibbow Thanks a lot for this PR: I can see from the issue number that this is a really old issue. Thanks for taking the time to implement it!
I went through this, and my only concern is the transmute, that IMO causes undefined behavior in this case. I see you already know how to address it, though :)
| DataType::Date32(DateUnit::Day) => { | ||
| let days = chrono::NaiveDate::parse_from_str(s, "%Y-%m-%d") | ||
| .map(|t| since(t, from_ymd(1970, 1, 1)).num_days() as i32); | ||
| days.map(|t| unsafe { std::mem::transmute_copy::<i32, T::Native>(&t) }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what @vertexclique said: transmute is one of the most unsafe operations in rust, and this can easily lead to undefined behavior if it overflows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative would be to extend the ArrowNativeType with from_i32 and from_i64, following the model of from_usize and then implement those functions for i32 and i64 respectively (as those are the underlying native types)
I tried this approach out on a branch in case you are interested / want to take the change:
Commit with change: alamb@cc61e7a
The branch (with your change) is here: https://github.com/alamb/arrow/tree/alamb/less-unsafe)
| .case_insensitive(true) | ||
| .build() | ||
| .unwrap(); | ||
| static ref DATE_RE: Regex = Regex::new(r"^\d\d\d\d-\d\d-\d\d$").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't there a \d{4} or something like that? May make it a bit easier to read and more expressive, IMO
|
Thanks for the feedback @vertexclique and @jorgecarleitao! |
|
|
||
| /// Parses a string into the specified `ArrowPrimitiveType`. | ||
| fn parse_field<T: ArrowPrimitiveType>(s: &str) -> Result<T::Native> { | ||
| let from_ymd = chrono::NaiveDate::from_ymd; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note there is also code to convert strings to nanosecond timestamps (string_to_timestamp_nanos) here: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/datetime_expressions.rs#L30
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the long run, this is motivation to centralise this into temporal kernels; so we can share this with the JSON reader.
One of the things I'll submit a PR for in the coming days/weeks, is a crate that parses strings to numbers faster than what libcore does.
| DataType::Date32(DateUnit::Day) => { | ||
| let days = chrono::NaiveDate::parse_from_str(s, "%Y-%m-%d") | ||
| .map(|t| since(t, from_ymd(1970, 1, 1)).num_days() as i32); | ||
| days.map(|t| unsafe { std::mem::transmute_copy::<i32, T::Native>(&t) }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative would be to extend the ArrowNativeType with from_i32 and from_i64, following the model of from_usize and then implement those functions for i32 and i64 respectively (as those are the underlying native types)
I tried this approach out on a branch in case you are interested / want to take the change:
Commit with change: alamb@cc61e7a
The branch (with your change) is here: https://github.com/alamb/arrow/tree/alamb/less-unsafe)
I think this is fine as long as reasonable errors (unsupported XXX) are produced if alternate types are used |
|
@alamb I can't respond to your comment. |
|
@nevi-me -- good call -- I was simply copy/pasting what was in this PR in terms of |
|
Hi @Jibbow , I think we need not wait for #8609 before you can update this PR |
| let from_ymd = chrono::NaiveDate::from_ymd; | ||
| let since = chrono::NaiveDate::signed_duration_since; | ||
|
|
||
| match T::DATA_TYPE { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will benefit from changes in this PR to include a trait.
#8714
|
@Jibbow, are you blocked or need help here? Let us know if that is the case so that we can help. |
|
Benchmarks on master are failing as the csv parser doesn't support date32 / date64 related to this addition: #8892 @Jibbow any plans on finishing the PR? Would be really nice to add! |
Sorry that I broke this. I had been running my benchmarks with Parquet so did not notice. I am working with Andy to uplift the tests to catch this kind of issue. |
This is based on #8611 by @Jibbow and some suggestions by @alamb Adds date32 / date64 to the csv reader. This also fixes the benchmark which now includes date types which were added by @seddonm1 There are some missing parts in the date format support (such as actual ms support) but those can be implemented I think as separate PRs. Closes #8913 from Dandandan/date_csv Authored-by: Heres, Daniel <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
|
FYI we have incorporated this code in via #8913 - thanks for the help @Jibbow . I am going to close this PR for now as it has a conflict and we are trying to clean up the PR queue -- please let me know if that was a mistake. |
|
Sorry for being so quiet the last weeks. I had a research paper deadline coming up. Thanks @Dandandan for taking over! |
This is based on apache/arrow#8611 by @Jibbow and some suggestions by @alamb Adds date32 / date64 to the csv reader. This also fixes the benchmark which now includes date types which were added by @seddonm1 There are some missing parts in the date format support (such as actual ms support) but those can be implemented I think as separate PRs. Closes #8913 from Dandandan/date_csv Authored-by: Heres, Daniel <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>
This PR partly handles the issue https://issues.apache.org/jira/browse/ARROW-4804.
Things to discuss:
Date32currently assumes to haveDateUnit::DayandDate64assumesDateUnit::Millisecondrespectively.YYY-MM-DDorYYY-MM-DDTHH:MM:SS. I'm not sure if this is a reasonable assumption. It should also be noted that the C++ implementation does not strictly follow the ISO format because it usesYYY-MM-DD HH:MM:SSinstead (without theT). The difference is also a point that has to be discussed before merging.std::mem::transmute_copyto converti32andi64toT::Native. I struggled with Rust's type system here and I'd appreciate suggestions on how to do this in a less-unsafe way.Things that are not part of this PR: