Skip to content

Conversation

@andizimmerer
Copy link
Contributor

This PR partly handles the issue https://issues.apache.org/jira/browse/ARROW-4804.

Things to discuss:

  1. Date32 currently assumes to have DateUnit::Day and Date64 assumes DateUnit::Millisecond respectively.
  2. Dates have to be in ISO format YYY-MM-DD or YYY-MM-DDTHH:MM:SS. I'm not sure if this is a reasonable assumption. It should also be noted that the C++ implementation does not strictly follow the ISO format because it uses YYY-MM-DD HH:MM:SS instead (without the T). The difference is also a point that has to be discussed before merging.
  3. I'm using std::mem::transmute_copy to convert i32 and i64 to T::Native. I struggled with Rust's type system here and I'd appreciate suggestions on how to do this in a less-unsafe way.

Things that are not part of this PR:

  1. Handling of other temporal types like timestamp and time.
  2. No list of columns for temporal types (reference: "To keep inference performant. user should provide a Vec of which columns to try convert to a temporal array")

@github-actions
Copy link

github-actions bot commented Nov 8, 2020

@andizimmerer
Copy link
Contributor Author

Just saw #8609 and num_cast() might actually be the thing I was looking for to convert i64 to T::Native?

@vertexclique
Copy link
Contributor

Truly it is. Otherwise platform native primitives mightn't fit into the return types.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Jibbow Thanks a lot for this PR: I can see from the issue number that this is a really old issue. Thanks for taking the time to implement it!

I went through this, and my only concern is the transmute, that IMO causes undefined behavior in this case. I see you already know how to address it, though :)

DataType::Date32(DateUnit::Day) => {
let days = chrono::NaiveDate::parse_from_str(s, "%Y-%m-%d")
.map(|t| since(t, from_ymd(1970, 1, 1)).num_days() as i32);
days.map(|t| unsafe { std::mem::transmute_copy::<i32, T::Native>(&t) })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what @vertexclique said: transmute is one of the most unsafe operations in rust, and this can easily lead to undefined behavior if it overflows.

Copy link
Contributor

@alamb alamb Nov 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative would be to extend the ArrowNativeType with from_i32 and from_i64, following the model of from_usize and then implement those functions for i32 and i64 respectively (as those are the underlying native types)

I tried this approach out on a branch in case you are interested / want to take the change:
Commit with change: alamb@cc61e7a

The branch (with your change) is here: https://github.com/alamb/arrow/tree/alamb/less-unsafe)

.case_insensitive(true)
.build()
.unwrap();
static ref DATE_RE: Regex = Regex::new(r"^\d\d\d\d-\d\d-\d\d$").unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't there a \d{4} or something like that? May make it a bit easier to read and more expressive, IMO

@andizimmerer
Copy link
Contributor Author

Thanks for the feedback @vertexclique and @jorgecarleitao!
I'll update the PR when #8609 is merged.


/// Parses a string into the specified `ArrowPrimitiveType`.
fn parse_field<T: ArrowPrimitiveType>(s: &str) -> Result<T::Native> {
let from_ymd = chrono::NaiveDate::from_ymd;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note there is also code to convert strings to nanosecond timestamps (string_to_timestamp_nanos) here: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/datetime_expressions.rs#L30

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the long run, this is motivation to centralise this into temporal kernels; so we can share this with the JSON reader.
One of the things I'll submit a PR for in the coming days/weeks, is a crate that parses strings to numbers faster than what libcore does.

DataType::Date32(DateUnit::Day) => {
let days = chrono::NaiveDate::parse_from_str(s, "%Y-%m-%d")
.map(|t| since(t, from_ymd(1970, 1, 1)).num_days() as i32);
days.map(|t| unsafe { std::mem::transmute_copy::<i32, T::Native>(&t) })
Copy link
Contributor

@alamb alamb Nov 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative would be to extend the ArrowNativeType with from_i32 and from_i64, following the model of from_usize and then implement those functions for i32 and i64 respectively (as those are the underlying native types)

I tried this approach out on a branch in case you are interested / want to take the change:
Commit with change: alamb@cc61e7a

The branch (with your change) is here: https://github.com/alamb/arrow/tree/alamb/less-unsafe)

@alamb
Copy link
Contributor

alamb commented Nov 10, 2020

Date32 currently assumes to have DateUnit::Day and Date64 assumes DateUnit::Millisecond respectively.

I think this is fine as long as reasonable errors (unsupported XXX) are produced if alternate types are used

@nevi-me
Copy link
Contributor

nevi-me commented Nov 10, 2020

@alamb I can't respond to your comment.
I see you're using T::Native::from_i32(since(days, from_ymd(1970, 1, 1)).num_days() as i32). Shouldn't there be a way for us to get the desired value from chrono without introducing a fallible cast/conversion?

@alamb
Copy link
Contributor

alamb commented Nov 10, 2020

@nevi-me -- good call -- I was simply copy/pasting what was in this PR in terms of as i32 without looking carefully enough. I updated https://github.com/alamb/arrow/tree/alamb/less-unsafe with a new commit that doesn't use as i32 and instead uses try_into()

86741de

@nevi-me
Copy link
Contributor

nevi-me commented Nov 21, 2020

Hi @Jibbow , I think we need not wait for #8609 before you can update this PR

let from_ymd = chrono::NaiveDate::from_ymd;
let since = chrono::NaiveDate::signed_duration_since;

match T::DATA_TYPE {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will benefit from changes in this PR to include a trait.
#8714

@github-actions github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 25, 2020
@jorgecarleitao
Copy link
Member

@Jibbow, are you blocked or need help here? Let us know if that is the case so that we can help.

@Dandandan
Copy link
Contributor

Benchmarks on master are failing as the csv parser doesn't support date32 / date64 related to this addition: #8892

@Jibbow any plans on finishing the PR? Would be really nice to add!

@seddonm1
Copy link
Contributor

Benchmarks on master are failing as the csv parser doesn't support date32 / date64 related to this addition: #8892

@Jibbow any plans on finishing the PR? Would be really nice to add!

Sorry that I broke this. I had been running my benchmarks with Parquet so did not notice. I am working with Andy to uplift the tests to catch this kind of issue.

alamb pushed a commit that referenced this pull request Dec 15, 2020
This is based on #8611 by @Jibbow and some suggestions by @alamb

Adds date32 / date64 to the csv reader. This also fixes the benchmark which now includes date types which were added by @seddonm1

There are some missing parts in the date format support (such as actual ms support) but those can be implemented I think as separate PRs.

Closes #8913 from Dandandan/date_csv

Authored-by: Heres, Daniel <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
@alamb
Copy link
Contributor

alamb commented Dec 15, 2020

FYI we have incorporated this code in via #8913 - thanks for the help @Jibbow . I am going to close this PR for now as it has a conflict and we are trying to clean up the PR queue -- please let me know if that was a mistake.

@alamb alamb closed this Dec 15, 2020
@andizimmerer
Copy link
Contributor Author

Sorry for being so quiet the last weeks. I had a research paper deadline coming up. Thanks @Dandandan for taking over!

alamb pushed a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
This is based on apache/arrow#8611 by @Jibbow and some suggestions by @alamb

Adds date32 / date64 to the csv reader. This also fixes the benchmark which now includes date types which were added by @seddonm1

There are some missing parts in the date format support (such as actual ms support) but those can be implemented I think as separate PRs.

Closes #8913 from Dandandan/date_csv

Authored-by: Heres, Daniel <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Rust needs-rebase A PR that needs to be rebased by the author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants