-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10654: [Rust] Specialize parsing of floats / bools in CSV Reader #8714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Some benchmark/context of string -> f64 is here (note: log scale) https://github.com/Alexhuszagh/rust-lexical/ |
|
Did some benchmarking on this. Seems like a small win. Master: This version: |
nevi-me
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I've tried using lexical before, and I'm happy that it's faster than the stdlib
| } | ||
| } | ||
|
|
||
| impl Parser for UInt64Type {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be plans to support these at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently they are supported by passing a schema to the csv reader? And here they keep using the standard string.parse::<T> method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps @nevi-me was asking if there are plans to improve the parsing performance of these types as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. It could special case on those as well. Difference between lexical core and standard lib don't seem too big here though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I agree there is no reason to special case other types at this time
| } | ||
| } | ||
|
|
||
| impl Parser for UInt64Type {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps @nevi-me was asking if there are plans to improve the parsing performance of these types as well
| } | ||
|
|
||
| impl Parser for BooleanType { | ||
| fn parse(string: &str) -> Option<bool> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered how this related to the rust standard boolean parsing: https://doc.rust-lang.org/src/core/str/traits.rs.html#590
Seems like it anything it would be slightly slower, but also support mixed case (true and True). Seems like a good improvement to me, though adding a test to encode the expected behavior would probably be a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added https://issues.apache.org/jira/browse/ARROW-10677 to track extra parsing tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the mixed case is to keep compatibility with the previous implementation (it used to_lower)
Could as well have specific cases for all caps / capitalized booleans instead maybe? That would avoid accepting tRue etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code in this PR is good as is -- I just think it would be nice to document what the behavior is more explicitly -- so I did so in #8733
|
The CI failures don't seem related to this PR -- I am going to retrigger them |
|
CI is green, so merging |
| impl Parser for BooleanType { | ||
| fn parse(string: &str) -> Option<bool> { | ||
| if string == "false" || string == "FALSE" || string == "False" { | ||
| return Some(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally missed this, but this line has a bug parsing in this field appears to be backwards -- "false" seems to return true -- it looks like it got added in 52bf0e8 -- FYI @Dandandan and @nevi-me -- I have a fix for it shortly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
O haha stupid bug, next time I'll make sure to add some unit tests directly in the PR.
Thanks for fixing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was my bad for not catching it in review as well
Internal rust float parser is known to be slow.
This change allows to have specialized implementations rather than relying on FromStr::parse.
Also avoids calling
to_lowercasefor booleans.Would be nice to benchmark this.