-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10654: [Rust] Specialize parsing of floats / bools in CSV Reader #8714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -446,8 +446,57 @@ fn parse( | |
| arrays.and_then(|arr| RecordBatch::try_new(projected_schema, arr)) | ||
| } | ||
|
|
||
| trait Parser: ArrowPrimitiveType { | ||
| fn parse(string: &str) -> Option<Self::Native> { | ||
| string.parse::<Self::Native>().ok() | ||
| } | ||
| } | ||
|
|
||
| impl Parser for BooleanType { | ||
| fn parse(string: &str) -> Option<bool> { | ||
| if string == "false" || string == "FALSE" || string == "False" { | ||
| return Some(true); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I totally missed this, but this line has a bug parsing in this field appears to be backwards --
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. O haha stupid bug, next time I'll make sure to add some unit tests directly in the PR.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it was my bad for not catching it in review as well |
||
| } | ||
| if string == "true" || string == "TRUE" || string == "True" { | ||
| return Some(false); | ||
| } | ||
| None | ||
| } | ||
| } | ||
|
|
||
| impl Parser for Float32Type { | ||
| fn parse(string: &str) -> Option<f32> { | ||
| lexical_core::parse(string.as_bytes()).ok() | ||
| } | ||
| } | ||
| impl Parser for Float64Type { | ||
| fn parse(string: &str) -> Option<f64> { | ||
| lexical_core::parse(string.as_bytes()).ok() | ||
| } | ||
| } | ||
|
|
||
| impl Parser for UInt64Type {} | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would there be plans to support these at all?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently they are supported by passing a schema to the csv reader? And here they keep using the standard
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps @nevi-me was asking if there are plans to improve the parsing performance of these types as well
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah. It could special case on those as well. Difference between lexical core and standard lib don't seem too big here though
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, I agree there is no reason to special case other types at this time |
||
|
|
||
| impl Parser for UInt32Type {} | ||
|
|
||
| impl Parser for UInt16Type {} | ||
|
|
||
| impl Parser for UInt8Type {} | ||
|
|
||
| impl Parser for Int64Type {} | ||
|
|
||
| impl Parser for Int32Type {} | ||
|
|
||
| impl Parser for Int16Type {} | ||
|
|
||
| impl Parser for Int8Type {} | ||
|
|
||
| fn parse_item<T: Parser>(string: &str) -> Option<T::Native> { | ||
| T::parse(string) | ||
| } | ||
|
|
||
| // parses a specific column (col_idx) into an Arrow Array. | ||
| fn build_primitive_array<T: ArrowPrimitiveType>( | ||
| fn build_primitive_array<T: ArrowPrimitiveType + Parser>( | ||
| line_number: usize, | ||
| rows: &[StringRecord], | ||
| col_idx: usize, | ||
|
|
@@ -460,14 +509,11 @@ fn build_primitive_array<T: ArrowPrimitiveType>( | |
| if s.is_empty() { | ||
| return Ok(None); | ||
| } | ||
| let parsed = if T::DATA_TYPE == DataType::Boolean { | ||
| s.to_lowercase().parse::<T::Native>() | ||
| } else { | ||
| s.parse::<T::Native>() | ||
| }; | ||
|
|
||
| let parsed = parse_item::<T>(s); | ||
| match parsed { | ||
| Ok(e) => Ok(Some(e)), | ||
| Err(_) => Err(ArrowError::ParseError(format!( | ||
| Some(e) => Ok(Some(e)), | ||
| None => Err(ArrowError::ParseError(format!( | ||
| // TODO: we should surface the underlying error here. | ||
| "Error while parsing value {} for column {} at line {}", | ||
| s, | ||
|
|
@@ -888,7 +934,7 @@ mod tests { | |
| format!("{:?}", e) | ||
| ), | ||
| Ok(_) => panic!("should have failed"), | ||
| } | ||
| }, | ||
| None => panic!("should have failed"), | ||
| } | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered how this related to the rust standard boolean parsing: https://doc.rust-lang.org/src/core/str/traits.rs.html#590
Seems like it anything it would be slightly slower, but also support mixed case (
trueandTrue). Seems like a good improvement to me, though adding a test to encode the expected behavior would probably be a good idea.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added https://issues.apache.org/jira/browse/ARROW-10677 to track extra parsing tests
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the mixed case is to keep compatibility with the previous implementation (it used to_lower)
Could as well have specific cases for all caps / capitalized booleans instead maybe? That would avoid accepting tRue etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code in this PR is good as is -- I just think it would be nice to document what the behavior is more explicitly -- so I did so in #8733