Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions rust/arrow/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ chrono = "0.4"
flatbuffers = "0.6"
hex = "0.4"
prettytable-rs = { version = "0.8.0", optional = true }
lexical-core = "^0.7"

[features]
default = []
Expand Down
64 changes: 55 additions & 9 deletions rust/arrow/src/csv/reader.rs
Original file line number Diff line number Diff line change
Expand Up @@ -446,8 +446,57 @@ fn parse(
arrays.and_then(|arr| RecordBatch::try_new(projected_schema, arr))
}

trait Parser: ArrowPrimitiveType {
fn parse(string: &str) -> Option<Self::Native> {
string.parse::<Self::Native>().ok()
}
}

impl Parser for BooleanType {
fn parse(string: &str) -> Option<bool> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered how this related to the rust standard boolean parsing: https://doc.rust-lang.org/src/core/str/traits.rs.html#590

Seems like it anything it would be slightly slower, but also support mixed case (true and True). Seems like a good improvement to me, though adding a test to encode the expected behavior would probably be a good idea.

Copy link
Contributor

@alamb alamb Nov 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added https://issues.apache.org/jira/browse/ARROW-10677 to track extra parsing tests

Copy link
Contributor Author

@Dandandan Dandandan Nov 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the mixed case is to keep compatibility with the previous implementation (it used to_lower)

Could as well have specific cases for all caps / capitalized booleans instead maybe? That would avoid accepting tRue etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code in this PR is good as is -- I just think it would be nice to document what the behavior is more explicitly -- so I did so in #8733

if string == "false" || string == "FALSE" || string == "False" {
return Some(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally missed this, but this line has a bug parsing in this field appears to be backwards -- "false" seems to return true -- it looks like it got added in 52bf0e8 -- FYI @Dandandan and @nevi-me -- I have a fix for it shortly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O haha stupid bug, next time I'll make sure to add some unit tests directly in the PR.
Thanks for fixing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was my bad for not catching it in review as well

}
if string == "true" || string == "TRUE" || string == "True" {
return Some(false);
}
None
}
}

impl Parser for Float32Type {
fn parse(string: &str) -> Option<f32> {
lexical_core::parse(string.as_bytes()).ok()
}
}
impl Parser for Float64Type {
fn parse(string: &str) -> Option<f64> {
lexical_core::parse(string.as_bytes()).ok()
}
}

impl Parser for UInt64Type {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be plans to support these at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently they are supported by passing a schema to the csv reader? And here they keep using the standard string.parse::<T> method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps @nevi-me was asking if there are plans to improve the parsing performance of these types as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. It could special case on those as well. Difference between lexical core and standard lib don't seem too big here though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree there is no reason to special case other types at this time


impl Parser for UInt32Type {}

impl Parser for UInt16Type {}

impl Parser for UInt8Type {}

impl Parser for Int64Type {}

impl Parser for Int32Type {}

impl Parser for Int16Type {}

impl Parser for Int8Type {}

fn parse_item<T: Parser>(string: &str) -> Option<T::Native> {
T::parse(string)
}

// parses a specific column (col_idx) into an Arrow Array.
fn build_primitive_array<T: ArrowPrimitiveType>(
fn build_primitive_array<T: ArrowPrimitiveType + Parser>(
line_number: usize,
rows: &[StringRecord],
col_idx: usize,
Expand All @@ -460,14 +509,11 @@ fn build_primitive_array<T: ArrowPrimitiveType>(
if s.is_empty() {
return Ok(None);
}
let parsed = if T::DATA_TYPE == DataType::Boolean {
s.to_lowercase().parse::<T::Native>()
} else {
s.parse::<T::Native>()
};

let parsed = parse_item::<T>(s);
match parsed {
Ok(e) => Ok(Some(e)),
Err(_) => Err(ArrowError::ParseError(format!(
Some(e) => Ok(Some(e)),
None => Err(ArrowError::ParseError(format!(
// TODO: we should surface the underlying error here.
"Error while parsing value {} for column {} at line {}",
s,
Expand Down Expand Up @@ -888,7 +934,7 @@ mod tests {
format!("{:?}", e)
),
Ok(_) => panic!("should have failed"),
}
},
None => panic!("should have failed"),
}
}
Expand Down