Skip to content

Conversation

@Dandandan
Copy link
Contributor

Internal rust float parser is known to be slow.

This change allows to have specialized implementations rather than relying on FromStr::parse.

Also avoids calling to_lowercase for booleans.

Would be nice to benchmark this.

@Dandandan Dandandan changed the title ARROW-10654: Specialize parsers ARROW-10654: [Rust] Specialize parsing of floats / bools Nov 19, 2020
@Dandandan
Copy link
Contributor Author

Dandandan commented Nov 19, 2020

Some benchmark/context of string -> f64 is here (note: log scale) https://github.com/Alexhuszagh/rust-lexical/

@github-actions
Copy link

@Dandandan
Copy link
Contributor Author

Did some benchmarking on this. Seems like a small win.

Master:

Running benchmarks with the following options: Opt { debug: false, iterations: 10, concurrency: 1, batch_size: 4096, path: "./yellow_tripdata_2020-01.csv", file_format: "csv" }

Query 'fare_amt_by_passenger' iteration 0 took 4114 ms
Query 'fare_amt_by_passenger' iteration 1 took 4087 ms
Query 'fare_amt_by_passenger' iteration 2 took 4094 ms
Query 'fare_amt_by_passenger' iteration 3 took 4118 ms
Query 'fare_amt_by_passenger' iteration 4 took 4091 ms
Query 'fare_amt_by_passenger' iteration 5 took 4099 ms
Query 'fare_amt_by_passenger' iteration 6 took 4115 ms
Query 'fare_amt_by_passenger' iteration 7 took 4129 ms
Query 'fare_amt_by_passenger' iteration 8 took 4105 ms
Query 'fare_amt_by_passenger' iteration 9 took 4095 ms

This version:

Running benchmarks with the following options: Opt { debug: false, iterations: 10, concurrency: 1, batch_size: 4096, path: "./yellow_tripdata_2020-01.csv", file_format: "csv" }
Executing 'fare_amt_by_passenger'
Query 'fare_amt_by_passenger' iteration 0 took 3985 ms
Query 'fare_amt_by_passenger' iteration 1 took 3954 ms
Query 'fare_amt_by_passenger' iteration 2 took 3961 ms
Query 'fare_amt_by_passenger' iteration 3 took 3959 ms
Query 'fare_amt_by_passenger' iteration 4 took 3963 ms
Query 'fare_amt_by_passenger' iteration 5 took 3963 ms
Query 'fare_amt_by_passenger' iteration 6 took 3958 ms
Query 'fare_amt_by_passenger' iteration 7 took 3959 ms
Query 'fare_amt_by_passenger' iteration 8 took 3971 ms
Query 'fare_amt_by_passenger' iteration 9 took 3977 ms

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I've tried using lexical before, and I'm happy that it's faster than the stdlib

}
}

impl Parser for UInt64Type {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be plans to support these at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently they are supported by passing a schema to the csv reader? And here they keep using the standard string.parse::<T> method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps @nevi-me was asking if there are plans to improve the parsing performance of these types as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. It could special case on those as well. Difference between lexical core and standard lib don't seem too big here though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree there is no reason to special case other types at this time

@alamb alamb changed the title ARROW-10654: [Rust] Specialize parsing of floats / bools ARROW-10654: [Rust] Specialize parsing of floats / bools iin CSV Reader Nov 21, 2020
@alamb alamb changed the title ARROW-10654: [Rust] Specialize parsing of floats / bools iin CSV Reader ARROW-10654: [Rust] Specialize parsing of floats / bools in CSV Reader Nov 21, 2020
}
}

impl Parser for UInt64Type {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps @nevi-me was asking if there are plans to improve the parsing performance of these types as well

}

impl Parser for BooleanType {
fn parse(string: &str) -> Option<bool> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered how this related to the rust standard boolean parsing: https://doc.rust-lang.org/src/core/str/traits.rs.html#590

Seems like it anything it would be slightly slower, but also support mixed case (true and True). Seems like a good improvement to me, though adding a test to encode the expected behavior would probably be a good idea.

Copy link
Contributor

@alamb alamb Nov 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added https://issues.apache.org/jira/browse/ARROW-10677 to track extra parsing tests

Copy link
Contributor Author

@Dandandan Dandandan Nov 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the mixed case is to keep compatibility with the previous implementation (it used to_lower)

Could as well have specific cases for all caps / capitalized booleans instead maybe? That would avoid accepting tRue etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code in this PR is good as is -- I just think it would be nice to document what the behavior is more explicitly -- so I did so in #8733

@alamb
Copy link
Contributor

alamb commented Nov 21, 2020

The CI failures don't seem related to this PR -- I am going to retrigger them

@alamb
Copy link
Contributor

alamb commented Nov 21, 2020

CI is green, so merging

@alamb alamb closed this in d873657 Nov 21, 2020
impl Parser for BooleanType {
fn parse(string: &str) -> Option<bool> {
if string == "false" || string == "FALSE" || string == "False" {
return Some(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally missed this, but this line has a bug parsing in this field appears to be backwards -- "false" seems to return true -- it looks like it got added in 52bf0e8 -- FYI @Dandandan and @nevi-me -- I have a fix for it shortly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O haha stupid bug, next time I'll make sure to add some unit tests directly in the PR.
Thanks for fixing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was my bad for not catching it in review as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants