-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10747: [Rust]: CSV reader optimization #8781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format? See also: |
|
I found some further opportunities for optimizing by also reusing the stringrecord items, for another speed up. |
|
Is ready for review now. |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, @Dandandan , really cool speedup for such an important op.
nevi-me
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@Dandandan I'm not sure why the Windows build failed and I assume it is unrelated but the logs are not available. Could you push an empty commit to trigger CI again? |
Just did. Let's see what happens! |
We can rerun failed CI jobs from the UI, which is often better as it doesn't trigger AppVeyor and Travis CI |
|
Now some other jobs failed. Maybe we can rerun those? |
CI seems to be misbehaving, it's not letting me cancel the workflow, even though the tests have failed. I'll leave this tab open, and retry the Rust jobs in about an hour. I'll merge this after CI passes |
|
I also removed the now unused buffered iterator as it is unused by now, and I think will not lead to efficient code in general. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't get a chance to review this carefully before merge. It looks nice to me. Nice work @Dandandan
| ) -> Result<()> { | ||
| let mut queries = HashMap::new(); | ||
| queries.insert("fare_amt_by_passenger", "SELECT passenger_count, MIN(fare_amount), MIN(fare_amount), SUM(fare_amount) FROM tripdata GROUP BY passenger_count"); | ||
| queries.insert("fare_amt_by_passenger", "SELECT passenger_count, MIN(fare_amount), MAX(fare_amount), SUM(fare_amount) FROM tripdata GROUP BY passenger_count"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
this PR makes CSV reading (quite a bit) faster by reusing allocations, and doing things a bit more manually.
It removes usage of BufReader, which is done in rust-csv already and causes overhead.
The nytaxi (entire job, with reading 1 year csv) benchmark speeds up from ~4500ms to ~1900ms.
Loading the line item csv in memory for the tpch benchmark for goes from ~9800ms -> ~6000 ms.
I think a further optimization would be to stop using the
StringRecordsaltogether (e.g. by using the underlying https://docs.rs/csv-core/0.1.10/csv_core/ library instead) but that could be a next step.FYI @alamb @nevi-me @jorgecarleitao