Skip to content

Conversation

@Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Nov 26, 2020

this PR makes CSV reading (quite a bit) faster by reusing allocations, and doing things a bit more manually.
It removes usage of BufReader, which is done in rust-csv already and causes overhead.

The nytaxi (entire job, with reading 1 year csv) benchmark speeds up from ~4500ms to ~1900ms.
Loading the line item csv in memory for the tpch benchmark for goes from ~9800ms -> ~6000 ms.

I think a further optimization would be to stop using the StringRecords altogether (e.g. by using the underlying https://docs.rs/csv-core/0.1.10/csv_core/ library instead) but that could be a next step.

FYI @alamb @nevi-me @jorgecarleitao

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@Dandandan Dandandan changed the title ARROW-10747: [DataFusion]: WIP CSV reader optimization ARROW-10747: [Rust]: WIP CSV reader optimization Nov 26, 2020
@Dandandan Dandandan changed the title ARROW-10747: [Rust]: WIP CSV reader optimization ARROW-10747: [Rust]: WIP CSV reader optimization Nov 26, 2020
@github-actions
Copy link

@Dandandan Dandandan changed the title ARROW-10747: [Rust]: WIP CSV reader optimization ARROW-10747: [Rust]: CSV reader optimization Nov 26, 2020
@Dandandan
Copy link
Contributor Author

I found some further opportunities for optimizing by also reusing the stringrecord items, for another speed up.

@Dandandan
Copy link
Contributor Author

Is ready for review now.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, @Dandandan , really cool speedup for such an important op.

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andygrove
Copy link
Member

@Dandandan I'm not sure why the Windows build failed and I assume it is unrelated but the logs are not available. Could you push an empty commit to trigger CI again?

@Dandandan
Copy link
Contributor Author

@Dandandan I'm not sure why the Windows build failed and I assume it is unrelated but the logs are not available. Could you push an empty commit to trigger CI again?

Just did. Let's see what happens!

@nevi-me
Copy link
Contributor

nevi-me commented Nov 27, 2020

Could you push an empty commit to trigger CI again?

We can rerun failed CI jobs from the UI, which is often better as it doesn't trigger AppVeyor and Travis CI

@Dandandan
Copy link
Contributor Author

Now some other jobs failed. Maybe we can rerun those?

@nevi-me
Copy link
Contributor

nevi-me commented Nov 27, 2020

Now some other jobs failed. Maybe we can rerun those?

CI seems to be misbehaving, it's not letting me cancel the workflow, even though the tests have failed. I'll leave this tab open, and retry the Rust jobs in about an hour. I'll merge this after CI passes

@Dandandan
Copy link
Contributor Author

I also removed the now unused buffered iterator as it is unused by now, and I think will not lead to efficient code in general.

@nevi-me nevi-me closed this in 95a497f Nov 28, 2020
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't get a chance to review this carefully before merge. It looks nice to me. Nice work @Dandandan

) -> Result<()> {
let mut queries = HashMap::new();
queries.insert("fare_amt_by_passenger", "SELECT passenger_count, MIN(fare_amount), MIN(fare_amount), SUM(fare_amount) FROM tripdata GROUP BY passenger_count");
queries.insert("fare_amt_by_passenger", "SELECT passenger_count, MIN(fare_amount), MAX(fare_amount), SUM(fare_amount) FROM tripdata GROUP BY passenger_count");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants