Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Oct 15, 2020

This improves CSV string conversion performance by about 30%.

This improves CSV string conversion by about 30%.
@github-actions
Copy link

@pitrou
Copy link
Member Author

pitrou commented Oct 15, 2020

This also improves the ARROW-10308 benchmark by about 9%.

@pitrou pitrou requested a review from bkietz October 15, 2020 16:27
@pitrou
Copy link
Member Author

pitrou commented Oct 15, 2020

Another possibility would be to compute and store ASCIIness of values while parsing CSV (ideally this should not cost anything CPU-wise). Then we can reuse that information to skip UTF8 validation for most values.

Edit: actually, a quick attempt shows a significant decrease in CSV parsing speed. Too bad.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one nit

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops

pitrou and others added 2 commits October 15, 2020 21:14
Co-authored-by: Benjamin Kietzman <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
@pitrou pitrou closed this in 2510f4f Oct 16, 2020
@pitrou pitrou deleted the ARROW-10313-faster-utf8-validate branch October 16, 2020 07:32
kszucs pushed a commit that referenced this pull request Oct 19, 2020
This improves CSV string conversion performance by about 30%.

Closes #8470 from pitrou/ARROW-10313-faster-utf8-validate

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants