-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6131: [C++] Optimize the Arrow Non-Ascii-string-validation #5038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
23e090a to
0b36f0d
Compare
|
What's the point of including this in Arrow if it's not used in the codebase? |
|
I had been thinking that this would be exposed in some way to be invoked easily on If this patch is accepted we need to add something to LICENSE.txt |
|
This also doesn't have any unit tests |
I'd like to add the unit test case for |
3fb2fa2 to
01fff3a
Compare
|
Add |
|
I'm working on adding a more capable Aarch64 machine to the Ursabot cluster (48-core ThunderX) so that could help with providing more timely CI |
|
Cool. But I still don't think we should hamper the full-ASCII case (which will be very common in e.g. CSV or JSON files). |
|
I agree |
I'd like to propose a full-ASCII optimization for Arm64 later. |
01fff3a to
ec8b241
Compare
|
Add Ascii validation: After: Get ~ 1.9x speedup for Large Ascii |
Algorithm and code come from: https://github.com/cyb70289/utf8 (MIT LICENSE) From discussion in ARROW-6131: The patch would introduce a fast non-ASCII validation method with this new algorithm into Apache Arrow and give give a option to non-ascii case validation Change-Id: Idf7e13b71a91d5aa85fe47dd271b2159d19fde6a Signed-off-by: Yuqi Gu <[email protected]>
Change-Id: I306c15b60ed657a7de4167191425b0d2a6046d04
Before: ValidateTinyAscii 6.04 ns 6.03 ns 112754472 bytes_per_second=1.54331G/s ValidateSmallAscii 22.8 ns 22.7 ns 29583112 bytes_per_second=5.60858G/s ValidateLargeAscii 15166 ns 15164 ns 46164 bytes_per_second=6.14226G/s After: ValidateTinyAscii 8.06 ns 8.06 ns 87046764 bytes_per_second=1.15497G/s ValidateSmallAscii 14.5 ns 14.5 ns 48115118 bytes_per_second=8.77193G/s ValidateLargeAscii 7840 ns 7839 ns 89218 bytes_per_second=11.8821G/s Get ~ 1.9x speedup for Large Ascii Change-Id: Ia98e3f836dad259e9c4711f82b6af8de90bb9d77 Signed-off-by: Yuqi Gu <[email protected]>
ec8b241 to
34ed1e7
Compare
|
Rebase the code for file confliction and also have added the full-ASCII validation optimization here. |
Change-Id: Id2bcfde0d15bda004223fa476bb85fd119c417b8
|
@guyuqi this fails to build on windows due to type truncation issues: https://github.com/apache/arrow/pull/5038/checks?check_run_id=380918799#step:4:510 |
|
It would be useful to have an ASCII validation function (versus UTF-8 / non-UTF-8), would you be able to break out the ASCII validation changes into a new PR? |
Yes, of course. |
Algorithm and code come from: https://github.com/cyb70289/utf8 (MIT LICENSE)
The patch would introduce a fast non-ASCII validation method with this new algorithm into Apache Arrow and give give a option to non-ascii case validation.
Arm64 platform
Benchmark
Origin:
The new algorithm:
x86 platform
Benchmark
Origin:
The new algorithm: