Skip to content

Conversation

@guyuqi
Copy link
Member

@guyuqi guyuqi commented Aug 8, 2019

Algorithm and code come from: https://github.com/cyb70289/utf8 (MIT LICENSE)

The patch would introduce a fast non-ASCII validation method with this new algorithm into Apache Arrow and give give a option to non-ascii case validation.

Arm64 platform

Benchmark

Origin:

----------------------------------------------------------------
Benchmark                         Time           CPU Iterations
----------------------------------------------------------------
ValidateTinyAscii                 7 ns          7 ns  107435339   1.42978GB/s
ValidateTinyNonAscii             16 ns         16 ns   42655054   639.503MB/s
ValidateSmallAscii               29 ns         29 ns   24516945    4.4671GB/s
ValidateSmallAlmostAscii         91 ns         91 ns    7677848   1.51182GB/s
ValidateSmallNonAscii           175 ns        175 ns    4009837    731.98MB/s
ValidateLargeAscii            18821 ns      18814 ns      37194   4.95077GB/s
ValidateLargeAlmostAscii      64056 ns      64025 ns      10929   1.45533GB/s
ValidateLargeNonAscii        130321 ns     130249 ns       5375   732.909MB/s

The new algorithm:

----------------------------------------------------------------
Benchmark                         Time           CPU Iterations
----------------------------------------------------------------
ValidateTinyAscii                 6 ns          6 ns  116427650   1.59527GB/s
ValidateTinyNonAscii             17 ns         17 ns   41897276   628.046MB/s
ValidateSmallAscii              117 ns        117 ns    5964896   1113.14MB/s
ValidateSmallAlmostAscii        145 ns        145 ns    4819232    971.76MB/s
ValidateSmallNonAscii           118 ns        118 ns    5947924   1085.68MB/s
ValidateLargeAscii            82297 ns      82247 ns       8511   1.13246GB/s
ValidateLargeAlmostAscii      81145 ns      81138 ns       8627   1.14838GB/s
ValidateLargeNonAscii         81221 ns      81202 ns       8621   1.14805GB/s

x86 platform

Benchmark

Origin:

----------------------------------------------------------------
Benchmark                         Time           CPU Iterations
----------------------------------------------------------------
ValidateTinyAscii                 3 ns          3 ns  180870752   3.02228GB/s
ValidateTinyNonAscii             10 ns         10 ns   73066734    1061.9MB/s
ValidateSmallAscii               11 ns         11 ns   61844060   11.2604GB/s
ValidateSmallAlmostAscii         53 ns         53 ns   10000000   2.60296GB/s
ValidateSmallNonAscii            88 ns         88 ns    7948330   1.41807GB/s
ValidateLargeAscii             4887 ns       4886 ns     140027   19.0622GB/s
ValidateLargeAlmostAscii      31479 ns      31474 ns      22286    2.9604GB/s
ValidateLargeNonAscii         64243 ns      64234 ns      10771   1.45132GB/s

The new algorithm:

----------------------------------------------------------------
Benchmark                         Time           CPU Iterations
----------------------------------------------------------------
ValidateTinyAscii                 4 ns          4 ns  137163202   2.29073GB/s
ValidateTinyNonAscii             10 ns         10 ns   71646804   1068.54MB/s
ValidateSmallAscii               46 ns         46 ns   16725701   2.80417GB/s
ValidateSmallAlmostAscii         63 ns         63 ns   11153069   2.17771GB/s
ValidateSmallNonAscii            55 ns         55 ns   12845765   2.25957GB/s
ValidateLargeAscii            28394 ns      28390 ns      24765   3.28084GB/s
ValidateLargeAlmostAscii      27530 ns      27526 ns      21613    3.3851GB/s
ValidateLargeNonAscii         27492 ns      27488 ns      24760   3.39145GB/s

@guyuqi guyuqi force-pushed the utf8-validation-ARROW-6131 branch from 23e090a to 0b36f0d Compare August 8, 2019 08:31
@pitrou
Copy link
Member

pitrou commented Aug 12, 2019

What's the point of including this in Arrow if it's not used in the codebase?

@wesm
Copy link
Member

wesm commented Aug 13, 2019

I had been thinking that this would be exposed in some way to be invoked easily on BinaryArray.

If this patch is accepted we need to add something to LICENSE.txt

@wesm
Copy link
Member

wesm commented Aug 13, 2019

This also doesn't have any unit tests

@guyuqi
Copy link
Member Author

guyuqi commented Aug 16, 2019

This also doesn't have any unit tests

I'd like to add the unit test case for ValidateNonAscii.

@guyuqi guyuqi force-pushed the utf8-validation-ARROW-6131 branch 2 times, most recently from 3fb2fa2 to 01fff3a Compare August 28, 2019 03:01
@guyuqi
Copy link
Member Author

guyuqi commented Aug 28, 2019

Add ValidateNonAscii to unittest case: arrow-utility-test.
Arm64 Unit tests results:

Test project /home/builder/arrow/cpp/bld
      Start  2: arrow-array-test
      Start 52: arrow-io-compressed-test
      Start 35: arrow-ipc-read-write-test
      Start 26: arrow-compute-aggregate-test
 1/55 Test #26: arrow-compute-aggregate-test .......   Passed    0.90 sec
      Start 63: arrow-logging-test
 2/55 Test #35: arrow-ipc-read-write-test ..........   Passed    1.51 sec
 3/55 Test #63: arrow-logging-test .................   Passed    0.43 sec
      Start 31: arrow-compute-filter-test
      Start 66: arrow-thread-pool-test
 4/55 Test #66: arrow-thread-pool-test .............   Passed    0.47 sec
 5/55 Test #31: arrow-compute-filter-test ..........   Passed    0.47 sec
 6/55 Test #52: arrow-io-compressed-test ...........   Passed    2.22 sec
      Start 65: arrow-task-group-test
      Start 28: arrow-compute-compare-test
      Start 64: arrow-rle-encoding-test
 7/55 Test #65: arrow-task-group-test ..............   Passed    0.28 sec
      Start 55: arrow-io-memory-test
 8/55 Test #64: arrow-rle-encoding-test ............   Passed    0.38 sec
 9/55 Test #28: arrow-compute-compare-test .........   Passed    0.38 sec
      Start 24: arrow-compute-hash-test
      Start 30: arrow-compute-take-test
10/55 Test #24: arrow-compute-hash-test ............   Passed    0.18 sec
11/55 Test #55: arrow-io-memory-test ...............   Passed    0.38 sec
      Start 53: arrow-io-file-test
      Start 44: arrow-csv-parser-test
12/55 Test #30: arrow-compute-take-test ............   Passed    0.28 sec
13/55 Test #44: arrow-csv-parser-test ..............   Passed    0.20 sec
      Start 61: arrow-compression-test
      Start 56: arrow-io-readahead-test
14/55 Test #53: arrow-io-file-test .................   Passed    0.31 sec
15/55 Test #61: arrow-compression-test .............   Passed    0.20 sec
      Start  4: arrow-extension_type-test
      Start 43: arrow-csv-converter-test
16/55 Test #56: arrow-io-readahead-test ............   Passed    0.21 sec
17/55 Test #43: arrow-csv-converter-test ...........   Passed    0.07 sec
18/55 Test  #4: arrow-extension_type-test ..........   Passed    0.18 sec
      Start  1: arrow-allocator-test
      Start  3: arrow-buffer-test
      Start  8: arrow-result-test
19/55 Test  #8: arrow-result-test ..................   Passed    0.08 sec
20/55 Test  #3: arrow-buffer-test ..................   Passed    0.08 sec
21/55 Test  #1: arrow-allocator-test ...............   Passed    0.18 sec
      Start 41: arrow-csv-chunker-test
      Start 36: arrow-ipc-json-simple-test
      Start 38: arrow-json-integration-test
22/55 Test #38: arrow-json-integration-test ........   Passed    0.08 sec
23/55 Test #36: arrow-ipc-json-simple-test .........   Passed    0.08 sec
24/55 Test #41: arrow-csv-chunker-test .............   Passed    0.18 sec
      Start 18: arrow-compute-test
      Start 25: arrow-compute-util-internal-test
      Start 34: arrow-feather-test
25/55 Test #34: arrow-feather-test .................   Passed    0.09 sec
26/55 Test #25: arrow-compute-util-internal-test ...   Passed    0.09 sec
27/55 Test #18: arrow-compute-test .................   Passed    0.19 sec
      Start 22: arrow-compute-boolean-test
      Start 60: arrow-bit-util-test
      Start  7: arrow-public-api-test
28/55 Test  #7: arrow-public-api-test ..............   Passed    0.08 sec
29/55 Test #60: arrow-bit-util-test ................   Passed    0.08 sec
30/55 Test #22: arrow-compute-boolean-test .........   Passed    0.18 sec
      Start 11: arrow-stl-test
      Start 49: arrow-json-test
      Start 10: arrow-status-test
31/55 Test #10: arrow-status-test ..................   Passed    0.08 sec
32/55 Test #49: arrow-json-test ....................   Passed    0.10 sec
33/55 Test #11: arrow-stl-test .....................   Passed    0.20 sec
      Start 20: arrow-compute-operations-test
      Start 48: arrow-localfs-test
      Start 59: arrow-utility-test
34/55 Test #48: arrow-localfs-test .................   Passed    0.10 sec
35/55 Test #20: arrow-compute-operations-test ......   Passed    0.20 sec
      Start 13: arrow-table-test
      Start 16: arrow-sparse_tensor-test
36/55 Test #59: arrow-utility-test .................   Passed    0.21 sec
37/55 Test #16: arrow-sparse_tensor-test ...........   Passed    0.08 sec
38/55 Test #13: arrow-table-test ...................   Passed    0.18 sec
      Start 14: arrow-table_builder-test
      Start  6: arrow-pretty_print-test
      Start 47: arrow-filesystem-test
39/55 Test #47: arrow-filesystem-test ..............   Passed    0.08 sec
40/55 Test  #6: arrow-pretty_print-test ............   Passed    0.08 sec
41/55 Test #14: arrow-table_builder-test ...........   Passed    0.18 sec
      Start 62: arrow-decimal-test
      Start 78: arrow-dataset-file_test
      Start  5: arrow-memory_pool-test
42/55 Test  #5: arrow-memory_pool-test .............   Passed    0.08 sec
43/55 Test #78: arrow-dataset-file_test ............   Passed    0.08 sec
44/55 Test #62: arrow-decimal-test .................   Passed    0.18 sec
      Start 23: arrow-compute-cast-test
      Start 42: arrow-csv-column-builder-test
      Start 37: arrow-ipc-json-test
45/55 Test #37: arrow-ipc-json-test ................   Passed    0.08 sec
46/55 Test #42: arrow-csv-column-builder-test ......   Passed    0.09 sec
47/55 Test #23: arrow-compute-cast-test ............   Passed    0.19 sec
      Start 54: arrow-io-hdfs-test
      Start 40: arrow-concatenate-test
      Start 12: arrow-type-test
48/55 Test #12: arrow-type-test ....................   Passed    0.08 sec
49/55 Test #40: arrow-concatenate-test .............   Passed    0.08 sec
50/55 Test #54: arrow-io-hdfs-test .................   Passed    0.18 sec
      Start 51: arrow-io-buffered-test
      Start  9: arrow-scalar-test
      Start 15: arrow-tensor-test
51/55 Test #15: arrow-tensor-test ..................   Passed    0.08 sec
52/55 Test  #9: arrow-scalar-test ..................   Passed    0.08 sec
53/55 Test #51: arrow-io-buffered-test .............   Passed    0.19 sec
      Start 19: arrow-compute-expression-test
54/55 Test #19: arrow-compute-expression-test ......   Passed    0.10 sec
55/55 Test  #2: arrow-array-test ...................   Passed    6.17 sec

100% tests passed, 0 tests failed out of 55

Label Time Summary:
arrow-tests      =  19.49 sec*proc (54 tests)
arrow_dataset    =   0.08 sec*proc (1 test)
unittest         =  19.57 sec*proc (55 tests)

Total Test time (real) =   6.18 sec
[100%] Built target unittest

@guyuqi
Copy link
Member Author

guyuqi commented Sep 18, 2019

@hrw is trying to add Arm support to Arrow CI matrix on PR: #5024

@wesm
Copy link
Member

wesm commented Sep 18, 2019

I'm working on adding a more capable Aarch64 machine to the Ursabot cluster (48-core ThunderX) so that could help with providing more timely CI

@pitrou
Copy link
Member

pitrou commented Sep 18, 2019

Cool. But I still don't think we should hamper the full-ASCII case (which will be very common in e.g. CSV or JSON files).

@wesm
Copy link
Member

wesm commented Sep 18, 2019

I agree

@guyuqi
Copy link
Member Author

guyuqi commented Sep 19, 2019

Cool. But I still don't think we should hamper the full-ASCII case (which will be very common in e.g. CSV or JSON files).

I'd like to propose a full-ASCII optimization for Arm64 later.

@guyuqi
Copy link
Member Author

guyuqi commented Oct 11, 2019

Add Ascii validation:
Before:

ValidateTinyAscii 6.04 ns 6.03 ns 112754472 bytes_per_second=1.54331G/s
ValidateSmallAscii 22.8 ns 22.7 ns 29583112 bytes_per_second=5.60858G/s
ValidateLargeAscii 15166 ns 15164 ns 46164 bytes_per_second=6.14226G/s

After:

ValidateTinyAscii 8.06 ns 8.06 ns 87046764 bytes_per_second=1.15497G/s
ValidateSmallAscii 14.5 ns 14.5 ns 48115118 bytes_per_second=8.77193G/s
ValidateLargeAscii 7840 ns 7839 ns 89218 bytes_per_second=11.8821G/s

Get ~ 1.9x speedup for Large Ascii

guyuqi added 3 commits January 9, 2020 07:05
Algorithm and code come from: https://github.com/cyb70289/utf8 (MIT LICENSE)

From discussion in ARROW-6131:
The patch would introduce a fast non-ASCII validation method with
this new algorithm into Apache Arrow and give give a option to non-ascii case validation

Change-Id: Idf7e13b71a91d5aa85fe47dd271b2159d19fde6a
Signed-off-by: Yuqi Gu <[email protected]>
Change-Id: I306c15b60ed657a7de4167191425b0d2a6046d04
Before:
ValidateTinyAscii 6.04 ns 6.03 ns 112754472 bytes_per_second=1.54331G/s
ValidateSmallAscii 22.8 ns 22.7 ns 29583112 bytes_per_second=5.60858G/s
ValidateLargeAscii 15166 ns 15164 ns 46164 bytes_per_second=6.14226G/s

After:
ValidateTinyAscii 8.06 ns 8.06 ns 87046764 bytes_per_second=1.15497G/s
ValidateSmallAscii 14.5 ns 14.5 ns 48115118 bytes_per_second=8.77193G/s
ValidateLargeAscii 7840 ns 7839 ns 89218 bytes_per_second=11.8821G/s

Get ~ 1.9x speedup for Large Ascii

Change-Id: Ia98e3f836dad259e9c4711f82b6af8de90bb9d77
Signed-off-by: Yuqi Gu <[email protected]>
@guyuqi guyuqi force-pushed the utf8-validation-ARROW-6131 branch from ec8b241 to 34ed1e7 Compare January 9, 2020 07:08
@guyuqi
Copy link
Member Author

guyuqi commented Jan 9, 2020

Rebase the code for file confliction and also have added the full-ASCII validation optimization here.

Change-Id: Id2bcfde0d15bda004223fa476bb85fd119c417b8
@fsaintjacques
Copy link
Contributor

@guyuqi this fails to build on windows due to type truncation issues: https://github.com/apache/arrow/pull/5038/checks?check_run_id=380918799#step:4:510

@wesm
Copy link
Member

wesm commented Apr 29, 2020

It would be useful to have an ASCII validation function (versus UTF-8 / non-UTF-8), would you be able to break out the ASCII validation changes into a new PR?

@guyuqi
Copy link
Member Author

guyuqi commented Apr 30, 2020

It would be useful to have an ASCII validation function (versus UTF-8 / non-UTF-8), would you be able to break out the ASCII validation changes into a new PR?

Yes, of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants