Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Oct 1, 2020

Use a lookup table to emulate PEXT 5 bits at a time.
Remove the slow scalar path.

@pitrou
Copy link
Member Author

pitrou commented Oct 1, 2020

See JIRA issue for benchmarks. Would be nice to have benchmarks on other machines. @emkornfield

@pitrou
Copy link
Member Author

pitrou commented Oct 1, 2020

I also notice that we call internal::GreaterThanBitmap for each 64 levels, which always goes through the dynamic dispatch indirection (meaning two function calls, I think). We could call GreaterThanBitmapImpl but that requires compiling a specialized version of level_conversion_inc.h for AVX2, otherwise we lose performance.

@github-actions
Copy link

github-actions bot commented Oct 1, 2020

@emkornfield
Copy link
Contributor

@pitrou I'm devoting most of my bandwidth to try to finish up the parquet read component this week, is it ok if I take a closer look next week (hopefully with enough time before an RC is cut?)

@emkornfield
Copy link
Contributor

I also notice that we call internal::GreaterThanBitmap for each 64 levels, which always goes through the dynamic dispatch indirection (meaning two function calls, I think). We could call GreaterThanBitmapImpl but that requires compiling a specialized version of level_conversion_inc.h for AVX2, otherwise we lose performance.

yeah it isn't ideal, it is possible there is a better factoring in there but it seemed hard to do and isolate BMI2 special instructions, I guess if this isn't too much slower then BMI2 on intel we could potentially collapse everything, but I would not expect that to be the case.

@pitrou
Copy link
Member Author

pitrou commented Oct 2, 2020

is it ok if I take a closer look next week

No problem.

but I would not expect that to be the case.

Right. The emulation is probably much slower.

Use a lookup table to emulate PEXT 5 bits at a time.
Remove the slow scalar path.
@pitrou
Copy link
Member Author

pitrou commented Oct 6, 2020

Updated benchmarks on AMD Ryzen:

                         benchmark         baseline         contender  change %                                                                                                                                           counters
0     BM_ReadListOfStructColumn/50  392.881 MiB/sec   564.029 MiB/sec    43.562     {'run_name': 'BM_ReadListOfStructColumn/50', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 23}
10            BM_ReadListColumn/50  485.560 MiB/sec   675.023 MiB/sec    39.019             {'run_name': 'BM_ReadListColumn/50', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 42}
7     BM_ReadStructOfListColumn/50  341.782 MiB/sec   462.097 MiB/sec    35.202     {'run_name': 'BM_ReadStructOfListColumn/50', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 20}
3       BM_ReadListOfListColumn/50  447.657 MiB/sec   566.594 MiB/sec    26.569       {'run_name': 'BM_ReadListOfListColumn/50', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 39}
23            BM_ReadListColumn/99    1.168 GiB/sec     1.365 GiB/sec    16.883            {'run_name': 'BM_ReadListColumn/99', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 102}
4     BM_ReadListOfStructColumn/99  975.429 MiB/sec     1.095 GiB/sec    14.925     {'run_name': 'BM_ReadListOfStructColumn/99', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 56}
9     BM_ReadStructOfListColumn/99  798.058 MiB/sec   896.789 MiB/sec    12.371     {'run_name': 'BM_ReadStructOfListColumn/99', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 46}
22      BM_ReadListOfListColumn/99    1.050 GiB/sec     1.168 GiB/sec    11.159       {'run_name': 'BM_ReadListOfListColumn/99', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 94}
1      BM_ReadListOfStructColumn/1  654.576 MiB/sec   725.676 MiB/sec    10.862      {'run_name': 'BM_ReadListOfStructColumn/1', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 38}
19             BM_ReadListColumn/1  919.949 MiB/sec  1005.740 MiB/sec     9.326              {'run_name': 'BM_ReadListColumn/1', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 81}
11     BM_ReadListOfStructColumn/0  835.259 MiB/sec   908.920 MiB/sec     8.819      {'run_name': 'BM_ReadListOfStructColumn/0', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 49}
17     BM_ReadStructOfListColumn/1  605.129 MiB/sec   649.556 MiB/sec     7.342      {'run_name': 'BM_ReadStructOfListColumn/1', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 35}
8              BM_ReadListColumn/0    1.067 GiB/sec     1.145 GiB/sec     7.334              {'run_name': 'BM_ReadListColumn/0', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 92}
5      BM_ReadStructOfListColumn/0  700.157 MiB/sec   740.414 MiB/sec     5.750      {'run_name': 'BM_ReadStructOfListColumn/0', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 41}
6        BM_ReadListOfListColumn/0  929.109 MiB/sec   966.896 MiB/sec     4.067        {'run_name': 'BM_ReadListOfListColumn/0', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 82}
14  BM_ReadStructOfStructColumn/50    1.537 GiB/sec     1.595 GiB/sec     3.772   {'run_name': 'BM_ReadStructOfStructColumn/50', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 45}
13          BM_ReadStructColumn/99    4.211 GiB/sec     4.330 GiB/sec     2.835          {'run_name': 'BM_ReadStructColumn/99', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 252}
20          BM_ReadStructColumn/50    1.155 GiB/sec     1.187 GiB/sec     2.755           {'run_name': 'BM_ReadStructColumn/50', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 69}
15   BM_ReadStructOfStructColumn/1    1.802 GiB/sec     1.849 GiB/sec     2.566    {'run_name': 'BM_ReadStructOfStructColumn/1', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 53}
12           BM_ReadStructColumn/1    1.798 GiB/sec     1.843 GiB/sec     2.521           {'run_name': 'BM_ReadStructColumn/1', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 110}
2   BM_ReadStructOfStructColumn/99    3.464 GiB/sec     3.530 GiB/sec     1.898  {'run_name': 'BM_ReadStructOfStructColumn/99', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 100}
16   BM_ReadStructOfStructColumn/0    6.021 GiB/sec     6.065 GiB/sec     0.724   {'run_name': 'BM_ReadStructOfStructColumn/0', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 179}
21           BM_ReadStructColumn/0    6.821 GiB/sec     6.812 GiB/sec    -0.137           {'run_name': 'BM_ReadStructColumn/0', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 410}
18       BM_ReadListOfListColumn/1  805.644 MiB/sec   801.480 MiB/sec    -0.517        {'run_name': 'BM_ReadListOfListColumn/1', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 70}

@pitrou pitrou force-pushed the ARROW-10058-faster-sw-pext branch from cd01f19 to 482797c Compare October 6, 2020 18:19
@emkornfield
Copy link
Contributor

sorry some personal issues came up. hope to have time tonight to review this and other parquet related CLs

@pitrou
Copy link
Member Author

pitrou commented Oct 6, 2020

For the record, if I profile BM_ReadStructOfListColumn/50, I get the following hot spots (in cycles spent):

  • ~19% in DefRepLevelsToListInfo
  • ~15% in DelimitRecords
  • ~11% in BitRunReader::NextRun
  • ~10% in SpacedExpand
  • ~6% in DictionaryConverter<int>::Copy
  • ~5% in PathWriteContext::AppendRepLevels

And ExtractBitsSoftware (the PEXT emulation) only takes ~1.1%, which seems good enough for now.

@emkornfield
Copy link
Contributor

+1. Thanks.

@pitrou pitrou deleted the ARROW-10058-faster-sw-pext branch October 7, 2020 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants