Speed up oriented bounding box (OBB) overlap test #21526

SeanCurtis-TRI · 2024-06-05T17:09:28Z

Is your feature request related to a problem? Please describe.
Because we advocate using compliant hydroelastics as the preferred mode of contact, the tet-tet pair culling is incredibly important. We use an oriented bounding box (OBB) bounding volume hierarchy (BVH) to "quickly" cull as many tet-pair candidates as possible. As such, we test pairs of OBBs for overlapping a lot.

Even if we make no algorithmic changes to how we cull tet pairs, simply speeding up the overlapping test should yield dividends.

Describe the solution you'd like
The OBB overlap test lends itself well to SIMD performance improvements. (Its search for separating planes naturally partitions into parallel lumps of computation suitable for SIMD). We should offer up a SIMD implementation.

The SIMD implementation needs the same kind of fallback that the rotation matrix multiplication has (i.e., support for when run on a system where SIMD is not available).

In addition, we need some supporting infrastructure:

We need some benchmarking so that we can track performance changes and prevent regression (or, at the very least, feel good about changes).
The unit tests on the OBB overlap code are currently implicit in the test for a different compilation unit. They need to be brought into their own dedicated unit tests.
- Move boxes_overlap tests from obb_test to boxes_overlap_test #21546

Describe alternatives you've considered
We could consider different bounding volume types (spheres, etc.) and different traversal algorithms (parallel, etc.). However, those are much larger endeavors with uncertain return. Generally, OBB overlap is a valuable operation and making it faster is an unfettered good. Even if we end up accelerating compliant hydro contact using a different bounding volume type, this function should still be faster.

jwnimmer-tri · 2024-06-05T17:12:15Z

FYI my proof of concept branch is here: https://github.com/jwnimmer-tri/drake/commits/boxes-overlap/

jwnimmer-tri · 2024-06-05T18:26:07Z

Oh, one other thing I forgot to mention: if we really don't like the how this looks using raw intrinsics, conceivably we could bring in https://github.com/google/highway to write it in a more normal C++ style that would compile down to efficient intrinsics (even better, with cpu-based dispatch to the best routine for that flavor). Hopefully we can skate by without this time, but as we start writing more SIMD, I think it's probable we'll need to add highway eventually.

jwnimmer-tri · 2024-06-12T21:55:46Z

@SeanCurtis-TRI if possible, it would be nice to cave off the "make a googlebench" part of this to @rpoyner-tri. (Even if you need to massage the space of input OBB domains afterwards.) Maybe you two can talk offline about it.

SeanCurtis-TRI · 2024-06-13T13:31:30Z

I'm fully in favor of the carving.

need to massage the space of input oBB domains afterwards

Not sure what you mean exactly.

jwnimmer-tri · 2024-06-13T13:48:46Z

Not sure what you mean exactly.

Since the function has a couple of early-exit paths, the average performance will be a function of the distribution (and predictability) of input OBBs. Part of making a good microbenchmark will be feeding in OBBs that have similar statistics to the ones seen in a real application.

SeanCurtis-TRI · 2024-06-13T14:35:02Z

I expect the easiest way to do that would be to take one of the representative scenarios, add some counters to the boxes overlap test and run things with the counters. Probably need to run across multiple manipulands.

jwnimmer-tri · 2024-06-13T15:28:47Z

Great. A nice starting scenario for that might be the anzu#13014 replay, which Rico is already going to re-profile soon. We can run it with some OBB case 1 / case 2 / case 3 static counters.

jwnimmer-tri · 2024-06-13T15:29:48Z

(But in any case, even a benchmark with only vanilla input statistics would still be a nice benefit. We can always dial in better statistics over time.)

rpoyner-tri · 2024-06-20T16:05:29Z

My thought is to start with focused benchmarks for each of the cases, then collect statistics, then implement some statistical-mix benchmarks.

As I understand things so far, looks like I'll want to shuffle some common setup code out of boxes_overlap_test for use in benchmarks.

rpoyner-tri · 2024-06-20T23:18:24Z

Here are some branch execution count statistics from the anzu profiling target mentioned above. The statistics collection hackery is shown at #21598.

A axis separation returns: [3944039, 8807791, 14267170]
B axis separation returns: [1487990, 2207006, 3686886]
Cross product axis separation returns [2137676, 1165955, 447086, 683011, 343790, 145649, 120081, 73198, 21485]
Overlap returns [76206736]

jwnimmer-tri · 2024-06-20T23:23:37Z

Good stuff! To help us grok it, could you baseline it to percentages? (I think that means summing to get the total calls, then diving that across the numbers?)

rpoyner-tri · 2024-06-20T23:30:13Z

Here's a google sheet with the same data, prettied up a bit.: https://docs.google.com/spreadsheets/d/1TMtvXn3JanMObBe3tXTyTXfe3FcGoWcrhA5CApF9T9w/edit?usp=sharing

rpoyner-tri · 2024-06-20T23:32:05Z

Surprising (at least to me) that 65% of the queries are overlap (full execution of the whole function). Does that match others' intuition?

sherm1 · 2024-06-20T23:32:25Z

That's an interesting result -- looks like broad phase is doing a good job so most of the returns are overlaps. I wonder if that means it would be faster to look for overlap rather than looking for separation.

jwnimmer-tri · 2024-06-20T23:41:22Z

That's one possibility, yes. Is there a fewer-flops math to check for overlap instead of non-overlap?

Another (SIMD) tactic would be to use branchless -- unconditionally compute all 15 overlap queries into a mask, and return it directly (casting mask int to C++ bool). That might avoid some branch mispredict penalties.

rpoyner-tri · 2024-06-21T15:34:36Z

Memorialized the anzu branch use to measure here: anzu#13099

SeanCurtis-TRI · 2024-06-24T13:33:35Z

There are two pieces of information necessary to gain a greater understanding of the value of the broadphase culling:

You need some sense of how many possible obb-obb pairs are not being evaluated. I.e., for every "these obbs are separated" result you get, you'd want to know the size of the sub-tree of the bounding volume traversal tree lies underneath it.
You'd also like to know what fraction of the obb-obb pairs marked as "overlapping" actually produce overlapping "primitives". For example it would be easy to say true far too much, leading to unhelpful and expensive primitive-primitive tests.

That said, the efficacy of the BVH code is not at stake here. Merely the question of given what the code does can we get it to do it faster. The question of whether or not the BVH code is doing a "good" thing is for a later date.

jwnimmer-tri · 2024-06-24T19:37:52Z

FYI @rpoyner-tri my hack branch was boxes-overlap.

SeanCurtis-TRI · 2024-06-25T18:24:44Z

Fun update.

I ran the overlap benchmark in #21596 three times. Once against the BoxesOverlap() in master, once against the intrinsics-based implementation in boxes-overlap, and once against the highway version of the same (located here). The benchmark results are down below, but here's a summary of what jumped out at me:

Master -> SIMD reduces the cost to almost 1/3 of the original. Not surprising as we are taking the 15 calculations and doing it in five bands of three.
intrinsics -> highway reduces the cost a bit more.
- Highway isn't quite as fast in a couple of cases -- the absolute fastest cases for intrinsics (12.5 ns becomes 13.6 ns).
- However, in all other cases, it takes only 3/4 of the time of the intrinsics. Yowza. That means the ratio from master to highway is almost 1/4.

Master version

------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU    Allocs   Iterations
------------------------------------------------------------------------------------------------
BoxesOverlapBenchmark/ParallelContainedCase        129 ns          129 ns     0.125      4968178
BoxesOverlapBenchmark/PosedCase/0/0/-1            28.9 ns         28.9 ns    0.1875     24954658
BoxesOverlapBenchmark/PosedCase/1/0/-1             113 ns          113 ns    0.1875      6287359
BoxesOverlapBenchmark/PosedCase/0/1/-1            32.9 ns         32.9 ns    0.1875     21461271
BoxesOverlapBenchmark/PosedCase/1/1/-1             119 ns          119 ns    0.1875      6101520
BoxesOverlapBenchmark/PosedCase/0/2/-1            35.4 ns         35.4 ns    0.1875     19722903
BoxesOverlapBenchmark/PosedCase/1/2/-1             111 ns          111 ns    0.1875      6349643
BoxesOverlapBenchmark/PosedCase/0/0/0             52.4 ns         52.4 ns    0.1875     13089050
BoxesOverlapBenchmark/PosedCase/1/0/0              112 ns          112 ns    0.1875      6309225
BoxesOverlapBenchmark/PosedCase/0/1/0             68.7 ns         68.6 ns    0.1875     10190161
BoxesOverlapBenchmark/PosedCase/1/1/0              113 ns          113 ns    0.1875      6130558
BoxesOverlapBenchmark/PosedCase/0/2/0             83.3 ns         83.3 ns    0.1875      8344837
BoxesOverlapBenchmark/PosedCase/1/2/0              110 ns          110 ns    0.1875      6341236
BoxesOverlapBenchmark/PosedCase/0/0/1             66.7 ns         66.7 ns    0.1875     10517465
BoxesOverlapBenchmark/PosedCase/1/0/1              110 ns          110 ns    0.1875      6322427
BoxesOverlapBenchmark/PosedCase/0/1/1             81.4 ns         81.4 ns    0.1875      8648747
BoxesOverlapBenchmark/PosedCase/1/1/1              111 ns          111 ns    0.1875      6329373
BoxesOverlapBenchmark/PosedCase/0/2/1             95.7 ns         95.7 ns    0.1875      7211109
BoxesOverlapBenchmark/PosedCase/1/2/1              112 ns          112 ns    0.1875      6359837
BoxesOverlapBenchmark/PosedCase/0/0/2             81.9 ns         81.9 ns    0.1875      8431608
BoxesOverlapBenchmark/PosedCase/1/0/2              112 ns          112 ns    0.1875      6409212
BoxesOverlapBenchmark/PosedCase/0/1/2             97.9 ns         97.9 ns    0.1875      7112036
BoxesOverlapBenchmark/PosedCase/1/1/2              113 ns          113 ns    0.1875      6076795
BoxesOverlapBenchmark/PosedCase/0/2/2              113 ns          113 ns    0.1875      6211708
BoxesOverlapBenchmark/PosedCase/1/2/2              113 ns          113 ns    0.1875      6204990

Intrinsics

------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU    Allocs   Iterations
------------------------------------------------------------------------------------------------
BoxesOverlapBenchmark/ParallelContainedCase       61.4 ns         61.4 ns     0.125      9166908
BoxesOverlapBenchmark/PosedCase/0/0/-1            12.5 ns         12.5 ns    0.1875     53171712
BoxesOverlapBenchmark/PosedCase/1/0/-1            40.8 ns         40.8 ns    0.1875     17289468
BoxesOverlapBenchmark/PosedCase/0/1/-1            12.3 ns         12.3 ns    0.1875     59670400
BoxesOverlapBenchmark/PosedCase/1/1/-1            41.4 ns         41.4 ns    0.1875     16710841
BoxesOverlapBenchmark/PosedCase/0/2/-1            12.1 ns         12.1 ns    0.1875     58791224
BoxesOverlapBenchmark/PosedCase/1/2/-1            41.2 ns         41.2 ns    0.1875     17114134
BoxesOverlapBenchmark/PosedCase/0/0/0             28.2 ns         28.2 ns    0.1875     24030831
BoxesOverlapBenchmark/PosedCase/1/0/0             41.3 ns         41.3 ns    0.1875     17129613
BoxesOverlapBenchmark/PosedCase/0/1/0             32.0 ns         32.0 ns    0.1875     21884716
BoxesOverlapBenchmark/PosedCase/1/1/0             41.5 ns         41.5 ns    0.1875     15562419
BoxesOverlapBenchmark/PosedCase/0/2/0             34.9 ns         34.9 ns    0.1875     20026718
BoxesOverlapBenchmark/PosedCase/1/2/0             41.0 ns         41.0 ns    0.1875     17022138
BoxesOverlapBenchmark/PosedCase/0/0/1             31.6 ns         31.6 ns    0.1875     22007645
BoxesOverlapBenchmark/PosedCase/1/0/1             41.6 ns         41.6 ns    0.1875     16646245
BoxesOverlapBenchmark/PosedCase/0/1/1             34.9 ns         34.9 ns    0.1875     19944926
BoxesOverlapBenchmark/PosedCase/1/1/1             41.0 ns         40.8 ns    0.1875     17098019
BoxesOverlapBenchmark/PosedCase/0/2/1             37.6 ns         37.4 ns    0.1875     18830446
BoxesOverlapBenchmark/PosedCase/1/2/1             41.4 ns         41.2 ns    0.1875     16807910
BoxesOverlapBenchmark/PosedCase/0/0/2             35.0 ns         34.9 ns    0.1875     20040863
BoxesOverlapBenchmark/PosedCase/1/0/2             41.0 ns         40.8 ns    0.1875     16670585
BoxesOverlapBenchmark/PosedCase/0/1/2             37.5 ns         37.4 ns    0.1875     18546764
BoxesOverlapBenchmark/PosedCase/1/1/2             40.9 ns         40.7 ns    0.1875     17225832
BoxesOverlapBenchmark/PosedCase/0/2/2             41.0 ns         40.8 ns    0.1875     17366779
BoxesOverlapBenchmark/PosedCase/1/2/2             40.4 ns         40.3 ns    0.1875     17244243

Highway

------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU    Allocs   Iterations
------------------------------------------------------------------------------------------------
BoxesOverlapBenchmark/ParallelContainedCase       28.5 ns         28.5 ns     0.125     19737188
BoxesOverlapBenchmark/PosedCase/0/0/-1            13.6 ns         13.6 ns    0.1875     52039155
BoxesOverlapBenchmark/PosedCase/1/0/-1            29.1 ns         29.1 ns    0.1875     24285073
BoxesOverlapBenchmark/PosedCase/0/1/-1            13.7 ns         13.7 ns    0.1875     49932847
BoxesOverlapBenchmark/PosedCase/1/1/-1            30.1 ns         30.1 ns    0.1875     23355036
BoxesOverlapBenchmark/PosedCase/0/2/-1            13.6 ns         13.6 ns    0.1875     50565652
BoxesOverlapBenchmark/PosedCase/1/2/-1            28.9 ns         28.9 ns    0.1875     23586123
BoxesOverlapBenchmark/PosedCase/0/0/0             22.9 ns         22.9 ns    0.1875     30625961
BoxesOverlapBenchmark/PosedCase/1/0/0             29.0 ns         29.0 ns    0.1875     24223770
BoxesOverlapBenchmark/PosedCase/0/1/0             24.4 ns         24.4 ns    0.1875     28330168
BoxesOverlapBenchmark/PosedCase/1/1/0             29.4 ns         29.4 ns    0.1875     23944716
BoxesOverlapBenchmark/PosedCase/0/2/0             26.4 ns         26.4 ns    0.1875     26287474
BoxesOverlapBenchmark/PosedCase/1/2/0             29.1 ns         29.1 ns    0.1875     23929282
BoxesOverlapBenchmark/PosedCase/0/0/1             24.5 ns         24.5 ns    0.1875     27674972
BoxesOverlapBenchmark/PosedCase/1/0/1             29.7 ns         29.7 ns    0.1875     23606747
BoxesOverlapBenchmark/PosedCase/0/1/1             26.6 ns         26.6 ns    0.1875     26635718
BoxesOverlapBenchmark/PosedCase/1/1/1             31.0 ns         31.0 ns    0.1875     22288692
BoxesOverlapBenchmark/PosedCase/0/2/1             28.3 ns         28.3 ns    0.1875     23564967
BoxesOverlapBenchmark/PosedCase/1/2/1             30.0 ns         30.0 ns    0.1875     23136369
BoxesOverlapBenchmark/PosedCase/0/0/2             26.3 ns         26.3 ns    0.1875     26526882
BoxesOverlapBenchmark/PosedCase/1/0/2             30.2 ns         30.2 ns    0.1875     22954542
BoxesOverlapBenchmark/PosedCase/0/1/2             28.0 ns         28.0 ns    0.1875     24790100
BoxesOverlapBenchmark/PosedCase/1/1/2             30.2 ns         30.2 ns    0.1875     23085781
BoxesOverlapBenchmark/PosedCase/0/2/2             30.2 ns         30.2 ns    0.1875     23164136
BoxesOverlapBenchmark/PosedCase/1/2/2             29.4 ns         29.4 ns    0.1875     23627965

sherm1 · 2024-06-25T19:05:42Z

Wow! Great result.

rpoyner-tri · 2024-06-25T19:34:24Z

Correction: branch execution statistics anzu branch is now anzu#13134, superseding the prior attempt.

SeanCurtis-TRI · 2024-07-22T15:09:31Z

Hmmmm....

In adding the highway overhead for multiple compilation and runtime selection, the benchmark performance took a huge hit.

The worst cases have essentially doubled and the best cases have tripled in duration. As a reality check, I tried it again against the initial highway implementation and was able to observe the results documented above. To check it out yourself, look at the PR #21733.

------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU    Allocs   Iterations
------------------------------------------------------------------------------------------------
BoxesOverlapBenchmark/ParallelContainedCase       60.8 ns         60.8 ns     0.125      9833171
BoxesOverlapBenchmark/PosedCase/0/0/-1            40.5 ns         40.5 ns    0.1875     16907408
BoxesOverlapBenchmark/PosedCase/1/0/-1            52.4 ns         52.4 ns    0.1875     13169531
BoxesOverlapBenchmark/PosedCase/0/1/-1            40.1 ns         40.1 ns    0.1875     17885429
BoxesOverlapBenchmark/PosedCase/1/1/-1            55.3 ns         55.3 ns    0.1875     12229552
BoxesOverlapBenchmark/PosedCase/0/2/-1            41.4 ns         41.4 ns    0.1875     17048447
BoxesOverlapBenchmark/PosedCase/1/2/-1            56.4 ns         56.4 ns    0.1875     12077904
BoxesOverlapBenchmark/PosedCase/0/0/0             49.5 ns         49.5 ns    0.1875     13715203
BoxesOverlapBenchmark/PosedCase/1/0/0             54.8 ns         54.8 ns    0.1875     12218963
BoxesOverlapBenchmark/PosedCase/0/1/0             49.6 ns         49.6 ns    0.1875     13547289
BoxesOverlapBenchmark/PosedCase/1/1/0             53.2 ns         53.2 ns    0.1875     13072635
BoxesOverlapBenchmark/PosedCase/0/2/0             50.4 ns         50.4 ns    0.1875     10000000
BoxesOverlapBenchmark/PosedCase/1/2/0             57.1 ns         57.1 ns    0.1875     13320933
BoxesOverlapBenchmark/PosedCase/0/0/1             54.0 ns         54.0 ns    0.1875     13436138
BoxesOverlapBenchmark/PosedCase/1/0/1             55.0 ns         55.0 ns    0.1875     10632611
BoxesOverlapBenchmark/PosedCase/0/1/1             52.1 ns         52.1 ns    0.1875     12016522
BoxesOverlapBenchmark/PosedCase/1/1/1             55.1 ns         55.1 ns    0.1875     12805206
BoxesOverlapBenchmark/PosedCase/0/2/1             53.9 ns         53.9 ns    0.1875     13031730
BoxesOverlapBenchmark/PosedCase/1/2/1             55.2 ns         55.2 ns    0.1875     13063296
BoxesOverlapBenchmark/PosedCase/0/0/2             52.1 ns         52.1 ns    0.1875     13117466
BoxesOverlapBenchmark/PosedCase/1/0/2             55.6 ns         55.6 ns    0.1875     12824349
BoxesOverlapBenchmark/PosedCase/0/1/2             54.4 ns         54.4 ns    0.1875     12830509
BoxesOverlapBenchmark/PosedCase/1/1/2             55.4 ns         55.4 ns    0.1875     12067189
BoxesOverlapBenchmark/PosedCase/0/2/2             56.0 ns         56.0 ns    0.1875     12886335
BoxesOverlapBenchmark/PosedCase/1/2/2             55.9 ns         55.9 ns    0.1875     12708442

SeanCurtis-TRI added type: feature request component: geometry proximity Contact, distance, signed distance queries and related properties labels Jun 5, 2024

SeanCurtis-TRI added this to #dynamics (Drake board) Jun 5, 2024

jwnimmer-tri assigned SeanCurtis-TRI Jun 5, 2024

jwnimmer-tri added type: performance priority: high and removed type: feature request labels Jun 5, 2024

jwnimmer-tri moved this to Todo in #dynamics (Drake board) Jun 5, 2024

SeanCurtis-TRI mentioned this issue Jun 7, 2024

Move boxes_overlap tests from obb_test to boxes_overlap_test #21546

Merged

jwnimmer-tri mentioned this issue Jun 14, 2024

[math] Use highway instead of intrinsics #21571

Closed

jwnimmer-tri moved this from Todo to In Progress in #dynamics (Drake board) Jun 17, 2024

rpoyner-tri mentioned this issue Jun 20, 2024

[geometry] Add benchmark for internal::BoxesOverlap #21596

Merged

jwnimmer-tri mentioned this issue Jun 26, 2024

[math] Use dynamic dispatch for highway SIMD #21626

Merged

SeanCurtis-TRI mentioned this issue Jul 22, 2024

Reimplement OBB box overlap test using highway (simd) #21733

Merged

SeanCurtis-TRI closed this as completed in #21733 Jul 26, 2024

github-project-automation bot moved this from In Progress to Done in #dynamics (Drake board) Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up oriented bounding box (OBB) overlap test #21526

Speed up oriented bounding box (OBB) overlap test #21526

SeanCurtis-TRI commented Jun 5, 2024 •

edited by jwnimmer-tri

Loading

jwnimmer-tri commented Jun 5, 2024

jwnimmer-tri commented Jun 5, 2024

jwnimmer-tri commented Jun 12, 2024

SeanCurtis-TRI commented Jun 13, 2024

jwnimmer-tri commented Jun 13, 2024 •

edited

Loading

SeanCurtis-TRI commented Jun 13, 2024

jwnimmer-tri commented Jun 13, 2024

jwnimmer-tri commented Jun 13, 2024

rpoyner-tri commented Jun 20, 2024

rpoyner-tri commented Jun 20, 2024

jwnimmer-tri commented Jun 20, 2024

rpoyner-tri commented Jun 20, 2024

rpoyner-tri commented Jun 20, 2024

sherm1 commented Jun 20, 2024

jwnimmer-tri commented Jun 20, 2024

rpoyner-tri commented Jun 21, 2024

SeanCurtis-TRI commented Jun 24, 2024

jwnimmer-tri commented Jun 24, 2024

SeanCurtis-TRI commented Jun 25, 2024

sherm1 commented Jun 25, 2024

rpoyner-tri commented Jun 25, 2024

SeanCurtis-TRI commented Jul 22, 2024

Speed up oriented bounding box (OBB) overlap test #21526

Speed up oriented bounding box (OBB) overlap test #21526

Comments

SeanCurtis-TRI commented Jun 5, 2024 • edited by jwnimmer-tri Loading

jwnimmer-tri commented Jun 5, 2024

jwnimmer-tri commented Jun 5, 2024

jwnimmer-tri commented Jun 12, 2024

SeanCurtis-TRI commented Jun 13, 2024

jwnimmer-tri commented Jun 13, 2024 • edited Loading

SeanCurtis-TRI commented Jun 13, 2024

jwnimmer-tri commented Jun 13, 2024

jwnimmer-tri commented Jun 13, 2024

rpoyner-tri commented Jun 20, 2024

rpoyner-tri commented Jun 20, 2024

jwnimmer-tri commented Jun 20, 2024

rpoyner-tri commented Jun 20, 2024

rpoyner-tri commented Jun 20, 2024

sherm1 commented Jun 20, 2024

jwnimmer-tri commented Jun 20, 2024

rpoyner-tri commented Jun 21, 2024

SeanCurtis-TRI commented Jun 24, 2024

jwnimmer-tri commented Jun 24, 2024

SeanCurtis-TRI commented Jun 25, 2024

Master version

Intrinsics

Highway

sherm1 commented Jun 25, 2024

rpoyner-tri commented Jun 25, 2024

SeanCurtis-TRI commented Jul 22, 2024

SeanCurtis-TRI commented Jun 5, 2024 •

edited by jwnimmer-tri

Loading

jwnimmer-tri commented Jun 13, 2024 •

edited

Loading