Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up oriented bounding box (OBB) overlap test #21526

Closed
3 tasks done
SeanCurtis-TRI opened this issue Jun 5, 2024 · 22 comments · Fixed by #21733
Closed
3 tasks done

Speed up oriented bounding box (OBB) overlap test #21526

SeanCurtis-TRI opened this issue Jun 5, 2024 · 22 comments · Fixed by #21733
Assignees
Labels
component: geometry proximity Contact, distance, signed distance queries and related properties priority: high type: performance

Comments

@SeanCurtis-TRI
Copy link
Contributor

SeanCurtis-TRI commented Jun 5, 2024

Is your feature request related to a problem? Please describe.
Because we advocate using compliant hydroelastics as the preferred mode of contact, the tet-tet pair culling is incredibly important. We use an oriented bounding box (OBB) bounding volume hierarchy (BVH) to "quickly" cull as many tet-pair candidates as possible. As such, we test pairs of OBBs for overlapping a lot.

Even if we make no algorithmic changes to how we cull tet pairs, simply speeding up the overlapping test should yield dividends.

Describe the solution you'd like
The OBB overlap test lends itself well to SIMD performance improvements. (Its search for separating planes naturally partitions into parallel lumps of computation suitable for SIMD). We should offer up a SIMD implementation.

The SIMD implementation needs the same kind of fallback that the rotation matrix multiplication has (i.e., support for when run on a system where SIMD is not available).

In addition, we need some supporting infrastructure:

  • We need some benchmarking so that we can track performance changes and prevent regression (or, at the very least, feel good about changes).
  • The unit tests on the OBB overlap code are currently implicit in the test for a different compilation unit. They need to be brought into their own dedicated unit tests.

Describe alternatives you've considered
We could consider different bounding volume types (spheres, etc.) and different traversal algorithms (parallel, etc.). However, those are much larger endeavors with uncertain return. Generally, OBB overlap is a valuable operation and making it faster is an unfettered good. Even if we end up accelerating compliant hydro contact using a different bounding volume type, this function should still be faster.

@SeanCurtis-TRI SeanCurtis-TRI added type: feature request component: geometry proximity Contact, distance, signed distance queries and related properties labels Jun 5, 2024
@jwnimmer-tri
Copy link
Collaborator

FYI my proof of concept branch is here: https://github.com/jwnimmer-tri/drake/commits/boxes-overlap/

@jwnimmer-tri
Copy link
Collaborator

Oh, one other thing I forgot to mention: if we really don't like the how this looks using raw intrinsics, conceivably we could bring in https://github.com/google/highway to write it in a more normal C++ style that would compile down to efficient intrinsics (even better, with cpu-based dispatch to the best routine for that flavor). Hopefully we can skate by without this time, but as we start writing more SIMD, I think it's probable we'll need to add highway eventually.

@jwnimmer-tri
Copy link
Collaborator

@SeanCurtis-TRI if possible, it would be nice to cave off the "make a googlebench" part of this to @rpoyner-tri. (Even if you need to massage the space of input OBB domains afterwards.) Maybe you two can talk offline about it.

@SeanCurtis-TRI
Copy link
Contributor Author

I'm fully in favor of the carving.

need to massage the space of input oBB domains afterwards

Not sure what you mean exactly.

@jwnimmer-tri
Copy link
Collaborator

jwnimmer-tri commented Jun 13, 2024

Not sure what you mean exactly.

Since the function has a couple of early-exit paths, the average performance will be a function of the distribution (and predictability) of input OBBs. Part of making a good microbenchmark will be feeding in OBBs that have similar statistics to the ones seen in a real application.

@SeanCurtis-TRI
Copy link
Contributor Author

I expect the easiest way to do that would be to take one of the representative scenarios, add some counters to the boxes overlap test and run things with the counters. Probably need to run across multiple manipulands.

@jwnimmer-tri
Copy link
Collaborator

Great. A nice starting scenario for that might be the anzu#13014 replay, which Rico is already going to re-profile soon. We can run it with some OBB case 1 / case 2 / case 3 static counters.

@jwnimmer-tri
Copy link
Collaborator

(But in any case, even a benchmark with only vanilla input statistics would still be a nice benefit. We can always dial in better statistics over time.)

@rpoyner-tri
Copy link
Contributor

My thought is to start with focused benchmarks for each of the cases, then collect statistics, then implement some statistical-mix benchmarks.

As I understand things so far, looks like I'll want to shuffle some common setup code out of boxes_overlap_test for use in benchmarks.

@rpoyner-tri
Copy link
Contributor

Here are some branch execution count statistics from the anzu profiling target mentioned above. The statistics collection hackery is shown at #21598.

A axis separation returns: [3944039, 8807791, 14267170]
B axis separation returns: [1487990, 2207006, 3686886]
Cross product axis separation returns [2137676, 1165955, 447086, 683011, 343790, 145649, 120081, 73198, 21485]
Overlap returns [76206736]

@jwnimmer-tri
Copy link
Collaborator

Good stuff! To help us grok it, could you baseline it to percentages? (I think that means summing to get the total calls, then diving that across the numbers?)

@rpoyner-tri
Copy link
Contributor

Here's a google sheet with the same data, prettied up a bit.: https://docs.google.com/spreadsheets/d/1TMtvXn3JanMObBe3tXTyTXfe3FcGoWcrhA5CApF9T9w/edit?usp=sharing

@rpoyner-tri
Copy link
Contributor

Surprising (at least to me) that 65% of the queries are overlap (full execution of the whole function). Does that match others' intuition?

@sherm1
Copy link
Member

sherm1 commented Jun 20, 2024

That's an interesting result -- looks like broad phase is doing a good job so most of the returns are overlaps. I wonder if that means it would be faster to look for overlap rather than looking for separation.

@jwnimmer-tri
Copy link
Collaborator

That's one possibility, yes. Is there a fewer-flops math to check for overlap instead of non-overlap?

Another (SIMD) tactic would be to use branchless -- unconditionally compute all 15 overlap queries into a mask, and return it directly (casting mask int to C++ bool). That might avoid some branch mispredict penalties.

@rpoyner-tri
Copy link
Contributor

Memorialized the anzu branch use to measure here: anzu#13099

@SeanCurtis-TRI
Copy link
Contributor Author

There are two pieces of information necessary to gain a greater understanding of the value of the broadphase culling:

  1. You need some sense of how many possible obb-obb pairs are not being evaluated. I.e., for every "these obbs are separated" result you get, you'd want to know the size of the sub-tree of the bounding volume traversal tree lies underneath it.
  2. You'd also like to know what fraction of the obb-obb pairs marked as "overlapping" actually produce overlapping "primitives". For example it would be easy to say true far too much, leading to unhelpful and expensive primitive-primitive tests.

That said, the efficacy of the BVH code is not at stake here. Merely the question of given what the code does can we get it to do it faster. The question of whether or not the BVH code is doing a "good" thing is for a later date.

@jwnimmer-tri
Copy link
Collaborator

FYI @rpoyner-tri my hack branch was boxes-overlap.

@SeanCurtis-TRI
Copy link
Contributor Author

Fun update.

I ran the overlap benchmark in #21596 three times. Once against the BoxesOverlap() in master, once against the intrinsics-based implementation in boxes-overlap, and once against the highway version of the same (located here). The benchmark results are down below, but here's a summary of what jumped out at me:

  1. Master -> SIMD reduces the cost to almost 1/3 of the original. Not surprising as we are taking the 15 calculations and doing it in five bands of three.
  2. intrinsics -> highway reduces the cost a bit more.
    • Highway isn't quite as fast in a couple of cases -- the absolute fastest cases for intrinsics (12.5 ns becomes 13.6 ns).
    • However, in all other cases, it takes only 3/4 of the time of the intrinsics. Yowza. That means the ratio from master to highway is almost 1/4.

Master version

------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU    Allocs   Iterations
------------------------------------------------------------------------------------------------
BoxesOverlapBenchmark/ParallelContainedCase        129 ns          129 ns     0.125      4968178
BoxesOverlapBenchmark/PosedCase/0/0/-1            28.9 ns         28.9 ns    0.1875     24954658
BoxesOverlapBenchmark/PosedCase/1/0/-1             113 ns          113 ns    0.1875      6287359
BoxesOverlapBenchmark/PosedCase/0/1/-1            32.9 ns         32.9 ns    0.1875     21461271
BoxesOverlapBenchmark/PosedCase/1/1/-1             119 ns          119 ns    0.1875      6101520
BoxesOverlapBenchmark/PosedCase/0/2/-1            35.4 ns         35.4 ns    0.1875     19722903
BoxesOverlapBenchmark/PosedCase/1/2/-1             111 ns          111 ns    0.1875      6349643
BoxesOverlapBenchmark/PosedCase/0/0/0             52.4 ns         52.4 ns    0.1875     13089050
BoxesOverlapBenchmark/PosedCase/1/0/0              112 ns          112 ns    0.1875      6309225
BoxesOverlapBenchmark/PosedCase/0/1/0             68.7 ns         68.6 ns    0.1875     10190161
BoxesOverlapBenchmark/PosedCase/1/1/0              113 ns          113 ns    0.1875      6130558
BoxesOverlapBenchmark/PosedCase/0/2/0             83.3 ns         83.3 ns    0.1875      8344837
BoxesOverlapBenchmark/PosedCase/1/2/0              110 ns          110 ns    0.1875      6341236
BoxesOverlapBenchmark/PosedCase/0/0/1             66.7 ns         66.7 ns    0.1875     10517465
BoxesOverlapBenchmark/PosedCase/1/0/1              110 ns          110 ns    0.1875      6322427
BoxesOverlapBenchmark/PosedCase/0/1/1             81.4 ns         81.4 ns    0.1875      8648747
BoxesOverlapBenchmark/PosedCase/1/1/1              111 ns          111 ns    0.1875      6329373
BoxesOverlapBenchmark/PosedCase/0/2/1             95.7 ns         95.7 ns    0.1875      7211109
BoxesOverlapBenchmark/PosedCase/1/2/1              112 ns          112 ns    0.1875      6359837
BoxesOverlapBenchmark/PosedCase/0/0/2             81.9 ns         81.9 ns    0.1875      8431608
BoxesOverlapBenchmark/PosedCase/1/0/2              112 ns          112 ns    0.1875      6409212
BoxesOverlapBenchmark/PosedCase/0/1/2             97.9 ns         97.9 ns    0.1875      7112036
BoxesOverlapBenchmark/PosedCase/1/1/2              113 ns          113 ns    0.1875      6076795
BoxesOverlapBenchmark/PosedCase/0/2/2              113 ns          113 ns    0.1875      6211708
BoxesOverlapBenchmark/PosedCase/1/2/2              113 ns          113 ns    0.1875      6204990

Intrinsics

------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU    Allocs   Iterations
------------------------------------------------------------------------------------------------
BoxesOverlapBenchmark/ParallelContainedCase       61.4 ns         61.4 ns     0.125      9166908
BoxesOverlapBenchmark/PosedCase/0/0/-1            12.5 ns         12.5 ns    0.1875     53171712
BoxesOverlapBenchmark/PosedCase/1/0/-1            40.8 ns         40.8 ns    0.1875     17289468
BoxesOverlapBenchmark/PosedCase/0/1/-1            12.3 ns         12.3 ns    0.1875     59670400
BoxesOverlapBenchmark/PosedCase/1/1/-1            41.4 ns         41.4 ns    0.1875     16710841
BoxesOverlapBenchmark/PosedCase/0/2/-1            12.1 ns         12.1 ns    0.1875     58791224
BoxesOverlapBenchmark/PosedCase/1/2/-1            41.2 ns         41.2 ns    0.1875     17114134
BoxesOverlapBenchmark/PosedCase/0/0/0             28.2 ns         28.2 ns    0.1875     24030831
BoxesOverlapBenchmark/PosedCase/1/0/0             41.3 ns         41.3 ns    0.1875     17129613
BoxesOverlapBenchmark/PosedCase/0/1/0             32.0 ns         32.0 ns    0.1875     21884716
BoxesOverlapBenchmark/PosedCase/1/1/0             41.5 ns         41.5 ns    0.1875     15562419
BoxesOverlapBenchmark/PosedCase/0/2/0             34.9 ns         34.9 ns    0.1875     20026718
BoxesOverlapBenchmark/PosedCase/1/2/0             41.0 ns         41.0 ns    0.1875     17022138
BoxesOverlapBenchmark/PosedCase/0/0/1             31.6 ns         31.6 ns    0.1875     22007645
BoxesOverlapBenchmark/PosedCase/1/0/1             41.6 ns         41.6 ns    0.1875     16646245
BoxesOverlapBenchmark/PosedCase/0/1/1             34.9 ns         34.9 ns    0.1875     19944926
BoxesOverlapBenchmark/PosedCase/1/1/1             41.0 ns         40.8 ns    0.1875     17098019
BoxesOverlapBenchmark/PosedCase/0/2/1             37.6 ns         37.4 ns    0.1875     18830446
BoxesOverlapBenchmark/PosedCase/1/2/1             41.4 ns         41.2 ns    0.1875     16807910
BoxesOverlapBenchmark/PosedCase/0/0/2             35.0 ns         34.9 ns    0.1875     20040863
BoxesOverlapBenchmark/PosedCase/1/0/2             41.0 ns         40.8 ns    0.1875     16670585
BoxesOverlapBenchmark/PosedCase/0/1/2             37.5 ns         37.4 ns    0.1875     18546764
BoxesOverlapBenchmark/PosedCase/1/1/2             40.9 ns         40.7 ns    0.1875     17225832
BoxesOverlapBenchmark/PosedCase/0/2/2             41.0 ns         40.8 ns    0.1875     17366779
BoxesOverlapBenchmark/PosedCase/1/2/2             40.4 ns         40.3 ns    0.1875     17244243

Highway

------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU    Allocs   Iterations
------------------------------------------------------------------------------------------------
BoxesOverlapBenchmark/ParallelContainedCase       28.5 ns         28.5 ns     0.125     19737188
BoxesOverlapBenchmark/PosedCase/0/0/-1            13.6 ns         13.6 ns    0.1875     52039155
BoxesOverlapBenchmark/PosedCase/1/0/-1            29.1 ns         29.1 ns    0.1875     24285073
BoxesOverlapBenchmark/PosedCase/0/1/-1            13.7 ns         13.7 ns    0.1875     49932847
BoxesOverlapBenchmark/PosedCase/1/1/-1            30.1 ns         30.1 ns    0.1875     23355036
BoxesOverlapBenchmark/PosedCase/0/2/-1            13.6 ns         13.6 ns    0.1875     50565652
BoxesOverlapBenchmark/PosedCase/1/2/-1            28.9 ns         28.9 ns    0.1875     23586123
BoxesOverlapBenchmark/PosedCase/0/0/0             22.9 ns         22.9 ns    0.1875     30625961
BoxesOverlapBenchmark/PosedCase/1/0/0             29.0 ns         29.0 ns    0.1875     24223770
BoxesOverlapBenchmark/PosedCase/0/1/0             24.4 ns         24.4 ns    0.1875     28330168
BoxesOverlapBenchmark/PosedCase/1/1/0             29.4 ns         29.4 ns    0.1875     23944716
BoxesOverlapBenchmark/PosedCase/0/2/0             26.4 ns         26.4 ns    0.1875     26287474
BoxesOverlapBenchmark/PosedCase/1/2/0             29.1 ns         29.1 ns    0.1875     23929282
BoxesOverlapBenchmark/PosedCase/0/0/1             24.5 ns         24.5 ns    0.1875     27674972
BoxesOverlapBenchmark/PosedCase/1/0/1             29.7 ns         29.7 ns    0.1875     23606747
BoxesOverlapBenchmark/PosedCase/0/1/1             26.6 ns         26.6 ns    0.1875     26635718
BoxesOverlapBenchmark/PosedCase/1/1/1             31.0 ns         31.0 ns    0.1875     22288692
BoxesOverlapBenchmark/PosedCase/0/2/1             28.3 ns         28.3 ns    0.1875     23564967
BoxesOverlapBenchmark/PosedCase/1/2/1             30.0 ns         30.0 ns    0.1875     23136369
BoxesOverlapBenchmark/PosedCase/0/0/2             26.3 ns         26.3 ns    0.1875     26526882
BoxesOverlapBenchmark/PosedCase/1/0/2             30.2 ns         30.2 ns    0.1875     22954542
BoxesOverlapBenchmark/PosedCase/0/1/2             28.0 ns         28.0 ns    0.1875     24790100
BoxesOverlapBenchmark/PosedCase/1/1/2             30.2 ns         30.2 ns    0.1875     23085781
BoxesOverlapBenchmark/PosedCase/0/2/2             30.2 ns         30.2 ns    0.1875     23164136
BoxesOverlapBenchmark/PosedCase/1/2/2             29.4 ns         29.4 ns    0.1875     23627965

@sherm1
Copy link
Member

sherm1 commented Jun 25, 2024

Wow! Great result.

@rpoyner-tri
Copy link
Contributor

Correction: branch execution statistics anzu branch is now anzu#13134, superseding the prior attempt.

@SeanCurtis-TRI
Copy link
Contributor Author

Hmmmm....

In adding the highway overhead for multiple compilation and runtime selection, the benchmark performance took a huge hit.

The worst cases have essentially doubled and the best cases have tripled in duration. As a reality check, I tried it again against the initial highway implementation and was able to observe the results documented above. To check it out yourself, look at the PR #21733.

------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU    Allocs   Iterations
------------------------------------------------------------------------------------------------
BoxesOverlapBenchmark/ParallelContainedCase       60.8 ns         60.8 ns     0.125      9833171
BoxesOverlapBenchmark/PosedCase/0/0/-1            40.5 ns         40.5 ns    0.1875     16907408
BoxesOverlapBenchmark/PosedCase/1/0/-1            52.4 ns         52.4 ns    0.1875     13169531
BoxesOverlapBenchmark/PosedCase/0/1/-1            40.1 ns         40.1 ns    0.1875     17885429
BoxesOverlapBenchmark/PosedCase/1/1/-1            55.3 ns         55.3 ns    0.1875     12229552
BoxesOverlapBenchmark/PosedCase/0/2/-1            41.4 ns         41.4 ns    0.1875     17048447
BoxesOverlapBenchmark/PosedCase/1/2/-1            56.4 ns         56.4 ns    0.1875     12077904
BoxesOverlapBenchmark/PosedCase/0/0/0             49.5 ns         49.5 ns    0.1875     13715203
BoxesOverlapBenchmark/PosedCase/1/0/0             54.8 ns         54.8 ns    0.1875     12218963
BoxesOverlapBenchmark/PosedCase/0/1/0             49.6 ns         49.6 ns    0.1875     13547289
BoxesOverlapBenchmark/PosedCase/1/1/0             53.2 ns         53.2 ns    0.1875     13072635
BoxesOverlapBenchmark/PosedCase/0/2/0             50.4 ns         50.4 ns    0.1875     10000000
BoxesOverlapBenchmark/PosedCase/1/2/0             57.1 ns         57.1 ns    0.1875     13320933
BoxesOverlapBenchmark/PosedCase/0/0/1             54.0 ns         54.0 ns    0.1875     13436138
BoxesOverlapBenchmark/PosedCase/1/0/1             55.0 ns         55.0 ns    0.1875     10632611
BoxesOverlapBenchmark/PosedCase/0/1/1             52.1 ns         52.1 ns    0.1875     12016522
BoxesOverlapBenchmark/PosedCase/1/1/1             55.1 ns         55.1 ns    0.1875     12805206
BoxesOverlapBenchmark/PosedCase/0/2/1             53.9 ns         53.9 ns    0.1875     13031730
BoxesOverlapBenchmark/PosedCase/1/2/1             55.2 ns         55.2 ns    0.1875     13063296
BoxesOverlapBenchmark/PosedCase/0/0/2             52.1 ns         52.1 ns    0.1875     13117466
BoxesOverlapBenchmark/PosedCase/1/0/2             55.6 ns         55.6 ns    0.1875     12824349
BoxesOverlapBenchmark/PosedCase/0/1/2             54.4 ns         54.4 ns    0.1875     12830509
BoxesOverlapBenchmark/PosedCase/1/1/2             55.4 ns         55.4 ns    0.1875     12067189
BoxesOverlapBenchmark/PosedCase/0/2/2             56.0 ns         56.0 ns    0.1875     12886335
BoxesOverlapBenchmark/PosedCase/1/2/2             55.9 ns         55.9 ns    0.1875     12708442

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: geometry proximity Contact, distance, signed distance queries and related properties priority: high type: performance
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants