-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up oriented bounding box (OBB) overlap test #21526
Comments
FYI my proof of concept branch is here: https://github.com/jwnimmer-tri/drake/commits/boxes-overlap/ |
Oh, one other thing I forgot to mention: if we really don't like the how this looks using raw intrinsics, conceivably we could bring in https://github.com/google/highway to write it in a more normal C++ style that would compile down to efficient intrinsics (even better, with cpu-based dispatch to the best routine for that flavor). Hopefully we can skate by without this time, but as we start writing more SIMD, I think it's probable we'll need to add |
@SeanCurtis-TRI if possible, it would be nice to cave off the "make a googlebench" part of this to @rpoyner-tri. (Even if you need to massage the space of input OBB domains afterwards.) Maybe you two can talk offline about it. |
I'm fully in favor of the carving.
Not sure what you mean exactly. |
Since the function has a couple of early-exit paths, the average performance will be a function of the distribution (and predictability) of input OBBs. Part of making a good microbenchmark will be feeding in OBBs that have similar statistics to the ones seen in a real application. |
I expect the easiest way to do that would be to take one of the representative scenarios, add some counters to the boxes overlap test and run things with the counters. Probably need to run across multiple manipulands. |
Great. A nice starting scenario for that might be the |
(But in any case, even a benchmark with only vanilla input statistics would still be a nice benefit. We can always dial in better statistics over time.) |
My thought is to start with focused benchmarks for each of the cases, then collect statistics, then implement some statistical-mix benchmarks. As I understand things so far, looks like I'll want to shuffle some common setup code out of boxes_overlap_test for use in benchmarks. |
Here are some branch execution count statistics from the anzu profiling target mentioned above. The statistics collection hackery is shown at #21598.
|
Good stuff! To help us grok it, could you baseline it to percentages? (I think that means summing to get the total calls, then diving that across the numbers?) |
Here's a google sheet with the same data, prettied up a bit.: https://docs.google.com/spreadsheets/d/1TMtvXn3JanMObBe3tXTyTXfe3FcGoWcrhA5CApF9T9w/edit?usp=sharing |
Surprising (at least to me) that 65% of the queries are overlap (full execution of the whole function). Does that match others' intuition? |
That's an interesting result -- looks like broad phase is doing a good job so most of the returns are overlaps. I wonder if that means it would be faster to look for overlap rather than looking for separation. |
That's one possibility, yes. Is there a fewer-flops math to check for overlap instead of non-overlap? Another (SIMD) tactic would be to use branchless -- unconditionally compute all 15 overlap queries into a mask, and return it directly (casting mask int to C++ bool). That might avoid some branch mispredict penalties. |
Memorialized the anzu branch use to measure here: |
There are two pieces of information necessary to gain a greater understanding of the value of the broadphase culling:
That said, the efficacy of the BVH code is not at stake here. Merely the question of given what the code does can we get it to do it faster. The question of whether or not the BVH code is doing a "good" thing is for a later date. |
FYI @rpoyner-tri my hack branch was boxes-overlap. |
Fun update. I ran the overlap benchmark in #21596 three times. Once against the
Master version
Intrinsics
Highway
|
Wow! Great result. |
Correction: branch execution statistics anzu branch is now |
Hmmmm.... In adding the highway overhead for multiple compilation and runtime selection, the benchmark performance took a huge hit. The worst cases have essentially doubled and the best cases have tripled in duration. As a reality check, I tried it again against the initial highway implementation and was able to observe the results documented above. To check it out yourself, look at the PR #21733.
|
Is your feature request related to a problem? Please describe.
Because we advocate using compliant hydroelastics as the preferred mode of contact, the tet-tet pair culling is incredibly important. We use an oriented bounding box (OBB) bounding volume hierarchy (BVH) to "quickly" cull as many tet-pair candidates as possible. As such, we test pairs of OBBs for overlapping a lot.
Even if we make no algorithmic changes to how we cull tet pairs, simply speeding up the overlapping test should yield dividends.
Describe the solution you'd like
The OBB overlap test lends itself well to SIMD performance improvements. (Its search for separating planes naturally partitions into parallel lumps of computation suitable for SIMD). We should offer up a SIMD implementation.
The SIMD implementation needs the same kind of fallback that the rotation matrix multiplication has (i.e., support for when run on a system where SIMD is not available).
In addition, we need some supporting infrastructure:
Describe alternatives you've considered
We could consider different bounding volume types (spheres, etc.) and different traversal algorithms (parallel, etc.). However, those are much larger endeavors with uncertain return. Generally, OBB overlap is a valuable operation and making it faster is an unfettered good. Even if we end up accelerating compliant hydro contact using a different bounding volume type, this function should still be faster.
The text was updated successfully, but these errors were encountered: