-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[math] Use dynamic dispatch for highway SIMD #21626
Conversation
660620f
to
8bbbba9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 10 of 12 files at r1, 2 of 3 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: needs platform reviewer assigned, needs at least two assigned reviewers, labeled "do not merge"
math/fast_pose_composition_functions_avx2_fma.cc
line 614 at r1 (raw file):
#else // HWY_MAX_BYTES /* The portable versions are always defined. They should be written to maximize
FYI For reviewers -- the content of this section is copied from the (now-deleted) file named fast_pose_composition_functions_wrapper.cc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 12 files at r1, 6 of 6 files at r4, all commit messages.
Reviewable status: needs platform reviewer assigned, needs at least two assigned reviewers, labeled "do not merge" (waiting on @jwnimmer-tri)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: needs platform reviewer assigned, needs at least two assigned reviewers, labeled "do not merge" (waiting on @jwnimmer-tri)
common/hwy_dynamic_impl.h
line 35 at r4 (raw file):
be wrapped. */ template <typename ChooseFunctor> class LateBoundFunction {
FYI We don't strictly need the LateBoundFunction helper. The built-in dispatcher in highway is only like 5 instructions or so. This just brings it down to 1 instruction, like we already had.
If we like this idea, my goal would be to submit this code upstream so it becomes part of highway itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, +@sherm1 this is probably ready to start feature review.
Reviewable status: 1 unresolved discussion, LGTM missing from assignee sherm1(platform), needs at least two assigned reviewers, labeled "do not merge" (waiting on @jwnimmer-tri)
a discussion (no related file):
Working
Drake should obey numpy's environment variable command that opts-out of CPU features (towards reproducible science). Not sure if I will do that here, or in a follow-up PR.
81b5e04
to
6e02f33
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: 1 unresolved discussion, LGTM missing from assignee sherm1(platform), needs at least two assigned reviewers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 unresolved discussion, LGTM missing from assignee sherm1(platform), needs at least two assigned reviewers
math/fast_pose_composition_functions_avx2_fma.cc
line 718 at r6 (raw file):
// Create the lookup tables for the per-CPU hwy implementation functions, and // required functors that select from the lookup tables.
FYI I also considered making some sugar macros (or something) for this boilerplate, but I didn't want to go overboard. Happy to take input on what would help. My currently thought is we let Sean's code come into play as a second data point, and then try to clean it up once we have that additional experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow that assembly code is beautiful!
Feature review checkpoint.
Reviewed 5 of 12 files at r1, 4 of 6 files at r4, 2 of 3 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: 4 unresolved discussions, LGTM missing from assignee sherm1(platform), needs at least two assigned reviewers
common/BUILD.bazel
line 358 at r6 (raw file):
# Highway automatically compiles our functions for multiple CPU # targets, so that the code is tuned as best as possible for that # target. However, some of that specialization is unncessary for our
typo unncessary
common/hwy_dynamic.h
line 10 at r6 (raw file):
/* Anywhere in Drake that uses Highway for dynamic CPU dispatch should call this helper function to register a reset handler to clear the latched CPU detction.
typo detction
common/hwy_dynamic_impl.h
line 18 at r6 (raw file):
In a given process, the first time our Call() function is called, it will latch-initialize the wrapped function pointer as follows: - call the ChooseFunctor to select a which function pointer to use,
typo "a which" -> which
math/fast_pose_composition_functions_avx2_fma.cc
line 718 at r6 (raw file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
FYI I also considered making some sugar macros (or something) for this boilerplate, but I didn't want to go overboard. Happy to take input on what would help. My currently thought is we let Sean's code come into play as a second data point, and then try to clean it up once we have that additional experience.
Fine as is IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r7, all commit messages.
Reviewable status: 1 unresolved discussion, LGTM missing from assignee sherm1(platform), needs at least two assigned reviewers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 unresolved discussions, LGTM missing from assignee sherm1(platform), needs at least two assigned reviewers
common/BUILD.bazel
line 361 at r7 (raw file):
# use of SIMD in Drake, so we opt-out of targets that do not offer any # advantages vs their build time / code size. "HWY_DISABLED_TARGETS=(" + "|".join([
Working
I'm going to swap this from a deny-list to an allow-list. I think having a known list of targets will be easier to maintain and ensure we have test coverage for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r8, all commit messages.
Reviewable status: 1 unresolved discussion, LGTM missing from assignee sherm1(platform), needs at least two assigned reviewers
common/BUILD.bazel
line 361 at r7 (raw file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
Working
I'm going to swap this from a deny-list to an allow-list. I think having a known list of targets will be easier to maintain and ensure we have test coverage for.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feature with a few minor comments and questions
Reviewed 1 of 12 files at r1, 1 of 6 files at r4, 2 of 3 files at r7, 1 of 1 files at r8, all commit messages.
Reviewable status: 5 unresolved discussions, needs at least two assigned reviewers
common/BUILD.bazel
line 366 at r8 (raw file):
# # Highway only offers a deny-list for targets, so we'll need to deny # everything and then un-deny what we want to allow.
BTW consider noting that each of the architectures is represented by a single bit and that the deny list is the "or" of the losers. I had to stare at this awhile to figure out what was happening here.
(But I do like the "allow" list better)
common/hwy_dynamic_impl.h
line 59 at r8 (raw file):
} /* (For testing only) Clears the latched target detection. This allows for
nit: this says "for testing only" but it is used just above in ChooseThenCall.
common/hwy_dynamic_impl.h
line 74 at r8 (raw file):
// All operations on this pointer must use memory_order::relaxed, which is // zero-cost on the platforms we care about. Note that the default value for
BTW is "zero cost" an overstatement? There is still the cost of an atomic access even if the compiler can still do reordering. Can those ever be zero cost?
math/test/fast_pose_composition_functions_test.cc
line 171 at r8 (raw file):
// Check the answer. EXPECT_TRUE(CompareMatrices(arg3->GetAsMatrix4(), expected));
BTW this test code is very nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: 4 unresolved discussions, needs at least two assigned reviewers
common/BUILD.bazel
line 366 at r8 (raw file):
Previously, sherm1 (Michael Sherman) wrote…
BTW consider noting that each of the architectures is represented by a single bit and that the deny list is the "or" of the losers. I had to stare at this awhile to figure out what was happening here.
(But I do like the "allow" list better)
Excellent point. In writing the comment, I realized I could slightly simplify the code, too.
common/hwy_dynamic_impl.h
line 59 at r8 (raw file):
Previously, sherm1 (Michael Sherman) wrote…
nit: this says "for testing only" but it is used just above in ChooseThenCall.
Hmm. The use below only takes the address of this function (so that tests can indirectly call it) -- ChooseThenCall doesn't actually call this. Should I clarify some comments somewhere to improve the situation?
common/hwy_dynamic_impl.h
line 74 at r8 (raw file):
Previously, sherm1 (Michael Sherman) wrote…
BTW is "zero cost" an overstatement? There is still the cost of an atomic access even if the compiler can still do reordering. Can those ever be zero cost?
That's the point of the comment -- this isn't actually atomic. For our targets, this is just a vanilla MOV on both the both load side and the store side. Maybe I should rephrase it somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r9, all commit messages.
Reviewable status: 3 unresolved discussions, needs at least two assigned reviewers
common/BUILD.bazel
line 366 at r8 (raw file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
Excellent point. In writing the comment, I realized I could slightly simplify the code, too.
Nice!
common/hwy_dynamic_impl.h
line 59 at r8 (raw file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
Hmm. The use below only takes the address of this function (so that tests can indirectly call it) -- ChooseThenCall doesn't actually call this. Should I clarify some comments somewhere to improve the situation?
Yes, just add a comment on the first line of ChooseThenCall noting that Reset is just being installed for future test use rather than called. I didn't realize that.
common/hwy_dynamic_impl.h
line 74 at r8 (raw file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
That's the point of the comment -- this isn't actually atomic. For our targets, this is just a vanilla MOV on both the both load side and the store side. Maybe I should rephrase it somehow?
Doesn't that have to be a MOV to/from un-cached memory? If not, how does a write in one CPU get seen by another CPU? Hard to believe that coordination can be done at zero cost. I suppose the cost can be limited to the store operation if there is hardware support for invalidating the cached loads in other CPUs? Do you have a handy reference link that explains why this is free? If so that would be a good addition to the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r9, 2 of 2 files at r10, all commit messages.
Reviewable status: 3 unresolved discussions, needs at least two assigned reviewers
common/hwy_dynamic_impl.h
line 74 at r8 (raw file):
Previously, sherm1 (Michael Sherman) wrote…
Doesn't that have to be a MOV to/from un-cached memory? If not, how does a write in one CPU get seen by another CPU? Hard to believe that coordination can be done at zero cost. I suppose the cost can be limited to the store operation if there is hardware support for invalidating the cached loads in other CPUs? Do you have a handy reference link that explains why this is free? If so that would be a good addition to the comment.
I updated the code to somewhat better match what we're aiming for.
I suspect what wasn't clear to you was that in order to avoid any xchg
store instructions (with cache locking), we accept the possibility that multiple threads could all be redundantly choosing the function pointer at the same time.
Since this happens at most once per cpu, and choosing is pretty quick anyway, this is a fair trade-off.
common/hwy_dynamic_impl.h
line 73 at r10 (raw file):
// "release" memory order here because we only care about memory consistency // as viewed from the current thread. function_.store(impl, std::memory_order::release);
Working
I forgot that I had planned to write unit tests in common
for this new code.
At the very least, I should add a check that our "choose" code is effective -- it would be a travesty if we accidentally removed this line and all of our "fast functions" ended up taking the slow path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r10, all commit messages.
Reviewable status: 2 unresolved discussions, needs at least two assigned reviewers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+@SeanCurtis-TRI for bonus feature review, and also platform review, please.
Reviewed 4 of 4 files at r11, all commit messages.
Reviewable status: LGTM missing from assignee SeanCurtis-TRI(platform)
a discussion (no related file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
Working
Drake should obey numpy's environment variable command that opts-out of CPU features (towards reproducible science). Not sure if I will do that here, or in a follow-up PR.
OK For now, I just turned off AVX3. I'll work on the configuration in a follow-up. For now I just want to unblock Sean.
-(release notes: fix) +(release notes: none)
common/hwy_dynamic_impl.h
line 73 at r10 (raw file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
Working
I forgot that I had planned to write unit tests in
common
for this new code.At the very least, I should add a check that our "choose" code is effective -- it would be a travesty if we accidentally removed this line and all of our "fast functions" ended up taking the slow path.
Done. Tests added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 unresolved discussion, LGTM missing from assignee SeanCurtis-TRI(platform)
a discussion (no related file):
Working
Final sanity check benchmarking before we merge.
Also FYI @sherm1 -- I added a bunch more code (tests), so please take one more look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can't provide some macros that would help make further instantiations more compact and bullet-proof. But I'll wait to see what it looks like when I apply it to box overlap.
Reviewed 5 of 12 files at r1, 3 of 6 files at r4, 2 of 3 files at r5, 1 of 1 files at r6, 1 of 3 files at r7, 1 of 2 files at r9, 1 of 2 files at r10, 4 of 4 files at r11, all commit messages.
Reviewable status: 7 unresolved discussions
a discussion (no related file):
Would it be worth writing up a checklist for the necessary steps to use hwy? Cargo cult is grand and glorious, but directing programmer's attention might be helpful. There's some arcane stuff taking place here and a bit of hand holding might not go amiss.
common/hwy_dynamic.h
line 9 at r11 (raw file):
with unit tests that want to probe code that uses "hwy/highway.h" for SIMD. */ /* Anywhere in Drake that uses Highway for dynamic CPU dispatch should call this
nit: This phrase suggests that it could be used outside of tests (which contradicts the documented purpose of this function above).
Code quote:
Anywhere in Drake
common/hwy_dynamic.h
line 13 at r11 (raw file):
(Most developers never need to worry about this, because it's automatic when you use the tools in `hwy_dynamic_impl.h`.) This function is safe to call from multiple threads currently. */
nit typo
Suggestion:
concurrently
common/hwy_dynamic.h
line 16 at r11 (raw file):
void HwyDynamicRegisterResetFunction(void (*)()); /* (For testing only) Clears the latched CPU target detection by calling all of
nit: Given the comment at the top that the functions "purpose is to assist with unit tests", it seems labeling one of these functions as "For testing only" and not the other is either redundant or inconsistent.
Code quote:
(For testing only)
common/hwy_dynamic_impl.h
line 81 at r11 (raw file):
testing multiple different targets from the same test program. This function is NOT thread-safe; only use this in single-threaded tests. */ __attribute__((cold)) static void Reset() {
BTW If I read this right, everytime we call Reset()
and then Call()
we add a pointer to LateBoundFunction::Reset
to our global resets. So, if we call HwyDynamicReset()
we'll repeatedly call Reset()
. Given that Reset()
is only for tests, the expectation is that will only happen in tests, so not a problem?
math/fast_pose_composition_functions_avx2_fma.cc
line 33 at r11 (raw file):
// or larger. When we have smaller registers (e.g., SSE2's 2-wide lanes, or // SVE's variable-length vectors) we will fall back to non-SIMD code. #if HWY_MAX_BYTES >= 32 && HWY_HAVE_SCALABLE == 0
BTW We might want to alias this and hoist it into hwy_dynamic_impl.h
so we don't have this boilerplate everywhere we use hwy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r11, all commit messages.
Reviewable status: 8 unresolved discussions
common/test/hwy_dynamic_test_array_mul.cc
line 96 at r11 (raw file):
std::map<std::string, int>& GetMutableArrayMulCounters() { static drake::never_destroyed<std::map<std::string, int>> singleton;
BTW you could use drake::string_map
here. Of course there is no performance reason to do so in this context but it might help encourage best practices by example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I didn't want to try adding sugar until we have more examples. Once we have a nice set of users of this code, we can refactor to the then-apparent needs for what is consistent versus what varies each time.
Reviewed 6 of 6 files at r12, all commit messages.
Reviewable status: 6 unresolved discussions
a discussion (no related file):
Previously, SeanCurtis-TRI (Sean Curtis) wrote…
Would it be worth writing up a checklist for the necessary steps to use hwy? Cargo cult is grand and glorious, but directing programmer's attention might be helpful. There's some arcane stuff taking place here and a bit of hand holding might not go amiss.
To my mind, the clearest exposition of such a checklist is the sample code we have in our tree, i.e., cargo culting. A separate list seems like it would be literal a copy-paste of one of these files (just without function bodies). If we find acute traps that developers are hitting, I think we should aim to solve that with better abstractions, rather than more documentation.
a discussion (no related file):
Working
I just remembered that the Highway docs recommend to use -O2
on Highway code, even in Debug builds. The abstractions that give things nice names lead to a bunch of trivial function calls, which are easily inlined into nothingness, but -O0 defeats that. I think I'll need to add one more build flag to align with that advice.
common/hwy_dynamic.h
line 9 at r11 (raw file):
Previously, SeanCurtis-TRI (Sean Curtis) wrote…
nit: This phrase suggests that it could be used outside of tests (which contradicts the documented purpose of this function above).
The first function is supposed to be called from outside of tests. I tried to tune up the overview to present a clearer story.
common/hwy_dynamic.h
line 16 at r11 (raw file):
Previously, SeanCurtis-TRI (Sean Curtis) wrote…
nit: Given the comment at the top that the functions "purpose is to assist with unit tests", it seems labeling one of these functions as "For testing only" and not the other is either redundant or inconsistent.
The first function is supposed to be called from outside of tests. I tried to tune up the overview to present a clearer story.
common/hwy_dynamic_impl.h
line 81 at r11 (raw file):
Previously, SeanCurtis-TRI (Sean Curtis) wrote…
BTW If I read this right, everytime we call
Reset()
and thenCall()
we add a pointer toLateBoundFunction::Reset
to our global resets. So, if we callHwyDynamicReset()
we'll repeatedly callReset()
. Given thatReset()
is only for tests, the expectation is that will only happen in tests, so not a problem?
Good point. Not a perf problem, but still confusing. Fixed to ignore duplicates.
math/fast_pose_composition_functions_avx2_fma.cc
line 33 at r11 (raw file):
Previously, SeanCurtis-TRI (Sean Curtis) wrote…
BTW We might want to alias this and hoist it into
hwy_dynamic_impl.h
so we don't have this boilerplate everywhere we use hwy?
It wasn't clear to me that every call site would have the same situation as we do here.
In the pose composition code, we are computing extra flops in the SIMD spelling because we know they will net out as a positive, but only under certain conditions (the ones in this ifdef). I tried to communicate that in the comment right above the ifdef, but let me know if we can do it better.
A different use of SIMD -- e.g. an AutoDiff linear combination of partials -- would be improved by really any target and vector sized that highway supports. Similar in spirit to the ArrayMul example in the unit test -- no data transposition, just an elementwise pile of flops. All of the highway targets can do that well, even SSE2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 unresolved discussions
a discussion (no related file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
Working
I just remembered that the Highway docs recommend to use
-O2
on Highway code, even in Debug builds. The abstractions that give things nice names lead to a bunch of trivial function calls, which are easily inlined into nothingness, but -O0 defeats that. I think I'll need to add one more build flag to align with that advice.
Done. I turned it on for all of the pose composition stuff. There are ways to take a narrower swipe, but their spelling ends up being complicated and this seems like an OK compromise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r12, 1 of 1 files at r13, all commit messages.
Reviewable status: 1 unresolved discussion
a discussion (no related file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
To my mind, the clearest exposition of such a checklist is the sample code we have in our tree, i.e., cargo culting. A separate list seems like it would be literal a copy-paste of one of these files (just without function bodies). If we find acute traps that developers are hitting, I think we should aim to solve that with better abstractions, rather than more documentation.
That's fine.
math/fast_pose_composition_functions_avx2_fma.cc
line 33 at r11 (raw file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
It wasn't clear to me that every call site would have the same situation as we do here.
In the pose composition code, we are computing extra flops in the SIMD spelling because we know they will net out as a positive, but only under certain conditions (the ones in this ifdef). I tried to communicate that in the comment right above the ifdef, but let me know if we can do it better.
A different use of SIMD -- e.g. an AutoDiff linear combination of partials -- would be improved by really any target and vector sized that highway supports. Similar in spirit to the ArrayMul example in the unit test -- no data transposition, just an elementwise pile of flops. All of the highway targets can do that well, even SSE2.
That's fine. I'm pretty sure boxes_overlap is going to go the same path as this, where there's a bespoke fallback implementation. So, I had two samples that seemed to require the same pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r13, all commit messages.
Reviewable status:complete! all discussions resolved, LGTM from assignees SeanCurtis-TRI(platform),sherm1(platform)
a discussion (no related file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
Working
Final sanity check benchmarking before we merge.
Done. Confirmed that the CassieDouble
benchmark suite has no change from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r12, 1 of 1 files at r13, all commit messages.
Reviewable status:complete! all discussions resolved, LGTM from assignees SeanCurtis-TRI(platform),sherm1(platform)
Towards #21526.
As a sample of what's new, here's the disassembly of
ComposeXXImpl
compiled for "AVX3" using GCC on Jammy:To see everything, after building this you can peek at
drake/bazel-bin/math/_objs/_geometric_transform_compiled_cc_impl/fast_pose_composition_functions_avx2_fma.pic.o
.This change is![Reviewable](https://camo.githubusercontent.com/1541c4039185914e83657d3683ec25920c672c6c5c7ab4240ee7bff601adec0b/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)