Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[math] Use dynamic dispatch for highway SIMD #21626

Merged
merged 1 commit into from
Jul 2, 2024

Conversation

jwnimmer-tri
Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri commented Jun 26, 2024

Towards #21526.


As a sample of what's new, here's the disassembly of ComposeXXImpl compiled for "AVX3" using GCC on Jammy:

endbr64 
mov    eax,0x7
vmovupd ymm3,YMMWORD PTR [rdi]
vmovupd ymm2,YMMWORD PTR [rdi+0x18]
kmovb  k1,eax
vmovupd ymm0{k1}{z},YMMWORD PTR [rdi+0x48]
vmovupd ymm1,YMMWORD PTR [rdi+0x30]
vmulpd ymm5,ymm3,QWORD PTR [rsi]{1to4}
vmulpd ymm4,ymm3,QWORD PTR [rsi+0x18]{1to4}
vfmadd231pd ymm0,ymm3,QWORD PTR [rsi+0x48]{1to4}
vmulpd ymm3,ymm3,QWORD PTR [rsi+0x30]{1to4}
vfmadd231pd ymm5,ymm2,QWORD PTR [rsi+0x8]{1to4}
vfmadd231pd ymm0,ymm2,QWORD PTR [rsi+0x50]{1to4}
vfmadd231pd ymm4,ymm2,QWORD PTR [rsi+0x20]{1to4}
vfmadd132pd ymm2,ymm3,QWORD PTR [rsi+0x38]{1to4}
vfmadd231pd ymm5,ymm1,QWORD PTR [rsi+0x10]{1to4}
vfmadd231pd ymm0,ymm1,QWORD PTR [rsi+0x58]{1to4}
vfmadd231pd ymm4,ymm1,QWORD PTR [rsi+0x28]{1to4}
vfmadd132pd ymm1,ymm2,QWORD PTR [rsi+0x40]{1to4}
vmovupd YMMWORD PTR [rdx],ymm5
vmovupd YMMWORD PTR [rdx+0x18],ymm4
vmovupd YMMWORD PTR [rdx+0x30],ymm1
vmovupd YMMWORD PTR [rdx+0x48]{k1},ymm0
vzeroupper 
ret    

To see everything, after building this you can peek at drake/bazel-bin/math/_objs/_geometric_transform_compiled_cc_impl/fast_pose_composition_functions_avx2_fma.pic.o.


This change is Reviewable

@jwnimmer-tri jwnimmer-tri added status: do not merge status: do not review release notes: fix This pull request contains fixes (no new features) labels Jun 26, 2024
@jwnimmer-tri jwnimmer-tri force-pushed the highway2 branch 2 times, most recently from 660620f to 8bbbba9 Compare June 26, 2024 01:27
Copy link
Collaborator Author

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 12 files at r1, 2 of 3 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: needs platform reviewer assigned, needs at least two assigned reviewers, labeled "do not merge"


math/fast_pose_composition_functions_avx2_fma.cc line 614 at r1 (raw file):

#else  // HWY_MAX_BYTES

/* The portable versions are always defined. They should be written to maximize

FYI For reviewers -- the content of this section is copied from the (now-deleted) file named fast_pose_composition_functions_wrapper.cc.

Copy link
Collaborator Author

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 12 files at r1, 6 of 6 files at r4, all commit messages.
Reviewable status: needs platform reviewer assigned, needs at least two assigned reviewers, labeled "do not merge" (waiting on @jwnimmer-tri)

Copy link
Collaborator Author

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: needs platform reviewer assigned, needs at least two assigned reviewers, labeled "do not merge" (waiting on @jwnimmer-tri)


common/hwy_dynamic_impl.h line 35 at r4 (raw file):

be wrapped. */
template <typename ChooseFunctor>
class LateBoundFunction {

FYI We don't strictly need the LateBoundFunction helper. The built-in dispatcher in highway is only like 5 instructions or so. This just brings it down to 1 instruction, like we already had.

If we like this idea, my goal would be to submit this code upstream so it becomes part of highway itself.

@jwnimmer-tri jwnimmer-tri marked this pull request as ready for review June 26, 2024 04:18
Copy link
Collaborator Author

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, +@sherm1 this is probably ready to start feature review.

Reviewable status: 1 unresolved discussion, LGTM missing from assignee sherm1(platform), needs at least two assigned reviewers, labeled "do not merge" (waiting on @jwnimmer-tri)

a discussion (no related file):
Working

Drake should obey numpy's environment variable command that opts-out of CPU features (towards reproducible science). Not sure if I will do that here, or in a follow-up PR.


Copy link
Collaborator Author

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: 1 unresolved discussion, LGTM missing from assignee sherm1(platform), needs at least two assigned reviewers

Copy link
Collaborator Author

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 unresolved discussion, LGTM missing from assignee sherm1(platform), needs at least two assigned reviewers


math/fast_pose_composition_functions_avx2_fma.cc line 718 at r6 (raw file):

// Create the lookup tables for the per-CPU hwy implementation functions, and
// required functors that select from the lookup tables.

FYI I also considered making some sugar macros (or something) for this boilerplate, but I didn't want to go overboard. Happy to take input on what would help. My currently thought is we let Sean's code come into play as a second data point, and then try to clean it up once we have that additional experience.

Copy link
Member

@sherm1 sherm1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow that assembly code is beautiful!

Feature review checkpoint.

Reviewed 5 of 12 files at r1, 4 of 6 files at r4, 2 of 3 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: 4 unresolved discussions, LGTM missing from assignee sherm1(platform), needs at least two assigned reviewers


common/BUILD.bazel line 358 at r6 (raw file):

        # Highway automatically compiles our functions for multiple CPU
        # targets, so that the code is tuned as best as possible for that
        # target. However, some of that specialization is unncessary for our

typo unncessary


common/hwy_dynamic.h line 10 at r6 (raw file):

/* Anywhere in Drake that uses Highway for dynamic CPU dispatch should call this
helper function to register a reset handler to clear the latched CPU detction.

typo detction


common/hwy_dynamic_impl.h line 18 at r6 (raw file):

In a given process, the first time our Call() function is called, it will
latch-initialize the wrapped function pointer as follows:
- call the ChooseFunctor to select a which function pointer to use,

typo "a which" -> which


math/fast_pose_composition_functions_avx2_fma.cc line 718 at r6 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

FYI I also considered making some sugar macros (or something) for this boilerplate, but I didn't want to go overboard. Happy to take input on what would help. My currently thought is we let Sean's code come into play as a second data point, and then try to clean it up once we have that additional experience.

Fine as is IMO

Copy link
Collaborator Author

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r7, all commit messages.
Reviewable status: 1 unresolved discussion, LGTM missing from assignee sherm1(platform), needs at least two assigned reviewers

Copy link
Collaborator Author

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 unresolved discussions, LGTM missing from assignee sherm1(platform), needs at least two assigned reviewers


common/BUILD.bazel line 361 at r7 (raw file):

        # use of SIMD in Drake, so we opt-out of targets that do not offer any
        # advantages vs their build time / code size.
        "HWY_DISABLED_TARGETS=(" + "|".join([

Working

I'm going to swap this from a deny-list to an allow-list. I think having a known list of targets will be easier to maintain and ensure we have test coverage for.

Copy link
Collaborator Author

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r8, all commit messages.
Reviewable status: 1 unresolved discussion, LGTM missing from assignee sherm1(platform), needs at least two assigned reviewers


common/BUILD.bazel line 361 at r7 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Working

I'm going to swap this from a deny-list to an allow-list. I think having a known list of targets will be easier to maintain and ensure we have test coverage for.

Done

Copy link
Member

@sherm1 sherm1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature :lgtm: with a few minor comments and questions

Reviewed 1 of 12 files at r1, 1 of 6 files at r4, 2 of 3 files at r7, 1 of 1 files at r8, all commit messages.
Reviewable status: 5 unresolved discussions, needs at least two assigned reviewers


common/BUILD.bazel line 366 at r8 (raw file):

        #
        # Highway only offers a deny-list for targets, so we'll need to deny
        # everything and then un-deny what we want to allow.

BTW consider noting that each of the architectures is represented by a single bit and that the deny list is the "or" of the losers. I had to stare at this awhile to figure out what was happening here.

(But I do like the "allow" list better)


common/hwy_dynamic_impl.h line 59 at r8 (raw file):

  }

  /* (For testing only) Clears the latched target detection. This allows for

nit: this says "for testing only" but it is used just above in ChooseThenCall.


common/hwy_dynamic_impl.h line 74 at r8 (raw file):

  // All operations on this pointer must use memory_order::relaxed, which is
  // zero-cost on the platforms we care about. Note that the default value for

BTW is "zero cost" an overstatement? There is still the cost of an atomic access even if the compiler can still do reordering. Can those ever be zero cost?


math/test/fast_pose_composition_functions_test.cc line 171 at r8 (raw file):

  // Check the answer.
  EXPECT_TRUE(CompareMatrices(arg3->GetAsMatrix4(), expected));

BTW this test code is very nice!

Copy link
Collaborator Author

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 4 unresolved discussions, needs at least two assigned reviewers


common/BUILD.bazel line 366 at r8 (raw file):

Previously, sherm1 (Michael Sherman) wrote…

BTW consider noting that each of the architectures is represented by a single bit and that the deny list is the "or" of the losers. I had to stare at this awhile to figure out what was happening here.

(But I do like the "allow" list better)

Excellent point. In writing the comment, I realized I could slightly simplify the code, too.


common/hwy_dynamic_impl.h line 59 at r8 (raw file):

Previously, sherm1 (Michael Sherman) wrote…

nit: this says "for testing only" but it is used just above in ChooseThenCall.

Hmm. The use below only takes the address of this function (so that tests can indirectly call it) -- ChooseThenCall doesn't actually call this. Should I clarify some comments somewhere to improve the situation?


common/hwy_dynamic_impl.h line 74 at r8 (raw file):

Previously, sherm1 (Michael Sherman) wrote…

BTW is "zero cost" an overstatement? There is still the cost of an atomic access even if the compiler can still do reordering. Can those ever be zero cost?

That's the point of the comment -- this isn't actually atomic. For our targets, this is just a vanilla MOV on both the both load side and the store side. Maybe I should rephrase it somehow?

Copy link
Member

@sherm1 sherm1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r9, all commit messages.
Reviewable status: 3 unresolved discussions, needs at least two assigned reviewers


common/BUILD.bazel line 366 at r8 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Excellent point. In writing the comment, I realized I could slightly simplify the code, too.

Nice!


common/hwy_dynamic_impl.h line 59 at r8 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Hmm. The use below only takes the address of this function (so that tests can indirectly call it) -- ChooseThenCall doesn't actually call this. Should I clarify some comments somewhere to improve the situation?

Yes, just add a comment on the first line of ChooseThenCall noting that Reset is just being installed for future test use rather than called. I didn't realize that.


common/hwy_dynamic_impl.h line 74 at r8 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

That's the point of the comment -- this isn't actually atomic. For our targets, this is just a vanilla MOV on both the both load side and the store side. Maybe I should rephrase it somehow?

Doesn't that have to be a MOV to/from un-cached memory? If not, how does a write in one CPU get seen by another CPU? Hard to believe that coordination can be done at zero cost. I suppose the cost can be limited to the store operation if there is hardware support for invalidating the cached loads in other CPUs? Do you have a handy reference link that explains why this is free? If so that would be a good addition to the comment.

Copy link
Collaborator Author

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r9, 2 of 2 files at r10, all commit messages.
Reviewable status: 3 unresolved discussions, needs at least two assigned reviewers


common/hwy_dynamic_impl.h line 74 at r8 (raw file):

Previously, sherm1 (Michael Sherman) wrote…

Doesn't that have to be a MOV to/from un-cached memory? If not, how does a write in one CPU get seen by another CPU? Hard to believe that coordination can be done at zero cost. I suppose the cost can be limited to the store operation if there is hardware support for invalidating the cached loads in other CPUs? Do you have a handy reference link that explains why this is free? If so that would be a good addition to the comment.

I updated the code to somewhat better match what we're aiming for.

I suspect what wasn't clear to you was that in order to avoid any xchg store instructions (with cache locking), we accept the possibility that multiple threads could all be redundantly choosing the function pointer at the same time.

Since this happens at most once per cpu, and choosing is pretty quick anyway, this is a fair trade-off.


common/hwy_dynamic_impl.h line 73 at r10 (raw file):

    // "release" memory order here because we only care about memory consistency
    // as viewed from the current thread.
    function_.store(impl, std::memory_order::release);

Working

I forgot that I had planned to write unit tests in common for this new code.

At the very least, I should add a check that our "choose" code is effective -- it would be a travesty if we accidentally removed this line and all of our "fast functions" ended up taking the slow path.

Copy link
Member

@sherm1 sherm1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r10, all commit messages.
Reviewable status: 2 unresolved discussions, needs at least two assigned reviewers

@jwnimmer-tri jwnimmer-tri added release notes: none This pull request should not be mentioned in the release notes and removed release notes: fix This pull request contains fixes (no new features) labels Jul 1, 2024
Copy link
Collaborator Author

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+@SeanCurtis-TRI for bonus feature review, and also platform review, please.

Reviewed 4 of 4 files at r11, all commit messages.
Reviewable status: LGTM missing from assignee SeanCurtis-TRI(platform)

a discussion (no related file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Working

Drake should obey numpy's environment variable command that opts-out of CPU features (towards reproducible science). Not sure if I will do that here, or in a follow-up PR.

OK For now, I just turned off AVX3. I'll work on the configuration in a follow-up. For now I just want to unblock Sean.

-(release notes: fix) +(release notes: none)



common/hwy_dynamic_impl.h line 73 at r10 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Working

I forgot that I had planned to write unit tests in common for this new code.

At the very least, I should add a check that our "choose" code is effective -- it would be a travesty if we accidentally removed this line and all of our "fast functions" ended up taking the slow path.

Done. Tests added.

Copy link
Collaborator Author

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 unresolved discussion, LGTM missing from assignee SeanCurtis-TRI(platform)

a discussion (no related file):
Working

Final sanity check benchmarking before we merge.


@jwnimmer-tri
Copy link
Collaborator Author

Also FYI @sherm1 -- I added a bunch more code (tests), so please take one more look.

Copy link
Contributor

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:LGTM: With some editorial comments.

I wonder if we can't provide some macros that would help make further instantiations more compact and bullet-proof. But I'll wait to see what it looks like when I apply it to box overlap.

Reviewed 5 of 12 files at r1, 3 of 6 files at r4, 2 of 3 files at r5, 1 of 1 files at r6, 1 of 3 files at r7, 1 of 2 files at r9, 1 of 2 files at r10, 4 of 4 files at r11, all commit messages.
Reviewable status: 7 unresolved discussions

a discussion (no related file):
Would it be worth writing up a checklist for the necessary steps to use hwy? Cargo cult is grand and glorious, but directing programmer's attention might be helpful. There's some arcane stuff taking place here and a bit of hand holding might not go amiss.



common/hwy_dynamic.h line 9 at r11 (raw file):

with unit tests that want to probe code that uses "hwy/highway.h" for SIMD. */

/* Anywhere in Drake that uses Highway for dynamic CPU dispatch should call this

nit: This phrase suggests that it could be used outside of tests (which contradicts the documented purpose of this function above).

Code quote:

Anywhere in Drake

common/hwy_dynamic.h line 13 at r11 (raw file):

(Most developers never need to worry about this, because it's automatic when you
use the tools in `hwy_dynamic_impl.h`.) This function is safe to call from
multiple threads currently. */

nit typo

Suggestion:

concurrently

common/hwy_dynamic.h line 16 at r11 (raw file):

void HwyDynamicRegisterResetFunction(void (*)());

/* (For testing only) Clears the latched CPU target detection by calling all of

nit: Given the comment at the top that the functions "purpose is to assist with unit tests", it seems labeling one of these functions as "For testing only" and not the other is either redundant or inconsistent.

Code quote:

(For testing only)

common/hwy_dynamic_impl.h line 81 at r11 (raw file):

  testing multiple different targets from the same test program. This function
  is NOT thread-safe; only use this in single-threaded tests. */
  __attribute__((cold)) static void Reset() {

BTW If I read this right, everytime we call Reset() and then Call() we add a pointer to LateBoundFunction::Reset to our global resets. So, if we call HwyDynamicReset()we'll repeatedly call Reset(). Given that Reset() is only for tests, the expectation is that will only happen in tests, so not a problem?


math/fast_pose_composition_functions_avx2_fma.cc line 33 at r11 (raw file):

// or larger. When we have smaller registers (e.g., SSE2's 2-wide lanes, or
// SVE's variable-length vectors) we will fall back to non-SIMD code.
#if HWY_MAX_BYTES >= 32 && HWY_HAVE_SCALABLE == 0

BTW We might want to alias this and hoist it into hwy_dynamic_impl.h so we don't have this boilerplate everywhere we use hwy?

Copy link
Member

@sherm1 sherm1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r11, all commit messages.
Reviewable status: 8 unresolved discussions


common/test/hwy_dynamic_test_array_mul.cc line 96 at r11 (raw file):

std::map<std::string, int>& GetMutableArrayMulCounters() {
  static drake::never_destroyed<std::map<std::string, int>> singleton;

BTW you could use drake::string_map here. Of course there is no performance reason to do so in this context but it might help encourage best practices by example.

Copy link
Collaborator Author

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I didn't want to try adding sugar until we have more examples. Once we have a nice set of users of this code, we can refactor to the then-apparent needs for what is consistent versus what varies each time.

Reviewed 6 of 6 files at r12, all commit messages.
Reviewable status: 6 unresolved discussions

a discussion (no related file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

Would it be worth writing up a checklist for the necessary steps to use hwy? Cargo cult is grand and glorious, but directing programmer's attention might be helpful. There's some arcane stuff taking place here and a bit of hand holding might not go amiss.

To my mind, the clearest exposition of such a checklist is the sample code we have in our tree, i.e., cargo culting. A separate list seems like it would be literal a copy-paste of one of these files (just without function bodies). If we find acute traps that developers are hitting, I think we should aim to solve that with better abstractions, rather than more documentation.


a discussion (no related file):
Working

I just remembered that the Highway docs recommend to use -O2 on Highway code, even in Debug builds. The abstractions that give things nice names lead to a bunch of trivial function calls, which are easily inlined into nothingness, but -O0 defeats that. I think I'll need to add one more build flag to align with that advice.



common/hwy_dynamic.h line 9 at r11 (raw file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

nit: This phrase suggests that it could be used outside of tests (which contradicts the documented purpose of this function above).

The first function is supposed to be called from outside of tests. I tried to tune up the overview to present a clearer story.


common/hwy_dynamic.h line 16 at r11 (raw file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

nit: Given the comment at the top that the functions "purpose is to assist with unit tests", it seems labeling one of these functions as "For testing only" and not the other is either redundant or inconsistent.

The first function is supposed to be called from outside of tests. I tried to tune up the overview to present a clearer story.


common/hwy_dynamic_impl.h line 81 at r11 (raw file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

BTW If I read this right, everytime we call Reset() and then Call() we add a pointer to LateBoundFunction::Reset to our global resets. So, if we call HwyDynamicReset()we'll repeatedly call Reset(). Given that Reset() is only for tests, the expectation is that will only happen in tests, so not a problem?

Good point. Not a perf problem, but still confusing. Fixed to ignore duplicates.


math/fast_pose_composition_functions_avx2_fma.cc line 33 at r11 (raw file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

BTW We might want to alias this and hoist it into hwy_dynamic_impl.h so we don't have this boilerplate everywhere we use hwy?

It wasn't clear to me that every call site would have the same situation as we do here.

In the pose composition code, we are computing extra flops in the SIMD spelling because we know they will net out as a positive, but only under certain conditions (the ones in this ifdef). I tried to communicate that in the comment right above the ifdef, but let me know if we can do it better.

A different use of SIMD -- e.g. an AutoDiff linear combination of partials -- would be improved by really any target and vector sized that highway supports. Similar in spirit to the ArrayMul example in the unit test -- no data transposition, just an elementwise pile of flops. All of the highway targets can do that well, even SSE2.

Copy link
Collaborator Author

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 unresolved discussions

a discussion (no related file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Working

I just remembered that the Highway docs recommend to use -O2 on Highway code, even in Debug builds. The abstractions that give things nice names lead to a bunch of trivial function calls, which are easily inlined into nothingness, but -O0 defeats that. I think I'll need to add one more build flag to align with that advice.

Done. I turned it on for all of the pose composition stuff. There are ways to take a narrower swipe, but their spelling ends up being complicated and this seems like an OK compromise.


Copy link
Contributor

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r12, 1 of 1 files at r13, all commit messages.
Reviewable status: 1 unresolved discussion

a discussion (no related file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

To my mind, the clearest exposition of such a checklist is the sample code we have in our tree, i.e., cargo culting. A separate list seems like it would be literal a copy-paste of one of these files (just without function bodies). If we find acute traps that developers are hitting, I think we should aim to solve that with better abstractions, rather than more documentation.

That's fine.



math/fast_pose_composition_functions_avx2_fma.cc line 33 at r11 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

It wasn't clear to me that every call site would have the same situation as we do here.

In the pose composition code, we are computing extra flops in the SIMD spelling because we know they will net out as a positive, but only under certain conditions (the ones in this ifdef). I tried to communicate that in the comment right above the ifdef, but let me know if we can do it better.

A different use of SIMD -- e.g. an AutoDiff linear combination of partials -- would be improved by really any target and vector sized that highway supports. Similar in spirit to the ArrayMul example in the unit test -- no data transposition, just an elementwise pile of flops. All of the highway targets can do that well, even SSE2.

That's fine. I'm pretty sure boxes_overlap is going to go the same path as this, where there's a bespoke fallback implementation. So, I had two samples that seemed to require the same pattern.

Copy link
Collaborator Author

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r13, all commit messages.
Reviewable status: :shipit: complete! all discussions resolved, LGTM from assignees SeanCurtis-TRI(platform),sherm1(platform)

a discussion (no related file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Working

Final sanity check benchmarking before we merge.

Done. Confirmed that the CassieDouble benchmark suite has no change from this PR.


Copy link
Member

@sherm1 sherm1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r12, 1 of 1 files at r13, all commit messages.
Reviewable status: :shipit: complete! all discussions resolved, LGTM from assignees SeanCurtis-TRI(platform),sherm1(platform)

@jwnimmer-tri jwnimmer-tri merged commit 74ad572 into RobotLocomotion:master Jul 2, 2024
8 of 9 checks passed
@jwnimmer-tri jwnimmer-tri deleted the highway2 branch July 2, 2024 22:54
RussTedrake pushed a commit to RussTedrake/drake that referenced this pull request Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: low release notes: none This pull request should not be mentioned in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants