-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LPP provider #10303
Add LPP provider #10303
Conversation
Should you rename your provider to reflect the NIC vendor? I think many providers like EFA utilize PCIe to realize Peer direct features. |
We have been using this name for a long time, and it would require substantial effort to change it at this point. |
Where can I see the results from AWS CI or Jenkins? |
@tstruk For AWS CI:
There may be more fatal errors (output is large), so I'll comment what I see |
192c9e1
to
551873d
Compare
Thank you @darrylabbate I have pushed a new version. Let's wait and see if it will work any better. |
prov/lpp/src/lpp_hmem.c
Outdated
// Might wan to unset later as this can affect performance. | ||
CUresult cures; | ||
int flag = 1; | ||
cures = ofi_cuPointerSetAttribute(&flag, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a FYI that we have problem with this sync memops set when running with NCCL >= 2.19. Basically it's not safe to set CU_POINTER_ATTRIBUTE_SYNC_MEMOPS
inside libfabric for NCCL application
I guess you set this attribute to enforce a synchronous cuda memcpy inside libfabric, while cuda memcpy is not allowed inside Libfabric either for NCCL application - #6124
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our device uses GPUDirectRDMA, which avoid the need for cudaMemcpy. We're setting the SYNC_MEMOPS attr out of an abundance of caution, as outlined in GpuDirectRDMA docs (https://docs.nvidia.com/cuda/gpudirect-rdma/#synchronization-and-memory-ordering).
Poking through the code, the EFA rdm calls efa_rdm_attempt_to_sycm_memops
before a lot of transfers. How are you guys handing this issue then? Is there some nuance to when 'efa_mr->needs_sync` is set?
As for NCCL >=2.19, is this related to the changes with device memory allocation? I ran into a problem testing out newer NCCL version, but was able to mitigate it with the NCCL_CUMEM_ENABLE=0
env var.
Also, now that I'm noticing it, this could probably be reworked to use cuda_set_sync_memops
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for NCCL >=2.19, is this related to the changes with device memory allocation? I ran into a problem testing out newer NCCL version, but was able to mitigate it with the NCCL_CUMEM_ENABLE=0 env var.
Yes, the problem can be mitigated by that env exactly. We have a boolean as FI_OPT_CUDA_API_PERMITTED
to toggle whether CUDA API can be permitted to call inside Libfabric in data transfer. NCCL plugin set this boolean to false to disable cuda api usage in Libfabric.
https://ofiwg.github.io/libfabric/main/man/fi_endpoint.3.html
FI_OPT_CUDA_API_PERMITTED - bool
This option only applies to the fi_setopt call. It is used to control endpoint’s behavior in making calls to CUDA API. By default, an endpoint is permitted to call CUDA API. If user wish to prohibit an endpoint from making such calls, user can achieve that by set this option to false. If an endpoint’s support of CUDA memory relies on making calls to CUDA API, it will return -FI_EOPNOTSUPP for the call to fi_setopt. If either CUDA library or CUDA device is not available, endpoint will return -FI_EINVAL. All providers that support FI_HMEM capability implement this option.
The problem is that boolean is set in the EP level via fi_setopt. while the MR reg can happen before that, so we cannot reuse that boolean so we can only do the toggle in the transmission call.
Throwing the Intel CI failures here as well - two different builds failed and had two different errors:
|
|
include/ofi_prov.h
Outdated
@@ -304,6 +304,21 @@ UCX_INI ; | |||
# define UCX_INIT NULL | |||
#endif | |||
|
|||
#if defined(_WIN32) && (HAVE_EFA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is LPP somewhat related to EFA, or is it a typo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a typo.
prov/lpp/include/lpp_cntr.h
Outdated
ofi_atomic64_t last_errcounter; | ||
struct dlist_entry ep_list; | ||
ofi_mutex_t lock; | ||
} Lpp_cntr_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use struct directly instead of using typedef. Same for many other instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
prov/lpp/include/lpp.h
Outdated
// | ||
// Per fi_getinfo(3): | ||
// | ||
// Capabilities may be grouped into two general categories: primary and | ||
// secondary. Primary capabilities must explicitly be requested by an | ||
// application, and a provider must enable support for only those primary | ||
// capabilities which were selected. Secondary capabilities may optionally be | ||
// requested by an application. If requested, a provider must support the | ||
// capability or fail the fi_getinfo request (FI_ENODATA). A provider may | ||
// optionally report non-selected secondary capabilities if doing so would not | ||
// compromise performance or security. | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use /*
*/
for multiline comments. Same for many other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
prov/lpp/include/lpp_tagged.h
Outdated
#define _LPP_TAGGED_H_ | ||
|
||
ssize_t lpp_fi_trecv(struct fid_ep *ep, void *buf, size_t len, void *desc, | ||
fi_addr_t src_addr, uint64_t tag, uint64_t ignore, void *context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align the second line with the 'struct' in the first line. Same for other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
if (posix_memalign(&buf, KLPP_MQ_CELL_SIZE, size) != 0) { | ||
return NULL; | ||
} else { | ||
return buf; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use '{
and }
when the body has a single line. Same for other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
if (klpp_getdevice(klpp_fd, &klpp_devinfo) < 0) { | ||
FI_WARN(&lpp_prov, FI_LOG_FABRIC, "failed to query KLPP device %ld\n", i); | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should klpp_fd be closed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Added a close() before continue.
prov/lpp/src/lpp_nosys.c
Outdated
// | ||
// Common | ||
// | ||
int lpp_fi_no_bind(struct fid *fid, struct fid *bfid, uint64_t flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use the functions defined in src/enosys.c
instead of defining here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Resolved. Thanks! |
Seeing the same error as before in AWS CI:
|
421fa4e
to
8ae6134
Compare
@darrylabbate could you please check what AWS CI is not happy about? |
It seems the latest push succeeded in compilation |
It takes hours for the AWS CI to finish. I can see that it's still in progress. I don't have visibility into what stage it is at. |
For security reasons, this is unfortunately a limitation of our current CI. We'll do our best to communicate CI failures that aren't caught by Actions, etc. |
Eventually it failed again. Could you please check what was wrong? |
It's still showing "in progress" on our end (and in GitHub). Where do you see it's failing? |
Because I pushed the latest updates, but it failed before that. So let's wait and see. |
Looks like it failed to build on ARM-based EC2 instance types. I'll need to dig further to provide a root cause. |
Error from previous build:
|
Thank you for digging into this. I guess I need to limit the build to x86 only for now. |
So what is the latest build failure in AWS? |
What is the actual protocol, though? That should still be added. |
Added a new FI_PROTO_LPP enum. |
754f1e5
to
184eb12
Compare
Add FI_PROTO_LPP enum for the lpp provider. Signed-off-by: Tadeusz Struk <[email protected]>
Libfabric PCIe Provider (LPP) is a provider, which runs on top of Gigaio FabreX(™) fabric. Co-authored-by: Abhishek Goyanka <[email protected]> Co-authored-by: Benjamin Kitor <[email protected]> Co-authored-by: David Dai <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Eric Pilmore <[email protected]> Co-authored-by: John Ihnotic <[email protected]> Co-authored-by: Thayne Harbaugh <[email protected]> Signed-off-by: Tadeusz Struk <[email protected]>
From AWS CI:
|
Thank you for checking. I've added a new AM_CONDITIONAL for HMEM. Let's see if it will help. |
What does the |
@tstruk The summary stage summarizes the pass/fail/skip counts of all the previous stages. It fails if the total fail count is not zero. Due to the way some of the tests are constructed, a stage may return "success" status even if some of the tests in that stage have failed. Those fails are always captured in the summary stage. For this specific CI run, the failures are caused by a previous infrastructure issue. Just restarted the CI run, will see how it goes. |
@j-xiong could you please share the results of the two intel-ofi checks that failed? |
Thank you for info. |
@darrylabbate what's the latest status of AWS testing for this PR? |
|
Add LPP specific fabtests. Co-authored-by: Abhishek Goyanka <[email protected]> Co-authored-by: Benjamin Kitor <[email protected]> Co-authored-by: David Dai <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Eric Pilmore <[email protected]> Co-authored-by: John Ihnotic <[email protected]> Co-authored-by: Thayne Harbaugh <[email protected]> Signed-off-by: Tadeusz Struk <[email protected]>
Signed-off-by: Tadeusz Struk <[email protected]>
@darrylabbate could you check the results now please. |
@tstruk this time is ok, there are some efa test failure (we are fixing that) which is irrelevant to this PR |
@shefty so based on the latest comments from @j-xiong and @shijin-aws this PR is ready to go. |
thanks! |
I should have put a do not merge label. I didn't because there is a work in progress label there. |
Sorry about that. I checked for a do not merge label, but didn't see one, and I added the WIP one. Hopefully that doesn't mess up the release. If so, it can be reverted, then re-applied. |
@shefty No worry. I have updated the v2.0.0alpha PR with the new changes included. Simpler to re-do the packaging and validation than un-doing the changes. |
Add libfabric PCIe Provider (LPP) is a provider.
LPP is a a provider, which runs on top of Gigaio FabreX(™) fabric.