Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LPP provider #10303

Merged
merged 4 commits into from
Aug 30, 2024
Merged

Add LPP provider #10303

merged 4 commits into from
Aug 30, 2024

Conversation

tstruk
Copy link
Contributor

@tstruk tstruk commented Aug 12, 2024

Add libfabric PCIe Provider (LPP) is a provider.
LPP is a a provider, which runs on top of Gigaio FabreX(™) fabric.

@shijin-aws
Copy link
Contributor

Should you rename your provider to reflect the NIC vendor? I think many providers like EFA utilize PCIe to realize Peer direct features.

@tstruk
Copy link
Contributor Author

tstruk commented Aug 13, 2024

Should you rename your provider to reflect the NIC vendor? I think many providers like EFA utilize PCIe to realize Peer direct features.

We have been using this name for a long time, and it would require substantial effort to change it at this point.
Since there is no clash with any other provider name, or no conflict with naming conventions we would like to keep it that way.

@tstruk
Copy link
Contributor Author

tstruk commented Aug 13, 2024

Where can I see the results from AWS CI or Jenkins?
The builds passed fine for me in Github Actions https://github.com/tstruk/libfabric/actions/runs/10357960203/job/28671102705
Looks like the results from these tools are not available to the public, are they?

@darrylabbate
Copy link
Member

@tstruk For AWS CI:

Command make -j failed with error:
Makefile:3742: Extraneous text after `endif' directive
Makefile:3742: Extraneous text after `endif' directive

There may be more fatal errors (output is large), so I'll comment what I see

@tstruk tstruk force-pushed the add_lpp branch 2 times, most recently from 192c9e1 to 551873d Compare August 13, 2024 19:54
@tstruk
Copy link
Contributor Author

tstruk commented Aug 13, 2024

Thank you @darrylabbate I have pushed a new version. Let's wait and see if it will work any better.

// Might wan to unset later as this can affect performance.
CUresult cures;
int flag = 1;
cures = ofi_cuPointerSetAttribute(&flag,
Copy link
Contributor

@shijin-aws shijin-aws Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a FYI that we have problem with this sync memops set when running with NCCL >= 2.19. Basically it's not safe to set CU_POINTER_ATTRIBUTE_SYNC_MEMOPS inside libfabric for NCCL application

I guess you set this attribute to enforce a synchronous cuda memcpy inside libfabric, while cuda memcpy is not allowed inside Libfabric either for NCCL application - #6124

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our device uses GPUDirectRDMA, which avoid the need for cudaMemcpy. We're setting the SYNC_MEMOPS attr out of an abundance of caution, as outlined in GpuDirectRDMA docs (https://docs.nvidia.com/cuda/gpudirect-rdma/#synchronization-and-memory-ordering).

Poking through the code, the EFA rdm calls efa_rdm_attempt_to_sycm_memops before a lot of transfers. How are you guys handing this issue then? Is there some nuance to when 'efa_mr->needs_sync` is set?

As for NCCL >=2.19, is this related to the changes with device memory allocation? I ran into a problem testing out newer NCCL version, but was able to mitigate it with the NCCL_CUMEM_ENABLE=0 env var.

Also, now that I'm noticing it, this could probably be reworked to use cuda_set_sync_memops...

Copy link
Contributor

@shijin-aws shijin-aws Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for NCCL >=2.19, is this related to the changes with device memory allocation? I ran into a problem testing out newer NCCL version, but was able to mitigate it with the NCCL_CUMEM_ENABLE=0 env var.

Yes, the problem can be mitigated by that env exactly. We have a boolean as FI_OPT_CUDA_API_PERMITTED to toggle whether CUDA API can be permitted to call inside Libfabric in data transfer. NCCL plugin set this boolean to false to disable cuda api usage in Libfabric.

https://ofiwg.github.io/libfabric/main/man/fi_endpoint.3.html

FI_OPT_CUDA_API_PERMITTED - bool
This option only applies to the fi_setopt call. It is used to control endpoint’s behavior in making calls to CUDA API. By default, an endpoint is permitted to call CUDA API. If user wish to prohibit an endpoint from making such calls, user can achieve that by set this option to false. If an endpoint’s support of CUDA memory relies on making calls to CUDA API, it will return -FI_EOPNOTSUPP for the call to fi_setopt. If either CUDA library or CUDA device is not available, endpoint will return -FI_EINVAL. All providers that support FI_HMEM capability implement this option.

The problem is that boolean is set in the EP level via fi_setopt. while the MR reg can happen before that, so we cannot reuse that boolean so we can only do the toggle in the transmission call.

@aingerson
Copy link
Contributor

Throwing the Intel CI failures here as well - two different builds failed and had two different errors:

12:58:46  prov/lpp/src/hmem_cuda.o: In function `hmem_cuda_init':
12:58:46  .../libfabric/fabtests/prov/lpp/src/hmem_cuda.c:112: undefined reference to `count_of'
12:58:46  .../libfabric/fabtests/prov/lpp/src/hmem_cuda.c:113: undefined reference to `count_of'
12:58:46  .../libfabric/fabtests/prov/lpp/src/hmem_cuda.c:114: undefined reference to `count_of'
12:58:46  .../libfabric/fabtests/prov/lpp/src/hmem_cuda.c:115: undefined reference to `count_of'
12:58:46  .../libfabric/fabtests/prov/lpp/src/hmem_cuda.c:116: undefined reference to `count_of'
12:59:47  /usr/bin/ld: cannot find /usr/lib64/libatomic.so.1.2.0
12:59:47  collect2: error: ld returned 1 exit status

@j-xiong
Copy link
Contributor

j-xiong commented Aug 13, 2024

count_of was removed from fabric.h recently (see 6ac8242#diff-6c89f68bf67f752f1c4034d054016bb948ce8f178cb6b3550b910b51cc4bc045). Please add the definition to your test code.

@@ -304,6 +304,21 @@ UCX_INI ;
# define UCX_INIT NULL
#endif

#if defined(_WIN32) && (HAVE_EFA)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is LPP somewhat related to EFA, or is it a typo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a typo.

ofi_atomic64_t last_errcounter;
struct dlist_entry ep_list;
ofi_mutex_t lock;
} Lpp_cntr_t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use struct directly instead of using typedef. Same for many other instances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 100 to 111
//
// Per fi_getinfo(3):
//
// Capabilities may be grouped into two general categories: primary and
// secondary. Primary capabilities must explicitly be requested by an
// application, and a provider must enable support for only those primary
// capabilities which were selected. Secondary capabilities may optionally be
// requested by an application. If requested, a provider must support the
// capability or fail the fi_getinfo request (FI_ENODATA). A provider may
// optionally report non-selected secondary capabilities if doing so would not
// compromise performance or security.
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use /* */ for multiline comments. Same for many other places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

#define _LPP_TAGGED_H_

ssize_t lpp_fi_trecv(struct fid_ep *ep, void *buf, size_t len, void *desc,
fi_addr_t src_addr, uint64_t tag, uint64_t ignore, void *context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align the second line with the 'struct' in the first line. Same for other places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +95 to +99
if (posix_memalign(&buf, KLPP_MQ_CELL_SIZE, size) != 0) {
return NULL;
} else {
return buf;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use '{ and } when the body has a single line. Same for other places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
if (klpp_getdevice(klpp_fd, &klpp_devinfo) < 0) {
FI_WARN(&lpp_prov, FI_LOG_FABRIC, "failed to query KLPP device %ld\n", i);
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should klpp_fd be closed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Added a close() before continue.

//
// Common
//
int lpp_fi_no_bind(struct fid *fid, struct fid *bfid, uint64_t flags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use the functions defined in src/enosys.c instead of defining here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@tstruk
Copy link
Contributor Author

tstruk commented Aug 14, 2024

Throwing the Intel CI failures here as well - two different builds failed and had two different errors:

12:58:46  prov/lpp/src/hmem_cuda.o: In function `hmem_cuda_init':
12:58:46  .../libfabric/fabtests/prov/lpp/src/hmem_cuda.c:112: undefined reference to `count_of'
12:58:46  .../libfabric/fabtests/prov/lpp/src/hmem_cuda.c:113: undefined reference to `count_of'
12:58:46  .../libfabric/fabtests/prov/lpp/src/hmem_cuda.c:114: undefined reference to `count_of'
12:58:46  .../libfabric/fabtests/prov/lpp/src/hmem_cuda.c:115: undefined reference to `count_of'
12:58:46  .../libfabric/fabtests/prov/lpp/src/hmem_cuda.c:116: undefined reference to `count_of'
12:59:47  /usr/bin/ld: cannot find /usr/lib64/libatomic.so.1.2.0
12:59:47  collect2: error: ld returned 1 exit status

Resolved. Thanks!

@darrylabbate
Copy link
Member

Seeing the same error as before in AWS CI:

Command make -j failed with error:
Makefile:3742: Extraneous text after `endif' directive
Makefile:3742: Extraneous text after `endif' directive

@tstruk tstruk changed the title Add LPP provider Add LPP provider - WIP Aug 15, 2024
@tstruk tstruk force-pushed the add_lpp branch 2 times, most recently from 421fa4e to 8ae6134 Compare August 15, 2024 10:22
@tstruk
Copy link
Contributor Author

tstruk commented Aug 15, 2024

@darrylabbate could you please check what AWS CI is not happy about?

@shijin-aws
Copy link
Contributor

@darrylabbate could you please check what AWS CI is not happy about?

It seems the latest push succeeded in compilation

@tstruk
Copy link
Contributor Author

tstruk commented Aug 15, 2024

@darrylabbate could you please check what AWS CI is not happy about?

It seems the latest push succeeded in compilation

It takes hours for the AWS CI to finish. I can see that it's still in progress. I don't have visibility into what stage it is at.

@darrylabbate
Copy link
Member

@darrylabbate could you please check what AWS CI is not happy about?

It seems the latest push succeeded in compilation

It takes hours for the AWS CI to finish. I can see that it's still in progress. I don't have visibility into what stage it is at.

For security reasons, this is unfortunately a limitation of our current CI. We'll do our best to communicate CI failures that aren't caught by Actions, etc.

@tstruk
Copy link
Contributor Author

tstruk commented Aug 15, 2024

@darrylabbate could you please check what AWS CI is not happy about?

It seems the latest push succeeded in compilation

Eventually it failed again. Could you please check what was wrong?

@darrylabbate
Copy link
Member

It's still showing "in progress" on our end (and in GitHub). Where do you see it's failing?

@tstruk
Copy link
Contributor Author

tstruk commented Aug 15, 2024

It's still showing "in progress" on our end (and in GitHub). Where do you see it's failing?

Because I pushed the latest updates, but it failed before that. So let's wait and see.

@darrylabbate
Copy link
Member

Looks like it failed to build on ARM-based EC2 instance types. I'll need to dig further to provide a root cause.

@darrylabbate
Copy link
Member

Error from previous build:

prov/lpp/src/lpp_memcpy.c:25:10: fatal error: immintrin.h: No such file or directory

 #include <immintrin.h>

          ^~~~~~~~~~~~~

compilation terminated.

make[1]: *** [prov/lpp/src/src_libfabric_la-lpp_memcpy.lo] Error 1

make[1]: *** Waiting for unfinished jobs....

/tmp/ccOsPotZ.s: Assembler messages:

/tmp/ccOsPotZ.s:995: Error: unknown mnemonic `sfence' -- `sfence'

make[1]: *** [prov/lpp/src/src_libfabric_la-lpp_umc.lo] Error 1

make: *** [all] Error 2

@tstruk
Copy link
Contributor Author

tstruk commented Aug 16, 2024

Error from previous build:

Thank you for digging into this. I guess I need to limit the build to x86 only for now.

@tstruk
Copy link
Contributor Author

tstruk commented Aug 19, 2024

bot:aws:retest

So what is the latest build failure in AWS?

@shefty
Copy link
Member

shefty commented Aug 19, 2024

What is the actual protocol, though? That should still be added.

@tstruk
Copy link
Contributor Author

tstruk commented Aug 28, 2024

I did not see where the PR makes any changes to the libfabric core. Those should be in a separate patch, but I expected to see new FI_ADDR and FI_PROTO values, at the very least, along with updates to the core handling for the new values (e.g. tostr functions).

Edited: It looks like the addressing is sockaddr. I didn't notice protocol details, but I didn't look closely into the provider implementation.

Added a new FI_PROTO_LPP enum.

@tstruk tstruk force-pushed the add_lpp branch 2 times, most recently from 754f1e5 to 184eb12 Compare August 28, 2024 08:32
tstruk and others added 2 commits August 28, 2024 12:30
Add FI_PROTO_LPP enum for the lpp provider.

Signed-off-by: Tadeusz Struk <[email protected]>
Libfabric PCIe Provider (LPP) is a provider,
which runs on top of Gigaio FabreX(™) fabric.

Co-authored-by: Abhishek Goyanka <[email protected]>
Co-authored-by: Benjamin Kitor <[email protected]>
Co-authored-by: David Dai <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Eric Pilmore <[email protected]>
Co-authored-by: John Ihnotic <[email protected]>
Co-authored-by: Thayne Harbaugh <[email protected]>
Signed-off-by: Tadeusz Struk <[email protected]>
@tstruk tstruk changed the title Add LPP provider - WIP Add LPP provider Aug 28, 2024
@darrylabbate
Copy link
Member

From AWS CI:

prov/lpp/src/hmem_util.o: In function `hmem_memcpy_d2h':
/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10303-dso/source/libfabric/fabtests/prov/lpp/src/hmem_util.c:62: undefined reference to `hmem_rocm_memcpy_d2h'
prov/lpp/src/hmem_util.o: In function `hmem_memcpy_h2d':
/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10303-dso/source/libfabric/fabtests/prov/lpp/src/hmem_util.c:80: undefined reference to `hmem_rocm_memcpy_h2d'
prov/lpp/src/hmem_util.o: In function `hmem_alloc':
/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10303-dso/source/libfabric/fabtests/prov/lpp/src/hmem_util.c:98: undefined reference to `hmem_rocm_alloc'
prov/lpp/src/hmem_util.o: In function `hmem_free':
/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10303-dso/source/libfabric/fabtests/prov/lpp/src/hmem_util.c:114: undefined reference to `hmem_rocm_free'
collect2: error: ld returned 1 exit status

@tstruk
Copy link
Contributor Author

tstruk commented Aug 28, 2024

From AWS CI:

prov/lpp/src/hmem_util.o: In function `hmem_memcpy_d2h':
/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10303-dso/source/libfabric/fabtests/prov/lpp/src/hmem_util.c:62: undefined reference to `hmem_rocm_memcpy_d2h'
prov/lpp/src/hmem_util.o: In function `hmem_memcpy_h2d':
/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10303-dso/source/libfabric/fabtests/prov/lpp/src/hmem_util.c:80: undefined reference to `hmem_rocm_memcpy_h2d'
prov/lpp/src/hmem_util.o: In function `hmem_alloc':
/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10303-dso/source/libfabric/fabtests/prov/lpp/src/hmem_util.c:98: undefined reference to `hmem_rocm_alloc'
prov/lpp/src/hmem_util.o: In function `hmem_free':
/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10303-dso/source/libfabric/fabtests/prov/lpp/src/hmem_util.c:114: undefined reference to `hmem_rocm_free'
collect2: error: ld returned 1 exit status

Thank you for checking. I've added a new AM_CONDITIONAL for HMEM. Let's see if it will help.

@tstruk
Copy link
Contributor Author

tstruk commented Aug 29, 2024

What does the ci/cloudbees/stage/Summary do?

@j-xiong
Copy link
Contributor

j-xiong commented Aug 29, 2024

@tstruk The summary stage summarizes the pass/fail/skip counts of all the previous stages. It fails if the total fail count is not zero. Due to the way some of the tests are constructed, a stage may return "success" status even if some of the tests in that stage have failed. Those fails are always captured in the summary stage.

For this specific CI run, the failures are caused by a previous infrastructure issue. Just restarted the CI run, will see how it goes.

@tstruk
Copy link
Contributor Author

tstruk commented Aug 29, 2024

@j-xiong could you please share the results of the two intel-ofi checks that failed?

@j-xiong
Copy link
Contributor

j-xiong commented Aug 29, 2024

The errors are from fabtests over ucx due to the change in #10326. For example:

    fi_getopt(FI_OPT_INJECT_MSG_SIZE)(): benchmarks/benchmark_shared.c:97, ret=-22 (Invalid argument) 

They should go away with #10341.

@tstruk
Copy link
Contributor Author

tstruk commented Aug 29, 2024

They should go away with #10341.

Thank you for info.

@tstruk
Copy link
Contributor Author

tstruk commented Aug 29, 2024

@darrylabbate what's the latest status of AWS testing for this PR?

@darrylabbate
Copy link
Member

prov/lpp/src/test_util.c: In function ‘util_create_mr’:

prov/lpp/src/test_util.c:348:39: error: ‘PAGE_SIZE’ undeclared (first use in this function); did you mean ‘HAVE_ZE’?

   if (posix_memalign(&mr_info->uaddr, PAGE_SIZE,

                                       ^~~~~~~~~

                                       HAVE_ZE

prov/lpp/src/test_util.c:348:39: note: each undeclared identifier is reported only once for each function it appears in

make[1]: *** [prov/lpp/src/test_util.o] Error 1

tstruk and others added 2 commits August 29, 2024 20:47
Add LPP specific fabtests.

Co-authored-by: Abhishek Goyanka <[email protected]>
Co-authored-by: Benjamin Kitor <[email protected]>
Co-authored-by: David Dai <[email protected]>
Co-authored-by: Eric Badger <[email protected]>
Co-authored-by: Eric Pilmore <[email protected]>
Co-authored-by: John Ihnotic <[email protected]>
Co-authored-by: Thayne Harbaugh <[email protected]>
Signed-off-by: Tadeusz Struk <[email protected]>
@tstruk
Copy link
Contributor Author

tstruk commented Aug 30, 2024

@darrylabbate could you check the results now please.

@shijin-aws
Copy link
Contributor

@tstruk this time is ok, there are some efa test failure (we are fixing that) which is irrelevant to this PR

@tstruk
Copy link
Contributor Author

tstruk commented Aug 30, 2024

@shefty so based on the latest comments from @j-xiong and @shijin-aws this PR is ready to go.

@shefty shefty merged commit 65856bc into ofiwg:main Aug 30, 2024
11 of 14 checks passed
@shefty
Copy link
Member

shefty commented Aug 30, 2024

thanks!

@j-xiong
Copy link
Contributor

j-xiong commented Aug 30, 2024

I should have put a do not merge label. I didn't because there is a work in progress label there.

@shefty
Copy link
Member

shefty commented Aug 30, 2024

Sorry about that. I checked for a do not merge label, but didn't see one, and I added the WIP one. Hopefully that doesn't mess up the release. If so, it can be reverted, then re-applied.

@j-xiong
Copy link
Contributor

j-xiong commented Aug 30, 2024

@shefty No worry. I have updated the v2.0.0alpha PR with the new changes included. Simpler to re-do the packaging and validation than un-doing the changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants