Add LPP provider #10303

tstruk · 2024-08-12T19:27:54Z

Add libfabric PCIe Provider (LPP) is a provider.
LPP is a a provider, which runs on top of Gigaio FabreX(™) fabric.

shijin-aws · 2024-08-12T19:37:17Z

Should you rename your provider to reflect the NIC vendor? I think many providers like EFA utilize PCIe to realize Peer direct features.

tstruk · 2024-08-13T02:23:32Z

Should you rename your provider to reflect the NIC vendor? I think many providers like EFA utilize PCIe to realize Peer direct features.

We have been using this name for a long time, and it would require substantial effort to change it at this point.
Since there is no clash with any other provider name, or no conflict with naming conventions we would like to keep it that way.

tstruk · 2024-08-13T02:28:44Z

Where can I see the results from AWS CI or Jenkins?
The builds passed fine for me in Github Actions https://github.com/tstruk/libfabric/actions/runs/10357960203/job/28671102705
Looks like the results from these tools are not available to the public, are they?

darrylabbate · 2024-08-13T19:14:03Z

@tstruk For AWS CI:

Command make -j failed with error:
Makefile:3742: Extraneous text after `endif' directive
Makefile:3742: Extraneous text after `endif' directive

There may be more fatal errors (output is large), so I'll comment what I see

tstruk · 2024-08-13T20:11:18Z

Thank you @darrylabbate I have pushed a new version. Let's wait and see if it will work any better.

shijin-aws · 2024-08-13T20:35:46Z

prov/lpp/src/lpp_hmem.c

+		// Might wan to unset later as this can affect performance.
+		CUresult cures;
+		int flag = 1;
+		cures = ofi_cuPointerSetAttribute(&flag,


Just a FYI that we have problem with this sync memops set when running with NCCL >= 2.19. Basically it's not safe to set CU_POINTER_ATTRIBUTE_SYNC_MEMOPS inside libfabric for NCCL application

I guess you set this attribute to enforce a synchronous cuda memcpy inside libfabric, while cuda memcpy is not allowed inside Libfabric either for NCCL application - #6124

Our device uses GPUDirectRDMA, which avoid the need for cudaMemcpy. We're setting the SYNC_MEMOPS attr out of an abundance of caution, as outlined in GpuDirectRDMA docs (https://docs.nvidia.com/cuda/gpudirect-rdma/#synchronization-and-memory-ordering).

Poking through the code, the EFA rdm calls efa_rdm_attempt_to_sycm_memops before a lot of transfers. How are you guys handing this issue then? Is there some nuance to when 'efa_mr->needs_sync` is set?

As for NCCL >=2.19, is this related to the changes with device memory allocation? I ran into a problem testing out newer NCCL version, but was able to mitigate it with the NCCL_CUMEM_ENABLE=0 env var.

Also, now that I'm noticing it, this could probably be reworked to use cuda_set_sync_memops...

As for NCCL >=2.19, is this related to the changes with device memory allocation? I ran into a problem testing out newer NCCL version, but was able to mitigate it with the NCCL_CUMEM_ENABLE=0 env var.

Yes, the problem can be mitigated by that env exactly. We have a boolean as FI_OPT_CUDA_API_PERMITTED to toggle whether CUDA API can be permitted to call inside Libfabric in data transfer. NCCL plugin set this boolean to false to disable cuda api usage in Libfabric.

https://ofiwg.github.io/libfabric/main/man/fi_endpoint.3.html

FI_OPT_CUDA_API_PERMITTED - bool
This option only applies to the fi_setopt call. It is used to control endpoint’s behavior in making calls to CUDA API. By default, an endpoint is permitted to call CUDA API. If user wish to prohibit an endpoint from making such calls, user can achieve that by set this option to false. If an endpoint’s support of CUDA memory relies on making calls to CUDA API, it will return -FI_EOPNOTSUPP for the call to fi_setopt. If either CUDA library or CUDA device is not available, endpoint will return -FI_EINVAL. All providers that support FI_HMEM capability implement this option.

The problem is that boolean is set in the EP level via fi_setopt. while the MR reg can happen before that, so we cannot reuse that boolean so we can only do the toggle in the transmission call.

aingerson · 2024-08-13T20:42:57Z

Throwing the Intel CI failures here as well - two different builds failed and had two different errors:

12:58:46  prov/lpp/src/hmem_cuda.o: In function `hmem_cuda_init':
12:58:46  .../libfabric/fabtests/prov/lpp/src/hmem_cuda.c:112: undefined reference to `count_of'
12:58:46  .../libfabric/fabtests/prov/lpp/src/hmem_cuda.c:113: undefined reference to `count_of'
12:58:46  .../libfabric/fabtests/prov/lpp/src/hmem_cuda.c:114: undefined reference to `count_of'
12:58:46  .../libfabric/fabtests/prov/lpp/src/hmem_cuda.c:115: undefined reference to `count_of'
12:58:46  .../libfabric/fabtests/prov/lpp/src/hmem_cuda.c:116: undefined reference to `count_of'

12:59:47  /usr/bin/ld: cannot find /usr/lib64/libatomic.so.1.2.0
12:59:47  collect2: error: ld returned 1 exit status

j-xiong · 2024-08-13T20:56:33Z

count_of was removed from fabric.h recently (see 6ac8242#diff-6c89f68bf67f752f1c4034d054016bb948ce8f178cb6b3550b910b51cc4bc045). Please add the definition to your test code.

j-xiong · 2024-08-13T21:09:25Z

include/ofi_prov.h

@@ -304,6 +304,21 @@ UCX_INI ;
 #  define UCX_INIT NULL
 #endif

+#if defined(_WIN32) && (HAVE_EFA)


Is LPP somewhat related to EFA, or is it a typo?

That's a typo.

j-xiong · 2024-08-13T21:19:17Z

prov/lpp/include/lpp_cntr.h

+	ofi_atomic64_t		last_errcounter;
+	struct dlist_entry	ep_list;
+	ofi_mutex_t		lock;
+} Lpp_cntr_t;


Please use struct directly instead of using typedef. Same for many other instances.

j-xiong · 2024-08-13T21:21:23Z

prov/lpp/include/lpp.h

+//
+// Per fi_getinfo(3):
+//
+// Capabilities may be grouped into two general categories: primary and
+// secondary. Primary capabilities must explicitly be requested by an
+// application, and a provider must enable support for only those primary
+// capabilities which were selected. Secondary capabilities may optionally be
+// requested by an application. If requested, a provider must support the
+// capability or fail the fi_getinfo request (FI_ENODATA). A provider may
+// optionally report non-selected secondary capabilities if doing so would not
+// compromise performance or security.
+//


Please use /* */ for multiline comments. Same for many other places.

j-xiong · 2024-08-13T21:30:20Z

prov/lpp/include/lpp_tagged.h

+#define _LPP_TAGGED_H_
+
+ssize_t lpp_fi_trecv(struct fid_ep *ep, void *buf, size_t len, void *desc,
+	fi_addr_t src_addr, uint64_t tag, uint64_t ignore, void *context);


Align the second line with the 'struct' in the first line. Same for other places.

j-xiong · 2024-08-13T21:32:14Z

prov/lpp/src/lpp_atomic.c

+	if (posix_memalign(&buf, KLPP_MQ_CELL_SIZE, size) != 0) {
+		return NULL;
+	} else {
+		return buf;
+	}


don't use '{ and } when the body has a single line. Same for other places.

j-xiong · 2024-08-13T21:45:12Z

prov/lpp/src/lpp_getinfo.c

+		}
+		if (klpp_getdevice(klpp_fd, &klpp_devinfo) < 0) {
+			FI_WARN(&lpp_prov, FI_LOG_FABRIC, "failed to query KLPP device %ld\n", i);
+			continue;


should klpp_fd be closed here?

Yes. Added a close() before continue.

j-xiong · 2024-08-13T21:49:04Z

prov/lpp/src/lpp_nosys.c

+//
+// Common
+//
+int lpp_fi_no_bind(struct fid *fid, struct fid *bfid, uint64_t flags)


Can use the functions defined in src/enosys.c instead of defining here.

tstruk · 2024-08-14T15:32:01Z

Throwing the Intel CI failures here as well - two different builds failed and had two different errors:

12:58:46  prov/lpp/src/hmem_cuda.o: In function `hmem_cuda_init':
12:58:46  .../libfabric/fabtests/prov/lpp/src/hmem_cuda.c:112: undefined reference to `count_of'
12:58:46  .../libfabric/fabtests/prov/lpp/src/hmem_cuda.c:113: undefined reference to `count_of'
12:58:46  .../libfabric/fabtests/prov/lpp/src/hmem_cuda.c:114: undefined reference to `count_of'
12:58:46  .../libfabric/fabtests/prov/lpp/src/hmem_cuda.c:115: undefined reference to `count_of'
12:58:46  .../libfabric/fabtests/prov/lpp/src/hmem_cuda.c:116: undefined reference to `count_of'

12:59:47  /usr/bin/ld: cannot find /usr/lib64/libatomic.so.1.2.0
12:59:47  collect2: error: ld returned 1 exit status

Resolved. Thanks!

darrylabbate · 2024-08-14T18:23:03Z

Seeing the same error as before in AWS CI:

Command make -j failed with error:
Makefile:3742: Extraneous text after `endif' directive
Makefile:3742: Extraneous text after `endif' directive

tstruk · 2024-08-15T15:03:29Z

@darrylabbate could you please check what AWS CI is not happy about?

shijin-aws · 2024-08-15T17:37:59Z

@darrylabbate could you please check what AWS CI is not happy about?

It seems the latest push succeeded in compilation

tstruk · 2024-08-15T17:42:14Z

@darrylabbate could you please check what AWS CI is not happy about?

It seems the latest push succeeded in compilation

It takes hours for the AWS CI to finish. I can see that it's still in progress. I don't have visibility into what stage it is at.

darrylabbate · 2024-08-15T17:45:23Z

@darrylabbate could you please check what AWS CI is not happy about?

It seems the latest push succeeded in compilation

It takes hours for the AWS CI to finish. I can see that it's still in progress. I don't have visibility into what stage it is at.

For security reasons, this is unfortunately a limitation of our current CI. We'll do our best to communicate CI failures that aren't caught by Actions, etc.

tstruk · 2024-08-15T21:15:00Z

@darrylabbate could you please check what AWS CI is not happy about?

It seems the latest push succeeded in compilation

Eventually it failed again. Could you please check what was wrong?

darrylabbate · 2024-08-15T21:16:51Z

It's still showing "in progress" on our end (and in GitHub). Where do you see it's failing?

tstruk · 2024-08-15T21:20:53Z

It's still showing "in progress" on our end (and in GitHub). Where do you see it's failing?

Because I pushed the latest updates, but it failed before that. So let's wait and see.

darrylabbate · 2024-08-15T21:26:46Z

Looks like it failed to build on ARM-based EC2 instance types. I'll need to dig further to provide a root cause.

darrylabbate · 2024-08-15T21:32:04Z

Error from previous build:

prov/lpp/src/lpp_memcpy.c:25:10: fatal error: immintrin.h: No such file or directory

 #include <immintrin.h>

          ^~~~~~~~~~~~~

compilation terminated.

make[1]: *** [prov/lpp/src/src_libfabric_la-lpp_memcpy.lo] Error 1

make[1]: *** Waiting for unfinished jobs....

/tmp/ccOsPotZ.s: Assembler messages:

/tmp/ccOsPotZ.s:995: Error: unknown mnemonic `sfence' -- `sfence'

make[1]: *** [prov/lpp/src/src_libfabric_la-lpp_umc.lo] Error 1

make: *** [all] Error 2

tstruk · 2024-08-16T03:56:47Z

Error from previous build:

Thank you for digging into this. I guess I need to limit the build to x86 only for now.

tstruk · 2024-08-19T20:38:51Z

bot:aws:retest

So what is the latest build failure in AWS?

shefty · 2024-08-19T20:44:52Z

What is the actual protocol, though? That should still be added.

tstruk · 2024-08-28T06:42:06Z

I did not see where the PR makes any changes to the libfabric core. Those should be in a separate patch, but I expected to see new FI_ADDR and FI_PROTO values, at the very least, along with updates to the core handling for the new values (e.g. tostr functions).

Edited: It looks like the addressing is sockaddr. I didn't notice protocol details, but I didn't look closely into the provider implementation.

Added a new FI_PROTO_LPP enum.

Add FI_PROTO_LPP enum for the lpp provider. Signed-off-by: Tadeusz Struk <[email protected]>

Libfabric PCIe Provider (LPP) is a provider, which runs on top of Gigaio FabreX(™) fabric. Co-authored-by: Abhishek Goyanka <[email protected]> Co-authored-by: Benjamin Kitor <[email protected]> Co-authored-by: David Dai <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Eric Pilmore <[email protected]> Co-authored-by: John Ihnotic <[email protected]> Co-authored-by: Thayne Harbaugh <[email protected]> Signed-off-by: Tadeusz Struk <[email protected]>

darrylabbate · 2024-08-28T17:58:46Z

From AWS CI:

prov/lpp/src/hmem_util.o: In function `hmem_memcpy_d2h':
/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10303-dso/source/libfabric/fabtests/prov/lpp/src/hmem_util.c:62: undefined reference to `hmem_rocm_memcpy_d2h'
prov/lpp/src/hmem_util.o: In function `hmem_memcpy_h2d':
/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10303-dso/source/libfabric/fabtests/prov/lpp/src/hmem_util.c:80: undefined reference to `hmem_rocm_memcpy_h2d'
prov/lpp/src/hmem_util.o: In function `hmem_alloc':
/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10303-dso/source/libfabric/fabtests/prov/lpp/src/hmem_util.c:98: undefined reference to `hmem_rocm_alloc'
prov/lpp/src/hmem_util.o: In function `hmem_free':
/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10303-dso/source/libfabric/fabtests/prov/lpp/src/hmem_util.c:114: undefined reference to `hmem_rocm_free'
collect2: error: ld returned 1 exit status

tstruk · 2024-08-28T18:49:14Z

From AWS CI:

prov/lpp/src/hmem_util.o: In function `hmem_memcpy_d2h':
/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10303-dso/source/libfabric/fabtests/prov/lpp/src/hmem_util.c:62: undefined reference to `hmem_rocm_memcpy_d2h'
prov/lpp/src/hmem_util.o: In function `hmem_memcpy_h2d':
/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10303-dso/source/libfabric/fabtests/prov/lpp/src/hmem_util.c:80: undefined reference to `hmem_rocm_memcpy_h2d'
prov/lpp/src/hmem_util.o: In function `hmem_alloc':
/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10303-dso/source/libfabric/fabtests/prov/lpp/src/hmem_util.c:98: undefined reference to `hmem_rocm_alloc'
prov/lpp/src/hmem_util.o: In function `hmem_free':
/home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10303-dso/source/libfabric/fabtests/prov/lpp/src/hmem_util.c:114: undefined reference to `hmem_rocm_free'
collect2: error: ld returned 1 exit status

Thank you for checking. I've added a new AM_CONDITIONAL for HMEM. Let's see if it will help.

tstruk · 2024-08-29T02:32:44Z

What does the ci/cloudbees/stage/Summary do?

j-xiong · 2024-08-29T04:47:37Z

@tstruk The summary stage summarizes the pass/fail/skip counts of all the previous stages. It fails if the total fail count is not zero. Due to the way some of the tests are constructed, a stage may return "success" status even if some of the tests in that stage have failed. Those fails are always captured in the summary stage.

For this specific CI run, the failures are caused by a previous infrastructure issue. Just restarted the CI run, will see how it goes.

tstruk · 2024-08-29T06:22:54Z

@j-xiong could you please share the results of the two intel-ofi checks that failed?

j-xiong · 2024-08-29T15:20:36Z

The errors are from fabtests over ucx due to the change in #10326. For example:

    fi_getopt(FI_OPT_INJECT_MSG_SIZE)(): benchmarks/benchmark_shared.c:97, ret=-22 (Invalid argument)

They should go away with #10341.

tstruk · 2024-08-29T15:22:41Z

They should go away with #10341.

Thank you for info.

tstruk · 2024-08-29T15:23:37Z

@darrylabbate what's the latest status of AWS testing for this PR?

darrylabbate · 2024-08-29T17:58:29Z

prov/lpp/src/test_util.c: In function ‘util_create_mr’:

prov/lpp/src/test_util.c:348:39: error: ‘PAGE_SIZE’ undeclared (first use in this function); did you mean ‘HAVE_ZE’?

   if (posix_memalign(&mr_info->uaddr, PAGE_SIZE,

                                       ^~~~~~~~~

                                       HAVE_ZE

prov/lpp/src/test_util.c:348:39: note: each undeclared identifier is reported only once for each function it appears in

make[1]: *** [prov/lpp/src/test_util.o] Error 1

Add LPP specific fabtests. Co-authored-by: Abhishek Goyanka <[email protected]> Co-authored-by: Benjamin Kitor <[email protected]> Co-authored-by: David Dai <[email protected]> Co-authored-by: Eric Badger <[email protected]> Co-authored-by: Eric Pilmore <[email protected]> Co-authored-by: John Ihnotic <[email protected]> Co-authored-by: Thayne Harbaugh <[email protected]> Signed-off-by: Tadeusz Struk <[email protected]>

Signed-off-by: Tadeusz Struk <[email protected]>

tstruk · 2024-08-30T18:42:00Z

@darrylabbate could you check the results now please.

shijin-aws · 2024-08-30T18:44:49Z

@tstruk this time is ok, there are some efa test failure (we are fixing that) which is irrelevant to this PR

tstruk · 2024-08-30T18:48:31Z

@shefty so based on the latest comments from @j-xiong and @shijin-aws this PR is ready to go.

shefty · 2024-08-30T19:51:09Z

thanks!

j-xiong · 2024-08-30T20:06:05Z

I should have put a do not merge label. I didn't because there is a work in progress label there.

shefty · 2024-08-30T21:58:58Z

Sorry about that. I checked for a do not merge label, but didn't see one, and I added the WIP one. Hopefully that doesn't mess up the release. If so, it can be reverted, then re-applied.

j-xiong · 2024-08-30T22:18:30Z

@shefty No worry. I have updated the v2.0.0alpha PR with the new changes included. Simpler to re-do the packaging and validation than un-doing the changes.

tstruk force-pushed the add_lpp branch 2 times, most recently from 192c9e1 to 551873d Compare August 13, 2024 19:54

shijin-aws requested a review from j-xiong August 13, 2024 20:25

shijin-aws reviewed Aug 13, 2024

View reviewed changes

j-xiong reviewed Aug 13, 2024

View reviewed changes

tstruk force-pushed the add_lpp branch from 551873d to eb16020 Compare August 15, 2024 04:53

tstruk changed the title ~~Add LPP provider~~ Add LPP provider - WIP Aug 15, 2024

tstruk force-pushed the add_lpp branch 2 times, most recently from 421fa4e to 8ae6134 Compare August 15, 2024 10:22

tstruk force-pushed the add_lpp branch from 8ae6134 to 40daac1 Compare August 15, 2024 15:31

tstruk force-pushed the add_lpp branch from 40daac1 to ce1e07a Compare August 15, 2024 20:03

tstruk force-pushed the add_lpp branch 2 times, most recently from 754f1e5 to 184eb12 Compare August 28, 2024 08:32

tstruk and others added 2 commits August 28, 2024 12:30

core: Add FI_PROTO_LPP

aa1a31c

Add FI_PROTO_LPP enum for the lpp provider. Signed-off-by: Tadeusz Struk <[email protected]>

tstruk changed the title ~~Add LPP provider - WIP~~ Add LPP provider Aug 28, 2024

tstruk force-pushed the add_lpp branch from 184eb12 to 78c8bd0 Compare August 28, 2024 13:02

tstruk force-pushed the add_lpp branch from 78c8bd0 to 14cd5ec Compare August 28, 2024 18:46

tstruk and others added 2 commits August 29, 2024 20:47

github/workflow: Enable lpp in pr-ci

9caee46

Signed-off-by: Tadeusz Struk <[email protected]>

tstruk force-pushed the add_lpp branch from 14cd5ec to 9caee46 Compare August 29, 2024 18:48

shefty merged commit 65856bc into ofiwg:main Aug 30, 2024
11 of 14 checks passed

Add LPP provider #10303

Add LPP provider #10303

Conversation

tstruk commented Aug 12, 2024

shijin-aws commented Aug 12, 2024

tstruk commented Aug 13, 2024

tstruk commented Aug 13, 2024

darrylabbate commented Aug 13, 2024

tstruk commented Aug 13, 2024

shijin-aws Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shijin-aws Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

aingerson commented Aug 13, 2024

j-xiong commented Aug 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tstruk commented Aug 14, 2024

darrylabbate commented Aug 14, 2024

tstruk commented Aug 15, 2024

shijin-aws commented Aug 15, 2024

tstruk commented Aug 15, 2024

darrylabbate commented Aug 15, 2024

tstruk commented Aug 15, 2024

darrylabbate commented Aug 15, 2024

tstruk commented Aug 15, 2024

darrylabbate commented Aug 15, 2024

darrylabbate commented Aug 15, 2024

tstruk commented Aug 16, 2024

tstruk commented Aug 19, 2024

shefty commented Aug 19, 2024

tstruk commented Aug 28, 2024 • edited Loading

darrylabbate commented Aug 28, 2024

tstruk commented Aug 28, 2024

tstruk commented Aug 29, 2024

j-xiong commented Aug 29, 2024

tstruk commented Aug 29, 2024

j-xiong commented Aug 29, 2024

tstruk commented Aug 29, 2024

tstruk commented Aug 29, 2024

darrylabbate commented Aug 29, 2024

tstruk commented Aug 30, 2024

shijin-aws commented Aug 30, 2024

tstruk commented Aug 30, 2024

shefty commented Aug 30, 2024

j-xiong commented Aug 30, 2024

shefty commented Aug 30, 2024

j-xiong commented Aug 30, 2024 • edited Loading

shijin-aws Aug 13, 2024 •

edited

Loading

shijin-aws Aug 15, 2024 •

edited

Loading

tstruk commented Aug 28, 2024 •

edited

Loading

j-xiong commented Aug 30, 2024 •

edited

Loading