Add queue sizes to endpoint attribute #10

shefty · 2014-09-03T17:46:46Z

The endpoint attribute structure should be expanded to expose the size of the underlying queue. Now that the EP attribute exist, we can simplify things for the user and avoid needing to use control interfaces to override the default values. But default values should still be available to the user, with the actual values returned when an endpoint is created.

dledford · 2014-09-16T17:28:34Z

@shefty For this, are you referring to adding fields to struct fi_info (which is used when creating an endpoint and therefore could be preset to requested values versus do a modify sequence after the endpoint is created)? And if so, are you wanting to get as details as IB gets here with options for send queue size, receive queue size, and send and recv queue maximum SG entries, and possibly a request for maximum inline data too? Or do you think that's getting to fabric specific and defeating the purpose of abstracting the fabric out?

shefty · 2014-09-16T17:58:21Z

I was thinking of adding the fields to fi_ep_attr, but I don't know what fields to add, if any. I was thinking along the lines of send/recv queue size and SGL sizes. But if we expand the endpoint to include the concept of sessions for multi-threaded purposes, then there may be multiple sizes, corresponding to different HW work queues. So a single send queue size value may not work. Personally, I like the idea of trying to keep things abstract and using return codes to keep the user from overrunning any lower level queues. I'm not sure that works for all apps though. And I'm not sure what to do with SGL limits. SGL limits seem easier to expose through fi_ep_attr.

dledford · 2014-09-16T18:24:46Z

On 09/16/2014 01:58 PM, Sean Hefty wrote:

I was thinking of adding the fields to fi_ep_attr, but I don't know what
fields to add, if any. I was thinking along the lines of send/recv queue
size and SGL sizes.

OK.

But if we expand the endpoint to include the concept
of sessions

Definition please. By sessions do you mean multiple connections between
two hosts not following the same path (like on over ib0 and one over
ib1, or say one over eth1 and one over ib0 where eth1 is RoCE enabled
and ib1 is InfiniBand)?

for multi-threaded purposes, then there may be multiple
sizes, corresponding to different HW work queues.

This will probably go beyond the scope of libfabrics. Or at least I
would think beyond the scope of the bottom layer of libfabrics. We've
talked multiple times about the difference between a libfabrics that
MPIs or other apps that want really low level, "get out of my way" type
access to the underlying fabric want, and then there are apps that want
"abstract away all that fabric stuff and give me a simple, but
performant, interface". The overhead associated with sessions seems
like it would pre-emptively force support for sessions up to that higher
layer abstraction. As such, I'm not sure you want to build that into
the lower layer data structures versus handling it entirely at a higher
layer.

At a minimum though, I can see that if you are going to support the
notion of sessions, then not only would the queue size and other
parameters need to be in fi_ep_attr, but I think you would need to move
the src_addr and dst_addr from fi_info to fi_ep_attr as well since the
addresses of each session would likely be unique.

So a single send queue
size value may not work. Personally, I like the idea of trying to keep
things abstract and using return codes to keep the user from overrunning
any lower level queues.

I had thought about that. But that is decided performance unfriendly in
the IB case. And it would prevent the app from implementing any sort of
credit mechanism themselves. But, for some providers, there is no
concept of a queue depth (sockets provider immediately comes to mind).
So I was thinking to add it, but define it in the API such that a user
can specify a requested queue depth in the fi_ep_attr struct, and
depending on the provider the endpoint is created on, the following
matrix of values will be placed in the fi_ep_attr struct on return:

User fills Provider has notion Provider is queue
in queue size of queue size deficient

Yes return min(max queue return -1
depth, requested
queue depth)
No return default queue return -1
depth

I'm not sure that works for all apps though. And
I'm not sure what to do with SGL limits. SGL limits seem easier to
expose through fi_ep_attr.

I would agree with that. I think there are a number of things that are
in fi_info that can go to fi_ep_attr if you are going to support
sessions, and a number of things that can come back if you aren't.
However, since there's no harm in them being in fi_ep_attr, we can put
stuff there and plan for the possible future that way.

shefty · 2014-09-16T18:45:38Z

For 'sessions', what I mean are multiple HW command queues mapped to the application. The command queues have the same transport and network level address. If the queues can receive data, they may have a different session level address, which ideally would be exposed to the app as an index. A very simple use case would be an app using different sessions to communicate with different sets of remote processes. (I haven't thought through this concept, so my ideas are just up in the air at the moment.)

I agree that we will need to expose a size for application credit schemes. Maybe the answer is in the definition. (Note that I'm lousy coming up with names.)

min_outstanding_send - The minimum number of data transfers that a provider will queue to an endpoint.

This still allows for returning EBUSY. A provider may be able to queue more requests.

I also want to consider software providers that enhance the capabilities of a HW provider. E.g. there could be a provider that supports transfers larger than 4 GB, by breaking up a large request into multiple smaller requests. I don't think this causes any issues to a reported queue size, but I haven't thought through it.

Btw, it's kind of arbitrary which fields go into fi_info versus fi_ep_attr. I wanted to keep all mandatory fields in fi_info, and only require those apps that want to deal at the lower level fill out fi_ep_attr.

dledford · 2014-09-16T19:13:34Z

On 09/16/2014 02:45 PM, Sean Hefty wrote:

For 'sessions', what I mean are multiple HW command queues mapped to the
application. The command queues have the same transport and network
level address. If the queues can receive data, they may have a different
session level address, which ideally would be exposed to the app as an
index. A very simple use case would be an app using different sessions
to communicate with different sets of remote processes. (I haven't
thought through this concept, so my ideas are just up in the air at the
moment.)

I think I get what you mean (but I doubt that sets of different
processes is reasonable, you will likely need a whole new EP for each
different process you talk to due to the requirement of having to
listen/connect to different ports/services). However, an example that
does make sense to me, and something I've been looking at doing as an
optimization to conserve memory use in IB communications, is the idea of
having multiple queue pairs between two apps where the queue pairs
utilized different maximum message sizes and queue depths in order to
allow you to send lots of small messages without wasting huge amounts of
space. Such as a queue pair with a max message size of 256 bytes,
another at 1k, another at 4k, another at 16k, and one at 64k, with each
queue pair having progressively fewer entries as the size got larger.
For apps that send lots of small messages with some medium and large
size messages mixed in, this would make a lot of sense (ordering issues
not being considered here, the app would either need to take care of
that or there would need to be a layered ordering provider on top of
this scheme).

I agree that we will need to expose a size for application credit
schemes. Maybe the answer is in the definition. (Note that I'm lousy
coming up with names.)

min_outstanding_send - The minimum number of data transfers that a
provider will queue to an endpoint.

Except that most credit schemes are based on the opposite of this: a
maximum that the app knows and can plan for minus the currently
in-flight number.

This still allows for returning EBUSY.

When it comes to applications that want to manage their credits, if we
ever return EBUSY, we've failed.

A provider may be able to queue
more requests.

I think it's fair to say that, if an app wants to manage its own
in-flight counts and credits, that the maximum queue depth plus sends
sent minus completions received should allow them to do so
deterministically, And that only applications that don't bother to
track queue state should ever hit EBUSY, but for them it should exist
and the tracking of queue depths versus sent versus completed should be
an optional optimization left up to the application. Fair enough?

I also want to consider software providers that enhance the capabilities
of a HW provider. E.g. there could be a provider that supports transfers
larger than 4 GB, by breaking up a large request into multiple smaller
requests. I don't think this causes any issues to a reported queue size,
but I haven't thought through it.

It shouldn't, but it would mean that the software provider will have to
provide a minimal queue of their own to compensate for split packets.
But that's OK, they have to split and recombine packets, a small queue
is nothing major to add to that.

Btw, it's kind of arbitrary which fields go into fi_info versus
fi_ep_attr. I wanted to keep all mandatory fields in fi_info, and only
require those apps that want to deal at the lower level fill out fi_ep_attr.

OK, I can understand that. I'll make a note to that effect in the
header file ;-)

shefty · 2014-09-16T21:50:49Z

I think I get what you mean (but I doubt that sets of different
processes is reasonable, you will likely need a whole new EP for each
different process you talk to due to the requirement of having to
listen/connect to different ports/services).

Ah - I was thinking more of unconnected endpoints. HPC apps in general want reliable unconnected endpoints. There are at least a couple of vendors that support this (including Intel). The Mellanox XRC and dynamic connection features are steps in this direction.

does make sense to me, and something I've been looking at doing as an
optimization to conserve memory use in IB communications, is the idea of
having multiple queue pairs between two apps where the queue pairs
utilized different maximum message sizes and queue depths in order to

The libfabric feature to do this is the FI_MULTI_RECV flag, which is support for 'slab based' memory buffering. I.e. the user posts a single large buffer, and multiple receives simply fill in the buffer. This would be more for future HW or non-offload HW. IB could simulate this by using RDMA writes with immediate in place of sending messages.

I agree that we will need to expose a size for application credit
schemes. Maybe the answer is in the definition. (Note that I'm lousy
coming up with names.)

min_outstanding_send - The minimum number of data transfers that a
provider will queue to an endpoint.

Except that most credit schemes are based on the opposite of this: a
maximum that the app knows and can plan for minus the currently
in-flight number.

The app can set its starting max_credits to the min_oustanding. I used min instead of max, since the app may be able to post more. E.g. for iWarp to support RDMA write with immediate, it would consume 2 queue entries (RDMA write + send message). So it would set the min_outstanding = 1/2(queue size). If the app posts nothing but writes with immediate, it will block at min_outstanding. But if it only does sends, it can queue twice that amount.

I think this meets the intent that you want. The only issue is really the name.

shefty · 2014-10-01T00:32:14Z

A general proposal to expose this is described here:

http://lists.openfabrics.org/pipermail/ofiwg/2014-September/000354.html

I will post a patch for this idea for further discussion.

shefty · 2014-10-02T00:12:03Z

A patch has been developed, but has not been committed.

shefty · 2014-10-10T07:27:11Z

An initial patch for this was committed 5cb07ab. Discussions are continuing on the mail list to enhance this, but closing this issue, since the queue sizes are now available.

Gnix getinfo fl

ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fff4c61e7e0 at pc 0x14f2cb7ae0b9 bp 0x7fff4c61e650 sp 0x7fff4c61ddd8 WRITE of size 17 at 0x7fff4c61e7e0 thread T0 #0 0x14f2cb7ae0b8 (/lib64/libasan.so.5+0xb40b8) ofiwg#1 0x14f2cb7aedd2 in vsscanf (/lib64/libasan.so.5+0xb4dd2) ofiwg#2 0x14f2cb7aeede in __interceptor_sscanf (/lib64/libasan.so.5+0xb4ede) ofiwg#3 0x14f2cb230766 in ofi_addr_format src/common.c:401 ofiwg#4 0x14f2cb233238 in ofi_str_toaddr src/common.c:780 ofiwg#5 0x14f2cb314332 in vrb_handle_ib_ud_addr prov/verbs/src/verbs_info.c:1670 ofiwg#6 0x14f2cb314332 in vrb_get_match_infos prov/verbs/src/verbs_info.c:1787 ofiwg#7 0x14f2cb314332 in vrb_getinfo prov/verbs/src/verbs_info.c:1841 ofiwg#8 0x14f2cb21fc28 in fi_getinfo_ src/fabric.c:1010 ofiwg#9 0x14f2cb25fcc0 in ofi_get_core_info prov/util/src/util_attr.c:298 ofiwg#10 0x14f2cb269b20 in ofix_getinfo prov/util/src/util_attr.c:321 ofiwg#11 0x14f2cb3e29fd in rxd_getinfo prov/rxd/src/rxd_init.c:122 ofiwg#12 0x14f2cb21fc28 in fi_getinfo_ src/fabric.c:1010 ofiwg#13 0x407150 in ft_getinfo common/shared.c:794 ofiwg#14 0x414917 in ft_init_fabric common/shared.c:1042 ofiwg#15 0x402f40 in run functional/bw.c:155 ofiwg#16 0x402f40 in main functional/bw.c:252 ofiwg#17 0x14f2ca1b28e2 in __libc_start_main (/lib64/libc.so.6+0x238e2) ofiwg#18 0x401d1d in _start (/root/libfabric/fabtests/functional/fi_bw+0x401d1d) Address 0x7fff4c61e7e0 is located in stack of thread T0 at offset 48 in frame #0 0x14f2cb2306f3 in ofi_addr_format src/common.c:397 This frame has 1 object(s): [32, 48) 'fmt' <== Memory access at offset 48 overflows this variable HINT: this may be a false positive if your program uses some custom stack unwind mechanism or swapcontext (longjmp and C++ exceptions *are* supported) SUMMARY: AddressSanitizer: stack-buffer-overflow (/lib64/libasan.so.5+0xb40b8) Shadow bytes around the buggy address: 0x1000698bbca0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x1000698bbcb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x1000698bbcc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x1000698bbcd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x1000698bbce0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>0x1000698bbcf0: 00 00 00 00 00 00 f1 f1 f1 f1 00 00[f2]f2 f3 f3 0x1000698bbd00: f3 f3 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 0x1000698bbd10: f1 f1 00 f2 f2 f2 f2 f2 f2 f2 00 f2 f2 f2 f2 f2 0x1000698bbd20: f2 f2 00 f2 f2 f2 f2 f2 f2 f2 00 f2 f2 f2 f2 f2 0x1000698bbd30: f2 f2 00 00 00 00 00 06 f2 f2 f2 f2 f2 f2 00 00 0x1000698bbd40: 00 00 00 06 f2 f2 f2 f2 f2 f2 00 00 00 00 00 00 Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb Fixes: 5d31276 ("common: Redo address string conversions") Signed-off-by: Honggang Li <[email protected]>

ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fff4c61e7e0 at pc 0x14f2cb7ae0b9 bp 0x7fff4c61e650 sp 0x7fff4c61ddd8 WRITE of size 17 at 0x7fff4c61e7e0 thread T0 #0 0x14f2cb7ae0b8 (/lib64/libasan.so.5+0xb40b8) #1 0x14f2cb7aedd2 in vsscanf (/lib64/libasan.so.5+0xb4dd2) #2 0x14f2cb7aeede in __interceptor_sscanf (/lib64/libasan.so.5+0xb4ede) #3 0x14f2cb230766 in ofi_addr_format src/common.c:401 #4 0x14f2cb233238 in ofi_str_toaddr src/common.c:780 #5 0x14f2cb314332 in vrb_handle_ib_ud_addr prov/verbs/src/verbs_info.c:1670 #6 0x14f2cb314332 in vrb_get_match_infos prov/verbs/src/verbs_info.c:1787 #7 0x14f2cb314332 in vrb_getinfo prov/verbs/src/verbs_info.c:1841 #8 0x14f2cb21fc28 in fi_getinfo_ src/fabric.c:1010 #9 0x14f2cb25fcc0 in ofi_get_core_info prov/util/src/util_attr.c:298 #10 0x14f2cb269b20 in ofix_getinfo prov/util/src/util_attr.c:321 #11 0x14f2cb3e29fd in rxd_getinfo prov/rxd/src/rxd_init.c:122 #12 0x14f2cb21fc28 in fi_getinfo_ src/fabric.c:1010 #13 0x407150 in ft_getinfo common/shared.c:794 #14 0x414917 in ft_init_fabric common/shared.c:1042 #15 0x402f40 in run functional/bw.c:155 #16 0x402f40 in main functional/bw.c:252 #17 0x14f2ca1b28e2 in __libc_start_main (/lib64/libc.so.6+0x238e2) #18 0x401d1d in _start (/root/libfabric/fabtests/functional/fi_bw+0x401d1d) Address 0x7fff4c61e7e0 is located in stack of thread T0 at offset 48 in frame #0 0x14f2cb2306f3 in ofi_addr_format src/common.c:397 This frame has 1 object(s): [32, 48) 'fmt' <== Memory access at offset 48 overflows this variable HINT: this may be a false positive if your program uses some custom stack unwind mechanism or swapcontext (longjmp and C++ exceptions *are* supported) SUMMARY: AddressSanitizer: stack-buffer-overflow (/lib64/libasan.so.5+0xb40b8) Shadow bytes around the buggy address: 0x1000698bbca0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x1000698bbcb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x1000698bbcc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x1000698bbcd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x1000698bbce0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>0x1000698bbcf0: 00 00 00 00 00 00 f1 f1 f1 f1 00 00[f2]f2 f3 f3 0x1000698bbd00: f3 f3 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 0x1000698bbd10: f1 f1 00 f2 f2 f2 f2 f2 f2 f2 00 f2 f2 f2 f2 f2 0x1000698bbd20: f2 f2 00 f2 f2 f2 f2 f2 f2 f2 00 f2 f2 f2 f2 f2 0x1000698bbd30: f2 f2 00 00 00 00 00 06 f2 f2 f2 f2 f2 f2 00 00 0x1000698bbd40: 00 00 00 06 f2 f2 f2 f2 f2 f2 00 00 00 00 00 00 Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb Fixes: 5d31276 ("common: Redo address string conversions") Signed-off-by: Honggang Li <[email protected]>

If a posted receive matches with a saved receive, we may need to increment the rx counter. Set the rx counter increment callback to match that of the posted receive. This fixes an assert in xnet_cntr_inc() accessing a NULL cntr_inc function pointer. Program received signal SIGABRT, Aborted. 0x0000155552d4d37f in raise () from /lib64/libc.so.6 #0 0x0000155552d4d37f in raise () from /lib64/libc.so.6 #1 0x0000155552d37db5 in abort () from /lib64/libc.so.6 #2 0x0000155552d37c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6 #3 0x0000155552d45a76 in __assert_fail () from /lib64/libc.so.6 #4 0x00001555522967f9 in xnet_cntr_inc (ep=0x6e4c70, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:347 #5 0x0000155552296836 in xnet_report_cntr_success (ep=0x6e4c70, cq=0x6ca930, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:354 #6 0x000015555229970d in xnet_complete_saved (saved_entry=0x6f7a30) at prov/tcp/src/xnet_progress.c:153 #7 0x0000155552299961 in xnet_recv_saved (saved_entry=0x6f7a30, rx_entry=0x6f7840) at prov/tcp/src/xnet_progress.c:188 #8 0x00001555522946f8 in xnet_srx_tag (srx=0x6dd1c0, recv_entry=0x6f7840) at prov/tcp/src/xnet_srx.c:445 ofiwg#9 0x0000155552294bb1 in xnet_srx_trecv (ep_fid=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_srx.c:558 ofiwg#10 0x000015555228f60e in fi_trecv (ep=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at ./include/rdma/fi_tagged.h:91 ofiwg#11 0x00001555522900a7 in xnet_rdm_trecv (ep_fid=0x6d9fe0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_rdm.c:212 Signed-off-by: Sean Hefty <[email protected]>

If a posted receive matches with a saved receive, we may need to increment the rx counter. Set the rx counter increment callback to match that of the posted receive. This fixes an assert in xnet_cntr_inc() accessing a NULL cntr_inc function pointer. Program received signal SIGABRT, Aborted. 0x0000155552d4d37f in raise () from /lib64/libc.so.6 #0 0x0000155552d4d37f in raise () from /lib64/libc.so.6 #1 0x0000155552d37db5 in abort () from /lib64/libc.so.6 #2 0x0000155552d37c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6 #3 0x0000155552d45a76 in __assert_fail () from /lib64/libc.so.6 #4 0x00001555522967f9 in xnet_cntr_inc (ep=0x6e4c70, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:347 #5 0x0000155552296836 in xnet_report_cntr_success (ep=0x6e4c70, cq=0x6ca930, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:354 #6 0x000015555229970d in xnet_complete_saved (saved_entry=0x6f7a30) at prov/tcp/src/xnet_progress.c:153 #7 0x0000155552299961 in xnet_recv_saved (saved_entry=0x6f7a30, rx_entry=0x6f7840) at prov/tcp/src/xnet_progress.c:188 #8 0x00001555522946f8 in xnet_srx_tag (srx=0x6dd1c0, recv_entry=0x6f7840) at prov/tcp/src/xnet_srx.c:445 #9 0x0000155552294bb1 in xnet_srx_trecv (ep_fid=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_srx.c:558 #10 0x000015555228f60e in fi_trecv (ep=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at ./include/rdma/fi_tagged.h:91 #11 0x00001555522900a7 in xnet_rdm_trecv (ep_fid=0x6d9fe0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_rdm.c:212 Signed-off-by: Sean Hefty <[email protected]>

shefty added this to the alpha release milestone Sep 3, 2014

shefty added the enhancement label Sep 3, 2014

shefty self-assigned this Oct 1, 2014

shefty closed this as completed Oct 2, 2014

shefty reopened this Oct 2, 2014

shefty closed this as completed Oct 10, 2014

shefty mentioned this issue Nov 5, 2014

Including usnic in build, but without using it results in crash #270

Closed

hppritcha added a commit to hppritcha/libfabric that referenced this issue Feb 10, 2015

Merge pull request ofiwg#10 from hppritcha/gnix_getinfo_fl

4174ea2

Gnix getinfo fl

shefty mentioned this issue Mar 6, 2015

prov/sockets: fi_cmatose hangs #725

Closed

shefty mentioned this issue Sep 26, 2015

crash in sockets provider during finalize of fi_rdm_multi_recv #1309

Closed

tenbrugg mentioned this issue Jun 27, 2016

running SNAP on 1k ranks with OpenMPI causes seg fault #2162

Closed

This was referenced Sep 29, 2017

prov/verbs - Assertion `buf_region->num_used == 0' failed. #3351

Closed

verbs+rxm - segfault in fi_ibv_wc_2_wce #3355

Closed

frostedcmos mentioned this issue Dec 4, 2019

hfd5 parallel test suite segfaults in psm2 provider #5478

Closed

krehm mentioned this issue Jan 28, 2020

program deadlocks with ofi_uffd_handler() #5580

Closed

frostedcmos mentioned this issue Feb 19, 2020

ofi+verbs;ofi_rxm segfault in vrb_poll_cq() #5653

Closed

frostedcmos mentioned this issue Feb 28, 2020

Verbs hangs in ibv_reg_mr() / fi mr caching issue #5687

Closed

bsbernd mentioned this issue Apr 30, 2020

prov/tcp dead stall - generic ofi_write_socket buffer size issue #5897

Closed

swelch mentioned this issue Aug 28, 2020

prov/verbs: account for off-by-one credit initialization #6212

Merged

Honggang-LI mentioned this issue Dec 17, 2020

src/common.c: fix a stack-buffer-overflow issue #6466

Merged

shefty mentioned this issue Dec 18, 2020

src/common.c: fix a stack-buffer-overflow issue #6471

Merged

frostedcmos mentioned this issue Mar 30, 2021

DAOS: rxm crash in rxm_conn_close() on the server when client exits during rdma transfer #6665

Closed

frostedcmos mentioned this issue Aug 5, 2021

DAOS: verbs;rxm - latest ofi main causes mem corruption when running at scale #6973

Closed

This was referenced Dec 8, 2021

DAOS: verbs;rxm - fi_cancel() error handling issue #7287

Closed

segfault in rxm_open_conn on master branch (NULL provider name) #7300

Closed

chien-intel mentioned this issue Jun 6, 2022

prov/efa: fi_info crash in a system with mlnx but no efa defice #7805

Closed

bfaccini mentioned this issue Jul 11, 2022

prov/verbs;ofi_rxm: rxm_handle_error():793<warn> fi_eq_readerr: err: Connection refused (111), prov_err: Unknown error -8 (-8) #7880

Closed

aingerson mentioned this issue Oct 11, 2022

prov/psm3: race causing hangs in fi_multinode test #8090

Closed

aingerson mentioned this issue Dec 5, 2022

fi_rdm_tagged_peek failures on occasional CI runs #8249

Closed

finjulhich mentioned this issue May 13, 2023

prov/psm3: illegal instruction #8933

Closed

Juee14Desai mentioned this issue Sep 15, 2023

prov/verbs: Few fabtests failing after setting FI_OFI_RXM_USE_SRX=true for verbs;ofi_rxm #9336

Closed

zachdworkin mentioned this issue Jun 25, 2024

prov/psm3: "munmap_chunk(): invalid pointer" on cleanup of fi_rdm_tagged_peek with OOB #10123

Open

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add queue sizes to endpoint attribute #10

Add queue sizes to endpoint attribute #10

shefty commented Sep 3, 2014

dledford commented Sep 16, 2014

shefty commented Sep 16, 2014

dledford commented Sep 16, 2014

shefty commented Sep 16, 2014

dledford commented Sep 16, 2014

shefty commented Sep 16, 2014

shefty commented Oct 1, 2014

shefty commented Oct 2, 2014

shefty commented Oct 10, 2014

Add queue sizes to endpoint attribute #10

Add queue sizes to endpoint attribute #10

Comments

shefty commented Sep 3, 2014

dledford commented Sep 16, 2014

shefty commented Sep 16, 2014

dledford commented Sep 16, 2014

shefty commented Sep 16, 2014

dledford commented Sep 16, 2014

shefty commented Sep 16, 2014

shefty commented Oct 1, 2014

shefty commented Oct 2, 2014

shefty commented Oct 10, 2014