man: clarify fi_cancel as time bounded #7

sayantansur · 2014-09-02T21:53:59Z

fi_cancel should return within a bounded period of time,
i.e. it cannot block indefinitely. Some providers may need
to perform additional actions that may not be "immediate".

Based on OFIWG discussion on 9/2.

fi_cancel should return within a bounded period of time, i.e. it cannot block indefinitely. Some providers may need to perform additional actions that may not be "immediate". Based on OFIWG discussion on 9/2.

shefty · 2014-09-02T22:13:51Z

Rather than stating that cancel should return within a bounded time, it would be better to say that the cancel operation will complete within a bounded time.

sayantansur · 2014-09-02T22:18:29Z

Sounds good. Is cancel a synchronous or asynchronous op? I thought it was synchronous, so I used the term return to imply complete.

jsquyres · 2014-09-02T22:20:23Z

I'd prefer async. Everything should be async.

I.e., who knows if some hardware will need to do something wonky / time-consuming to cancel something out of its receive queue.

Just my $0.02...

On Sep 2, 2014, at 6:18 PM, Sayantan Sur [email protected] wrote:

Sounds good. Is cancel a synchronous or asynchronous op? I thought it was synchronous, so I used the term return to imply complete.

—
Reply to this email directly or view it on GitHub.

Jeff Squyres
[email protected]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

sayantansur · 2014-09-02T22:26:43Z

Seems like a reasonable request to me.

I would request though that it be clear where the completions of such things appear. Mixing these completions with data transfer completions makes a provider writers job that much harder (especially when performance is concerned).

shefty · 2014-09-02T23:01:40Z

There are 2 operations of concern here. The original request being canceled and the cancel operation itself. I envision the original request completing with FI_ECANCELED (assuming it didn't complete first). I don't know that cancel actually needs to generate a completion entry at all, or if the cancel call needs to generate an error relative to the process of canceling the request.

sayantansur · 2014-09-02T23:06:45Z

Ah. Gotcha.

Github noob. Now what happens? I retract this pull request and submit a new one?

shefty · 2014-09-02T23:18:35Z

I believe you just need to update your repo with the changes and the existing pull request should update. But I've never done this yet. :)

jsquyres · 2014-09-02T23:23:25Z

It'll be a good experiment. Update us and let us know what happens. :-)

On Sep 2, 2014, at 7:18 PM, Sean Hefty [email protected] wrote:

I believe you just need to update your repo with the changes and the existing pull request should update. But I've never done this yet. :)
—
Reply to this email directly or view it on GitHub.

Jeff Squyres
[email protected]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

Better wording. There are 2 operations of concern here. The original request being canceled and the cancel operation itself.

sayantansur · 2014-09-02T23:26:10Z

I updated my branch. Let's hope it updated the pull request too.

man: clarify fi_cancel as time bounded

jsquyres · 2014-09-03T00:07:08Z

@sayantansur So you added another commit to the branch that made the end state be the way we wanted it, right?

I.e., ea50b71 was the original commit, and 1e82ec6 was the second commit to make the change.

I wonder if there's a way to change the original commit (so that there would only be 1 commit, not 2) and still have the pull request be valid...?

shefty · 2014-09-03T00:10:13Z

I wonder if there's a way to change the original commit (so that there
would only be 1 commit, not 2) and still have the pull request be
valid...?

Locally, he should be able to modify the original commit and do a force push. I agree that there should only be a single commit.

sayantansur · 2014-09-03T00:11:17Z

Yes.

I liked doing it this way. It seemed to me that GitHub wants you to have a
branch per pull request pretty much. So, it's not a big deal to have
another commit. Although that's not really answering your question :)

On Tuesday, September 2, 2014, Jeff Squyres [email protected]
wrote:

@sayantansur https://github.com/sayantansur So you added another commit
to the branch that made the end state be the way we wanted it, right?

I.e., ea50b71
ea50b71
was the original commit, and 1e82ec6
1e82ec6
was the second commit to make the change.

I wonder if there's a way to change the original commit (so that there
would only be 1 commit, not 2) and still have the pull request be valid...?

—
Reply to this email directly or view it on GitHub
#7 (comment).

sayantansur · 2014-09-03T00:12:30Z

I'll force push next time with one commit.

On Tuesday, September 2, 2014, Sean Hefty [email protected] wrote:

I wonder if there's a way to change the original commit (so that there
would only be 1 commit, not 2) and still have the pull request be
valid...?

Locally, he should be able to modify the original commit and do a force
push. I agree that there should only be a single commit.

—
Reply to this email directly or view it on GitHub
#7 (comment).

jsquyres · 2014-09-03T00:19:43Z

I think you could:

do another commit
git rebase -i
squash the 2 commits together
git push --mirror

and that would have done it, right?

Once the PR is done, then you can just kill the branch. I.e., the "never modify public history" rule doesn't apply to a branch whose sole purpose is to be pulled, right?

On Sep 2, 2014, at 8:12 PM, Sayantan Sur [email protected] wrote:

I'll force push next time with one commit.

On Tuesday, September 2, 2014, Sean Hefty [email protected] wrote:

I wonder if there's a way to change the original commit (so that there
would only be 1 commit, not 2) and still have the pull request be
valid...?

Locally, he should be able to modify the original commit and do a force
push. I agree that there should only be a single commit.

—
Reply to this email directly or view it on GitHub
#7 (comment).

—
Reply to this email directly or view it on GitHub.

Jeff Squyres
[email protected]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

jsquyres · 2014-09-03T00:34:55Z

Just for giggles, it might be worth trying with your next man page update
-- i.e., issue a PR with 1 commit, then make another commit, rebase/squash
it, and push -mirror, and then see what happens to the PR.

On Tue, Sep 2, 2014 at 8:12 PM, Sayantan Sur [email protected]
wrote:

I'll force push next time with one commit.

On Tuesday, September 2, 2014, Sean Hefty [email protected]
wrote:

I wonder if there's a way to change the original commit (so that there
would only be 1 commit, not 2) and still have the pull request be
valid...?

Locally, he should be able to modify the original commit and do a force
push. I agree that there should only be a single commit.

—
Reply to this email directly or view it on GitHub
#7 (comment).

—
Reply to this email directly or view it on GitHub
#7 (comment).

{+} Jeff Squyres

sayantansur · 2014-09-03T00:38:47Z

Will do.

On Tuesday, September 2, 2014, Jeff Squyres [email protected]
wrote:

Just for giggles, it might be worth trying with your next man page update
-- i.e., issue a PR with 1 commit, then make another commit, rebase/squash
it, and push -mirror, and then see what happens to the PR.

On Tue, Sep 2, 2014 at 8:12 PM, Sayantan Sur <[email protected]
javascript:_e(%7B%7D,'cvml','[email protected]');>
wrote:

I'll force push next time with one commit.

On Tuesday, September 2, 2014, Sean Hefty <[email protected]
javascript:_e(%7B%7D,'cvml','[email protected]');>
wrote:

I wonder if there's a way to change the original commit (so that
there
would only be 1 commit, not 2) and still have the pull request be
valid...?

Locally, he should be able to modify the original commit and do a
force
push. I agree that there should only be a single commit.

—
Reply to this email directly or view it on GitHub
#7 (comment).

—
Reply to this email directly or view it on GitHub
#7 (comment).

{+} Jeff Squyres

—
Reply to this email directly or view it on GitHub
#7 (comment).

dledford · 2014-09-03T03:56:32Z

On 09/02/2014 08:19 PM, Jeff Squyres wrote:

I think you could:

do another commit

git rebase -i

squash the 2 commits together

git push --mirror

and that would have done it, right?

Once the PR is done, then you can just kill the branch. I.e., the "never
modify public history" rule doesn't apply to a branch whose sole purpose
is to be pulled, right?

That's a matter of policy. If you have a branch you wish people to
test, then maybe it's best to apply main stream branch policy to those
temporary branches.

The git post I linked in the wiki page talked about "sausage making".
That's what's relevant here.

shefty · 2014-09-03T17:41:26Z

I use stgit to manage my patches (similar to quilt). What those convert to in git terms, I don't know, but squash is probably it. For development trees, I rework my patches constantly. Trying to apply main stream branch policy to development branches would likely introduce commits which clearly add errors, making it difficult to bisect. And the ability to cherry-pick patches is often necessary, in order to reduce the number of patches that someone needs to maintain while they develop the code. So... in some cases, modifying the patch is the correct approach. But there are times when a requested change could be made using a follow on patch without any issues.

Update fabtests to match with latest libfabric API changes

fixed HPCX <=v1.9.7 support (#7) Signed-off-by: Sannikov, Alexander <[email protected]> Signed-off-by: Dmitry Gladkov <[email protected]>

Here is the deadlock scenario: #0 0x00007fed3a439495 in pthread_spin_lock () #1 0x00007fed37ad7cfd in fastlock_acquire () #2 0x00007fed37ad80a4 in psmx2_lock () #3 0x00007fed37ad8361 in psmx2_am_trx_ctxt_handler_ext () #4 0x00007fed37b084e7 in psmx2_am_trx_ctxt_handler_0 () #5 0x00007fed373c08c5 in self_am_short_request () #6 0x00007fed3739bf83 in __psm2_am_request_short () #7 0x00007fed37ad84ee in psmx2_trx_ctxt_disconnect_peers () A lock has been held in psmx2_trx_ctxt_disconnect_peers before psm2_am_request_short is called. While making progress inside this function, the execution is redirected to the AM handler due to the arrival of an incoming disconnection request. The AM handler tries to acquire the same lock that has already been held and reaches a deadlock. Fix by avoid calling psm2_am_request_short while holding the lock. Signed-off-by: Jianxin Xiong <[email protected]>

Here is the deadlock scenario: #0 0x00007fed3a439495 in pthread_spin_lock () #1 0x00007fed37ad7cfd in fastlock_acquire () #2 0x00007fed37ad80a4 in psmx2_lock () #3 0x00007fed37ad8361 in psmx2_am_trx_ctxt_handler_ext () #4 0x00007fed37b084e7 in psmx2_am_trx_ctxt_handler_0 () #5 0x00007fed373c08c5 in self_am_short_request () #6 0x00007fed3739bf83 in __psm2_am_request_short () #7 0x00007fed37ad84ee in psmx2_trx_ctxt_disconnect_peers () A lock has been held in psmx2_trx_ctxt_disconnect_peers before psm2_am_request_short is called. While making progress inside this function, the execution is redirected to the AM handler due to the arrival of an incoming disconnection request. The AM handler tries to acquire the same lock that has already been held and reaches a deadlock. Fix by avoiding calling psm2_am_request_short while holding the lock. Signed-off-by: Jianxin Xiong <[email protected]>

Problem reported by Address Sanitizer: ================================================================= ==25220==ERROR: AddressSanitizer: heap-use-after-free on address 0x6270000072e0 at pc 0x00010b926a3c bp 0x700001bd1c30 sp 0x700001bd1c28 READ of size 4 at 0x6270000072e0 thread T4 #0 0x10b926a3b in sock_conn_listener_thread (libfabric.1.dylib:x86_64+0xdca3b) #1 0x7fff7e2d5660 in _pthread_body (libsystem_pthread.dylib:x86_64+0x3660) #2 0x7fff7e2d550c in _pthread_start (libsystem_pthread.dylib:x86_64+0x350c) #3 0x7fff7e2d4bf8 in thread_start (libsystem_pthread.dylib:x86_64+0x2bf8) 0x6270000072e0 is located 480 bytes inside of 12944-byte region [0x627000007100,0x62700000a390) freed by thread T0 here: #0 0x10baf1a9d in wrap_free (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x56a9d) #1 0x10b9016bf in sock_ep_close (libfabric.1.dylib:x86_64+0xb76bf) #2 0x10b7f4a8f in fi_close fabric.h:593 #3 0x10b7f4209 in main shared_ctx.c:649 #4 0x7fff7dfbd014 in start (libdyld.dylib:x86_64+0x1014) previously allocated by thread T0 here: #0 0x10baf1e27 in wrap_calloc (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x56e27) #1 0x10b906df4 in sock_alloc_endpoint (libfabric.1.dylib:x86_64+0xbcdf4) #2 0x10b8f7fdb in sock_msg_ep (libfabric.1.dylib:x86_64+0xadfdb) #3 0x10b7f7c93 in fi_endpoint fi_endpoint.h:164 #4 0x10b7f5e40 in server_connect shared_ctx.c:471 #5 0x10b7f49ba in run shared_ctx.c:573 #6 0x10b7f411b in main shared_ctx.c:647 #7 0x7fff7dfbd014 in start (libdyld.dylib:x86_64+0x1014) Thread T4 created by T0 here: #0 0x10bae999d in wrap_pthread_create (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x4e99d) #1 0x10b925f9b in sock_conn_start_listener_thread (libfabric.1.dylib:x86_64+0xdbf9b) #2 0x10b8e7eb2 in sock_domain (libfabric.1.dylib:x86_64+0x9deb2) #3 0x10b7f87d3 in fi_domain fi_domain.h:306 #4 0x10b7f5c9f in server_connect shared_ctx.c:460 #5 0x10b7f49ba in run shared_ctx.c:573 #6 0x10b7f411b in main shared_ctx.c:647 #7 0x7fff7dfbd014 in start (libdyld.dylib:x86_64+0x1014) The issue shows up more frequently on OS X, which emulates epoll. However, I believe the problem could occur on any platform. In sock_ep_close, we remove the socket from the epoll fd, then free the endpoint. However, if the listener thread has received an event on the socket, but has not yet started processing it, then a race can occur. The listener thread could have returned from ofi_epoll_wait, but suspended trying to acquire the signal_lock. The signal_lock is acquired from sock_ep_close, where ofi_epoll_del is called, then released. The endpoint is then freed. The listener thread can now acquire the signal_lock, where it will attempt to access the freed endpoint data. To avoid the race, we add a change boolean to the listener. That boolean is only changed while holding the signal_lock. When a socket is removed from the epollfd, we mark the listener state as 'changed'. The listener thread checks the changed state prior to processing any events. If set, it clears the state, and calls ofi_epoll_wait again to get a new set of events to process. Note that this works for epoll set to level-triggered (poll semantics). Sockets that reported events will report those same events when wait is called a second time. Sockets which were removed from the epoll set would have their events removed, as they are no longer being monitored. This fix is applied both to the listener thread and cm thread. Signed-off-by: Sean Hefty <[email protected]>

ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fff4c61e7e0 at pc 0x14f2cb7ae0b9 bp 0x7fff4c61e650 sp 0x7fff4c61ddd8 WRITE of size 17 at 0x7fff4c61e7e0 thread T0 #0 0x14f2cb7ae0b8 (/lib64/libasan.so.5+0xb40b8) ofiwg#1 0x14f2cb7aedd2 in vsscanf (/lib64/libasan.so.5+0xb4dd2) ofiwg#2 0x14f2cb7aeede in __interceptor_sscanf (/lib64/libasan.so.5+0xb4ede) ofiwg#3 0x14f2cb230766 in ofi_addr_format src/common.c:401 ofiwg#4 0x14f2cb233238 in ofi_str_toaddr src/common.c:780 ofiwg#5 0x14f2cb314332 in vrb_handle_ib_ud_addr prov/verbs/src/verbs_info.c:1670 ofiwg#6 0x14f2cb314332 in vrb_get_match_infos prov/verbs/src/verbs_info.c:1787 ofiwg#7 0x14f2cb314332 in vrb_getinfo prov/verbs/src/verbs_info.c:1841 ofiwg#8 0x14f2cb21fc28 in fi_getinfo_ src/fabric.c:1010 ofiwg#9 0x14f2cb25fcc0 in ofi_get_core_info prov/util/src/util_attr.c:298 ofiwg#10 0x14f2cb269b20 in ofix_getinfo prov/util/src/util_attr.c:321 ofiwg#11 0x14f2cb3e29fd in rxd_getinfo prov/rxd/src/rxd_init.c:122 ofiwg#12 0x14f2cb21fc28 in fi_getinfo_ src/fabric.c:1010 ofiwg#13 0x407150 in ft_getinfo common/shared.c:794 ofiwg#14 0x414917 in ft_init_fabric common/shared.c:1042 ofiwg#15 0x402f40 in run functional/bw.c:155 ofiwg#16 0x402f40 in main functional/bw.c:252 ofiwg#17 0x14f2ca1b28e2 in __libc_start_main (/lib64/libc.so.6+0x238e2) ofiwg#18 0x401d1d in _start (/root/libfabric/fabtests/functional/fi_bw+0x401d1d) Address 0x7fff4c61e7e0 is located in stack of thread T0 at offset 48 in frame #0 0x14f2cb2306f3 in ofi_addr_format src/common.c:397 This frame has 1 object(s): [32, 48) 'fmt' <== Memory access at offset 48 overflows this variable HINT: this may be a false positive if your program uses some custom stack unwind mechanism or swapcontext (longjmp and C++ exceptions *are* supported) SUMMARY: AddressSanitizer: stack-buffer-overflow (/lib64/libasan.so.5+0xb40b8) Shadow bytes around the buggy address: 0x1000698bbca0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x1000698bbcb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x1000698bbcc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x1000698bbcd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x1000698bbce0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>0x1000698bbcf0: 00 00 00 00 00 00 f1 f1 f1 f1 00 00[f2]f2 f3 f3 0x1000698bbd00: f3 f3 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 0x1000698bbd10: f1 f1 00 f2 f2 f2 f2 f2 f2 f2 00 f2 f2 f2 f2 f2 0x1000698bbd20: f2 f2 00 f2 f2 f2 f2 f2 f2 f2 00 f2 f2 f2 f2 f2 0x1000698bbd30: f2 f2 00 00 00 00 00 06 f2 f2 f2 f2 f2 f2 00 00 0x1000698bbd40: 00 00 00 06 f2 f2 f2 f2 f2 f2 00 00 00 00 00 00 Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb Fixes: 5d31276 ("common: Redo address string conversions") Signed-off-by: Honggang Li <[email protected]>

ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fff4c61e7e0 at pc 0x14f2cb7ae0b9 bp 0x7fff4c61e650 sp 0x7fff4c61ddd8 WRITE of size 17 at 0x7fff4c61e7e0 thread T0 #0 0x14f2cb7ae0b8 (/lib64/libasan.so.5+0xb40b8) #1 0x14f2cb7aedd2 in vsscanf (/lib64/libasan.so.5+0xb4dd2) #2 0x14f2cb7aeede in __interceptor_sscanf (/lib64/libasan.so.5+0xb4ede) #3 0x14f2cb230766 in ofi_addr_format src/common.c:401 #4 0x14f2cb233238 in ofi_str_toaddr src/common.c:780 #5 0x14f2cb314332 in vrb_handle_ib_ud_addr prov/verbs/src/verbs_info.c:1670 #6 0x14f2cb314332 in vrb_get_match_infos prov/verbs/src/verbs_info.c:1787 #7 0x14f2cb314332 in vrb_getinfo prov/verbs/src/verbs_info.c:1841 #8 0x14f2cb21fc28 in fi_getinfo_ src/fabric.c:1010 #9 0x14f2cb25fcc0 in ofi_get_core_info prov/util/src/util_attr.c:298 #10 0x14f2cb269b20 in ofix_getinfo prov/util/src/util_attr.c:321 #11 0x14f2cb3e29fd in rxd_getinfo prov/rxd/src/rxd_init.c:122 #12 0x14f2cb21fc28 in fi_getinfo_ src/fabric.c:1010 #13 0x407150 in ft_getinfo common/shared.c:794 #14 0x414917 in ft_init_fabric common/shared.c:1042 #15 0x402f40 in run functional/bw.c:155 #16 0x402f40 in main functional/bw.c:252 #17 0x14f2ca1b28e2 in __libc_start_main (/lib64/libc.so.6+0x238e2) #18 0x401d1d in _start (/root/libfabric/fabtests/functional/fi_bw+0x401d1d) Address 0x7fff4c61e7e0 is located in stack of thread T0 at offset 48 in frame #0 0x14f2cb2306f3 in ofi_addr_format src/common.c:397 This frame has 1 object(s): [32, 48) 'fmt' <== Memory access at offset 48 overflows this variable HINT: this may be a false positive if your program uses some custom stack unwind mechanism or swapcontext (longjmp and C++ exceptions *are* supported) SUMMARY: AddressSanitizer: stack-buffer-overflow (/lib64/libasan.so.5+0xb40b8) Shadow bytes around the buggy address: 0x1000698bbca0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x1000698bbcb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x1000698bbcc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x1000698bbcd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x1000698bbce0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>0x1000698bbcf0: 00 00 00 00 00 00 f1 f1 f1 f1 00 00[f2]f2 f3 f3 0x1000698bbd00: f3 f3 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 0x1000698bbd10: f1 f1 00 f2 f2 f2 f2 f2 f2 f2 00 f2 f2 f2 f2 f2 0x1000698bbd20: f2 f2 00 f2 f2 f2 f2 f2 f2 f2 00 f2 f2 f2 f2 f2 0x1000698bbd30: f2 f2 00 00 00 00 00 06 f2 f2 f2 f2 f2 f2 00 00 0x1000698bbd40: 00 00 00 06 f2 f2 f2 f2 f2 f2 00 00 00 00 00 00 Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb Fixes: 5d31276 ("common: Redo address string conversions") Signed-off-by: Honggang Li <[email protected]>

Problem reported by Address Sanitizer: ================================================================= ==25220==ERROR: AddressSanitizer: heap-use-after-free on address 0x6270000072e0 at pc 0x00010b926a3c bp 0x700001bd1c30 sp 0x700001bd1c28 READ of size 4 at 0x6270000072e0 thread T4 #0 0x10b926a3b in sock_conn_listener_thread (libfabric.1.dylib:x86_64+0xdca3b) ofiwg#1 0x7fff7e2d5660 in _pthread_body (libsystem_pthread.dylib:x86_64+0x3660) ofiwg#2 0x7fff7e2d550c in _pthread_start (libsystem_pthread.dylib:x86_64+0x350c) ofiwg#3 0x7fff7e2d4bf8 in thread_start (libsystem_pthread.dylib:x86_64+0x2bf8) 0x6270000072e0 is located 480 bytes inside of 12944-byte region [0x627000007100,0x62700000a390) freed by thread T0 here: #0 0x10baf1a9d in wrap_free (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x56a9d) ofiwg#1 0x10b9016bf in sock_ep_close (libfabric.1.dylib:x86_64+0xb76bf) ofiwg#2 0x10b7f4a8f in fi_close fabric.h:593 ofiwg#3 0x10b7f4209 in main shared_ctx.c:649 ofiwg#4 0x7fff7dfbd014 in start (libdyld.dylib:x86_64+0x1014) previously allocated by thread T0 here: #0 0x10baf1e27 in wrap_calloc (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x56e27) ofiwg#1 0x10b906df4 in sock_alloc_endpoint (libfabric.1.dylib:x86_64+0xbcdf4) ofiwg#2 0x10b8f7fdb in sock_msg_ep (libfabric.1.dylib:x86_64+0xadfdb) ofiwg#3 0x10b7f7c93 in fi_endpoint fi_endpoint.h:164 ofiwg#4 0x10b7f5e40 in server_connect shared_ctx.c:471 ofiwg#5 0x10b7f49ba in run shared_ctx.c:573 ofiwg#6 0x10b7f411b in main shared_ctx.c:647 ofiwg#7 0x7fff7dfbd014 in start (libdyld.dylib:x86_64+0x1014) Thread T4 created by T0 here: #0 0x10bae999d in wrap_pthread_create (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x4e99d) ofiwg#1 0x10b925f9b in sock_conn_start_listener_thread (libfabric.1.dylib:x86_64+0xdbf9b) ofiwg#2 0x10b8e7eb2 in sock_domain (libfabric.1.dylib:x86_64+0x9deb2) ofiwg#3 0x10b7f87d3 in fi_domain fi_domain.h:306 ofiwg#4 0x10b7f5c9f in server_connect shared_ctx.c:460 ofiwg#5 0x10b7f49ba in run shared_ctx.c:573 ofiwg#6 0x10b7f411b in main shared_ctx.c:647 ofiwg#7 0x7fff7dfbd014 in start (libdyld.dylib:x86_64+0x1014) The issue shows up more frequently on OS X, which emulates epoll. However, I believe the problem could occur on any platform. In sock_ep_close, we remove the socket from the epoll fd, then free the endpoint. However, if the listener thread has received an event on the socket, but has not yet started processing it, then a race can occur. The listener thread could have returned from ofi_epoll_wait, but suspended trying to acquire the signal_lock. The signal_lock is acquired from sock_ep_close, where ofi_epoll_del is called, then released. The endpoint is then freed. The listener thread can now acquire the signal_lock, where it will attempt to access the freed endpoint data. To avoid the race, we add a change boolean to the listener. That boolean is only changed while holding the signal_lock. When a socket is removed from the epollfd, we mark the listener state as 'changed'. The listener thread checks the changed state prior to processing any events. If set, it clears the state, and calls ofi_epoll_wait again to get a new set of events to process. Note that this works for epoll set to level-triggered (poll semantics). Sockets that reported events will report those same events when wait is called a second time. Sockets which were removed from the epoll set would have their events removed, as they are no longer being monitored. This fix is applied both to the listener thread and cm thread. Signed-off-by: Sean Hefty <[email protected]>

If a posted receive matches with a saved receive, we may need to increment the rx counter. Set the rx counter increment callback to match that of the posted receive. This fixes an assert in xnet_cntr_inc() accessing a NULL cntr_inc function pointer. Program received signal SIGABRT, Aborted. 0x0000155552d4d37f in raise () from /lib64/libc.so.6 #0 0x0000155552d4d37f in raise () from /lib64/libc.so.6 #1 0x0000155552d37db5 in abort () from /lib64/libc.so.6 #2 0x0000155552d37c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6 #3 0x0000155552d45a76 in __assert_fail () from /lib64/libc.so.6 #4 0x00001555522967f9 in xnet_cntr_inc (ep=0x6e4c70, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:347 #5 0x0000155552296836 in xnet_report_cntr_success (ep=0x6e4c70, cq=0x6ca930, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:354 #6 0x000015555229970d in xnet_complete_saved (saved_entry=0x6f7a30) at prov/tcp/src/xnet_progress.c:153 #7 0x0000155552299961 in xnet_recv_saved (saved_entry=0x6f7a30, rx_entry=0x6f7840) at prov/tcp/src/xnet_progress.c:188 #8 0x00001555522946f8 in xnet_srx_tag (srx=0x6dd1c0, recv_entry=0x6f7840) at prov/tcp/src/xnet_srx.c:445 ofiwg#9 0x0000155552294bb1 in xnet_srx_trecv (ep_fid=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_srx.c:558 ofiwg#10 0x000015555228f60e in fi_trecv (ep=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at ./include/rdma/fi_tagged.h:91 ofiwg#11 0x00001555522900a7 in xnet_rdm_trecv (ep_fid=0x6d9fe0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_rdm.c:212 Signed-off-by: Sean Hefty <[email protected]>

If a posted receive matches with a saved receive, we may need to increment the rx counter. Set the rx counter increment callback to match that of the posted receive. This fixes an assert in xnet_cntr_inc() accessing a NULL cntr_inc function pointer. Program received signal SIGABRT, Aborted. 0x0000155552d4d37f in raise () from /lib64/libc.so.6 #0 0x0000155552d4d37f in raise () from /lib64/libc.so.6 #1 0x0000155552d37db5 in abort () from /lib64/libc.so.6 #2 0x0000155552d37c89 in __assert_fail_base.cold.0 () from /lib64/libc.so.6 #3 0x0000155552d45a76 in __assert_fail () from /lib64/libc.so.6 #4 0x00001555522967f9 in xnet_cntr_inc (ep=0x6e4c70, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:347 #5 0x0000155552296836 in xnet_report_cntr_success (ep=0x6e4c70, cq=0x6ca930, xfer_entry=0x6f7a30) at prov/tcp/src/xnet_cq.c:354 #6 0x000015555229970d in xnet_complete_saved (saved_entry=0x6f7a30) at prov/tcp/src/xnet_progress.c:153 #7 0x0000155552299961 in xnet_recv_saved (saved_entry=0x6f7a30, rx_entry=0x6f7840) at prov/tcp/src/xnet_progress.c:188 #8 0x00001555522946f8 in xnet_srx_tag (srx=0x6dd1c0, recv_entry=0x6f7840) at prov/tcp/src/xnet_srx.c:445 #9 0x0000155552294bb1 in xnet_srx_trecv (ep_fid=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_srx.c:558 #10 0x000015555228f60e in fi_trecv (ep=0x6dd1c0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at ./include/rdma/fi_tagged.h:91 #11 0x00001555522900a7 in xnet_rdm_trecv (ep_fid=0x6d9fe0, buf=0x6990c4, len=4, desc=0x0, src_addr=0, tag=21474836494, ignore=3458764513820540928, context=0x7ffffffeb180) at prov/tcp/src/xnet_rdm.c:212 Signed-off-by: Sean Hefty <[email protected]>

man: clarify fi_cancel as time bounded

1e82ec6

fi_cancel should return within a bounded period of time, i.e. it cannot block indefinitely. Some providers may need to perform additional actions that may not be "immediate". Based on OFIWG discussion on 9/2.

man: update fi_cancel completion

ea50b71

Better wording. There are 2 operations of concern here. The original request being canceled and the cancel operation itself.

shefty added a commit that referenced this pull request Sep 2, 2014

Merge pull request #7 from sayantansur/manpage-updates

6872711

man: clarify fi_cancel as time bounded

shefty merged commit 6872711 into ofiwg:master Sep 2, 2014

sayantansur deleted the manpage-updates branch September 3, 2014 17:49

shefty mentioned this pull request Nov 5, 2014

Including usnic in build, but without using it results in crash #270

Closed

shefty mentioned this pull request Feb 26, 2015

sockets provider occasionally hangs #701

Closed

shefty mentioned this pull request Mar 6, 2015

prov/sockets: fi_cmatose hangs #725

Closed

shefty mentioned this pull request Sep 26, 2015

crash in sockets provider during finalize of fi_rdm_multi_recv #1309

Closed

tenbrugg mentioned this pull request Jun 27, 2016

running SNAP on 1k ranks with OpenMPI causes seg fault #2162

Closed

aingerson referenced this pull request in aingerson/libfabric Jan 25, 2017

Merge pull request #7 from shefty/master

793ef3c

Update fabtests to match with latest libfabric API changes

shefty pushed a commit that referenced this pull request Aug 29, 2017

OFI/MLX: fixed warnings reported by GCC 7.1 and updated documentation.

ee512f3

fixed HPCX <=v1.9.7 support (#7) Signed-off-by: Sannikov, Alexander <[email protected]> Signed-off-by: Dmitry Gladkov <[email protected]>

j-xiong mentioned this pull request Dec 12, 2017

prov/psm2: Fix a deadlock in connection cleanup handler #3613

Merged

frostedcmos mentioned this pull request Dec 4, 2019

hfd5 parallel test suite segfaults in psm2 provider #5478

Closed

krehm mentioned this pull request Jan 28, 2020

program deadlocks with ofi_uffd_handler() #5580

Closed

frostedcmos mentioned this pull request Feb 19, 2020

ofi+verbs;ofi_rxm segfault in vrb_poll_cq() #5653

Closed

frostedcmos mentioned this pull request Feb 28, 2020

Verbs hangs in ibv_reg_mr() / fi mr caching issue #5687

Closed

bsbernd mentioned this pull request Apr 30, 2020

prov/tcp dead stall - generic ofi_write_socket buffer size issue #5897

Closed

swelch mentioned this pull request Aug 28, 2020

prov/verbs: account for off-by-one credit initialization #6212

Merged

Honggang-LI mentioned this pull request Dec 17, 2020

src/common.c: fix a stack-buffer-overflow issue #6466

Merged

shefty mentioned this pull request Dec 18, 2020

src/common.c: fix a stack-buffer-overflow issue #6471

Merged

frostedcmos mentioned this pull request Mar 30, 2021

DAOS: rxm crash in rxm_conn_close() on the server when client exits during rdma transfer #6665

Closed

frostedcmos mentioned this pull request Aug 5, 2021

DAOS: verbs;rxm - latest ofi main causes mem corruption when running at scale #6973

Closed

This was referenced Dec 8, 2021

DAOS: verbs;rxm - fi_cancel() error handling issue #7287

Closed

segfault in rxm_open_conn on master branch (NULL provider name) #7300

Closed

chien-intel mentioned this pull request Jun 6, 2022

prov/efa: fi_info crash in a system with mlnx but no efa defice #7805

Closed

bfaccini mentioned this pull request Jul 11, 2022

prov/verbs;ofi_rxm: rxm_handle_error():793<warn> fi_eq_readerr: err: Connection refused (111), prov_err: Unknown error -8 (-8) #7880

Closed

aingerson mentioned this pull request Oct 11, 2022

prov/psm3: race causing hangs in fi_multinode test #8090

Closed

aingerson mentioned this pull request Dec 5, 2022

fi_rdm_tagged_peek failures on occasional CI runs #8249

Closed

finjulhich mentioned this pull request May 13, 2023

prov/psm3: illegal instruction #8933

Closed

Juee14Desai mentioned this pull request Sep 15, 2023

prov/verbs: Few fabtests failing after setting FI_OFI_RXM_USE_SRX=true for verbs;ofi_rxm #9336

Closed

zachdworkin mentioned this pull request Jun 25, 2024

prov/psm3: "munmap_chunk(): invalid pointer" on cleanup of fi_rdm_tagged_peek with OOB #10123

Open

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

man: clarify fi_cancel as time bounded #7

man: clarify fi_cancel as time bounded #7

sayantansur commented Sep 2, 2014

shefty commented Sep 2, 2014

sayantansur commented Sep 2, 2014

jsquyres commented Sep 2, 2014

sayantansur commented Sep 2, 2014

shefty commented Sep 2, 2014

sayantansur commented Sep 2, 2014

shefty commented Sep 2, 2014

jsquyres commented Sep 2, 2014

sayantansur commented Sep 2, 2014

jsquyres commented Sep 3, 2014

shefty commented Sep 3, 2014

sayantansur commented Sep 3, 2014

sayantansur commented Sep 3, 2014

jsquyres commented Sep 3, 2014

jsquyres commented Sep 3, 2014

sayantansur commented Sep 3, 2014

dledford commented Sep 3, 2014

shefty commented Sep 3, 2014

man: clarify fi_cancel as time bounded #7

man: clarify fi_cancel as time bounded #7

Conversation

sayantansur commented Sep 2, 2014

shefty commented Sep 2, 2014

sayantansur commented Sep 2, 2014

jsquyres commented Sep 2, 2014

sayantansur commented Sep 2, 2014

shefty commented Sep 2, 2014

sayantansur commented Sep 2, 2014

shefty commented Sep 2, 2014

jsquyres commented Sep 2, 2014

sayantansur commented Sep 2, 2014

jsquyres commented Sep 3, 2014

shefty commented Sep 3, 2014

sayantansur commented Sep 3, 2014

sayantansur commented Sep 3, 2014

jsquyres commented Sep 3, 2014

jsquyres commented Sep 3, 2014

sayantansur commented Sep 3, 2014

dledford commented Sep 3, 2014

shefty commented Sep 3, 2014