Skip to content

Conversation

@hjelmn
Copy link
Member

@hjelmn hjelmn commented Jan 5, 2015

Fix a couple of outstanding bugs and cleanup send/sendi code.

@miked-mellanox Please take a look.

@hjelmn
Copy link
Member Author

hjelmn commented Jan 6, 2015

I should also mention that this also fixes a long outstanding bug that occurs when the openib btl attempts to coalesce fragments but runs out of descriptors. Looking for the bug id now.

@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/160/

Build Log
last 50 lines

[...truncated 6557 lines...]
Making install in mca/btl/openib
make[2]: Entering directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/btl/openib'
  CC       btl_openib.lo
  CC       btl_openib_component.lo
  CC       btl_openib_endpoint.lo
  CC       btl_openib_frag.lo
  CC       btl_openib_proc.lo
  LEX      btl_openib_lex.c
  CC       btl_openib_mca.lo
  CC       btl_openib_ini.lo
  CC       btl_openib_async.lo
  CC       btl_openib_xrc.lo
  CC       btl_openib_fd.lo
  CC       btl_openib_ip.lo
  CC       connect/btl_openib_connect_base.lo
  CC       connect/btl_openib_connect_empty.lo
  CC       connect/btl_openib_connect_rdmacm.lo
  CC       connect/btl_openib_connect_udcm.lo
  CC       connect/btl_openib_connect_sl.lo
  CC       btl_openib_lex.lo
btl_openib_endpoint.c: In function 'mca_btl_openib_endpoint_post_send':
btl_openib_endpoint.c:89: error: 'mca_btl_base_descriptor_t' has no member named 'des_segments'
btl_openib_endpoint.c: In function 'mca_btl_openib_endpoint_destruct':
btl_openib_endpoint.c:392: warning: 'ibv_unreg_xrc_rcv_qp' is deprecated (declared at /usr/include/infiniband/ofa_verbs.h:204)
btl_openib.c: In function 'create_srq':
btl_openib.c:328: warning: 'ibv_create_xrc_srq' is deprecated (declared at /usr/include/infiniband/ofa_verbs.h:89)
btl_openib_xrc.c: In function 'mca_btl_openib_open_xrc_domain':
btl_openib_xrc.c:65: warning: 'ibv_open_xrc_domain' is deprecated (declared at /usr/include/infiniband/ofa_verbs.h:72)
btl_openib_xrc.c: In function 'mca_btl_openib_close_xrc_domain':
btl_openib_xrc.c:83: warning: 'ibv_close_xrc_domain' is deprecated (declared at /usr/include/infiniband/ofa_verbs.h:102)
btl_openib_async.c: In function 'mca_btl_openib_load_apm_xrc_rcv':
btl_openib_async.c:663: warning: 'ibv_query_xrc_rcv_qp' is deprecated (declared at /usr/include/infiniband/ofa_verbs.h:168)
btl_openib_async.c:680: warning: 'ibv_modify_xrc_rcv_qp' is deprecated (declared at /usr/include/infiniband/ofa_verbs.h:152)
make[2]: *** [btl_openib_endpoint.lo] Error 1
make[2]: *** Waiting for unfinished jobs....
connect/btl_openib_connect_udcm.c: In function 'udcm_xrc_recv_qp_connect':
connect/btl_openib_connect_udcm.c:2491: warning: 'ibv_reg_xrc_rcv_qp' is deprecated (declared at /usr/include/infiniband/ofa_verbs.h:185)
connect/btl_openib_connect_udcm.c: In function 'udcm_xrc_recv_qp_create':
connect/btl_openib_connect_udcm.c:2516: warning: 'ibv_create_xrc_rcv_qp' is deprecated (declared at /usr/include/infiniband/ofa_verbs.h:126)
connect/btl_openib_connect_udcm.c:2528: warning: 'ibv_modify_xrc_rcv_qp' is deprecated (declared at /usr/include/infiniband/ofa_verbs.h:152)
connect/btl_openib_connect_udcm.c:2569: warning: 'ibv_modify_xrc_rcv_qp' is deprecated (declared at /usr/include/infiniband/ofa_verbs.h:152)
make[2]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal/mca/btl/openib'
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory `/scrap/jenkins/jenkins/jobs/gh-ompi-master-pr/workspace/opal'
make: *** [install-recursive] Error 1
Build step 'Execute shell' marked build as failure
[BFA] Scanning build for known causes...

[BFA] Done. 0s

Test FAILed.

@hjelmn
Copy link
Member Author

hjelmn commented Jan 6, 2015

Hmm, looks like I missed something pulling this out of my btl branch. Please ignore until I get this fixed.

hjelmn added 3 commits January 6, 2015 11:39
There was a bug in the openib btl handling this valid sequence of
calls:

desc = btl_alloc ();
btl_free (desc);

When triggered the bug would cause either fragment loss or undefined
behavior (SEGV, etc). The problem occured because btl_alloc contained
the logic to modify the pending fragment (length, etc) and these
changes were not corrected if the fragment was freed instead of sent.

To fix this issue I 1) moved some of the coalescing logic to the
btl_send function, and 2) retry the coalesced fragment on btl_free
if it was never sent. This appears to completely address the issue.
…n place (was duplicated in the send and sendi paths)
@hjelmn
Copy link
Member Author

hjelmn commented Jan 6, 2015

Should be good now.

@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/164/
Test PASSed.

@hjelmn
Copy link
Member Author

hjelmn commented Jan 6, 2015

@miked-mellanox Please let me know what you think of the changes.

@hjelmn hjelmn added the bug label Jan 7, 2015
@hjelmn hjelmn added this to the Open MPI 1.9 milestone Jan 7, 2015
@hjelmn
Copy link
Member Author

hjelmn commented Jan 14, 2015

@jladd-mlnx Does Mellanox have an opinion on this pull request? If not I will go ahead and merge it.

@jladd-mlnx
Copy link
Member

@hjelmn I was only waiting to see what the issue was that @abouteiller was reporting regarding message coalescing. Seems like your patch does indeed resolve his issue. I've reviewed you code and it looks fine. With regard to re-enabling message coalescing in 1.8.x; I think that we are of the opinion that we really don't care one way or the other. Maybe @hppritcha or @miked-mellanox have a burning desire to see it re-enabled, but @bureddy and myself do not.

@hppritcha
Copy link
Member

I vote against re-enabling message coalescing in 1.8.X.

@rhc54
Copy link
Contributor

rhc54 commented Jan 15, 2015

Regardless of the "vote", I'm not going to re-enable it in the 1.8 series as it is of little (if any) value, and caused trouble.

hjelmn added a commit that referenced this pull request Jan 15, 2015
@hjelmn hjelmn merged commit 006074c into open-mpi:master Jan 15, 2015
@abouteiller
Copy link
Member

Nathan,

remember that I set the coalescing MCA param to default to “false” yesterday. If you feel coalescing should be on by default, do not forget to revert that change.

Have fun,

Aurelien

      ~~~ Aurélien Bouteiller, Ph.D. ~~~
         ~ Research Scientist @ ICL ~

The University of Tennessee, Innovative Computing Laboratory
1122 Volunteer Blvd, suite 309, Knoxville, TN 37996
tel: +1 (865) 974-9375 fax: +1 (865) 974-8296
https://icl.cs.utk.edu/~bouteill/

Le 15 janv. 2015 à 16:05, Nathan Hjelm [email protected] a écrit :

Merged #332 #332.


Reply to this email directly or view it on GitHub #332 (comment).

dong0321 pushed a commit to dong0321/ompi that referenced this pull request Feb 19, 2020
BrendanCunningham pushed a commit to BrendanCunningham/ompi that referenced this pull request Jul 13, 2022
From open-mpi#332:
  Here follows a bug report by **Guido Vranken** via the _Tor bug bounty program_. Please credit Guido accordingly.

  ## Bug report

  The DNS code of Libevent contains this rather obvious OOB read:

  ```c
  static char *
  search_make_new(const struct search_state *const state, int n, const char *const base_name) {
      const size_t base_len = strlen(base_name);
      const char need_to_append_dot = base_name[base_len - 1] == '.' ? 0 : 1;
  ```

  If the length of ```base_name``` is 0, then line 3125 reads 1 byte before the buffer. This will trigger a crash on ASAN-protected builds.

  To reproduce:

  Build libevent with ASAN:
  ```
  $ CFLAGS='-fomit-frame-pointer -fsanitize=address' ./configure && make -j4
  ```
  Put the attached ```resolv.conf``` and ```poc.c``` in the source directory and then do:

  ```
  $ gcc -fsanitize=address -fomit-frame-pointer poc.c .libs/libevent.a
  $ ./a.out
  =================================================================
  ==22201== ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60060000efdf at pc 0x4429da bp 0x7ffe1ed47300 sp 0x7ffe1ed472f8
  READ of size 1 at 0x60060000efdf thread T0
  ```

P.S. we can add a check earlier, but since this is very uncommon, I didn't add it.

Fixes: open-mpi#332
BrendanCunningham pushed a commit to BrendanCunningham/ompi that referenced this pull request Jul 13, 2022
From open-mpi#332:
  Here follows a bug report by **Guido Vranken** via the _Tor bug bounty program_. Please credit Guido accordingly.

  ## Bug report

  The DNS code of Libevent contains this rather obvious OOB read:

  ```c
  static char *
  search_make_new(const struct search_state *const state, int n, const char *const base_name) {
      const size_t base_len = strlen(base_name);
      const char need_to_append_dot = base_name[base_len - 1] == '.' ? 0 : 1;
  ```

  If the length of ```base_name``` is 0, then line 3125 reads 1 byte before the buffer. This will trigger a crash on ASAN-protected builds.

  To reproduce:

  Build libevent with ASAN:
  ```
  $ CFLAGS='-fomit-frame-pointer -fsanitize=address' ./configure && make -j4
  ```
  Put the attached ```resolv.conf``` and ```poc.c``` in the source directory and then do:

  ```
  $ gcc -fsanitize=address -fomit-frame-pointer poc.c .libs/libevent.a
  $ ./a.out
  =================================================================
  ==22201== ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60060000efdf at pc 0x4429da bp 0x7ffe1ed47300 sp 0x7ffe1ed472f8
  READ of size 1 at 0x60060000efdf thread T0
  ```

P.S. we can add a check earlier, but since this is very uncommon, I didn't add it.

Fixes: open-mpi#332
Signed-off-by: Brendan Cunningham <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants