Skip to content

Conversation

@jsquyres
Copy link
Member

@jsquyres jsquyres commented Sep 5, 2022

NOTE: This PR is a proposal after the discussion from #10729. It also currently includes the docs commits from #10762 so that I could include docs for opal_mca_base_component_show_load_errors. Hence, only the non-docs commits on this PR are relevant. When #10762 is merged, those docs commits will disappear from this PR.


Convert the MCA parameter "opal_mca_base_component_show_load_errors"
to be a flexible mechanism to specify when (and when not) to emit
warnings about errors when trying to load DSO components.

  1. Convert the existing MCA parameter
    opal_mca_base_component_show_load_errors from a boolean to a
    string. It will still accept all prior valid boolean values, but
    it will also accept comma-delimited list of "framework[/component]"
    tokens. If the MCA base encounters an error when loading a DSO,
    opal_mca_base_component_show_load_errors is checked to see if a
    warning should be emitted.

    • If the value is boolean true or the string "all", then emit a
      warning
    • If the value is boolean false or the string "none", then do not
      emit a warning
    • If the value is a comma-delimited list of tokens: emit a warning
      about any dynamic component that fails to open and matches a
      token in the list. "Match" is defined as:
      • If a token in the list is only a framework name, then any
        component in that framework will match.
      • If a token in the list specifies both a framework name and a
        component name (in the form framework/component), then only
        the specified component in the specified framework will match.
    • The value can also be a "^" character followed by a
      comma-delimited list of "framework[/component]" values: This is
      similar to the comma-delimited list of tokens, except it will
      only emit warnings about dynamic components that fail to load and
      do not match a token in the list.

    NOTE: The equivalence of "all" with boolean true values, and
    "none" with boolean false values is only intended as a
    backwards compatibility mechanism, since prior to this
    commit, opal_mca_base_component_show_load_errors was a
    boolean value. It is not intended as a general mechanism
    that should be copied to all other include/exclude-type MCA
    params.

  2. Remove the configure option --enable-show-load-errors-by-default,
    replace it with --with-show-load-errors[=value]. The value
    specified will become the default value of the
    opal_mca_base_component_show_load_errors MCA variable (it defaults
    to "all").

    The CLI option name change is intentional. The previous MCA
    parameter only accepted boolean values; the new CLI name reflects
    that it can accept more than just boolean values.

The rationale for this commit is to allow packagers more granular
control over whether to warn about component DSO load failures or not.

The canonical example of where this is useful is accelerator
libraries: since accelerators are expensive, they may only be
available on a subset of nodes in a given HPC environment.
Consequently, the accelerator's support libraries may only be loaded
on the nodes that actually have accelerators physically present. In
such an environment, an administrator or packager may wish to
configure Open MPI:

  1. With accelerator components built as DSOs.
  2. Do not warn about about accelerator DSO component load failures.

For example:

./configure --enable-mca-dso=accelerator ...
make install
mpirun --mca opal_mca_base_component_show_load_errors '^accelerator' ...

Signed-off-by: Jeff Squyres [email protected]

@jsquyres jsquyres force-pushed the pr/show-load-errors----or-not branch from e94b008 to 9cf538e Compare September 5, 2022 15:33
rhc54 added a commit to rhc54/openpmix that referenced this pull request Sep 6, 2022
to be a flexible mechanism to specify when (and when not) to emit
warnings about errors when trying to load DSO components.

Port of open-mpi/ompi#10763

Signed-off-by: Ralph Castain <[email protected]>
rhc54 added a commit to rhc54/openpmix that referenced this pull request Sep 6, 2022
to be a flexible mechanism to specify when (and when not) to emit
warnings about errors when trying to load DSO components.

Port of open-mpi/ompi#10763

Signed-off-by: Ralph Castain <[email protected]>
rhc54 added a commit to openpmix/openpmix that referenced this pull request Sep 6, 2022
to be a flexible mechanism to specify when (and when not) to emit
warnings about errors when trying to load DSO components.

Port of open-mpi/ompi#10763

Signed-off-by: Ralph Castain <[email protected]>
rhc54 added a commit to rhc54/openpmix that referenced this pull request Sep 6, 2022
to be a flexible mechanism to specify when (and when not) to emit
warnings about errors when trying to load DSO components.

Port of open-mpi/ompi#10763

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit 6927bc4)
rhc54 added a commit to rhc54/openpmix that referenced this pull request Sep 6, 2022
to be a flexible mechanism to specify when (and when not) to emit
warnings about errors when trying to load DSO components.

Port of open-mpi/ompi#10763

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit 6927bc4)
rhc54 added a commit to rhc54/openpmix that referenced this pull request Sep 6, 2022
to be a flexible mechanism to specify when (and when not) to emit
warnings about errors when trying to load DSO components.

Port of open-mpi/ompi#10763

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit 6927bc4)
@jsquyres jsquyres force-pushed the pr/show-load-errors----or-not branch from 9cf538e to ce6aec5 Compare September 6, 2022 13:53
@jsquyres
Copy link
Member Author

jsquyres commented Sep 6, 2022

Updates this morning to rebase on top of minor updates to #10762.

rhc54 added a commit to openpmix/openpmix that referenced this pull request Sep 6, 2022
to be a flexible mechanism to specify when (and when not) to emit
warnings about errors when trying to load DSO components.

Port of open-mpi/ompi#10763

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit 6927bc4)
@jjhursey
Copy link
Member

jjhursey commented Sep 6, 2022

A couple of quick questions:

  • Does all mean true? all is mentioned as a valid option to ./configure --enable-mca-dso=, but not mentioned as a valid option to the MCA parameter.opal_mca_base_component_show_load_errors
  • If all == true then can we add a none option that means false?

To my knowledge, this would be the first include/exclude option (like -mca btl ^openib/-mca btl tcp,self) that would also have a true/false option for all/none. I think it works in this case. I'm not sure if it would work in others but could depending on the use case. So (without looking at the code yet) I would suggest to make the true/false support separate from the include/exclude processing so as not to provide this capability more widely than necessary.

@jsquyres
Copy link
Member Author

jsquyres commented Sep 6, 2022

@jjhursey Good comments about all/true/none/false. And to clarify: yes, all == true, and none == false. I will update the commit message / PR description. I added these equivalencies in order to preserve backwards compatibility (i.e., when opal_mca_base_component_show_load_errors was a boolean). I agree that this behavior probably shouldn't be copied to other include/exclude variables. I'll add a note about this in the commit message/description, too.

@jsquyres jsquyres force-pushed the pr/show-load-errors----or-not branch 2 times, most recently from 5ca2349 to 419469d Compare September 6, 2022 21:23
@jsquyres
Copy link
Member Author

jsquyres commented Sep 6, 2022

Had to force push the same commits (rebased to change their hashes) to un-stick the AWS Jenkins on this PR.

@rhc54
Copy link
Contributor

rhc54 commented Sep 6, 2022

@jsquyres Please let me know if you make a material change so I can port it.

@jsquyres
Copy link
Member Author

jsquyres commented Sep 7, 2022

@rhc54 Will do. Nothing yet; just rebasing and futzing to get CI to work properly.

@jsquyres jsquyres force-pushed the pr/show-load-errors----or-not branch from 419469d to f11f0cc Compare September 7, 2022 01:39
@jsquyres
Copy link
Member Author

jsquyres commented Sep 7, 2022

#10762 merged; so I rebased to eliminate the extra docs commits on this PR. This PR is now just the "show load errors" proposal commit.

@jsquyres
Copy link
Member Author

jsquyres commented Sep 7, 2022

bot:aws:recheck

@jsquyres
Copy link
Member Author

jsquyres commented Sep 8, 2022

@edgargabriel @wckzhang Can you please test this PR and see if it delivers the functionality that you're looking for?

@jsquyres
Copy link
Member Author

@edgargabriel @wckzhang Ping ^^

@edgargabriel
Copy link
Member

I literally started compiling the branch 2 mins ago :-). Should have some feedback later this morning.

@edgargabriel
Copy link
Member

edgargabriel commented Sep 12, 2022

So here is what I did for testing: I compiled Open MPI with ucx support, and removed after compilation/installation the ucx libraries.

./configure --enable-mca-dso=pml,osc,common-ucx --with-ucx=/my/ucx/dir/
make -j
make install

After removing the UCX directory, the executable still worked, but generated as expected warnings:

egabriel@ZT-DH170-26:~/tmp/show-load-errors$ mpirun -np 2 ./mpi_hello
[ZT-DH170-26:1009588] mca_base_component_repository_open: unable to open mca_btl_uct: libucm.so.0: cannot open shared object file: No such file or directory (ignored)
[ZT-DH170-26:1009589] mca_base_component_repository_open: unable to open mca_btl_uct: libucm.so.0: cannot open shared object file: No such file or directory (ignored)
[ZT-DH170-26:1009588] mca_base_component_repository_open: unable to open mca_pml_ucx: libucp.so.0: cannot open shared object file: No such file or directory (ignored)
[ZT-DH170-26:1009589] mca_base_component_repository_open: unable to open mca_pml_ucx: libucp.so.0: cannot open shared object file: No such file or directory (ignored)
[ZT-DH170-26:1009588] mca_base_component_repository_open: unable to open mca_osc_ucx: libucp.so.0: cannot open shared object file: No such file or directory (ignored)
[ZT-DH170-26:1009589] mca_base_component_repository_open: unable to open mca_osc_ucx: libucp.so.0: cannot open shared object file: No such file or directory (ignored)
Hello World from rank 1
Hello World from rank 0

I tried subsequently two parameter settings, both worked

egabriel@ZT-DH170-26:~/tmp/show-load-errors$ mpirun --mca opal_mca_base_component_show_load_errors false -np 2 ./mpi_hello
Hello World from rank 1
Hello World from rank 0

and

egabriel@ZT-DH170-26:~/tmp/show-load-errors$ mpirun --mca opal_mca_base_component_show_load_errors ^pml,btl,osc  -np 2 ./mpi_hello
Hello World from rank 0
Hello World from rank 1

(Note: I did get some additional warning such as

[LOG_CAT_ML] component ucx_p2p is not available but requested in hierarchy: ucx_p2p:p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error

but I assume they are unrelated with this issue and pr, and ignored/removed them from the output).

So I would say that this pr looks overall good and in line on how I think it could/should work

@gpaulsen
Copy link
Member

This proposal looks great from IBM's point of view. "show load errors" is a useful tool for diagnosing cluster issues at customer sites, and this enhances it's ability.

Copy link
Contributor

@qkoziol qkoziol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jsquyres jsquyres force-pushed the pr/show-load-errors----or-not branch from f11f0cc to 14cd70a Compare September 19, 2022 20:35
@jsquyres jsquyres marked this pull request as ready for review September 19, 2022 20:35
@jsquyres
Copy link
Member Author

This last push simply tweaked the documentation a little; no code changes.

@jsquyres
Copy link
Member Author

bot:aws:recheck

Convert the MCA parameter "opal_mca_base_component_show_load_errors"
to be a flexible mechanism to specify when (and when not) to emit
warnings about errors when trying to load DSO components.

1. Convert the existing MCA parameter
   opal_mca_base_component_show_load_errors from a boolean to a
   string.  It will still accept all prior valid boolean values, but
   it will also accept comma-delimited list of "framework[/component]"
   tokens.  If the MCA base encounters an error when loading a DSO,
   opal_mca_base_component_show_load_errors is checked to see if a
   warning should be emitted.

   - If the value is boolean true or the string "all", then emit a
     warning
   - If the value is boolean false or the string "none", then do not
     emit a warning
   - If the value is a comma-delimited list of tokens: emit a warning
     about any dynamic component that fails to open and matches a
     token in the list.  "Match" is defined as:
     - If a token in the list is only a framework name, then any
       component in that framework will match.
     - If a token in the list specifies both a framework name and a
       component name (in the form ``framework/component``), then only
       the specified component in the specified framework will match.
   - The value can also be a "^" character followed by a
     comma-delimited list of "framework[/component]" values: This is
     similar to the comma-delimited list of tokens, except it will
     only emit warnings about dynamic components that fail to load and
     do *not* match a token in the list.

   *NOTE*: The equivalence of "all" with boolean true values, and
	   "none" with boolean false values is only intended as a
	   backwards compatibility mechanism, since prior to this
	   commit, opal_mca_base_component_show_load_errors was a
	   boolean value.  It is not intended as a general mechanism
	   that should be copied to all other include/exclude-type MCA
	   params.

1. Remove the configure option --enable-show-load-errors-by-default,
   replace it with --with-show-load-errors[=value].  The value
   specified will become the default value of the
   opal_mca_base_component_show_load_errors MCA variable (it defaults
   to "all").

   The CLI option name change is intentional.  The previous MCA
   parameter only accepted boolean values; the new CLI name reflects
   that it can accept more than just boolean values.

The rationale for this commit is to allow packagers more granular
control over whether to warn about component DSO load failures or not.

The canonical example of where this is useful is accelerator
libraries: since accelerators are expensive, they may only be
available on a subset of nodes in a given HPC environment.
Consequently, the accelerator's support libraries may only be loaded
on the nodes that actually have accelerators physically present.  In
such an environment, an administrator or packager may wish to
configure Open MPI:

1. With accelerator components built as DSOs.
2. Do not warn about about accelerator DSO component load failures.

For example:

```
./configure --enable-mca-dso=accelerator ...
make install
mpirun --mca opal_mca_base_component_show_load_errors '^accelerator' ...
```

Signed-off-by: Jeff Squyres <[email protected]>
@jsquyres jsquyres force-pushed the pr/show-load-errors----or-not branch from 14cd70a to 20bbf27 Compare September 20, 2022 19:25
@jsquyres jsquyres merged commit 091e07a into open-mpi:main Sep 21, 2022
@jsquyres jsquyres deleted the pr/show-load-errors----or-not branch September 21, 2022 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants