-
Notifications
You must be signed in to change notification settings - Fork 936
Update to "show load errors" functionality #10763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to "show load errors" functionality #10763
Conversation
e94b008 to
9cf538e
Compare
to be a flexible mechanism to specify when (and when not) to emit warnings about errors when trying to load DSO components. Port of open-mpi/ompi#10763 Signed-off-by: Ralph Castain <[email protected]>
to be a flexible mechanism to specify when (and when not) to emit warnings about errors when trying to load DSO components. Port of open-mpi/ompi#10763 Signed-off-by: Ralph Castain <[email protected]>
to be a flexible mechanism to specify when (and when not) to emit warnings about errors when trying to load DSO components. Port of open-mpi/ompi#10763 Signed-off-by: Ralph Castain <[email protected]>
to be a flexible mechanism to specify when (and when not) to emit warnings about errors when trying to load DSO components. Port of open-mpi/ompi#10763 Signed-off-by: Ralph Castain <[email protected]> (cherry picked from commit 6927bc4)
to be a flexible mechanism to specify when (and when not) to emit warnings about errors when trying to load DSO components. Port of open-mpi/ompi#10763 Signed-off-by: Ralph Castain <[email protected]> (cherry picked from commit 6927bc4)
to be a flexible mechanism to specify when (and when not) to emit warnings about errors when trying to load DSO components. Port of open-mpi/ompi#10763 Signed-off-by: Ralph Castain <[email protected]> (cherry picked from commit 6927bc4)
9cf538e to
ce6aec5
Compare
|
Updates this morning to rebase on top of minor updates to #10762. |
to be a flexible mechanism to specify when (and when not) to emit warnings about errors when trying to load DSO components. Port of open-mpi/ompi#10763 Signed-off-by: Ralph Castain <[email protected]> (cherry picked from commit 6927bc4)
|
A couple of quick questions:
To my knowledge, this would be the first include/exclude option (like |
|
@jjhursey Good comments about |
5ca2349 to
419469d
Compare
|
Had to force push the same commits (rebased to change their hashes) to un-stick the AWS Jenkins on this PR. |
|
@jsquyres Please let me know if you make a material change so I can port it. |
|
@rhc54 Will do. Nothing yet; just rebasing and futzing to get CI to work properly. |
419469d to
f11f0cc
Compare
|
#10762 merged; so I rebased to eliminate the extra docs commits on this PR. This PR is now just the "show load errors" proposal commit. |
|
bot:aws:recheck |
|
@edgargabriel @wckzhang Can you please test this PR and see if it delivers the functionality that you're looking for? |
|
@edgargabriel @wckzhang Ping ^^ |
|
I literally started compiling the branch 2 mins ago :-). Should have some feedback later this morning. |
|
So here is what I did for testing: I compiled Open MPI with ucx support, and removed after compilation/installation the ucx libraries. After removing the UCX directory, the executable still worked, but generated as expected warnings: I tried subsequently two parameter settings, both worked and (Note: I did get some additional warning such as but I assume they are unrelated with this issue and pr, and ignored/removed them from the output). So I would say that this pr looks overall good and in line on how I think it could/should work |
|
This proposal looks great from IBM's point of view. "show load errors" is a useful tool for diagnosing cluster issues at customer sites, and this enhances it's ability. |
qkoziol
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
f11f0cc to
14cd70a
Compare
|
This last push simply tweaked the documentation a little; no code changes. |
|
bot:aws:recheck |
Convert the MCA parameter "opal_mca_base_component_show_load_errors"
to be a flexible mechanism to specify when (and when not) to emit
warnings about errors when trying to load DSO components.
1. Convert the existing MCA parameter
opal_mca_base_component_show_load_errors from a boolean to a
string. It will still accept all prior valid boolean values, but
it will also accept comma-delimited list of "framework[/component]"
tokens. If the MCA base encounters an error when loading a DSO,
opal_mca_base_component_show_load_errors is checked to see if a
warning should be emitted.
- If the value is boolean true or the string "all", then emit a
warning
- If the value is boolean false or the string "none", then do not
emit a warning
- If the value is a comma-delimited list of tokens: emit a warning
about any dynamic component that fails to open and matches a
token in the list. "Match" is defined as:
- If a token in the list is only a framework name, then any
component in that framework will match.
- If a token in the list specifies both a framework name and a
component name (in the form ``framework/component``), then only
the specified component in the specified framework will match.
- The value can also be a "^" character followed by a
comma-delimited list of "framework[/component]" values: This is
similar to the comma-delimited list of tokens, except it will
only emit warnings about dynamic components that fail to load and
do *not* match a token in the list.
*NOTE*: The equivalence of "all" with boolean true values, and
"none" with boolean false values is only intended as a
backwards compatibility mechanism, since prior to this
commit, opal_mca_base_component_show_load_errors was a
boolean value. It is not intended as a general mechanism
that should be copied to all other include/exclude-type MCA
params.
1. Remove the configure option --enable-show-load-errors-by-default,
replace it with --with-show-load-errors[=value]. The value
specified will become the default value of the
opal_mca_base_component_show_load_errors MCA variable (it defaults
to "all").
The CLI option name change is intentional. The previous MCA
parameter only accepted boolean values; the new CLI name reflects
that it can accept more than just boolean values.
The rationale for this commit is to allow packagers more granular
control over whether to warn about component DSO load failures or not.
The canonical example of where this is useful is accelerator
libraries: since accelerators are expensive, they may only be
available on a subset of nodes in a given HPC environment.
Consequently, the accelerator's support libraries may only be loaded
on the nodes that actually have accelerators physically present. In
such an environment, an administrator or packager may wish to
configure Open MPI:
1. With accelerator components built as DSOs.
2. Do not warn about about accelerator DSO component load failures.
For example:
```
./configure --enable-mca-dso=accelerator ...
make install
mpirun --mca opal_mca_base_component_show_load_errors '^accelerator' ...
```
Signed-off-by: Jeff Squyres <[email protected]>
14cd70a to
20bbf27
Compare
NOTE: This PR is a proposal after the discussion from #10729. It also currently includes the docs commits from #10762 so that I could include docs for
opal_mca_base_component_show_load_errors. Hence, only the non-docs commits on this PR are relevant. When #10762 is merged, those docs commits will disappear from this PR.Convert the MCA parameter "opal_mca_base_component_show_load_errors"
to be a flexible mechanism to specify when (and when not) to emit
warnings about errors when trying to load DSO components.
Convert the existing MCA parameter
opal_mca_base_component_show_load_errors from a boolean to a
string. It will still accept all prior valid boolean values, but
it will also accept comma-delimited list of "framework[/component]"
tokens. If the MCA base encounters an error when loading a DSO,
opal_mca_base_component_show_load_errors is checked to see if a
warning should be emitted.
warning
emit a warning
about any dynamic component that fails to open and matches a
token in the list. "Match" is defined as:
component in that framework will match.
component name (in the form
framework/component), then onlythe specified component in the specified framework will match.
comma-delimited list of "framework[/component]" values: This is
similar to the comma-delimited list of tokens, except it will
only emit warnings about dynamic components that fail to load and
do not match a token in the list.
NOTE: The equivalence of "all" with boolean true values, and
"none" with boolean false values is only intended as a
backwards compatibility mechanism, since prior to this
commit, opal_mca_base_component_show_load_errors was a
boolean value. It is not intended as a general mechanism
that should be copied to all other include/exclude-type MCA
params.
Remove the configure option --enable-show-load-errors-by-default,
replace it with --with-show-load-errors[=value]. The value
specified will become the default value of the
opal_mca_base_component_show_load_errors MCA variable (it defaults
to "all").
The CLI option name change is intentional. The previous MCA
parameter only accepted boolean values; the new CLI name reflects
that it can accept more than just boolean values.
The rationale for this commit is to allow packagers more granular
control over whether to warn about component DSO load failures or not.
The canonical example of where this is useful is accelerator
libraries: since accelerators are expensive, they may only be
available on a subset of nodes in a given HPC environment.
Consequently, the accelerator's support libraries may only be loaded
on the nodes that actually have accelerators physically present. In
such an environment, an administrator or packager may wish to
configure Open MPI:
For example:
Signed-off-by: Jeff Squyres [email protected]