-
Notifications
You must be signed in to change notification settings - Fork 937
v4.1.x: odls/default: cap the max number of child FDs to close #10360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
bwbarrett
merged 1 commit into
open-mpi:v4.1.x
from
jsquyres:pr/v4.1.x/odls-non-linux-sysconf-sc-open-max-i-cant-even-dot-dot-dot
May 16, 2022
Merged
v4.1.x: odls/default: cap the max number of child FDs to close #10360
bwbarrett
merged 1 commit into
open-mpi:v4.1.x
from
jsquyres:pr/v4.1.x/odls-non-linux-sysconf-sc-open-max-i-cant-even-dot-dot-dot
May 16, 2022
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
On some versions of MacOS (e.g., 12.3.1), we have seen sysconf(_SC_OPEN_MAX) -- and "ulimit -n" -- return very large numbers, and sometime return -1 (which means "unlimited"). This can result in an unreasonably large loop over closing all FDs (especially if -1 gets interpreted as LONG_MAX). open-mpi#10358 has some links to others who have seen this kind of behavior. Add an MCA param (orte_odls_default_maxfd, defaulting to 1024) that caps the max number of FDs to close in non-Linux environments. Use an MCA param because we're picking this max value fairly arbitrarily; give users a way to change it someday, if needed. Thanks to Scott Sayres for raising the issue. This is not a cherry pick because this code (i.e., ORTE) no longer exists on main. Signed-off-by: Jeff Squyres <[email protected]>
bwbarrett
approved these changes
May 6, 2022
1003n40
approved these changes
May 6, 2022
Member
Author
|
I'm putting this PR back in "draft" status, per @bosilica's comment. Let's see where the conversation goes (and I don't want to merge this PR accidentally before we come to consensus). |
Member
Author
|
Moving this out of draft. We'd like to do a v4.1.x release and have some kind of fix in it. If we get a better fix someday, great. |
jsquyres
added a commit
to jsquyres/pmix-master
that referenced
this pull request
May 18, 2022
On some OS's (e.g., macOS), the value returned by sysconf(_SC_OPEN_MAX) can be set by the user via "ulimit -n X", where X can be -1 (unlimited) or a positive integer. On macOS in particular, if the user does not set this value, it's unclear how the default value is chosen. Some users have reported seeing arbitrarily large default values (in the billions), resulting in a very large loop over close() that can take minutes/hours to complete, leading the user to think that the app has hung. To avoid this, ensure that we cap the max FD that we'll try to close. This is not a perfect scheme, and there's uncertainty on how the macOS default value works, so we provide the pmix_maxfd MCA var to allow the user to set the max FD value if needed. This commit is inspired by open-mpi/ompi#10360 and open-mpi/ompi#10358. Thanks to Scott Sayres for raising the issue. Signed-off-by: Jeff Squyres <[email protected]>
rhc54
pushed a commit
to openpmix/openpmix
that referenced
this pull request
May 18, 2022
On some OS's (e.g., macOS), the value returned by sysconf(_SC_OPEN_MAX) can be set by the user via "ulimit -n X", where X can be -1 (unlimited) or a positive integer. On macOS in particular, if the user does not set this value, it's unclear how the default value is chosen. Some users have reported seeing arbitrarily large default values (in the billions), resulting in a very large loop over close() that can take minutes/hours to complete, leading the user to think that the app has hung. To avoid this, ensure that we cap the max FD that we'll try to close. This is not a perfect scheme, and there's uncertainty on how the macOS default value works, so we provide the pmix_maxfd MCA var to allow the user to set the max FD value if needed. This commit is inspired by open-mpi/ompi#10360 and open-mpi/ompi#10358. Thanks to Scott Sayres for raising the issue. Signed-off-by: Jeff Squyres <[email protected]>
rhc54
pushed a commit
to rhc54/openpmix
that referenced
this pull request
May 27, 2022
On some OS's (e.g., macOS), the value returned by sysconf(_SC_OPEN_MAX) can be set by the user via "ulimit -n X", where X can be -1 (unlimited) or a positive integer. On macOS in particular, if the user does not set this value, it's unclear how the default value is chosen. Some users have reported seeing arbitrarily large default values (in the billions), resulting in a very large loop over close() that can take minutes/hours to complete, leading the user to think that the app has hung. To avoid this, ensure that we cap the max FD that we'll try to close. This is not a perfect scheme, and there's uncertainty on how the macOS default value works, so we provide the pmix_maxfd MCA var to allow the user to set the max FD value if needed. This commit is inspired by open-mpi/ompi#10360 and open-mpi/ompi#10358. Thanks to Scott Sayres for raising the issue. Signed-off-by: Jeff Squyres <[email protected]> (cherry picked from commit 7c72657)
rhc54
pushed a commit
to openpmix/openpmix
that referenced
this pull request
May 27, 2022
On some OS's (e.g., macOS), the value returned by sysconf(_SC_OPEN_MAX) can be set by the user via "ulimit -n X", where X can be -1 (unlimited) or a positive integer. On macOS in particular, if the user does not set this value, it's unclear how the default value is chosen. Some users have reported seeing arbitrarily large default values (in the billions), resulting in a very large loop over close() that can take minutes/hours to complete, leading the user to think that the app has hung. To avoid this, ensure that we cap the max FD that we'll try to close. This is not a perfect scheme, and there's uncertainty on how the macOS default value works, so we provide the pmix_maxfd MCA var to allow the user to set the max FD value if needed. This commit is inspired by open-mpi/ompi#10360 and open-mpi/ompi#10358. Thanks to Scott Sayres for raising the issue. Signed-off-by: Jeff Squyres <[email protected]> (cherry picked from commit 7c72657)
rhc54
pushed a commit
to rhc54/openpmix
that referenced
this pull request
Jun 1, 2022
On some OS's (e.g., macOS), the value returned by sysconf(_SC_OPEN_MAX) can be set by the user via "ulimit -n X", where X can be -1 (unlimited) or a positive integer. On macOS in particular, if the user does not set this value, it's unclear how the default value is chosen. Some users have reported seeing arbitrarily large default values (in the billions), resulting in a very large loop over close() that can take minutes/hours to complete, leading the user to think that the app has hung. To avoid this, ensure that we cap the max FD that we'll try to close. This is not a perfect scheme, and there's uncertainty on how the macOS default value works, so we provide the pmix_maxfd MCA var to allow the user to set the max FD value if needed. This commit is inspired by open-mpi/ompi#10360 and open-mpi/ompi#10358. Thanks to Scott Sayres for raising the issue. Signed-off-by: Jeff Squyres <[email protected]> (cherry picked from commit 7c72657)
rhc54
pushed a commit
to rhc54/openpmix
that referenced
this pull request
Jun 1, 2022
On some OS's (e.g., macOS), the value returned by sysconf(_SC_OPEN_MAX) can be set by the user via "ulimit -n X", where X can be -1 (unlimited) or a positive integer. On macOS in particular, if the user does not set this value, it's unclear how the default value is chosen. Some users have reported seeing arbitrarily large default values (in the billions), resulting in a very large loop over close() that can take minutes/hours to complete, leading the user to think that the app has hung. To avoid this, ensure that we cap the max FD that we'll try to close. This is not a perfect scheme, and there's uncertainty on how the macOS default value works, so we provide the pmix_maxfd MCA var to allow the user to set the max FD value if needed. This commit is inspired by open-mpi/ompi#10360 and open-mpi/ompi#10358. Thanks to Scott Sayres for raising the issue. Signed-off-by: Jeff Squyres <[email protected]> (cherry picked from commit 7c72657)
rhc54
pushed a commit
to openpmix/openpmix
that referenced
this pull request
Jun 2, 2022
On some OS's (e.g., macOS), the value returned by sysconf(_SC_OPEN_MAX) can be set by the user via "ulimit -n X", where X can be -1 (unlimited) or a positive integer. On macOS in particular, if the user does not set this value, it's unclear how the default value is chosen. Some users have reported seeing arbitrarily large default values (in the billions), resulting in a very large loop over close() that can take minutes/hours to complete, leading the user to think that the app has hung. To avoid this, ensure that we cap the max FD that we'll try to close. This is not a perfect scheme, and there's uncertainty on how the macOS default value works, so we provide the pmix_maxfd MCA var to allow the user to set the max FD value if needed. This commit is inspired by open-mpi/ompi#10360 and open-mpi/ompi#10358. Thanks to Scott Sayres for raising the issue. Signed-off-by: Jeff Squyres <[email protected]> (cherry picked from commit 7c72657)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
On some versions of MacOS (e.g., 12.3.1), we have seen
sysconf(_SC_OPEN_MAX) -- and "ulimit -n" -- return very large numbers,
and sometime return -1 (which means "unlimited"). This can result in
an unreasonably large loop over closing all FDs (especially if -1 gets
interpreted as LONG_MAX).
#10358 has some links to others
who have seen this kind of behavior.
Add an MCA param (orte_odls_default_maxfd, defaulting to 1024) that
caps the max number of FDs to close in non-Linux environments. Use an
MCA param because we're picking this max value fairly arbitrarily;
give users a way to change it someday, if needed.
Thanks to Scott Sayres for raising the issue.
This is not a cherry pick because this code (i.e., ORTE) no longer
exists on main.
Signed-off-by: Jeff Squyres [email protected]
bot:notacherrypick