Skip to content

Conversation

@jsquyres
Copy link
Member

@jsquyres jsquyres commented May 6, 2022

On some versions of MacOS (e.g., 12.3.1), we have seen
sysconf(_SC_OPEN_MAX) -- and "ulimit -n" -- return very large numbers,
and sometime return -1 (which means "unlimited"). This can result in
an unreasonably large loop over closing all FDs (especially if -1 gets
interpreted as LONG_MAX).
#10358 has some links to others
who have seen this kind of behavior.

Add an MCA param (orte_odls_default_maxfd, defaulting to 1024) that
caps the max number of FDs to close in non-Linux environments. Use an
MCA param because we're picking this max value fairly arbitrarily;
give users a way to change it someday, if needed.

Thanks to Scott Sayres for raising the issue.

This is not a cherry pick because this code (i.e., ORTE) no longer
exists on main.

Signed-off-by: Jeff Squyres [email protected]

bot:notacherrypick

On some versions of MacOS (e.g., 12.3.1), we have seen
sysconf(_SC_OPEN_MAX) -- and "ulimit -n" -- return very large numbers,
and sometime return -1 (which means "unlimited").  This can result in
an unreasonably large loop over closing all FDs (especially if -1 gets
interpreted as LONG_MAX).
open-mpi#10358 has some links to others
who have seen this kind of behavior.

Add an MCA param (orte_odls_default_maxfd, defaulting to 1024) that
caps the max number of FDs to close in non-Linux environments.  Use an
MCA param because we're picking this max value fairly arbitrarily;
give users a way to change it someday, if needed.

Thanks to Scott Sayres for raising the issue.

This is not a cherry pick because this code (i.e., ORTE) no longer
exists on main.

Signed-off-by: Jeff Squyres <[email protected]>
@jsquyres
Copy link
Member Author

jsquyres commented May 7, 2022

I'm putting this PR back in "draft" status, per @bosilica's comment. Let's see where the conversation goes (and I don't want to merge this PR accidentally before we come to consensus).

@jsquyres jsquyres marked this pull request as draft May 7, 2022 13:36
@jsquyres jsquyres marked this pull request as ready for review May 16, 2022 19:51
@jsquyres
Copy link
Member Author

Moving this out of draft. We'd like to do a v4.1.x release and have some kind of fix in it. If we get a better fix someday, great.

@bwbarrett bwbarrett merged commit 32e2e62 into open-mpi:v4.1.x May 16, 2022
@jsquyres jsquyres deleted the pr/v4.1.x/odls-non-linux-sysconf-sc-open-max-i-cant-even-dot-dot-dot branch May 16, 2022 19:52
jsquyres added a commit to jsquyres/pmix-master that referenced this pull request May 18, 2022
On some OS's (e.g., macOS), the value returned by
sysconf(_SC_OPEN_MAX) can be set by the user via "ulimit -n X", where
X can be -1 (unlimited) or a positive integer. On macOS in particular,
if the user does not set this value, it's unclear how the default
value is chosen.  Some users have reported seeing arbitrarily large
default values (in the billions), resulting in a very large loop over
close() that can take minutes/hours to complete, leading the user to
think that the app has hung.  To avoid this, ensure that we cap the
max FD that we'll try to close.  This is not a perfect scheme, and
there's uncertainty on how the macOS default value works, so we
provide the pmix_maxfd MCA var to allow the user to set the max FD
value if needed.

This commit is inspired by open-mpi/ompi#10360
and open-mpi/ompi#10358.

Thanks to Scott Sayres for raising the issue.

Signed-off-by: Jeff Squyres <[email protected]>
rhc54 pushed a commit to openpmix/openpmix that referenced this pull request May 18, 2022
On some OS's (e.g., macOS), the value returned by
sysconf(_SC_OPEN_MAX) can be set by the user via "ulimit -n X", where
X can be -1 (unlimited) or a positive integer. On macOS in particular,
if the user does not set this value, it's unclear how the default
value is chosen.  Some users have reported seeing arbitrarily large
default values (in the billions), resulting in a very large loop over
close() that can take minutes/hours to complete, leading the user to
think that the app has hung.  To avoid this, ensure that we cap the
max FD that we'll try to close.  This is not a perfect scheme, and
there's uncertainty on how the macOS default value works, so we
provide the pmix_maxfd MCA var to allow the user to set the max FD
value if needed.

This commit is inspired by open-mpi/ompi#10360
and open-mpi/ompi#10358.

Thanks to Scott Sayres for raising the issue.

Signed-off-by: Jeff Squyres <[email protected]>
rhc54 pushed a commit to rhc54/openpmix that referenced this pull request May 27, 2022
On some OS's (e.g., macOS), the value returned by
sysconf(_SC_OPEN_MAX) can be set by the user via "ulimit -n X", where
X can be -1 (unlimited) or a positive integer. On macOS in particular,
if the user does not set this value, it's unclear how the default
value is chosen.  Some users have reported seeing arbitrarily large
default values (in the billions), resulting in a very large loop over
close() that can take minutes/hours to complete, leading the user to
think that the app has hung.  To avoid this, ensure that we cap the
max FD that we'll try to close.  This is not a perfect scheme, and
there's uncertainty on how the macOS default value works, so we
provide the pmix_maxfd MCA var to allow the user to set the max FD
value if needed.

This commit is inspired by open-mpi/ompi#10360
and open-mpi/ompi#10358.

Thanks to Scott Sayres for raising the issue.

Signed-off-by: Jeff Squyres <[email protected]>
(cherry picked from commit 7c72657)
rhc54 pushed a commit to openpmix/openpmix that referenced this pull request May 27, 2022
On some OS's (e.g., macOS), the value returned by
sysconf(_SC_OPEN_MAX) can be set by the user via "ulimit -n X", where
X can be -1 (unlimited) or a positive integer. On macOS in particular,
if the user does not set this value, it's unclear how the default
value is chosen.  Some users have reported seeing arbitrarily large
default values (in the billions), resulting in a very large loop over
close() that can take minutes/hours to complete, leading the user to
think that the app has hung.  To avoid this, ensure that we cap the
max FD that we'll try to close.  This is not a perfect scheme, and
there's uncertainty on how the macOS default value works, so we
provide the pmix_maxfd MCA var to allow the user to set the max FD
value if needed.

This commit is inspired by open-mpi/ompi#10360
and open-mpi/ompi#10358.

Thanks to Scott Sayres for raising the issue.

Signed-off-by: Jeff Squyres <[email protected]>
(cherry picked from commit 7c72657)
rhc54 pushed a commit to rhc54/openpmix that referenced this pull request Jun 1, 2022
On some OS's (e.g., macOS), the value returned by
sysconf(_SC_OPEN_MAX) can be set by the user via "ulimit -n X", where
X can be -1 (unlimited) or a positive integer. On macOS in particular,
if the user does not set this value, it's unclear how the default
value is chosen.  Some users have reported seeing arbitrarily large
default values (in the billions), resulting in a very large loop over
close() that can take minutes/hours to complete, leading the user to
think that the app has hung.  To avoid this, ensure that we cap the
max FD that we'll try to close.  This is not a perfect scheme, and
there's uncertainty on how the macOS default value works, so we
provide the pmix_maxfd MCA var to allow the user to set the max FD
value if needed.

This commit is inspired by open-mpi/ompi#10360
and open-mpi/ompi#10358.

Thanks to Scott Sayres for raising the issue.

Signed-off-by: Jeff Squyres <[email protected]>
(cherry picked from commit 7c72657)
rhc54 pushed a commit to rhc54/openpmix that referenced this pull request Jun 1, 2022
On some OS's (e.g., macOS), the value returned by
sysconf(_SC_OPEN_MAX) can be set by the user via "ulimit -n X", where
X can be -1 (unlimited) or a positive integer. On macOS in particular,
if the user does not set this value, it's unclear how the default
value is chosen.  Some users have reported seeing arbitrarily large
default values (in the billions), resulting in a very large loop over
close() that can take minutes/hours to complete, leading the user to
think that the app has hung.  To avoid this, ensure that we cap the
max FD that we'll try to close.  This is not a perfect scheme, and
there's uncertainty on how the macOS default value works, so we
provide the pmix_maxfd MCA var to allow the user to set the max FD
value if needed.

This commit is inspired by open-mpi/ompi#10360
and open-mpi/ompi#10358.

Thanks to Scott Sayres for raising the issue.

Signed-off-by: Jeff Squyres <[email protected]>
(cherry picked from commit 7c72657)
rhc54 pushed a commit to openpmix/openpmix that referenced this pull request Jun 2, 2022
On some OS's (e.g., macOS), the value returned by
sysconf(_SC_OPEN_MAX) can be set by the user via "ulimit -n X", where
X can be -1 (unlimited) or a positive integer. On macOS in particular,
if the user does not set this value, it's unclear how the default
value is chosen.  Some users have reported seeing arbitrarily large
default values (in the billions), resulting in a very large loop over
close() that can take minutes/hours to complete, leading the user to
think that the app has hung.  To avoid this, ensure that we cap the
max FD that we'll try to close.  This is not a perfect scheme, and
there's uncertainty on how the macOS default value works, so we
provide the pmix_maxfd MCA var to allow the user to set the max FD
value if needed.

This commit is inspired by open-mpi/ompi#10360
and open-mpi/ompi#10358.

Thanks to Scott Sayres for raising the issue.

Signed-off-by: Jeff Squyres <[email protected]>
(cherry picked from commit 7c72657)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants