-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BuildOcl] Fatal error: cannot open file, Too many open files #2911
Comments
We do not see evidence of this from other sources. What version of MIOpen are you using? Has it ever worked and with what version? If you build MIOpen yourself, then please attach output of CMake.
I am afraid you are using Linux distribution which is not officially supported. In this case our support options are limited. Of course we'll try to help, but please aware. |
@atamazov thanks for the pointer. Closing the issue now that I realised that some dependencies (comgr most notably) on my OS are horribly out of date. |
@atamazov I was able to reproduce exactly the same problem using one of the ROCm docker images. Docker image:
Pytorch is installed in the container via (latest stable version)
Docker command (please note
By default Docker has a fairly high limit on number of open files (I think it was something like 256k by default). I'm curious to know if this is indeed a problem, or if I should just up the limit on my host and live with it?
List of open files: Open files
|
@xelibrion Thanks for the logs. I recommend compacting comments that contain long console logs, see https://github.com/ROCm/MIOpen/wiki/How-to-insert-console-logs-into-github-pages. |
@xelibrion The reason of the issue is related to HIP runtime problem that enforced us to use interim code object files (as a workaround). The workaround was removed in #2225, the fix was first released in MIOpen 3.1.0 (https://github.com/ROCm/MIOpen/releases/tag/rocm-6.1.0). I suspect that PyTorch from https://download.pytorch.org/whl/rocm6.0 does not contain the necessary fixes, and you need to use the rocm6.1 based one. Possibly you can get it from https://download.pytorch.org/whl/nightly/rocm6.1. |
@atamazov unfortunately the url https://download.pytorch.org/whl/nightly/rocm6.1 does not seem to have a version of pytorch built against 6.1 yet. |
@xelibrion I've looked at https://rocm.docs.amd.com/projects/radeon/en/latest/docs/install/install-pytorch.html, and that led me to |
@atamazov thanks, it does indeed - I did not encounter any issues with that build. Closing the issue. |
Hey folks, I'm curious if there's a regression of #962?
I'm running ROCm 6.0.2 and this is the error that I get. I've seen this behaviour on ROCm 5.7 as well I think, except my Python process would just crash without any reasonable output/logs.
With the never version of PyTorch at least there's a log clearly stating a reason.
I tried setting a higher ulimit
ulimit -Sn 2048
as suggested in #962, but number of open files just kept rising (as reported bywatch 'ls /proc/8757/fd | wc -l'
. It took 4x time to hit this new limit, but the program would crash eventually anyway.Debug info:
The text was updated successfully, but these errors were encountered: