-
-
Notifications
You must be signed in to change notification settings - Fork 650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hydra-submitit launcher does not log error messages to log files. #2100
Comments
Thank you @abhiskk . we will address this as part of 1.1.2. |
thanks @abhiskk for helping with testing the changes with submitit/slurm. This is addressed on both Hydra 1.2 and 1.1 and will be released along with Hydra 1.1.2 soon. |
@jieru-hu @abhiskk Could you explain how this was addressed/completed? As far as I can tell, error logs are still not being produced. I am having an issue where my jobs are failing immediately (and unpredictably- sometimes they do, sometimes they don't) after starting, and the only error message I see, in he console, is:
I suspect some sort of SLURM issue. Normally, SLURM would publish an actual error message to the error log file, but the plugin is simply swallowing this output. I don't know where it is getting redirected. Please advise. |
This issue still happens to me with hydra 1.3.2. The error messages are not written into @abhiskk @jieru-hu Could you please elaborate on how to resolve this? A related issue is #2479 where one wants to exit the process for launching the MULTIRUN experiments. It seems one needs to add a callback, right?
|
Specifically for @Ubadub, not related to this issue. I also encountered this strange behavior. After investigating, I think the problem is that TL;DR: A workaround would be postponing the slurm jobs for a few seconds (60 here).
Specifying the error and output files to #!/bin/bash
# Parameters
#SBATCH --array=0-1%2
#SBATCH --cpus-per-task=1
#SBATCH --error=xxx/multiruns/2023-08-14_18-19/.submitit/%A_%a/%A_%a_0_log.err
#SBATCH --job-name=__main__
#SBATCH --mem-per-cpu=4G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --open-mode=append
#SBATCH --output=xxx/multiruns/2023-08-14_18-19/.submitit/%A_%a/%A_%a_0_log.out
#SBATCH --partition=short
#SBATCH --signal=USR2@120
#SBATCH --time=480
#SBATCH --wckey=submitit
# command
export SUBMITIT_EXECUTOR=slurm
srun --unbuffered --output xxx/multiruns/2023-08-14_18-19/.submitit/%A_%a/%A_%a_%t_log.out --error xxx/multiruns/2023-08-14_18-19/.submitit/%A_%a/%A_%a_%t_log.err python -u -m submitit.core._submit xxx/multiruns/2023-08-14_18-19/.submitit/%j |
Hi, I see that this issue is closed - but I am still struggling to understand how to retrieve the normal error/stdout output from SLURM when using submitit through hydra. I attempted the fix that @pipme suggested, but the behaviour is the same: I have a simple app that only throws an exception and receive the error on the stdout, but when I go to what is logged by hydra there is no mention of it. Any ideas? |
🚀 Feature Request
Logging error messages in hydra-submitit launches to relevant log files. Currently the error messages are ignored by the hydra-submitit setup and are only piped to the terminal.
Motivation
I have been using the hydra-submitit launcher for multi-node training and have been observing the above behavior. If the messages are logged to the relevant log files it will give us consolidation of the log messages and error message and will make it easy to track experiments. The logging of messages to terminal is also painful in the scenario when you launch hyperparameter searches with a single launch command and the messages from multiple hyperparameter search runs come to the same terminal.
For e.g. if we launch the runs we get a message like this on the terminal:
If any of these experiments crash then the log gets piped to the terminal using which we had submitted the jobs but they are not present in the relevant log files where everything else is logged, like:
.submitit/54459356/54459356_0_log.out, .submitit/54459356/54459356_0_log.err
.For eg this error message appeared on the terminal which was used to submit the jobs, but was never piped to the relevant log files.
Pitch
Discussed with @jieru-hu, this can be implemented as a native functionality or can be implemented using hydra callbacks.
The text was updated successfully, but these errors were encountered: