Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hydra-submitit launcher does not log error messages to log files. #2100

Closed
abhiskk opened this issue Mar 18, 2022 · 6 comments
Closed

Hydra-submitit launcher does not log error messages to log files. #2100

abhiskk opened this issue Mar 18, 2022 · 6 comments
Assignees
Labels
enhancement Enhanvement request internal
Milestone

Comments

@abhiskk
Copy link

abhiskk commented Mar 18, 2022

🚀 Feature Request

Logging error messages in hydra-submitit launches to relevant log files. Currently the error messages are ignored by the hydra-submitit setup and are only piped to the terminal.

Motivation

I have been using the hydra-submitit launcher for multi-node training and have been observing the above behavior. If the messages are logged to the relevant log files it will give us consolidation of the log messages and error message and will make it easy to track experiments. The logging of messages to terminal is also painful in the scenario when you launch hyperparameter searches with a single launch command and the messages from multiple hyperparameter search runs come to the same terminal.

For e.g. if we launch the runs we get a message like this on the terminal:

INFO: launch command: python train_d2_net.py -m         --config-path=../configs         --config-name=dummy.yaml         +mode=cluster         hydra.sweep.dir=/checkpoint/akadian/exps/20220316T012225/20220316T012225LOGS         use_wandb=true hydra.launcher.timeout_min=4320 D2.SOLVER.BASE_LR=0.02,0.002,0.0002,0.00002
[2022-03-16 01:23:26,351][HYDRA] Submitit 'slurm' sweep output dir : /checkpoint/akadian/exps-w3-reasoning/20220316T012225/20220316T012225LOGS
[2022-03-16 01:23:26,352][HYDRA]  #0 : +mode=cluster use_wandb=True wandb_project=lvis_debug D2.SOLVER.BASE_LR=0.02
[2022-03-16 01:23:26,358][HYDRA]  #1 : +mode=cluster use_wandb=True wandb_project=lvis_debug D2.SOLVER.BASE_LR=0.002
[2022-03-16 01:23:26,363][HYDRA]  #2 : +mode=cluster use_wandb=True wandb_project=lvis_debug D2.SOLVER.BASE_LR=0.0002
[2022-03-16 01:23:26,367][HYDRA]  #3 : +mode=cluster use_wandb=True wandb_project=lvis_debug D2.SOLVER.BASE_LR=2e-05

If any of these experiments crash then the log gets piped to the terminal using which we had submitted the jobs but they are not present in the relevant log files where everything else is logged, like: .submitit/54459356/54459356_0_log.out, .submitit/54459356/54459356_0_log.err.

For eg this error message appeared on the terminal which was used to submit the jobs, but was never piped to the relevant log files.

Error executing job with overrides: ['+mode=cluster', 'use_wandb=True', 'wandb_project=lvis_debug']
Traceback (most recent call last):
  File "/private/home/akadian/miniconda3/envs/d1/lib/python3.9/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/private/home/akadian/miniconda3/envs/d1/lib/python3.9/site-packages/hydra/_internal/utils.py", line 386, in <lambda>
    lambda: hydra.multirun(
  File "/private/home/akadian/miniconda3/envs/d1/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 140, in multirun
    ret = sweeper.sweep(arguments=task_overrides)
  File "/private/home/akadian/miniconda3/envs/d1/lib/python3.9/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 161, in sweep
    _ = r.return_value
  File "/private/home/akadian/miniconda3/envs/d1/lib/python3.9/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.

Pitch

Discussed with @jieru-hu, this can be implemented as a native functionality or can be implemented using hydra callbacks.

@abhiskk abhiskk added the enhancement Enhanvement request label Mar 18, 2022
@jieru-hu jieru-hu added this to the Hydra 1.1.2 milestone Mar 23, 2022
@jieru-hu jieru-hu self-assigned this Mar 23, 2022
@jieru-hu
Copy link
Contributor

Thank you @abhiskk . we will address this as part of 1.1.2.

@jieru-hu
Copy link
Contributor

thanks @abhiskk for helping with testing the changes with submitit/slurm. This is addressed on both Hydra 1.2 and 1.1 and will be released along with Hydra 1.1.2 soon.

@jieru-hu jieru-hu modified the milestones: Hydra 1.1.2, Hydra 1.2.0 Mar 30, 2022
@Ubadub
Copy link

Ubadub commented May 28, 2023

@jieru-hu @abhiskk Could you explain how this was addressed/completed? As far as I can tell, error logs are still not being produced.

I am having an issue where my jobs are failing immediately (and unpredictably- sometimes they do, sometimes they don't) after starting, and the only error message I see, in he console, is:

submitit.core.utils.UncompletedJobError: Job 12649156_1 (task: 0) with path /mmfs1/gscratch/stf/abhinavp/573-SADTech/src/multirun/05-28-02-43/.submitit/12649156_1/12649156_1_0_result.pkl
has not produced any output (state: FAILED)

I suspect some sort of SLURM issue. Normally, SLURM would publish an actual error message to the error log file, but the plugin is simply swallowing this output. I don't know where it is getting redirected. Please advise.

@pipme
Copy link

pipme commented Aug 22, 2023

This issue still happens to me with hydra 1.3.2. The error messages are not written into .submitit/54459356/54459356_0_log.out,.submitit/54459356/54459356_0_log.err or hydra's .log file. The error messages are only piped to the terminal.

@abhiskk @jieru-hu Could you please elaborate on how to resolve this? A related issue is #2479 where one wants to exit the process for launching the MULTIRUN experiments.

It seems one needs to add a callback, right?

hydra:
  callbacks:
    log_job_return:
      _target_: hydra.experimental.callbacks.LogJobReturnCallback

@pipme
Copy link

pipme commented Aug 22, 2023

Specifically for @Ubadub, not related to this issue. I also encountered this strange behavior. After investigating, I think the problem is that submitit by default specifies #SBATCH --error=xxx/.submitit/%A_%a/%A_%a_0_log.err for submission to SLURM while the folders %A_%a are not created yet when submitting the jobs. The folders are created almost instantly after submitting the SLURM job arrays, see here. SLURM can therefore fail silently since it doesn't support non-existed folders for writing the outputs before v23.02. (Also submitit assumes the .pkl files for executing are in those subfolders which might not be true at the time of submission.)

TL;DR: A workaround would be postponing the slurm jobs for a few seconds (60 here).

# @package _global_
# requires hydra-submitit-launcher (available via pip)

defaults:
  - override /hydra/launcher: submitit_slurm
 
hydra:
  launcher:
    additional_parameters:
      error: /dev/null  
      output: /dev/null
      begin: now+60

Specifying the error and output files to \dev\null may or may not be necessary but since the output&error paths for individual array jobs are correctly set internally by submitit with srun --output .submitit/%A_%a/%A_%a_%t_log.out ..., it doesn't harm to add those. Below is a sbatch script example created by submitit before the above modification.

#!/bin/bash

# Parameters
#SBATCH --array=0-1%2
#SBATCH --cpus-per-task=1
#SBATCH --error=xxx/multiruns/2023-08-14_18-19/.submitit/%A_%a/%A_%a_0_log.err
#SBATCH --job-name=__main__
#SBATCH --mem-per-cpu=4G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --open-mode=append
#SBATCH --output=xxx/multiruns/2023-08-14_18-19/.submitit/%A_%a/%A_%a_0_log.out
#SBATCH --partition=short
#SBATCH --signal=USR2@120
#SBATCH --time=480
#SBATCH --wckey=submitit

# command
export SUBMITIT_EXECUTOR=slurm
srun --unbuffered --output xxx/multiruns/2023-08-14_18-19/.submitit/%A_%a/%A_%a_%t_log.out --error xxx/multiruns/2023-08-14_18-19/.submitit/%A_%a/%A_%a_%t_log.err python -u -m submitit.core._submit xxx/multiruns/2023-08-14_18-19/.submitit/%j

@harveyaa
Copy link

Hi, I see that this issue is closed - but I am still struggling to understand how to retrieve the normal error/stdout output from SLURM when using submitit through hydra. I attempted the fix that @pipme suggested, but the behaviour is the same: I have a simple app that only throws an exception and receive the error on the stdout, but when I go to what is logged by hydra there is no mention of it. Any ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhanvement request internal
Projects
None yet
Development

No branches or pull requests

5 participants