Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[minor] more verbose about a job's error #2500

Merged
merged 1 commit into from
Dec 20, 2022

Conversation

min-xu-ai
Copy link

When a job raises an exception, it seems to be swallowed. After this change, the backtrace shows up in main.log as well as the job's stdout file. See below
image

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 10, 2022
Copy link
Collaborator

@Jasha10 Jasha10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @min-xu-ai!

@Jasha10 Jasha10 merged commit 943024a into facebookresearch:main Dec 20, 2022
@@ -186,6 +187,9 @@ def run_job(
ret.return_value = task_function(task_cfg)
ret.status = JobStatus.COMPLETED
except Exception as e:
name = "HYDRA_FULL_ERROR"
if name in os.environ and os.environ[name] == "1":
log.info(traceback.format_exc())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better log.error here, no?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree. also, it is likely crash from here, so maybe that won't be missed in most cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created followup PR #2549 changing the log level from INFO to ERROR.

@omry
Copy link
Collaborator

omry commented Jan 13, 2023

If I am not mistaken, the exception is supposed to be printed downstream (especially if HYDRA_FULL_ERROR is set).

@min-xu-ai
Copy link
Author

But somehow it didn’t in my experiments

@omry
Copy link
Collaborator

omry commented Jan 13, 2023

@Jasha10, did you confirm this before accepting the patch?
Exception handling in Hydra is not simple, this feels wrong.

@Jasha10
Copy link
Collaborator

Jasha10 commented Jan 14, 2023

Yes. To confirm, I modified the file examples/tutorials/basic/running_your_hydra_app/5_basic_sweep/my_app.py so that the my_app function raises a RuntimeError. Then I ran a basic sweep:
HYDRA_FULL_ERROR=1 python example/my_app.py -m
This sweep contains 4 jobs.

Here's what I observed:

  • Before this diff: none of the log files has a traceback (the logs are empty). On stdout, we get a traceback for the first job, which is printed after the sweep completes.
  • After this diff: all four log files contain a traceback. On stdout, we get a traceback for each of the four jobs (printed immediately while the job is running). In addition, we get one more traceback (for the first job) after the sweep completes.

@omry
Copy link
Collaborator

omry commented Jan 14, 2023

A couple of things:

  1. Hydra typically prints a sanitized stack trace unless HYDRA_FULL_ERROR is 1. This deviates from the pattern as far as I can tell.
  2. By adding print, aren't you getting a duplicate printout on non-multirun executions?

@omry
Copy link
Collaborator

omry commented Jan 18, 2023

@Jasha10, this is also missing a unit test that will ensure it's not breaking in the future.
I suggest you unland this until this and my questions are addressed.

Jasha10 added a commit to Jasha10/hydra that referenced this pull request Jan 19, 2023
@Jasha10
Copy link
Collaborator

Jasha10 commented Jan 19, 2023

Unlanding in #2556.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants