Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Airflow error messages more specific, clear and actionable #43171

Open
1 of 2 tasks
omkar-foss opened this issue Oct 18, 2024 · 9 comments
Open
1 of 2 tasks

Make Airflow error messages more specific, clear and actionable #43171

omkar-foss opened this issue Oct 18, 2024 · 9 comments
Labels
area:core kind:feature Feature Requests kind:meta High-level information important to the community

Comments

@omkar-foss
Copy link
Collaborator

omkar-foss commented Oct 18, 2024

Description

As per users' feedback in the Airflow Debugging Survey 2024, around 41.7% respondents don't consider error messages as actionable. Overall feedback also suggests that users find some error messages vague and confusing.

Use case/motivation

Goals for this issue are the following:

  • Identify and revise error messages that are vague, lack context, or do not provide clear guidance on resolving issues.
  • Provide detailed information, context, and actionable steps within the error messages to help users troubleshoot.
  • Transform error messages with meaningful linking wherever possible. e.g. an error like Celery command failed on host can be transformed or displayed with something like "Please check your DAG processor timeout variable for this". So the user has a starting point to start debugging.

Related issues

Parent Issue: #40975

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@omkar-foss omkar-foss added kind:feature Feature Requests needs-triage label for new issues that we didn't triage yet labels Oct 18, 2024
@dosubot dosubot bot added the area:core label Oct 18, 2024
@omkar-foss omkar-foss changed the title Improve clarity, specificity, and actionable guidance in Airflow error messages Make Airflow error messages more specific, clear and actionable Oct 18, 2024
@hterik
Copy link
Contributor

hterik commented Oct 21, 2024

I can recommend this guide from Google about writing good error messages: https://developers.google.com/tech-writing/error-messages. The rest of the courses in that book are also really good btw.

an error like Celery command failed on host can be transformed or displayed with something like "Please check your DAG processor timeout variable for this".

Actionable errors are good, but has to be done very carefully, because if it gives misleading advice it will lead users down chasing the wrong rabbit hole. For example this log in standard_task_runner.py is most of the time not due to memory running out: "Job %s was killed before it finished (likely due to running out of memory)",. I've seen our engineers chasing memory issues in vain countless of times because of that message. (yes we should have filed a PR 😄)

@omkar-foss omkar-foss added kind:meta High-level information important to the community and removed needs-triage label for new issues that we didn't triage yet labels Oct 23, 2024
@potiuk
Copy link
Member

potiuk commented Oct 24, 2024

but has to be done very carefully, because if it gives misleading advice it will lead users down chasing the wrong rabbit hole. For example this log in standard_task_runner.py is most of the time not due to memory running out: "Job %s was killed before it finished (likely due to running out of memory)",. I've seen our engineers chasing memory issues in vain countless of times because of that message.

I am big fan of "always tell the user what action from their side the error implies.". Agree things can be misleading and re the case you mentioned - I cannot find it now (I think I discussed it in the past), but I think in case of such complicated and multi-possible-root-cause we should explain what's going on and link to a FAQ page on Airflow explaining possible reasons. This way when you have the error, and we find other reasons and more detailed explanations what could be wrong and how to remediate it - we can always update the docs and add more information that will be useful for many past versions of airflow that people will have.

(yes we should have filed a PR 😄)

Absolutely :)

@omkar-foss
Copy link
Collaborator Author

Have a suggestion for multi-possible-root-cause issues - we can print Airflow error code with the error message e.g. AERR055: Job 10 was killed before it finished and can have an error code mapping with possible root causes like (just examples, not real causes):

Error Code Possible Commonly Observed Causes
AERR055 1) Ran out of memory
2) Job was stuck and killed after timeout
3) Job being run on Spot Instance Node (K8S on EKS)

Since error codes are shareable and easily searchable, it would be useful for team collaboration as well (e.g. instead of me saying "I'm looking into the error Job 10 was killed before it finished", can probably just say "I'm looking into AERR055". Much like how we use JIRA ticket numbers or GitHub issue/PR numbers.

@potiuk
Copy link
Member

potiuk commented Oct 29, 2024

❤️ this. This is what many other tools are doing already. And being able to classify and list all the different types of errors that the software can generate, together with explaining their cause and remediations - even just list those - is a sign of high maturity of the software.

@potiuk
Copy link
Member

potiuk commented Oct 29, 2024

I really like it.

We could finally find a use for AirflowException - so far it was mainly about being a base class for a number of exceptions, but if we add mandatory "error id" to AirflowException and make Airflow Exception abstract, and add handling so that that Error ID is displayed in the logs and maybe also produced as metric (counting the errors) and produce an event in the OTEL trace when they happen, might be really great mechanism to have and to "force" classification of all the errors that we have in Airflow.

@kunaljubce
Copy link
Contributor

@potiuk @omkar-foss I really like how this discussion is shaping up. Have we established any guidelines or SOPs around how to designate the error codes? Or if there's a thread where this discussion is ongoing, would be happy to contribute (both via discussions and PR).

@omkar-foss
Copy link
Collaborator Author

omkar-foss commented Nov 11, 2024

Nice to hear from you @kunaljubce.

I'm working on a doc to describe a list of all Airflow-related exceptions - starting with the AirflowException (as @potiuk mentioned above) as AERR001, and subsequent error codes assigned incrementally in a bread-first order. Will share the doc in the next few days.

We can then update that list as required based on further discussion.

@omkar-foss omkar-foss moved this from Planning to Todo in Debugging Improvements - Airflow 3 Nov 28, 2024
@omkar-foss
Copy link
Collaborator Author

I'm working on a doc to describe a list of all Airflow-related exceptions - starting with the AirflowException (as @potiuk mentioned #43171 (comment)) as AERR001, and subsequent error codes assigned incrementally in a bread-first order. Will share the doc in the next few days.

Hi, I'm still working on this, got caught up with other things. Will share the list in the next couple of days or so.

@omkar-foss omkar-foss moved this from Todo to In Progress in Debugging Improvements - Airflow 3 Nov 28, 2024
@omkar-foss
Copy link
Collaborator Author

omkar-foss commented Dec 3, 2024

Hey all, apologies for the delay on this. I've created a very basic guide with the Airflow error mapping, which we all can start adding to and improving further. For further details, kindly refer to this Airflow community slack thread here.

Update: You can also refer to #44616

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:feature Feature Requests kind:meta High-level information important to the community
Projects
Development

No branches or pull requests

4 participants