Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Airflow's debugging story #40975

Open
Tracked by #39593
kaxil opened this issue Jul 23, 2024 · 25 comments
Open
Tracked by #39593

Improve Airflow's debugging story #40975

kaxil opened this issue Jul 23, 2024 · 25 comments
Assignees
Labels
airflow3.0:candidate Potential candidates for Airflow 3.0 area:logging kind:feature Feature Requests

Comments

@kaxil
Copy link
Member

kaxil commented Jul 23, 2024

Summary

As we prepare for the release of Airflow 3.0, one of the key areas that need significant enhancement is the debugging experience.

Current Challenges

  • Insufficient Logging: Logs are often fragmented, in some cases overly verbose or non-existent and lack sufficient detail to easily trace issues. We should do an audit of the existing logs.
  • Complex Tracebacks: Debugging stack traces can be difficult due to the complex nature of DAG (Directed Acyclic Graph) execution and requires a full-running Airflow. Airflow's dag.test and task.test does a good job already but we should see if we can do even better.
  • Error Handling: Current error messages are not always informative or actionable, making it hard to understand the root cause of failures. We should do an audit of the existing errors.
  • Tooling Integration: Lack of integration with modern debugging and observability tools hinders the debugging process. Can we create a listing tool or some capabilities in the Airflow CLI that catches obvious errors? airflow dags parse does a job at it, worth checking if it is sufficient or not.

Whoever takes on this task should conduct a user research on the mailing list, Slack, Meetup or Airflow Summit to identify other common debugging problems that can be fixed.

@kaxil kaxil added the airflow3.0:candidate Potential candidates for Airflow 3.0 label Jul 23, 2024
@kaxil kaxil mentioned this issue Jul 23, 2024
10 tasks
@dosubot dosubot bot added area:logging kind:feature Feature Requests labels Jul 23, 2024
@potiuk
Copy link
Member

potiuk commented Jul 24, 2024

Let me add to it what I wrote about OTEL in the https://lists.apache.org/thread/b2bvn8sbxfncg9qpvry9w142944mnlj6 - this might be a great tool to hlep with things. I a not sure if I want to take lone ownership about that one - maybe there will be someone else who would like to take a look and explore things as well - but I am happy to be deeply involved in that one.

@Dev-iL
Copy link
Contributor

Dev-iL commented Jul 24, 2024

I'd like to be involved in this effort in some capacity. At least: brainstorming, qa, and documentation.

@omkar-foss
Copy link
Collaborator

omkar-foss commented Jul 24, 2024

Happy to help out with some of the logging and error handling implementation.

The debug snapshot idea sounds very useful @potiuk. It may give a canonical view of the user's environment. I suppose Jaeger provides a similar tool called Anonymizer, which generates a shareable json of a trace - probably same one that you were referring to in your mail. We can build our own debug snapshot util, or can think of using this tool with Jaeger since it supports the existing OTEL metrics and traces.

@kaxil
Copy link
Member Author

kaxil commented Jul 24, 2024

@Dev-iL Could I assign this GitHub issue to you? You can the lead the "scoping" part of this epic by talking to Jarek and others on Slack, mailing list and other venues and come back with a concrete proposal. Would you like to do that?

@Dev-iL
Copy link
Contributor

Dev-iL commented Jul 24, 2024

@kaxil Honestly? It sounds a bit scary going from contributing minor patches to being responsible for an important feature in an upcoming release. I prefer to actively observe and learn, at least once, how something like this is done and take on a similar responsibility after I know how much time/work it requires.

@kaxil
Copy link
Member Author

kaxil commented Jul 25, 2024

Absolutely, that's completely fine

@kaxil Honestly? It sounds a bit scary going from contributing minor patches to being responsible for an important feature in an upcoming release. I prefer to actively observe and learn, at least once, how something like this is done and take on a similar responsibility after I know how much time/work it requires.

@omkar-foss Do you want to take a stab at leading it?

@omkar-foss
Copy link
Collaborator

@omkar-foss Do you want to take a stab at leading it?

@kaxil I would love to take the lead on this, but right now I suppose I'm still a rookie in the ways of the Airflow community.

So for this one, I'll prefer to assist all of you in every way possible, while trying to get a better grasp of the processes, codebase etc. Hope that's okay, thanks for considering me though 😇

@omkar-foss
Copy link
Collaborator

omkar-foss commented Jul 31, 2024

Whoever takes on this task should conduct a user research on the mailing list, Slack, Meetup or Airflow Summit to identify other common debugging problems that can be fixed.

@kaxil Any idea if there's a predefined user research template that has been used for prior releases?

If not, I'd like to propose the following for conducting the survey:

  1. We can create a survey form with questions pertaining to understanding the users' debugging journey. Probably can use something like SurveyMonkey.
  2. We can have groups of questions in the form, each group for a section in "Current Challenges" above. For example, the groups could be Logging, Traceback, Error Handling, Tooling & Integrations.
  3. We can have 3 to 5 questions for each group. Let's try to keep the survey as brief and concise as possible.
  4. The survey form can be circulated on the mailing list, Slack, and other places.
  5. We can collate the feedback from the survey form and prioritize items accordingly.

Please let me know your thoughts on this, thanks.

cc: @potiuk @Dev-iL

@Dev-iL
Copy link
Contributor

Dev-iL commented Aug 1, 2024

@omkar-foss The main question is who the target audience of the research is, where possible answers are: maintainers, contributors, power users, general public, etc. Based on @kaxil's instructions, I'd say mostly power-users and above. If that is the case, I'm assuming most will be willing to participate in a survey, even if it has questions on topics people might not have an opinion on. If on the other hand, we're looking to get more participants, I think a literal survey is not the way, since people might open it, see how long it is, and just give up. That, of course, would be a terrible waste, because there are likely many use-cases that will not be represented.

For the above reason, I was thinking something like a feature voting platform (example1, example2) could be suitable - that way, if someone has a pain-point related to how a particular system works, they can look for existing posts or briefly explain what they have in mind (possibly with a template like a bug report) and allow others to vote or add to these suggestions. This also takes care of much of the aggregation work of the results.

@omkar-foss
Copy link
Collaborator

Hey @Dev-iL, I agree with your reasoning above. I checked out the sample Feature Upvote board that you've shared above and it surely feels simpler (and quicker) to submit compared to a regular survey form.

I suppose we'll need an initial list of features on the upvote board for the participants to vote, would be great to hear if you've any thoughts around it.

Not sure how much help I can be on this, but I'm here so feel free to tag me if you need any assistance! :)

@potiuk
Copy link
Member

potiuk commented Aug 5, 2024

@omkar-foss The main question is who the target audience of the research is, where possible answers are: maintainers, contributors, power users, general public, etc. Based on @kaxil's instructions, I'd say mostly power-users and above.

I'd say mostly power-users - yes, but also the tooling and debuggability should be targeted for "new" users. I think power-users mostly know their ways - they can do remote debugging, they know how to connect their IDEs to the code, they are able to even use pdb, py-spy and other tools while remote shelling to container instances etc.

But the goal here is to shorten the path between "I wrote some DAG and it does not work" to "how do I most effectively find inspect and understand what's going on there" - for a user who just wrote their first few dags.

I think an assumption should be that that person has some Python experience, they have an IDE (PyCharm/ VSCode) and they are willing to follow some instructions on setting up things first - while ideally this should be one-time setup and they should be able to re-use it easily (and teach others how to do it).

If that is the case, I'm assuming most will be willing to participate in a survey, even if it has questions on topics people might not have an opinion on. If on the other hand, we're looking to get more participants, I think a literal survey is not the way, since people might open it, see how long it is, and just give up. That, of course, would be a terrible waste, because there are likely many use-cases that will not be represented.

I think yes - survey is a good idea if well prepared and those power-users might indeed be willing to share their experiences - we can even leverage the upcoming Airlfow summit and do some prices / recognition and generally a bit more fuss about it - so if we could do it still in August and maybe run the survey during the Summit as well, we could likely make it much more efficient.

@Dev-iL
Copy link
Contributor

Dev-iL commented Aug 7, 2024

@potiuk @omkar-foss In the interest of moving ahead with this, I've made a google doc so we can start hashing out this survey collaboratively. Currently, it's publicly open for commenting - please send me your google account via slack so I could add you to the editors. If there are any privacy or other concerns, I don't mind moving the document to another platform.

@kaxil
Copy link
Member Author

kaxil commented Aug 7, 2024

@Dev-iL Drop a mail to [email protected] too (Public archive: https://lists.apache.org/[email protected]). I am sure a lot of developer & users might want to add things to it as well as in Airflow's slack channel

@Dev-iL
Copy link
Contributor

Dev-iL commented Aug 11, 2024

It's been a few days, and the document hasn't seen any activity (outside of my own placeholder ideas), nor did anyone approach me for editing rights. If this trend continues, we won't have the survey ready on time.

@kaxil I just saw your comment on the mailing list. My plan was to first iterate on the survey's structure in docs, move to form once satisfied, then circulate it for responses.

@omkar-foss
Copy link
Collaborator

I've made a google doc so we can start hashing out this survey collaboratively.

Doc looks good to me. Just one question/suggestion - will all questions be optional, or some mandatory, some optional? My suggestion would be to keep as many questions optional especially free text type questions (Q 2.4, 3.4, 4.4, 4.5). Reason being not all people will have feedback suiting each question.

@Dev-iL
Copy link
Contributor

Dev-iL commented Aug 12, 2024

@omkar-foss don't suggest - decide. I, too, think questions should be mostly optional. As for the contents of the survey - I don't believe it's ready. It currently has questions asking about general sentiments on things, and I don't know how actionable it will be unless users answer the free text questions en masse.

I'll give you an example: suppose user satisfaction with the airflow documentation comes out as "medium" overall - what do you do about this? OTOH, suppose we had a multi-select question that mentioned airflow features introduced in the last few 2.x releases, asking if users find the examples provided for them sufficient - now that would be something actionable. See what I mean?

It needs the eyes of someone who knows airflow and its power user community better than I do, to know the right questions to ask, potentially about specific components, plugins, use-cases, etc, so that feedback is insightful and useful.

@potiuk
Copy link
Member

potiuk commented Aug 13, 2024

I think none of the maintainers know "power users" well. Almost by definition, we are not running, nor maining airlfow and we do not have teams of people working together on DAGs. we are pretty much blind-folded when it comes to their needs and can at most guess what is troublesome for them or what can help them.

We mostly know how to debug Airflow itself, not how to debug Airflow DAGs. There are huge and significant differences for workflows, tooling and integration with IDEs. Same as with documentation - we are very POOR documentation writers, because a) we think about internals and not externals b) we have a lot of knowledge and assumptions that readers might not have and we might fail to explain it to them c) we tend to focus on HOW things are done not WHAT our users might want to learn form it.

That's why we NEED power users themselvs and ideally people who work in teams and have an opportunity to lead and decide on those questions and questionaire. We might definitely advise on decision making but we should not "lead" such process.

@omkar-foss
Copy link
Collaborator

omkar-foss commented Aug 15, 2024

That's why we NEED power users themselvs and ideally people who work in teams and have an opportunity to lead and decide on those questions and questionaire.

Yes, we're on the same page. We're now in the phase of collecting feedback on finalizing the survey draft on Airflow Slack, hoping for quicker response and finding users who use Airflow along with their teams. Starting with #contributors channel for now (most maintainers, less teams activity), we'll eventually check with #new-contributors, #user-troubleshooting, #documentation, etc. (least maintainers, most teams activity) in that order, as required.

Would be great if we all can continue this conversation from this issue to Airflow Slack (on #contributors channel) so we can discuss quicker and move closer to rolling out the final user survey.

@amoghrajesh
Copy link
Contributor

I also got a chance to review the doc and make some suggestions to it.
@omkar-foss @Dev-iL I too would like to collaborate on this issue if that's ok by you.
I think one very good opportunity would be to create the questionnaire, get it reviewed by @kaxil / @potiuk like folks and also someone from the product side, and distribute it as a QR / link at the Airflow Summit because the Summit will have people from varying backgrounds, who are related in some way to Airflow

@potiuk
Copy link
Member

potiuk commented Aug 21, 2024

I also looked at it - and actually I have a comment a bit contrary to those early comments of @amoghrajesh who insisted on "choice" answers.

Since we are not really sure about the debugging usage in a number of places I find the rating questions (Often/Rare/Satisfied etc.) telling us very little - especially that we also have no baseline to compare it.

I think this survey will be answered by a small number of people (not few 100s but few 10s maybe) so statistical aggregation of the data for such a small sample will be very misleading and useless - we will anyhow get mostly answers from people who are frustrated by their experiences, this is almost a given, so any stats based on the ranked answers will be a) super biased b) very little telling.

I think the biggest value of this survey is to get some concrete examples, stories, unknown to us ways how people are debugging Airlflow and the "free form" answer is absolutely most important insight we can get from it - we can learn for example that somoene uses x.y.z tool in this specific way, and that they miss that and this feature there - but we will never be able to ask the right question for it - especially one tha thave "rated" answer".

So I think pretty much all the questions there should be of the type:

  • I have no problems with it - this works fine
  • I have a problem with it - here is a detailed description

Or

  • I do not use any tools
  • i use some tools -> here describe what you are using

And I think the choice should be in most cases binary.

Otherwise I'd find very little value finding out that 15 of 20 people find that informations are often misleading without any additional explanation.

So I think all the questions that have 5 choices of satisfaction should be decresed to 2 choices ("not my problem/my problem) and the scond should be accompanied with obligatory explanation why. Yes it will make the survey longer to fill, and yes it will decrease the number of responses we get but I feel this will be way more useful for us.

@kaxil
Copy link
Member Author

kaxil commented Sep 2, 2024

fyi, following are the docs that have actionable next steps based on the questions (and options) in the survey:

@kaxil
Copy link
Member Author

kaxil commented Sep 2, 2024

The survey form is ready: https://s.apache.org/airflow-debugging-survey2024 , thanks to @Dev-iL , @omkar-foss & @amoghrajesh

@kaxil
Copy link
Member Author

kaxil commented Sep 5, 2024

Thanks to @Dev-iL -- we have a QR code that links to the survey

2024_survey

kaxil added a commit to astronomer/airflow that referenced this issue Oct 18, 2024
This will allow him to interact with the GitHub project for sig-debugging: apache#40975
kaxil added a commit that referenced this issue Oct 18, 2024
This will allow him to interact with the GitHub project for sig-debugging: #40975
@omkar-foss
Copy link
Collaborator

Hi all! As per discussion, we'll be tracking all issues related to Airflow Debugging Story (based on debugging survey responses) on this project: https://github.com/orgs/apache/projects/421

@potiuk
Copy link
Member

potiuk commented Nov 11, 2024

Also see #40802 (comment) discussion. I believe with OTEL and traces (and even including limited set of logs in the traces) we are closer to address big gap in debugging of Airflow where we can give our users a tool to provide us way more diagnostics information that will allow us to analyse, diagnose, and fix many problems much more efficiently.

ellisms pushed a commit to ellisms/airflow that referenced this issue Nov 13, 2024
This will allow him to interact with the GitHub project for sig-debugging: apache#40975
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
airflow3.0:candidate Potential candidates for Airflow 3.0 area:logging kind:feature Feature Requests
Projects
Development

No branches or pull requests

5 participants