Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send important executor logs to task logs #40468

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

vincbeck
Copy link
Contributor

@vincbeck vincbeck commented Jun 27, 2024

If the executor fails to start a task, the user will not see any logs in the UI because the task has not started. This PR leverages TaskContextLogger implemented in #32646. It forwards the important error messages when an executor fail to execute a task to the task logs.

cc @o-nikolas


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Copy link
Contributor

@o-nikolas o-nikolas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. Also what about the Batch executor? Do you plan on fast following with that after this PR?

airflow/executors/base_executor.py Show resolved Hide resolved
airflow/utils/log/task_context_logger.py Outdated Show resolved Hide resolved
@vincbeck
Copy link
Contributor Author

vincbeck commented Jun 27, 2024

Left a few comments. Also what about the Batch executor? Do you plan on fast following with that after this PR?

Yep, my plan is to first do it with one executor and receive feedbacks, address feedbacks etc ... Then when the direction is set, I'll create another PR for BatchExecutor

@ferruzzi
Copy link
Contributor

Had a look and I don't have anything extra to add; I like the direction this is going. I'd need to see some examples to have any real opinion on the question of one or one-per-executor but otherwise, I think it looks good after that one change to the session creation that Niko mentioned.

@@ -386,14 +385,16 @@ def attempt_task_runs(self):
)
self.pending_tasks.append(ecs_task)
else:
self.log.error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@syedahsn You should have a review of this if you get a chance. The failure reason handling has been modified here.

@@ -122,6 +123,7 @@ class BaseExecutor(LoggingMixin):
job_id: None | int | str = None
name: None | ExecutorName = None
callback_sink: BaseCallbackSink | None = None
task_context_logger: TaskContextLogger
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the class TaskContextLogger is private (see the sphinx directive in the docstring) so ... really we should either make it public or do not use it in providers.

the intention was to keep it internal since it's not something users should really mess with and in that scenario best to retain flexibility to change it at will.

possible solution: is there a way that this can be managed through the executor interface?

Copy link
Contributor Author

@vincbeck vincbeck Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible solution: is there a way that this can be managed through the executor interface?

How? At the end, we'll have to use TaskContextLogger in providers. Unless you want to copy TaskContextLogger in providers?

I am not sure this is a huge risk to make it public, the only public methods of this class are the emitting log methods such as error, info, ... The signatures of these methods are pretty generic and the risk to have a breaking change is pretty low?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How? At the end, we'll have to use TaskContextLogger in providers. Unless you want to copy TaskContextLogger in providers?

Well the first thing that occurred to me was to have an interface on Executor (i.e. a method) that would be like, 'send_task_log` or something. This could push to a queue that the scheduler could process periodically. It would not be duplicating the task context log class because that has substantially more going on; it would just be, providing an interface for executors to dump messages to TI logs, without worrying about the underlying mechanism.

But it's just an idea, if we want to explore keeping TaskContextLogger private.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is a huge risk to make it public, the only public methods of this class are the emitting log methods such as error, info, ... The signatures of these methods are pretty generic and the risk to have a breaking change is pretty low?

Yeah the other part of what would constitute "the interface" would be i suppose like the params def __init__(self, component_name: str, call_site_logger: Logger | None = None): and such. I am just very wary of making things public that aren't really for users because it is such an impediment when we want to change things.

One thing change we need to make with this interface, that I think was missed, is that each file needs to have a random string in it because, especially as more usages of this are added, there will be times when multiple such messages are added for a single TI, and currently, if they are from the same "component", then they will collide / override.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing change we need to make with this interface, that I think was missed, is that each file needs to have a random string in it because, especially as more usages of this are added, there will be times when multiple such messages are added for a single TI, and currently, if they are from the same "component", then they will collide / override.

Are you sure about this? TaskContextLogger is currently being used int eh scheduler with a component name static, thus, if you're right, that'd mean that scheduler logs override each other? Or I misunderstood what you said. https://github.com/apache/airflow/blob/main/airflow/jobs/scheduler_job_runner.py#L240

Copy link
Contributor

@dstandish dstandish Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about this? TaskContextLogger is currently being used int eh scheduler with a component name static, thus, if you're right, that'd mean that scheduler logs override each other? Or I misunderstood what you said. https://github.com/apache/airflow/blob/main/airflow/jobs/scheduler_job_runner.py#L240

Yep, pretty sure. But there may be misunderstanding also. So, the component gets added to the end of the full log name (which includes normal task log filename) so there would only be a collision if there were two exceptional events for the same task instance and same try number. Pretty unlikely I'm sure, and perhaps logically impossible currently with the events that are logged. But, still, we should just add a random string to the suffix to make it truly impossible. (note: not suggesting in this PR) Because more important as more events are logged. I guess it might not be a change to the params or anything just change in behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see! Thanks for the explanation. I agree then we should do that, but as you mentioned, I think it is better to do it in a separate PR. I can do it as follow-up PR.

@@ -149,7 +155,7 @@ def queue_command(
self.log.info("Adding to queue: %s", command)
self.queued_tasks[task_instance.key] = (command, priority, queue, task_instance)
else:
self.log.error("could not queue task %s", task_instance.key)
self.task_context_logger.error("could not queue task %s", task_instance.key, ti=task_instance)
Copy link
Contributor

@dstandish dstandish Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this truly an exceptional circumstance? i ask because, we should just be careful about adding too many usages of task_context_logger because, generally speaking, each call results in uploading a file to s3. think about what happens to performance if the scheduler starts having to upload files for many TIs in the scheduler loop.

ideally, people use a better logging system like elastic, where, it's at least possible that "shipping logs" just means printing to stdout. but i don't think that's true for the majority of users.

perhaps this is an other argument for considering sticking it in a queue that is managed separately from the main scheduler loop. just a thought.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's another thought for you. perhaps instead of using task context logger for every failure, we should add a reason field to go along with state, so that we can add a small message about what happened / why the task ended up in the state that it did, and then we would not need to use this, unfortunately clunky and not-super-good-for-performance feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea and I think it can be valuable for Airflow users, but that'd deserve an AIP (or at least a message de dev list to ask feedbacks). Though, I still do think we need to send logs to task logs in executors (and also other components of Airflow). If the executor is unable to do something about a given task, the user should know about it with all the possible details (stacktrace), so I think we still have this problem we need to solve

@@ -320,10 +320,11 @@ def __handle_failed_task(self, task_arn: str, reason: str):
)
)
else:
self.log.error(
self.task_context_logger.error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to send this message? if the task is failed, the user will see that it is failed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this something different from task retries? something like executor submit tries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to send this message? if the task is failed, the user will see that it is failed

Fair

Copy link
Contributor Author

@vincbeck vincbeck Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this something different from task retries? something like executor submit tries?

Yes, it is a retry mechanism if the ECS task fails itself (thus, not able to start the Airflow task). If the ECS task succeeds and then the Airflow task fails, then the task retry comes in play

@@ -386,14 +385,16 @@ def attempt_task_runs(self):
)
self.pending_tasks.append(ecs_task)
else:
self.log.error(
"ECS task %s has failed a maximum of %s times. Marking as failed",
self.task_context_logger.error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment here. just consider it everywhere. are you introducing usages of taskcontextlogger that (1) are not really necessary because the user will get enough information already or (2) just occur too frequently, so that the scheduler will now have to upload logs to s3 routinely during its loop.

because of the performance considerations, the feature should really only be used when the user would otherwise have no information about what happened.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good call. I can go through again the different scenarios and narrow them down to a few. TBH, I just went through all the self.log.error( thinking whatever error happens on the executor side, the user should know about it. But I guess I can apply a slightly smarter filter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I narrowed it down to one scenario in base_executor and two in ecs_executor. Let me know what you think.

@dstandish
Copy link
Contributor

@RNHTTR you might be good person to review from the perspective of "are these circumstances where user would otherwise not have insight into what happened to the task" slash "are these types errors too frequent" re impact to scheduler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Executors-core LocalExecutor & SequentialExecutor area:logging area:providers provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants