Send important executor logs to task logs #40468

vincbeck · 2024-06-27T19:19:12Z

If the executor fails to start a task, the user will not see any logs in the UI because the task has not started. This PR leverages TaskContextLogger implemented in #32646. It forwards the important error messages when an executor fail to execute a task to the task logs.

cc @o-nikolas

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

airflow/utils/log/task_context_logger.py

o-nikolas

Left a few comments. Also what about the Batch executor? Do you plan on fast following with that after this PR?

airflow/executors/base_executor.py

airflow/utils/log/task_context_logger.py

airflow/providers/amazon/aws/executors/ecs/ecs_executor.py

vincbeck · 2024-06-27T21:06:13Z

Left a few comments. Also what about the Batch executor? Do you plan on fast following with that after this PR?

Yep, my plan is to first do it with one executor and receive feedbacks, address feedbacks etc ... Then when the direction is set, I'll create another PR for BatchExecutor

ferruzzi · 2024-06-27T22:57:47Z

Had a look and I don't have anything extra to add; I like the direction this is going. I'd need to see some examples to have any real opinion on the question of one or one-per-executor but otherwise, I think it looks good after that one change to the session creation that Niko mentioned.

airflow/executors/base_executor.py

o-nikolas · 2024-06-28T19:27:21Z

airflow/providers/amazon/aws/executors/ecs/ecs_executor.py

@@ -386,14 +385,16 @@ def attempt_task_runs(self):
                    )
                    self.pending_tasks.append(ecs_task)
                else:
-                    self.log.error(


@syedahsn You should have a review of this if you get a chance. The failure reason handling has been modified here.

airflow/utils/log/task_context_logger.py

dstandish · 2024-07-03T15:13:11Z

airflow/executors/base_executor.py

@@ -122,6 +123,7 @@ class BaseExecutor(LoggingMixin):
    job_id: None | int | str = None
    name: None | ExecutorName = None
    callback_sink: BaseCallbackSink | None = None
+    task_context_logger: TaskContextLogger


the class TaskContextLogger is private (see the sphinx directive in the docstring) so ... really we should either make it public or do not use it in providers.

the intention was to keep it internal since it's not something users should really mess with and in that scenario best to retain flexibility to change it at will.

possible solution: is there a way that this can be managed through the executor interface?

possible solution: is there a way that this can be managed through the executor interface?

How? At the end, we'll have to use TaskContextLogger in providers. Unless you want to copy TaskContextLogger in providers?

I am not sure this is a huge risk to make it public, the only public methods of this class are the emitting log methods such as error, info, ... The signatures of these methods are pretty generic and the risk to have a breaking change is pretty low?

How? At the end, we'll have to use TaskContextLogger in providers. Unless you want to copy TaskContextLogger in providers?

Well the first thing that occurred to me was to have an interface on Executor (i.e. a method) that would be like, 'send_task_log` or something. This could push to a queue that the scheduler could process periodically. It would not be duplicating the task context log class because that has substantially more going on; it would just be, providing an interface for executors to dump messages to TI logs, without worrying about the underlying mechanism.

But it's just an idea, if we want to explore keeping TaskContextLogger private.

I am not sure this is a huge risk to make it public, the only public methods of this class are the emitting log methods such as error, info, ... The signatures of these methods are pretty generic and the risk to have a breaking change is pretty low?

Yeah the other part of what would constitute "the interface" would be i suppose like the params def __init__(self, component_name: str, call_site_logger: Logger | None = None): and such. I am just very wary of making things public that aren't really for users because it is such an impediment when we want to change things.

One thing change we need to make with this interface, that I think was missed, is that each file needs to have a random string in it because, especially as more usages of this are added, there will be times when multiple such messages are added for a single TI, and currently, if they are from the same "component", then they will collide / override.

One thing change we need to make with this interface, that I think was missed, is that each file needs to have a random string in it because, especially as more usages of this are added, there will be times when multiple such messages are added for a single TI, and currently, if they are from the same "component", then they will collide / override.

Are you sure about this? TaskContextLogger is currently being used int eh scheduler with a component name static, thus, if you're right, that'd mean that scheduler logs override each other? Or I misunderstood what you said. https://github.com/apache/airflow/blob/main/airflow/jobs/scheduler_job_runner.py#L240

Are you sure about this? TaskContextLogger is currently being used int eh scheduler with a component name static, thus, if you're right, that'd mean that scheduler logs override each other? Or I misunderstood what you said. https://github.com/apache/airflow/blob/main/airflow/jobs/scheduler_job_runner.py#L240

Yep, pretty sure. But there may be misunderstanding also. So, the component gets added to the end of the full log name (which includes normal task log filename) so there would only be a collision if there were two exceptional events for the same task instance and same try number. Pretty unlikely I'm sure, and perhaps logically impossible currently with the events that are logged. But, still, we should just add a random string to the suffix to make it truly impossible. (note: not suggesting in this PR) Because more important as more events are logged. I guess it might not be a change to the params or anything just change in behavior.

Oh I see! Thanks for the explanation. I agree then we should do that, but as you mentioned, I think it is better to do it in a separate PR. I can do it as follow-up PR.

airflow/utils/log/file_task_handler.py

dstandish · 2024-07-03T16:26:28Z

airflow/executors/base_executor.py

@@ -149,7 +155,7 @@ def queue_command(
            self.log.info("Adding to queue: %s", command)
            self.queued_tasks[task_instance.key] = (command, priority, queue, task_instance)
        else:
-            self.log.error("could not queue task %s", task_instance.key)
+            self.task_context_logger.error("could not queue task %s", task_instance.key, ti=task_instance)


is this truly an exceptional circumstance? i ask because, we should just be careful about adding too many usages of task_context_logger because, generally speaking, each call results in uploading a file to s3. think about what happens to performance if the scheduler starts having to upload files for many TIs in the scheduler loop.

ideally, people use a better logging system like elastic, where, it's at least possible that "shipping logs" just means printing to stdout. but i don't think that's true for the majority of users.

perhaps this is an other argument for considering sticking it in a queue that is managed separately from the main scheduler loop. just a thought.

here's another thought for you. perhaps instead of using task context logger for every failure, we should add a reason field to go along with state, so that we can add a small message about what happened / why the task ended up in the state that it did, and then we would not need to use this, unfortunately clunky and not-super-good-for-performance feature.

I like the idea and I think it can be valuable for Airflow users, but that'd deserve an AIP (or at least a message de dev list to ask feedbacks). Though, I still do think we need to send logs to task logs in executors (and also other components of Airflow). If the executor is unable to do something about a given task, the user should know about it with all the possible details (stacktrace), so I think we still have this problem we need to solve

dstandish · 2024-07-03T16:28:18Z

airflow/providers/amazon/aws/executors/ecs/ecs_executor.py

@@ -320,10 +320,11 @@ def __handle_failed_task(self, task_arn: str, reason: str):
                )
            )
        else:
-            self.log.error(
+            self.task_context_logger.error(


do we really need to send this message? if the task is failed, the user will see that it is failed

is this something different from task retries? something like executor submit tries?

do we really need to send this message? if the task is failed, the user will see that it is failed

Fair

is this something different from task retries? something like executor submit tries?

Yes, it is a retry mechanism if the ECS task fails itself (thus, not able to start the Airflow task). If the ECS task succeeds and then the Airflow task fails, then the task retry comes in play

dstandish · 2024-07-03T16:31:54Z

airflow/providers/amazon/aws/executors/ecs/ecs_executor.py

@@ -386,14 +385,16 @@ def attempt_task_runs(self):
                    )
                    self.pending_tasks.append(ecs_task)
                else:
-                    self.log.error(
-                        "ECS task %s has failed a maximum of %s times. Marking as failed",
+                    self.task_context_logger.error(


similar comment here. just consider it everywhere. are you introducing usages of taskcontextlogger that (1) are not really necessary because the user will get enough information already or (2) just occur too frequently, so that the scheduler will now have to upload logs to s3 routinely during its loop.

because of the performance considerations, the feature should really only be used when the user would otherwise have no information about what happened.

That's a good call. I can go through again the different scenarios and narrow them down to a few. TBH, I just went through all the self.log.error( thinking whatever error happens on the executor side, the user should know about it. But I guess I can apply a slightly smarter filter

I narrowed it down to one scenario in base_executor and two in ecs_executor. Let me know what you think.

dstandish · 2024-07-03T16:46:09Z

@RNHTTR you might be good person to review from the perspective of "are these circumstances where user would otherwise not have insight into what happened to the task" slash "are these types errors too frequent" re impact to scheduler.

vincbeck requested review from eladkal, o-nikolas, kaxil, XD-DENG, ashb, pierrejeambrun and hussein-awala as code owners June 27, 2024 19:19

boring-cyborg bot added area:Executors-core LocalExecutor & SequentialExecutor area:logging area:providers provider:amazon-aws AWS/Amazon - related issues labels Jun 27, 2024

vincbeck commented Jun 27, 2024

View reviewed changes

airflow/utils/log/task_context_logger.py Show resolved Hide resolved

airflow/utils/log/task_context_logger.py Outdated Show resolved Hide resolved

Send important executor logs to task logs

76ba744

vincbeck force-pushed the vincbeck/task_log_executor branch from 46aa58f to 76ba744 Compare June 27, 2024 19:24

Add unit test

de2af6c

o-nikolas reviewed Jun 27, 2024

View reviewed changes

airflow/executors/base_executor.py Show resolved Hide resolved

airflow/utils/log/task_context_logger.py Outdated Show resolved Hide resolved

airflow/providers/amazon/aws/executors/ecs/ecs_executor.py Show resolved Hide resolved

vincbeck added 2 commits June 28, 2024 11:57

Create session only when ti is TaskInstanceKey

907083b

Fix tests

6eb4286

RNHTTR reviewed Jun 28, 2024

View reviewed changes

airflow/executors/base_executor.py Outdated Show resolved Hide resolved

Fix tests + adjustments

34e00ce

o-nikolas reviewed Jun 28, 2024

View reviewed changes

vincbeck and others added 7 commits July 2, 2024 11:36

Merge branch 'main' into vincbeck/task_log_executor

44fe1fc

Update error message

4fcdefc

Fix test

ae80de4

Fix test

2abc248

Remove TaskInstanceKey type from _render_filename_db_access method

20e89e4

Fix ruff

595cbd6

Use correct create_session

bb7f1b6

vincbeck added 2 commits July 2, 2024 17:15

Update error message

9aad244

Update error message

083d15b

dstandish reviewed Jul 3, 2024

View reviewed changes

airflow/utils/log/task_context_logger.py Outdated Show resolved Hide resolved

dstandish reviewed Jul 3, 2024

View reviewed changes

airflow/utils/log/file_task_handler.py Outdated Show resolved Hide resolved

vincbeck added 2 commits July 3, 2024 11:39

Move ensure_ti and makes it back private

b3e6963

Fix test

260a1d5

dstandish reviewed Jul 3, 2024

View reviewed changes

Reduce number of calls to task_context_logger

6b5061d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Send important executor logs to task logs #40468

Send important executor logs to task logs #40468

vincbeck commented Jun 27, 2024 •

edited

Loading

o-nikolas left a comment

vincbeck commented Jun 27, 2024 •

edited

Loading

ferruzzi commented Jun 27, 2024

o-nikolas Jun 28, 2024

dstandish Jul 3, 2024

vincbeck Jul 3, 2024 •

edited

Loading

dstandish Jul 3, 2024

dstandish Jul 3, 2024

vincbeck Jul 3, 2024

dstandish Jul 3, 2024 •

edited

Loading

vincbeck Jul 4, 2024

dstandish Jul 3, 2024 •

edited

Loading

dstandish Jul 3, 2024

vincbeck Jul 3, 2024

dstandish Jul 3, 2024

dstandish Jul 3, 2024

vincbeck Jul 3, 2024

vincbeck Jul 3, 2024 •

edited

Loading

dstandish Jul 3, 2024

vincbeck Jul 3, 2024

vincbeck Jul 4, 2024

dstandish commented Jul 3, 2024

Send important executor logs to task logs #40468

Are you sure you want to change the base?

Send important executor logs to task logs #40468

Conversation

vincbeck commented Jun 27, 2024 • edited Loading

o-nikolas left a comment

Choose a reason for hiding this comment

vincbeck commented Jun 27, 2024 • edited Loading

ferruzzi commented Jun 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vincbeck Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dstandish Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dstandish Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vincbeck Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dstandish commented Jul 3, 2024

vincbeck commented Jun 27, 2024 •

edited

Loading

vincbeck commented Jun 27, 2024 •

edited

Loading

vincbeck Jul 3, 2024 •

edited

Loading

dstandish Jul 3, 2024 •

edited

Loading

dstandish Jul 3, 2024 •

edited

Loading

vincbeck Jul 3, 2024 •

edited

Loading