Fix ExecutionMode.WATCHER deadlock in Airflow 3.0 & 3.1#2087
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR addresses a thread-safety issue in Airflow 3.0 and 3.1 where xcom_push operations can cause deadlocks when running in ExecutionMode.WATCHER mode with multi-threaded dbt operations.
Key Changes:
- Introduces a thread-safe wrapper function
safe_xcom_pushthat uses a lock to serialize XCom operations - Replaces all direct
xcom_pushcalls in the watcher operator with the thread-safe wrapper - Adds necessary imports for threading and Airflow 3.x task instance types
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2087 +/- ##
=======================================
Coverage 97.80% 97.80%
=======================================
Files 91 91
Lines 5871 5873 +2
=======================================
+ Hits 5742 5744 +2
Misses 129 129 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
pankajastro
left a comment
There was a problem hiding this comment.
Looks good! It might be nice to add a regression test to make sure this issue doesn’t come back in the future.
I fully agree, @pankajastro. I discussed this with @ashb, but since it would happen only from time to time, and I couldn't reproduce it with Airflow |
|
|
||
| def safe_xcom_push(task_instance: TaskInstance, key: str, value: Any) -> None: | ||
| """ | ||
| Safely set an XCom value in a thread-safe manner in Airflow 3.0 and 3.1. |
There was a problem hiding this comment.
Do we need to mention specific Airflow versions here? Because we are not altering the behavior here based on Airflow versions here.
There was a problem hiding this comment.
I think we should mention we're using this because of these versions bug, but it is also harmless to use for others. It simplifies our code-base to not have to handle this differently based on Airflow version. If Airflow fixes this, we can then do the version check
Airflow 3.0 and 3.1 are currently not thread-safe when running `xcom_push`, which leads to deadlocks. The problem description and troubleshooting are detailed in ticket #2057. Unfortunately, since this issue is not deterministic, I was unable to create an automated test to reproduce and prevent regressions. That said, given that the producer node hangs indefinitely until the task times out, and given that we were no longer able to reproduce the issue after the current changes, I firmly believe we should proceed with this change. Closes: #2057 Co-authored-by: Ash Berlin-Taylor <ash@astronomer.io>
Bug fixes * Fix ``ExecutionMode.WATCHER`` deadlock in Airflow 3.0 & 3.1 by @tatiana in #2087 * Fix ``ExecutionMode.AIRFLOW_ASYNC`` ``TaskGroup`` XCom issue by @tatiana in #2088 * Guard watcher callback exceptions to avoid hanging producer tasks by @pankajkoti in #2101 * Fix SQL templated field rendering for dynamically mapped tasks in Airflow 2 by @tatiana in #2119 Enhancements * Remove usage of contextmanager in plugins for accessing connections in Airflow >= 3.1.2 by @pankajkoti in #2073 Docs * Improve ``ExecutionMode.AIRFLOW_ASYNC`` docs by @tatiana in #2103 * Add note about experimenting threads count for the Watcher Execution mode by @pankajkoti in #2083 * Fix minor documentation formatting issue by @dnskrv in #2098 * Correct example YAML key from ``operator_args`` to ``operator_kwargs`` by @jx2lee in #2091 Others * Fix broken CI due to fastapi incompatibility with cadwyn for Airflow 3 by @pankajkoti in #2076 * pre-commit autoupdate in #2078, #2104 related: astronomer/oss-integrations-private#272
Bug fixes * Fix ``ExecutionMode.WATCHER`` deadlock in Airflow 3.0 & 3.1 by @tatiana in #2087 * Fix ``ExecutionMode.AIRFLOW_ASYNC`` ``TaskGroup`` XCom issue by @tatiana in #2088 * Guard watcher callback exceptions to avoid hanging producer tasks by @pankajkoti in #2101 * Fix SQL templated field rendering for dynamically mapped tasks in Airflow 2 by @tatiana in #2119 Enhancements * Remove usage of contextmanager in plugins for accessing connections in Airflow >= 3.1.2 by @pankajkoti in #2073 Docs * Improve ``ExecutionMode.AIRFLOW_ASYNC`` docs by @tatiana in #2103 * Add note about experimenting threads count for the Watcher Execution mode by @pankajkoti in #2083 * Fix minor documentation formatting issue by @dnskrv in #2098 * Correct example YAML key from ``operator_args`` to ``operator_kwargs`` by @jx2lee in #2091 Others * Fix broken CI due to fastapi incompatibility with cadwyn for Airflow 3 by @pankajkoti in #2076 * pre-commit autoupdate in #2078, #2104 Closes: astronomer/oss-integrations-private#272
Airflow 3.0 and 3.1 are currently not thread-safe when running
xcom_push, which leads to deadlocks.The problem description and troubleshooting are detailed in ticket #2057.
Unfortunately, since this issue is not deterministic, I was unable to create an automated test to reproduce and prevent regressions. That said, given that the producer node hangs indefinitely until the task times out, and given that we were no longer able to reproduce the issue after the current changes, I firmly believe we should proceed with this change.
Closes: #2057
Co-authored-by: Ash Berlin-Taylor ash@astronomer.io