-
Notifications
You must be signed in to change notification settings - Fork 294
Avoid the Watcher consumers retrying to run a command that is being run by the producer #2557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -28,6 +28,7 @@ | |
| is_dbt_node_status_skipped, | ||
| is_dbt_node_status_success, | ||
| is_dbt_node_status_terminal, | ||
| is_producer_task_terminated, | ||
| safe_xcom_push, | ||
| xcom_set_lock, | ||
| ) | ||
|
|
@@ -583,6 +584,28 @@ def _cache_compiled_sql(self, ti: Any, context: Context) -> None: | |
| if hasattr(self, "_override_rtif"): | ||
| self._override_rtif(context) | ||
|
|
||
| def _handle_retry(self, try_number: int, producer_task_state: str | None, context: Context) -> bool | None: | ||
| """Handle sensor retry by checking whether the producer is still active. | ||
|
|
||
| Returns the fallback result if the producer has terminated, or None if | ||
| the sensor should continue polling (producer still active). | ||
| """ | ||
| if is_producer_task_terminated(producer_task_state): | ||
| # Producer finished — this is either an automatic retry after | ||
| # the producer completed or a manual task clear from the UI. | ||
| # Fall back to a non-watcher run. | ||
| return self._fallback_to_non_watcher_run(try_number, context) | ||
| # Producer is still active — the sensor likely timed out while the | ||
| # producer was still working. Keep polling instead of launching a | ||
| # duplicate dbt run. | ||
| logger.info( | ||
| "Try #%s but producer '%s' is still %s — continuing to poll instead of fallback.", | ||
| try_number, | ||
| self.producer_task_id, | ||
| producer_task_state or "unknown", | ||
| ) | ||
| return None | ||
|
|
||
| def poke(self, context: Context) -> bool: | ||
| """ | ||
| Checks the status of a dbt node (model or aggregated tests) by pulling relevant XComs from the producer task. | ||
|
|
@@ -605,10 +628,13 @@ def poke(self, context: Context) -> bool: | |
| self.model_unique_id, | ||
| ) | ||
|
|
||
| producer_task_state = self._get_producer_task_status(context) | ||
|
|
||
| if try_number > 1: | ||
| return self._fallback_to_non_watcher_run(try_number, context) | ||
| retry_result = self._handle_retry(try_number, producer_task_state, context) | ||
| if retry_result is not None: | ||
| return retry_result | ||
|
|
||
| producer_task_state = self._get_producer_task_status(context) | ||
| if not self.is_test_sensor: | ||
| self._log_startup_events(ti) | ||
| status = self._get_node_status(ti, context) | ||
|
|
@@ -624,8 +650,13 @@ def poke(self, context: Context) -> bool: | |
| ) | ||
| _log_dbt_event(dbt_events) | ||
|
|
||
| if status is None: | ||
| return self._evaluate_node_status(status, producer_task_state, try_number, context) | ||
|
|
||
| def _evaluate_node_status( | ||
|
Comment on lines
+653
to
+655
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Moved this to a new method, as this was getting complained about code complexity
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @pankajkoti Would there be an alternative way to refactor? The function name (_evaluate_node_status) does not seem accurate with what it is doing, which is actually to:
These steps feel quite critical and relevant in the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the name broadly reflects the method’s primary responsibility: it evaluates the node status and determines the appropriate poke outcome. By "evaluate/evaluation", I’m including checking dependencies like the producer task state and deciding on actions such as fallback, retry, raise, or succeed. The producer state check and fallback are part of handling cases where no direct status is available. That said, I agree it’s doing a few important things, so I’m open to renaming if you have a more precise suggestion. On extracting retrieval/logging: those are already separated into helper methods ( Happy to refactor further if you feel a different split would improve readability.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This function combines decision-making, state retrieval, and side effects (running dbt). While it is small, it has a lot of responsibility. I'm worried about future readability and testability. Maybe we could break it down into two smaller methods with clear responsibilities, something along the lines:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looking at the method again, it doesn’t actually do state retrieval. Both I feel splitting it would introduce indirection without reducing coupling; the action method would still need the same parameters (try_number, context) for the fallback path. Right now, the full flow is visible in one place:
If we split this into decision + handler, a reader would need to jump between methods and map decisions to actions, which I think hurts readability more than it helps at this size. I’m happy to rename the method if the current name doesn’t reflect its behaviour well, open to suggestions there. But I’d prefer not to split it unless the logic grows or we see reuse emerging. |
||
| self, status: Any, producer_task_state: str | None, try_number: int, context: Context | ||
| ) -> bool: | ||
| """Evaluate the dbt node status and return the poke result.""" | ||
| if status is None: | ||
| if producer_task_state == "failed": | ||
| if self.poke_retry_number > 0: | ||
| raise AirflowException( | ||
|
|
@@ -636,13 +667,12 @@ def poke(self, context: Context) -> bool: | |
| return self._fallback_to_non_watcher_run(try_number, context) | ||
|
|
||
| self.poke_retry_number += 1 | ||
|
|
||
| return False | ||
| elif is_dbt_node_status_skipped(status): | ||
|
|
||
| if is_dbt_node_status_skipped(status): | ||
| raise AirflowSkipException( | ||
| f"{self._resource_label} '{self.model_unique_id}' was skipped by the dbt command." | ||
| ) | ||
| elif is_dbt_node_status_success(status): | ||
| if is_dbt_node_status_success(status): | ||
| return True | ||
| else: | ||
| raise AirflowException(f"{self._resource_label} '{self.model_unique_id}' finished with status '{status}'") | ||
| raise AirflowException(f"{self._resource_label} '{self.model_unique_id}' finished with status '{status}'") | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it would be a cleaner interface if we renamed things to
is_producer_still_running.Retry is a consequence depending on other variables - including if the producer is actively running the task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe
_handle_retrystill makes sense as the name here. It's called from thepokemethod whentry_number > 1, i.e., when we're in an Airflow retry scenario, and this method, as part of handling the Airflow retry, decides whether to actually fall back (producer terminated) or keep polling (producer still active).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functions should be named based on what they do, not how they are used. The current name (
_handle_retry) is related to where it is used, rather than to its name.Naming functions after their action creates reusable, predictable code. If a function is named after its specific use case, it becomes difficult to reuse that same logic elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit torn here.
_handle_retryactually does describe what the method does; it handles the retry scenario. It checks the producer state, decides whether to fall back or keep polling, and acts accordingly. That is the method's responsibility. It's not named after its call site; it's named after the action it performs.When the time comes for reusing (although this function is marked local to the module/class), I believe these are what the actions will be for handling retries for the case, no? Or do you see a different path there?
We already have
is_producer_task_terminatedfor the state check. The suggestedis_producer_still_runningshouldn't call an action based on its name, so the fallback logic would have to move elsewhere, either into another method or back intopoke. If it goes into poke, it's not reusable either.