Skip to content

Avoid the Watcher consumers retrying to run a command that is being run by the producer#2557

Closed
pankajkoti wants to merge 3 commits into
mainfrom
watcher-sensor-retry
Closed

Avoid the Watcher consumers retrying to run a command that is being run by the producer#2557
pankajkoti wants to merge 3 commits into
mainfrom
watcher-sensor-retry

Conversation

@pankajkoti
Copy link
Copy Markdown
Contributor

When a sensor times out while the producer is still running a long dbt build, the retry previously launched a standalone dbt run for the same model — creating a duplicate concurrent execution.

Now the sensor checks the producer task state before deciding:

  • Producer terminated → fall back to local dbt run
  • Producer still active → continue polling XCom instead of launching a duplicate run

closes: https://github.com/astronomer/oss-integrations-private/issues/359

When a sensor times out while the producer is still running a long
dbt build, the retry previously launched a standalone dbt run for the
same model — creating a duplicate concurrent execution.

Now the sensor checks the producer task state before deciding:
- Producer terminated → fall back to local dbt run (handles manual
  UI task clear and retries after producer finished)
- Producer still active → continue polling XCom instead of launching
  a duplicate run

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Kubernetes watcher sensor retry test assumed fallback runs
unconditionally on retry. Update it to set the producer to a
terminal state, and add a test for the case where the producer
is still running.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@pankajkoti
Copy link
Copy Markdown
Contributor Author

pankajkoti commented Apr 15, 2026

Try number = 2 for the sensor when producer is still running.

Screenshot 2026-04-15 at 2 01 58 PM

@pankajkoti pankajkoti marked this pull request as ready for review April 15, 2026 09:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Cosmos watcher consumer sensors to avoid launching duplicate dbt runs on task retries by checking the producer task’s Airflow state and continuing to poll XCom when the producer is still active.

Changes:

  • Add is_producer_task_terminated() helper + terminal state set for producer task state evaluation.
  • Refactor watcher consumer poke() retry handling to fall back only when the producer is terminated; otherwise keep polling.
  • Extend unit tests to cover both retry paths (producer terminated vs still running) for local, Kubernetes, and test watcher sensors.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
cosmos/operators/_watcher/base.py Adds retry-handling helper and routes retry behavior based on producer state to prevent duplicate runs.
cosmos/operators/_watcher/state.py Introduces producer terminal state set + helper for determining whether producer is finished.
tests/operators/test_watcher.py Updates/extends tests for retry fallback vs continued polling (including test watcher behavior).
tests/operators/test_watcher_kubernetes_unit.py Adds Kubernetes watcher unit test ensuring retry keeps polling when producer is still running.
tests/operators/_watcher/test_state.py Adds unit tests for the new is_producer_task_terminated() helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cosmos/operators/_watcher/base.py Outdated
Comment thread cosmos/operators/_watcher/state.py Outdated
Comment on lines +653 to +655
return self._evaluate_node_status(status, producer_task_state, try_number, context)

def _evaluate_node_status(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this to a new method, as this was getting complained about code complexity

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pankajkoti Would there be an alternative way to refactor?

The function name (_evaluate_node_status) does not seem accurate with what it is doing, which is actually to:

  • check the status of the node
  • check the status of the producer airflow task
  • maybe execute dbt itself to run, as a fallback

These steps feel quite critical and relevant in the poke method, and it is probably hard to find a suitable name in a single method. There may be other parts of the function that could be refactored and isolated that are not so core to the logic (maybe the retrieval of status, or the logging handling?)

Copy link
Copy Markdown
Contributor Author

@pankajkoti pankajkoti Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name broadly reflects the method’s primary responsibility: it evaluates the node status and determines the appropriate poke outcome. By "evaluate/evaluation", I’m including checking dependencies like the producer task state and deciding on actions such as fallback, retry, raise, or succeed. The producer state check and fallback are part of handling cases where no direct status is available.

That said, I agree it’s doing a few important things, so I’m open to renaming if you have a more precise suggestion.

On extracting retrieval/logging: those are already separated into helper methods (_log_startup_events, _get_node_status, _cache_compiled_sql). Wrapping them further would likely add indirection without much benefit. The evaluation logic, on the other hand, forms a cohesive decision tree, which is why it felt reasonable to group it into a single method.

Happy to refactor further if you feel a different split would improve readability.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function combines decision-making, state retrieval, and side effects (running dbt). While it is small, it has a lot of responsibility. I'm worried about future readability and testability.

Maybe we could break it down into two smaller methods with clear responsibilities, something along the lines:

  • _evaluate_node_status: implements the decision tree logic
  • _handle_node_decision(decision, ...): executes side effects - run dbt or others

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the method again, it doesn’t actually do state retrieval. Both status and producer_task_state are passed in; the retrieval happens in poke. So this method is really decision-making and taking actions based on the decisions.

I feel splitting it would introduce indirection without reducing coupling; the action method would still need the same parameters (try_number, context) for the fallback path.

Right now, the full flow is visible in one place:

  • status is None -> check producer -> fallback or keep polling
  • skipped -> skip
  • success -> done

If we split this into decision + handler, a reader would need to jump between methods and map decisions to actions, which I think hurts readability more than it helps at this size.

I’m happy to rename the method if the current name doesn’t reflect its behaviour well, open to suggestions there. But I’d prefer not to split it unless the logic grows or we see reuse emerging.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.05%. Comparing base (f4fb470) to head (e8c29d7).
⚠️ Report is 19 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2557   +/-   ##
=======================================
  Coverage   98.04%   98.05%           
=======================================
  Files         103      103           
  Lines        7586     7598   +12     
=======================================
+ Hits         7438     7450   +12     
  Misses        148      148           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tatiana tatiana changed the title Avoid duplicate dbt runs on watcher sensor retry Avoid Watcher consumer attempting to run a command that is being run by producer Apr 15, 2026
@tatiana tatiana changed the title Avoid Watcher consumer attempting to run a command that is being run by producer Avoid the Watcher consumers retrying to run a command that is being run by the producer Apr 15, 2026
if hasattr(self, "_override_rtif"):
self._override_rtif(context)

def _handle_retry(self, try_number: int, producer_task_state: str | None, context: Context) -> bool | None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it would be a cleaner interface if we renamed things to is_producer_still_running.
Retry is a consequence depending on other variables - including if the producer is actively running the task

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe _handle_retry still makes sense as the name here. It's called from the poke method when try_number > 1, i.e., when we're in an Airflow retry scenario, and this method, as part of handling the Airflow retry, decides whether to actually fall back (producer terminated) or keep polling (producer still active).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions should be named based on what they do, not how they are used. The current name (_handle_retry) is related to where it is used, rather than to its name.

Naming functions after their action creates reusable, predictable code. If a function is named after its specific use case, it becomes difficult to reuse that same logic elsewhere.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit torn here. _handle_retry actually does describe what the method does; it handles the retry scenario. It checks the producer state, decides whether to fall back or keep polling, and acts accordingly. That is the method's responsibility. It's not named after its call site; it's named after the action it performs.

When the time comes for reusing (although this function is marked local to the module/class), I believe these are what the actions will be for handling retries for the case, no? Or do you see a different path there?

We already have is_producer_task_terminated for the state check. The suggested is_producer_still_running shouldn't call an action based on its name, so the fallback logic would have to move elsewhere, either into another method or back into poke. If it goes into poke, it's not reusable either.

Comment thread cosmos/operators/_watcher/base.py Outdated
Comment thread cosmos/operators/_watcher/state.py Outdated
Copy link
Copy Markdown
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really happy we're improving this part, @pankajkoti ! I left some questions and advice on the implementation. While they will not change the overall logic - your fix seems accurate - I believe they will help with the code maintainability.

@pankajkoti
Copy link
Copy Markdown
Contributor Author

I'm really happy we're improving this part, @pankajkoti ! I left some questions and advice on the implementation. While they will not change the overall logic - your fix seems accurate - I believe they will help with the code maintainability.

Thanks for the review @tatiana. I left my responses on the feedback and addressed some other minor comments that I have marked as resolved.

@pankajkoti
Copy link
Copy Markdown
Contributor Author

We now have conflicts here after the merge of PR #2559. It felt simpler to create a new PR than to resolve the conflicts here. I have created a new PR at #2592, so closing this PR in favour of that one.

@pankajkoti pankajkoti closed this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants