Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cylc set bug fix for new flows #6186

Merged
merged 6 commits into from
Jul 10, 2024
Merged

Conversation

hjoliver
Copy link
Member

@hjoliver hjoliver commented Jul 1, 2024

Damn it, found a bug in 8.3.0 for cylc set --flow=new on future tasks: the task_outputs table won't be updated because of "WHERE flow is n" in the DB statement.

From: https://github.com/hjoliver/ecox-interventions?tab=readme-ov-file#scenario-2-rerun-some-tasks-from-upstream-of-a-failed-task

Example:

[task parameters]
    m = 1..3
[scheduling]
    [[graph]]
        R1 = "a => b => c<m> => d"
[runtime]
    [[root]]
        pre-script = sleep 5
    [[a, b, d]]
    [[c<m>]]
    [[c<m=2>]]
        script = """
            if (( CYLC_TASK_SUBMIT_NUMBER == 1 )); then
                false
            fi
        """

The scenario:

  • 1/c_m2 fails, stalling the workflow due to a corrupted output from 1/b
  • want to rerun 1/b and 1/c_m2, but not 1/c_m1, m3, then continue to 1/d and finish

Method:

  • run a new flow (2) from 1/b after setting 1/c_m1, m3 to succeeded in the new flow
  1. cylc set test //1/c_m1 //1/c_m2 --flow=2
  2. cylc trigger test//1/b --flow=2

On master this reruns 1/c_m1 and 1/c_m3 because the set outputs don't end up in the DB.

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • CHANGES.md entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@hjoliver hjoliver added the bug Something is wrong :( label Jul 1, 2024
@hjoliver hjoliver added this to the 8.3.1 milestone Jul 1, 2024
@hjoliver hjoliver self-assigned this Jul 1, 2024
@hjoliver hjoliver force-pushed the flow-set-bug-fix branch from 4f4b551 to 121f92c Compare July 2, 2024 00:02
@wxtim
Copy link
Member

wxtim commented Jul 2, 2024

Needs a rebase to pull in flake8 fixes.

@wxtim
Copy link
Member

wxtim commented Jul 2, 2024

I have created a crude integration/fnctional test at hjoliver#55

I think I can do better/faster, but was blocked by a simulation mode bug.

@oliver-sanders oliver-sanders modified the milestones: 8.3.1, 8.3.2 Jul 3, 2024
@hjoliver hjoliver force-pushed the flow-set-bug-fix branch from 121f92c to 2e30d3f Compare July 5, 2024 07:53
@hjoliver
Copy link
Member Author

hjoliver commented Jul 5, 2024

@wxtim - rebased and added an integration test.

My integration test is presumably similar to what you are working on? Trouble is, it passes on master too! I've run out of time to investigate further, for now.

@hjoliver hjoliver force-pushed the flow-set-bug-fix branch from 2e30d3f to 1333320 Compare July 5, 2024 07:58
@wxtim
Copy link
Member

wxtim commented Jul 5, 2024

@wxtim - rebased and added an integration test.

My integration test is presumably similar to what you are working on? Trouble is, it passes on master too! I've run out of time to investigate further, for now.

Mine is a functional test using the integration test framework. Since you've written most of the test I'll leave the ball in your court I think. I have other 🐟 to 🥘 , not least a sim mode bug which will scupper skip mode too.

@hjoliver
Copy link
Member Author

hjoliver commented Jul 8, 2024

(Update: a similar scenario works fine on master if the set and trigger happens for tasks that have never run before in any flow).

@hjoliver hjoliver force-pushed the flow-set-bug-fix branch 2 times, most recently from 2f8c30b to ef72370 Compare July 8, 2024 06:43
@hjoliver hjoliver marked this pull request as ready for review July 8, 2024 06:44
@hjoliver
Copy link
Member Author

hjoliver commented Jul 8, 2024

Test tweaked and validated. It passes on master as an integration test for "the wrong reasons", which are not valid on this branch (it's to do with submit numbers being zero in integration tests - I can explain in detail if necessary). The same scenario does not pass on master as a functional test.

@hjoliver
Copy link
Member Author

hjoliver commented Jul 8, 2024

Damn it, I seem to have broken two integration tests though. So not quite done. DONE

Comment on lines 1713 to 1714
# TODO: Detecting removal after completion of some outputs probably
# TODO: requires recording removal in the DB (set :expired maybe?).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beyond the scope of this PR. I'll post a note on Element, then raise an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you stick up the issue and add a link to it in this comment so we can find our way back to the context when needed.

Note, the cylc remove work (will hopefully arrive in 8.4.0) may change this as removed tasks will be bumped into the None flow.

Copy link
Member Author

@hjoliver hjoliver Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've clarified the comments for now. This needs a bit of discussion before posting an issue - see Element chat.

Note this bit of code is to avoid respawning a previously-removed task (see #6066) - which is at odds with the proposed future of cylc remove. (If erasing flow history, you want to be able to respawn the removed tasks, not avoid it).

Comment on lines 1713 to 1714
# TODO: Detecting removal after completion of some outputs probably
# TODO: requires recording removal in the DB (set :expired maybe?).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you stick up the issue and add a link to it in this comment so we can find our way back to the context when needed.

Note, the cylc remove work (will hopefully arrive in 8.4.0) may change this as removed tasks will be bumped into the None flow.

@wxtim wxtim self-requested a review July 8, 2024 15:02
@hjoliver
Copy link
Member Author

hjoliver commented Jul 9, 2024

(All discussions resolved except for the TODO one)

'scheduling': {
'cycling mode': 'integer',
'graph': {
'R1': 'a => b => c1 & c2 => z',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO test becomes clearer if

Suggested change
'R1': 'a => b => c1 & c2 => z',
'R1': 'a => good & failonce => z',

And is the task b necessary?

Copy link
Member Author

@hjoliver hjoliver Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO test becomes clearer if

Well that makes it look as if success and failure are important for the test, which they're not. What matters is that only one of them gets respawned in the new flow, if the other one gets manually set to succeeded (in the new flow).

[UPDATE: not relevant to the updated test anyway]

And is the task b necessary?

Actually neither b nor z are needed. They were just to make the original functional case easier to manage. I'll remove them.

Copy link
Member

@wxtim wxtim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manually tested, and appears that the fix is working.

I'm not convinced that the test is demonstrating the bug on upstream/8.3.x or master.

Is there a reason why this bug is on 8.3.2 milestone but open against master?

@oliver-sanders
Copy link
Member

oliver-sanders commented Jul 9, 2024

Code changes make sense to me.

I'm not convinced that the test is demonstrating the bug on upstream/8.3.x or master.

I can confirm that test_set_future_flow passes on upstream/8.3.x so doesn't appear to be capturing this bug.

@@ -73,7 +73,7 @@ def test_basic(checker):
['output', '10000101T0000Z', 'succeeded'],
['output', '10010101T0000Z', 'succeeded'],
['good', '10000101T0000Z', 'waiting', '(flows=2)'],
]
['good', '10010101T0000Z', 'waiting', '(flows=2)'], ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, this task appears in the DB because the workflow has run on from the triggered task. This is an unintended side-effect of the await sleep(1) in the test fixture.

I've opened a PR to replace the sleep with a DB update call #6212, however, if we want to keep this behaviour to allow the workflow to run to completion again (to test this PR), then replace the sleep with await mod_complete(schd) (and close #6212).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I've not had time to think about this today)

Copy link
Member Author

@hjoliver hjoliver Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Reviewers - this is a test framework issue, so not a reason to delay this PR)

@MetRonnie MetRonnie self-requested a review July 9, 2024 15:48
@MetRonnie
Copy link
Member

This should be rebased onto 8.3.x (even if it doesn't make it into the 8.3.2 release)

@hjoliver
Copy link
Member Author

hjoliver commented Jul 9, 2024

Is there a reason why this bug is on 8.3.2 milestone but open against master?

This should be rebased onto 8.3.x (even if it doesn't make it into the 8.3.2 release)

I think I probably opened this PR right after 8.3.0 was released, before the 8.3.x branch was made, and forgot to come back and rebase it. But maybe it was just a mistake. Anyhow, rebased now.

@hjoliver hjoliver force-pushed the flow-set-bug-fix branch from 09028c8 to 9eafa2e Compare July 9, 2024 23:41
@hjoliver
Copy link
Member Author

hjoliver commented Jul 9, 2024

I'm not convinced that the test is demonstrating the bug on upstream/8.3.x or master.

I can confirm that test_set_future_flow passes on upstream/8.3.x so doesn't appear to be capturing this bug.

I commented on this above

The integration test essentially replicates the functional test that fails on master, but as an integration test it passes on master "for the wrong reasons" - to do with submit numbers being zero in integration tests.

UPDATE: I've simplified the test and it does now fail on master. Note the last commit clarifies and documents spawning logic a bit (it was too hard to follow!).

@hjoliver hjoliver changed the base branch from master to 8.3.x July 10, 2024 05:05
# set b:succeeded in flow 2 and check downstream spawning
schd.pool.set_prereqs_and_outputs(['1/b'], prereqs=[], outputs=[], flow=[2])
assert schd.pool.get_task(IntegerPoint("1"), "c1") is None, '1/c1 (flow 2) should not be spawned after 1/b:succeeded'
assert schd.pool.get_task(IntegerPoint("1"), "c2") is not None, '1/c2 (flow 2) should be spawned after 1/b:succeeded'
Copy link
Member

@MetRonnie MetRonnie Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a sign we are ready to throw off the chains of oppression and embrace line lengths > 79 chars? 😁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a sign that writing it like this doesn't work (always True):

assert (
   var is None,
   'string'
)

However, I suppose I should have done this:

assert  var is None, (
   'string'
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a sign that we don't run the linter on the test code!

Limiting line length is one thing that linters can all agree on.

linter/formatter/guide length
pep8 79
google 80
flake8 80
black 88
styleguide (ruby) 80
rubocop (ruby) 80
prettier (js/ts) 80
rustfmt (rust) 80

Line length is arbitrary and each each tool is free to choose its own limit, yet most pick 80. This is not a coincidence:

  • Readability (reading vertically is faster than reading horizontally, code structure is clearer when better spaced)
  • Reduced diff surface (diff highlights changes more clearly, reduced merge conflicts)
  • Accessibility (works better with larger font sizes)
  • Works better with tooling (e.g. side-by-side diffs)

Why Bother with 80 characters in a World of Modern Widescreen Displays?

A lot of people these days feel that a maximum line length of 80 characters is just a remnant of the past and makes little sense today. After all - modern displays can easily fit 200+ characters on a single line. Still, there are some important benefits to be gained from sticking to shorter lines of code.

First, and foremost - numerous studies have shown that humans read much faster vertically and very long lines of text impede the reading process. As noted earlier, one of the guiding principles of this style guide is to optimize the code we write for human consumption.

Additionally, limiting the required editor window width makes it possible to have several files open side-by-side, and works well when using code review tools that present the two versions in adjacent columns.

The default wrapping in most tools disrupts the visual structure of the code, making it more difficult to understand. The limits are chosen to avoid wrapping in editors with the window width set to 80, even if the tool places a marker glyph in the final column when wrapping lines. Some web based tools may not offer dynamic line wrapping at all.

-- https://rubystyle.guide/#max-line-length

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree lines should be kept short if possible, but sometimes 79 is too short for the content and results in less readable code. I would support moving to black's 88 limit

@MetRonnie
Copy link
Member

Merging with missing changelog entry added!

@MetRonnie MetRonnie merged commit 954f65e into cylc:8.3.x Jul 10, 2024
@hjoliver hjoliver deleted the flow-set-bug-fix branch July 18, 2024 02:17
@wxtim wxtim mentioned this pull request Jul 22, 2024
Comment on lines +1691 to +1703
if (
prev_status is not None
and not itask.state.outputs.get_completed_outputs()
):
# If itask has any history in this flow but no completed outputs
# we can infer it was deliberately removed, so don't respawn it.
# TODO (follow-up work):
# - this logic fails if task removed after some outputs completed
# - this is does not conform to future "cylc remove" flow-erasure
# behaviour which would result in respawning of the removed task
# See github.com/cylc/cylc-flow/pull/6186/#discussion_r1669727292
LOG.debug(f"Not respawning {point}/{name} - task was removed")
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is to be done about this RE: cylc remove? Should this block just be removed? @hjoliver CC @oliver-sanders

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I need to remind myself and think again about this. The linked discussion refers to Element chat. I'll try to find that, and we can continue the conversation there until it's clear what the problem is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However there are other ways that TaskPool.remove() are called that are not as a result of cylc remove, so this might still be needed to prevent a removed task from respawning immediately after being removed in one of those ways?

Thus I am thinking of leaving this in, just updating the comment (it no longer applies to tasks removed by cylc remove).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'm concerned about that too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(without diving into the code in depth)

The proposal does not change the behaviour of suicide-triggered tasks so that will need to be preserved.

This logic should not apply to the remove command any more though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants