Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflows manager #324

Merged
merged 9 commits into from
Mar 18, 2022
Merged

Workflows manager #324

merged 9 commits into from
Mar 18, 2022

Conversation

oliver-sanders
Copy link
Member

@oliver-sanders oliver-sanders commented Feb 25, 2022

Closes #223, #310, #311, #312 and most of #308

  • Check the presence of the workflow database for workflows which have previously run.
    • (closes scan: check the uuid for stopped workflows #223).
    • When workflows stop we keep their data in the store.
    • In the UI you will still see the state of the task pool when the workflow shut down.
    • If the user deletes the workflow database the data store needs to be wiped because they have deleted the continuity of the workflow run, i.e. it is effectively a new workflow run now.
  • Ensure that one scan finishes before the next begins.
    • (closes prevent parallel scans #312).
    • Scans are scheduled to an interval.
    • When changes in workflow state are detected the workflow manager calls the relevant methods in the data store manager to adapt to this change.
    • Because these operations are await'ed in the scan they hold up the scan.
    • This means it is possible two scans to overlap even if you set a sensible interval between them.
    • Especially likely on startup if you have loads of workflows.
  • Ensure workflow disconnect happens before re-connect.
    • (closes workflow state can be incorrect #311).
    • Previously the workflow manager was correctly telling the data store to connect to running workflows.
    • However, if the disconnect hadn't been performed, the new connect would just return rather than doing anything.
    • Ensure disconnect is called when necessary.

Starting and stopping a batch of workflows, before some would occasionally get stuck in the stopped state:

Taking a single workflow through a full cycle:

cycle

Removing the workflow database now causes the workflow to be re-registered (data from the old run will disappear):

db-remove

Requirements check-list

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Appropriate tests are included (unit and/or functional).
  • Appropriate change log entry included.
  • No documentation update required.
  • No dependency changes (one dep change but tests only)

@oliver-sanders oliver-sanders added the bug Something isn't working label Feb 25, 2022
@oliver-sanders oliver-sanders added this to the 1.0.1 milestone Feb 25, 2022
@oliver-sanders oliver-sanders self-assigned this Feb 25, 2022
@oliver-sanders
Copy link
Member Author

@dwsutherland if you get the chance to take a look at this one that would be great.

@datamel datamel self-requested a review February 28, 2022 14:19
@oliver-sanders oliver-sanders linked an issue Feb 28, 2022 that may be closed by this pull request
Copy link
Member

@MetRonnie MetRonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review (6/9 files viewed)

cylc/uiserver/workflows_mgr.py Outdated Show resolved Hide resolved
cylc/uiserver/workflows_mgr.py Outdated Show resolved Hide resolved
Copy link
Contributor

@datamel datamel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have run the ui-server with branch since yesterday without any traceback, deleting stale workflows and have seen no traceback. I have started and stopped multiple workflows, workflow state appears to be working correctly.

I have also read the code, only one query.
Thanks @oliver-sanders

else:
with suppress(IOError):
client.stop(stop_loop=False)
self.uiserver.data_store_mgr.disconnect_workflow(wid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need awaiting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not an async method though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry what I meant was should we be marking disconnect_workflow as async and so await this?
Not entirely sure why register_workflow is async but this one isn't when neither look IO-bound.
Not fully got my head around using asyncio for multi-threading CPU-bound operations yet. Will have to use some training time on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, not sure why register_workflow is defined async

@oliver-sanders
Copy link
Member Author

Keep this one open for a mo, I'll slide in an extra commit to fix the issue Ronnie recently reported.

@oliver-sanders
Copy link
Member Author

Last commit combines the stop and disconnect methods. The stop method contained the logic for updating the contact data. In the event a workflow went from running to stopped unexpectedly the stop method was not called leaving behind broken contact data and an incorrect status string.

newnew

@MetRonnie
Copy link
Member

Hmm, I was still able to reproduce the mismatch between status and statusMsg (stopped vs Running, discussed on Teams chat) for a failed integration test workflow. Does it need a cylc-ui fix?

@oliver-sanders
Copy link
Member Author

Don't think it's related to the UI, if you run the UIS in debug mode (cylc gui --debug) the output should be enough to work out what's going wrong, will take another look at the logic.

@oliver-sanders
Copy link
Member Author

(easy way to tell if it is UI related, refresh the browser and see if the issue persists)

@oliver-sanders
Copy link
Member Author

Couldn't replicate with 10 runs of the integration tests.

@datamel
Copy link
Contributor

datamel commented Mar 3, 2022

Couldn't replicate with 10 runs of the integration tests.

I'm having a go at replicating now. Will report back in a bit.

Update: I have reproduced a running status message for a stopped workflow, integration.test_scan_api/test_scan_cleans_stuck_contact_files/-crashed-. Note a refresh on the ui still reported running.

@oliver-sanders
Copy link
Member Author

ok, what does the debug output say?

@datamel
Copy link
Contributor

datamel commented Mar 3, 2022

ok, what does the debug output say?

Debug output looked normal as far as I could see. It is a very transient problem as have only seen it once. I have closely inspected the status_msg code and can't spot anything obvious. Only thing I can think of is nfs maybe causing problems?
@MetRonnie have you been able to reliably reproduce?

@oliver-sanders
Copy link
Member Author

Can you send me the debug output, it contains a record of every method called on the data store which should reveal the issue.

@datamel
Copy link
Contributor

datamel commented Mar 3, 2022

Can you send me the debug output, it contains a record of every method called on the data store which should reveal the issue.

[D 2022-03-03 14:09:18.510 CylcUIServer] register_workflow(~/cit-20220303T234411+0935/integration.test_examples/test_logging/834bfec6-9afb-11ec-93eb-0050569a149e)
[D 2022-03-03 14:09:18.518 CylcUIServer] Could not connect to ~/cit-20220303T234411+0935/integration.test_examples/test_logging/834bfec6-9afb-11ec-93eb-0050569a149e: cit-20220303T234411+0935/integration.test_examples/test_logging/834bfec6-9afb-11ec-93eb-0050569a149e is not running

@oliver-sanders
Copy link
Member Author

Brill, that focuses me on connection failure, there should be only one side-effect of connection failure which is the workflow getting added to self.workflows. Changed that so it only gets added if connection succeeded.

@codecov-commenter
Copy link

codecov-commenter commented Mar 3, 2022

Codecov Report

Merging #324 (92aec87) into master (8fe43ab) will decrease coverage by 2.33%.
The diff coverage is 69.36%.

@@            Coverage Diff             @@
##           master     #324      +/-   ##
==========================================
- Coverage   80.29%   77.96%   -2.34%     
==========================================
  Files          10       10              
  Lines        1005     1053      +48     
  Branches      197      205       +8     
==========================================
+ Hits          807      821      +14     
- Misses        164      196      +32     
- Partials       34       36       +2     
Impacted Files Coverage Δ
cylc/uiserver/app.py 85.34% <50.00%> (-0.75%) ⬇️
cylc/uiserver/data_store_mgr.py 61.42% <67.70%> (-6.27%) ⬇️
cylc/uiserver/workflows_mgr.py 82.65% <72.00%> (-6.71%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8fe43ab...92aec87. Read the comment docs.

@datamel
Copy link
Contributor

datamel commented Mar 7, 2022

The first traceback looks like a possible bug in cylc-flow due to the malformed path:

I had replaced my usual $HOME expansion in the traceback with a ~ (for security), so not sure it is a malformed path.
Unless you think the -crashed- could be causing problems?

Sorry for not making that clear, cylc/cylc-flow#4734 can probably just be closed.

@oliver-sanders
Copy link
Member Author

Ah, I thought we had failed to expand ~ resulting in /~ causing the problem.

In that case it's still a cylc-flow issue but presumably caused by the script trying to remove the contact file only to find that either the contact file or .service dir is no longer there?

It is perfectly possible that the contact file could have been removed since the client was created (even if that's a short time window another client started at the same time could have beat this one to removing the file, or a clean in progress at the time or whatever), so probably just need to quietly suppress this traceback.

@oliver-sanders
Copy link
Member Author

PR up cylc/cylc-flow#4735

@oliver-sanders
Copy link
Member Author

Ok, last commit should do it, really hard to test properly without running a workflow so came up with a kinda half-test which at least ensures that disconnect_workflow is called in the event connect_workflow fails.

Copy link
Contributor

@datamel datamel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Read code, checked out and tested as running smoothly.
Minor typo, not a blocker though. Thanks @oliver-sanders


Note there is some new warnings on this branch on running pytest:

Task was destroyed but it is pending!
task: <Task pending name='Task-29' coro=<WorkflowsManager.run() running at ~/cylc-uiserver/cylc/uiserver/workflows_mgr.py:412> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f43db947bb0>()]> cb=[IOLoop.add_future.<locals>.<lambda>() at ~/miniconda3/envs/cylc8/lib/python3.9/site-packages/tornado/ioloop.py:688]>
Task was destroyed but it is pending!
task: <Task pending name='Task-64' coro=<WorkflowsManager.run() running at ~/cylc-uiserver/cylc/uiserver/workflows_mgr.py:412> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f43db98d310>()]> cb=[IOLoop.add_future.<locals>.<lambda>() at ~/miniconda3/envs/cylc8/lib/python3.9/site-packages/tornado/ioloop.py:688]>

cylc/uiserver/data_store_mgr.py Outdated Show resolved Hide resolved
cylc/uiserver/data_store_mgr.py Outdated Show resolved Hide resolved
@oliver-sanders
Copy link
Member Author

Will take a look at those warnings...

@oliver-sanders
Copy link
Member Author

The scan task was being sent the stop signal, however, the UIS was shutting down before the stop had completed. Strangely, even when the scan task has completed Tornado needs an asyncio.sleep to detect that it has completed. Next commit should address this.

Copy link
Contributor

@datamel datamel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very strange. This latest commit has not solved the problem for me, I have run the tests as a set and am still getting the warnings. I also have run cylc/uiserver/tests/test_auth.py and cylc/uiserver/tests/test_graphql.py individually - these seem to be the source of the error. However on master they run without warning. Can you reproduce? Perhaps a race condition if not.

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Mar 10, 2022

I get lots of other warnings but not the specific task: <Task pending name='Task-29' coro=<WorkflowsManager.run() one.

(python 3.7)

@datamel
Copy link
Contributor

datamel commented Mar 10, 2022

I get lots of other warnings but not the specific task: <Task pending name='Task-29' coro=<WorkflowsManager.run() one.

(python 3.7)

Ahha I expect my python version is too high - Python 3.9.10

@oliver-sanders
Copy link
Member Author

I'll see if I can replicate with 3.9

@oliver-sanders
Copy link
Member Author

Managed to replicate, it's to do with the jupyter_server version. It looks like more recent jupyter_server is not calling the stop_extension endpoint (hopefully only in tests?).

cylc/uiserver/workflows_mgr.py Outdated Show resolved Hide resolved
cylc/uiserver/tests/test_data_store_mgr.py Outdated Show resolved Hide resolved
cylc/uiserver/tests/test_data_store_mgr.py Outdated Show resolved Hide resolved
@oliver-sanders
Copy link
Member Author

Copy link
Member

@MetRonnie MetRonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to shrug off my attempts to create horrible workflow/contact file states

@oliver-sanders
Copy link
Member Author

Good I think that's everything addressed?

Copy link
Contributor

@datamel datamel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick query about the commenting of the publish port.

cylc/uiserver/data_store_mgr.py Outdated Show resolved Hide resolved
cylc/uiserver/data_store_mgr.py Outdated Show resolved Hide resolved
@datamel datamel merged commit 27ad0dc into cylc:master Mar 18, 2022
@oliver-sanders oliver-sanders deleted the workflows-manager branch March 18, 2022 15:56
@oliver-sanders oliver-sanders mentioned this pull request Mar 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants