Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Workflow connect performance #3184

Merged
merged 4 commits into from
Mar 13, 2020

Conversation

HippocampusGirl
Copy link
Contributor

@HippocampusGirl HippocampusGirl commented Mar 9, 2020

Summary

Hi :-) I'm using nipype with fmriprep to process hundreds of subjects. When doing that, I noticed that creating the nipype workflows can be quite slow. This was mostly noticeable when there were nested workflows. To find out why, I ran a quick python cProfile.

Screenshot 2020-03-06 21 09 22

It turns out that a prominent performance bottleneck is the workflow.connect function, or rather the _has_attr function that is called by it. This function checks if a nested workflow has a specific input or output. If the input/output is not found, an error message is printed.
To check for inputs/outputs, _has_attr uses the workflow's inputs or outputs properties, which return a dictionary of the inputs/outputs of all nested nodes. _has_attr now checks whether the target input or output is listed in the dictionary and returns.
However, accessing these properties calls a getter function that traverses all of the workflow's child nodes to generate the dictionary. As this entire procedure is repeated over again for each new connection, it can be quite slow.

List of changes proposed in this PR (pull-request)

Instead of generating the full dictionary, I propose that the _has_attr function should traverse the individual workflow graphs until it finds the target node (or not). I have created a quick implementation that leads to a ~6x speedup in creating my nipype/fmriprep workflows. The code passes integration tests. Maybe this code will be useful for the larger nipype community.

Screenshot 2020-03-07 12 51 34

Acknowledgment

  • (Mandatory) I acknowledge that this contribution will be available under the Apache 2 license.

@effigies
Copy link
Member

effigies commented Mar 9, 2020

This looks great! We'll need to check the failing test...

@codecov
Copy link

codecov bot commented Mar 9, 2020

Codecov Report

Merging #3184 into master will increase coverage by 0.39%.
The diff coverage is 61.90%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3184      +/-   ##
==========================================
+ Coverage   64.88%   65.27%   +0.39%     
==========================================
  Files         299      301       +2     
  Lines       39506    40092     +586     
  Branches     5219     5329     +110     
==========================================
+ Hits        25632    26170     +538     
- Misses      12824    12843      +19     
- Partials     1050     1079      +29     
Flag Coverage Δ
#unittests 65.27% <61.90%> (+0.39%) ⬆️
Impacted Files Coverage Δ
nipype/pipeline/engine/workflows.py 69.12% <61.90%> (+0.30%) ⬆️
nipype/interfaces/ants/__init__.py 100.00% <0.00%> (ø)
nipype/interfaces/mixins/__init__.py 100.00% <0.00%> (ø)
nipype/utils/imagemanip.py 100.00% <0.00%> (ø)
nipype/interfaces/mixins/fixheader.py 100.00% <0.00%> (ø)
nipype/pipeline/engine/utils.py 71.72% <0.00%> (+0.57%) ⬆️
nipype/interfaces/ants/utils.py 89.16% <0.00%> (+0.87%) ⬆️
nipype/pipeline/plugins/tools.py 80.26% <0.00%> (+1.31%) ⬆️
nipype/utils/config.py 62.22% <0.00%> (+3.88%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b75a7ce...0372982. Read the comment docs.

@effigies
Copy link
Member

@HippocampusGirl Do you have a minute to review my suggested comment?

@HippocampusGirl
Copy link
Contributor Author

HippocampusGirl commented Mar 11, 2020

@HippocampusGirl Do you have a minute to review my suggested comment?

The comment is great, thank you for the suggestion! I added it.
Sorry about the slow reply, I'm applying for a PhD fellowship, and the deadline is this week, so I'm a bit swamped.

While I was looking at the comment I found that there was a case where you can connect an input twice that wasn't covered by the check and that I hadn't though of. If you have multiply nested workflows, the previous code did not check if an input was already connected in another workflow further down the hierarchy.
So for example if you have three nested workflow, flow1 (inner) <- flow2 (middle) <- flow3 (outer), then it was possible run first flow2.connect(node1, "attr", flow1, "inputnode.something") and then run flow3.connect(node2, "attr", flow2, "flow1.inputnode.something") without raising an error. However, it should raise an error, since they're both referring to the same node and the same input in the end.
This is of course because running connect in a parent workflow doesn't propagate the connection information to the graph of the child workflow. The connection stays in the parent workflow graph, until the workflow is flattened for running. If we want to detect this case, it is necessary to check the inputs to the workflows at every level of the hierarchy.
As such, I took pulled out the check to an inner function, and call it at every level of the hierarchy. I think this change does not have a big impact, but I think it may be useful to detect these cases, as processing pipelines are constantly getting more complex and deeply nested. I also don't know if the original author of the _has_attr and _get_inputs functions thought of this possibility and tests for it in another place.
I also wrote a test case in test_workflows.py called test_nested_workflow_doubleconnect.

@effigies
Copy link
Member

LGTM. Thanks!

@effigies effigies merged commit 75b598a into nipy:master Mar 13, 2020
@HippocampusGirl HippocampusGirl deleted the workflow-connect-performance branch March 13, 2020 16:57
@oesteban
Copy link
Contributor

I'm going to bookmark this PR as the paradigm of how things should work out in open source/science. Outstanding work @HippocampusGirl, thanks very much!

@satra
Copy link
Member

satra commented Mar 13, 2020

thank you @HippocampusGirl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants