-
Notifications
You must be signed in to change notification settings - Fork 533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Workflow connect performance #3184
[ENH] Workflow connect performance #3184
Conversation
This looks great! We'll need to check the failing test... |
Codecov Report
@@ Coverage Diff @@
## master #3184 +/- ##
==========================================
+ Coverage 64.88% 65.27% +0.39%
==========================================
Files 299 301 +2
Lines 39506 40092 +586
Branches 5219 5329 +110
==========================================
+ Hits 25632 26170 +538
- Misses 12824 12843 +19
- Partials 1050 1079 +29
Continue to review full report at Codecov.
|
@HippocampusGirl Do you have a minute to review my suggested comment? |
The comment is great, thank you for the suggestion! I added it. While I was looking at the comment I found that there was a case where you can connect an input twice that wasn't covered by the check and that I hadn't though of. If you have multiply nested workflows, the previous code did not check if an input was already connected in another workflow further down the hierarchy. |
LGTM. Thanks! |
I'm going to bookmark this PR as the paradigm of how things should work out in open source/science. Outstanding work @HippocampusGirl, thanks very much! |
thank you @HippocampusGirl |
Summary
Hi :-) I'm using nipype with fmriprep to process hundreds of subjects. When doing that, I noticed that creating the nipype workflows can be quite slow. This was mostly noticeable when there were nested workflows. To find out why, I ran a quick python cProfile.
It turns out that a prominent performance bottleneck is the
workflow.connect
function, or rather the_has_attr
function that is called by it. This function checks if a nested workflow has a specific input or output. If the input/output is not found, an error message is printed.To check for inputs/outputs,
_has_attr
uses the workflow'sinputs
oroutputs
properties, which return a dictionary of the inputs/outputs of all nested nodes._has_attr
now checks whether the target input or output is listed in the dictionary and returns.However, accessing these properties calls a getter function that traverses all of the workflow's child nodes to generate the dictionary. As this entire procedure is repeated over again for each new connection, it can be quite slow.
List of changes proposed in this PR (pull-request)
Instead of generating the full dictionary, I propose that the
_has_attr
function should traverse the individual workflow graphs until it finds the target node (or not). I have created a quick implementation that leads to a ~6x speedup in creating my nipype/fmriprep workflows. The code passes integration tests. Maybe this code will be useful for the larger nipype community.Acknowledgment