FIX: load_resultfile crashes if open resultsfile from crashed job #3182

daniel-ge · 2020-03-04T14:41:41Z

Problem

If a job crashes on computing nodes a result-file is written containing a dictionary (with traceback, etc). Then the 'controller' job reads the dictionary in order to produce the pickeled crash file. During reading of the results-file an exception arises in load_resultfile with AttributeError: 'dict' object has no attribute 'outputs'.

Details

In #2985 loadpkl was replaced by load_resultfile. But result_data could be a dictionary as assumed in the following source snippet:

nipype/nipype/pipeline/plugins/base.py

Lines 531 to 536 in b75a7ce

    
           else: 
        
               results_file = glob(os.path.join(node_dir, "result_*.pklz"))[0] 
        
               result_data = load_resultfile(results_file) 
        
           result_out = dict(result=None, traceback=None) 
        
           if isinstance(result_data, dict): 
        
               result_out["result"] = result_data["result"]

But result_data will never be a dictionary, because in load_resultfile it has to have an attribute outputs:

nipype/nipype/pipeline/engine/utils.py

Lines 293 to 294 in b75a7ce

    
           result = loadpkl(results_file) 
        
           if resolve and result.outputs:

Solution

Just check if result has an attribute outputs before accessing it.

effigies

Thanks. This seems to be #3085.

It would be good to have a regression test so that we can be sure this issue doesn't recur. Do you have any thoughts here? You can make an interface that reliably crashes like so:

def crasher(x):
    raise ValueError(x)

crash_if = Function(function=crasher)

Judging by the snippet that you shared, I guess this was happening in an SGE context. I don't know if we have a test plugin for exercising that code.

nipype/pipeline/engine/utils.py

codecov · 2020-03-04T15:07:27Z

Codecov Report

Merging #3182 into master will increase coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #3182      +/-   ##
==========================================
+ Coverage   64.88%   64.95%   +0.07%     
==========================================
  Files         299      299              
  Lines       39506    39506              
  Branches     5219     5219              
==========================================
+ Hits        25632    25663      +31     
+ Misses      12824    12784      -40     
- Partials     1050     1059       +9

Flag	Coverage Δ
#unittests	`64.95% <100.00%> (+0.07%)`	⬆️

Impacted Files	Coverage Δ
nipype/pipeline/engine/utils.py	`71.60% <100.00%> (+0.45%)`	⬆️
nipype/pipeline/plugins/tools.py	`80.26% <0.00%> (+1.31%)`	⬆️
nipype/utils/config.py	`62.22% <0.00%> (+3.88%)`	⬆️
nipype/pipeline/plugins/base.py	`64.10% <0.00%> (+5.20%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b75a7ce...4a6d6b8. Read the comment docs.

Co-Authored-By: Chris Markiewicz <[email protected]>

effigies · 2020-03-06T17:29:37Z

Hi @daniel-ge, any chance of a regression test? Let us know if you need help.

daniel-ge · 2020-03-07T22:52:37Z

That's a first attempt for a regression test. I am pretty sure there is much room for improvements. I tried to make it independent of existing/available installations of SLUM, SGE etc by running the batch script directly.

effigies

Nice test! I tested with and without your fix to verify that it should catch a regression.

I made some suggestions to reduce boilerplate...

nipype/pipeline/plugins/tests/test_sgelike.py

Co-Authored-By: Chris Markiewicz <[email protected]>

daniel-ge · 2020-03-08T14:10:25Z

Thanks for your improvements. I have used black for better code style.

effigies · 2020-03-09T13:56:43Z

@daniel-ge I removed some unused imports. Apologies for pushing to your master branch.

oesteban · 2020-03-15T02:11:12Z

I suspect this PR might have messed up with nodes parameterized by iterables. Will open an issue if confirmed.

EDIT: For now, we should put #3174 on hold.

oesteban · 2020-03-15T02:39:16Z

Okay, I just replicated with 1.4.2 so the problem is elsewhere. Sorry for the noise.

FIX: check if result has an attribute outputs before accessing it

c100772

effigies reviewed Mar 4, 2020

View reviewed changes

nipype/pipeline/engine/utils.py Outdated Show resolved Hide resolved

Update nipype/pipeline/engine/utils.py

75ce23d

Co-Authored-By: Chris Markiewicz <[email protected]>

First attempt of a SGE test

7b1a2f4

FIX: add missing argument to is_pending mock function

9c8f322

effigies reviewed Mar 8, 2020

View reviewed changes

effigies added this to the 1.5.0 milestone Mar 8, 2020

daniel-ge and others added 8 commits March 8, 2020 14:25

Update nipype/pipeline/plugins/tests/test_sgelike.py

57ab108

Co-Authored-By: Chris Markiewicz <[email protected]>

Update nipype/pipeline/plugins/tests/test_sgelike.py

269ac56

Co-Authored-By: Chris Markiewicz <[email protected]>

Update nipype/pipeline/plugins/tests/test_sgelike.py

98aeca1

Co-Authored-By: Chris Markiewicz <[email protected]>

Update nipype/pipeline/plugins/tests/test_sgelike.py

33befb1

Co-Authored-By: Chris Markiewicz <[email protected]>

Update nipype/pipeline/plugins/tests/test_sgelike.py

ce22e36

Co-Authored-By: Chris Markiewicz <[email protected]>

Update nipype/pipeline/plugins/tests/test_sgelike.py

ef739f0

Co-Authored-By: Chris Markiewicz <[email protected]>

Update nipype/pipeline/plugins/tests/test_sgelike.py

27a2472

Co-Authored-By: Chris Markiewicz <[email protected]>

FIX: get length of generator + STY: Black

02991da

TEST: Cleanup imports

4a6d6b8

effigies merged commit 347311c into nipy:master Mar 9, 2020

effigies mentioned this pull request May 27, 2020

Error while checking node hash, forcing re-run. #3085

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: load_resultfile crashes if open resultsfile from crashed job #3182

FIX: load_resultfile crashes if open resultsfile from crashed job #3182

daniel-ge commented Mar 4, 2020

effigies left a comment

codecov bot commented Mar 4, 2020 •

edited

Loading

effigies commented Mar 6, 2020

daniel-ge commented Mar 7, 2020

effigies left a comment

daniel-ge commented Mar 8, 2020

effigies commented Mar 9, 2020

oesteban commented Mar 15, 2020 •

edited

Loading

oesteban commented Mar 15, 2020

	else:
	results_file = glob(os.path.join(node_dir, "result_*.pklz"))[0]
	result_data = load_resultfile(results_file)
	result_out = dict(result=None, traceback=None)
	if isinstance(result_data, dict):
	result_out["result"] = result_data["result"]

	result = loadpkl(results_file)
	if resolve and result.outputs:

FIX: load_resultfile crashes if open resultsfile from crashed job #3182

FIX: load_resultfile crashes if open resultsfile from crashed job #3182

Conversation

daniel-ge commented Mar 4, 2020

Problem

Details

Solution

effigies left a comment

Choose a reason for hiding this comment

codecov bot commented Mar 4, 2020 • edited Loading

Codecov Report

effigies commented Mar 6, 2020

daniel-ge commented Mar 7, 2020

effigies left a comment

Choose a reason for hiding this comment

daniel-ge commented Mar 8, 2020

effigies commented Mar 9, 2020

oesteban commented Mar 15, 2020 • edited Loading

oesteban commented Mar 15, 2020

codecov bot commented Mar 4, 2020 •

edited

Loading

oesteban commented Mar 15, 2020 •

edited

Loading