Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update DOCKER_IMAGE tag in testslurm.yml #697

Merged
merged 13 commits into from
Sep 18, 2023

Conversation

adi611
Copy link
Contributor

@adi611 adi611 commented Sep 8, 2023

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to change)

Summary

Update DOCKER_IMAGE tag in testslurm.yml from latest to 21.08.6

Checklist

  • I have added tests to cover my changes (if necessary)
  • I have updated documentation (if necessary)

@codecov
Copy link

codecov bot commented Sep 8, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.32% ⚠️

Comparison is base (31aea01) 83.31% compared to head (e797007) 82.99%.
Report is 15 commits behind head on master.

❗ Current head e797007 differs from pull request most recent head afeb705. Consider uploading reports for the commit afeb705 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #697      +/-   ##
==========================================
- Coverage   83.31%   82.99%   -0.32%     
==========================================
  Files          22       22              
  Lines        4873     4894      +21     
  Branches     1401        0    -1401     
==========================================
+ Hits         4060     4062       +2     
- Misses        809      832      +23     
+ Partials        4        0       -4     
Flag Coverage Δ
unittests 82.99% <100.00%> (-0.32%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
pydra/utils/hash.py 95.03% <100.00%> (+2.17%) ⬆️

... and 5 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@satra
Copy link
Contributor

satra commented Sep 8, 2023

btw, you may want to see if you can build this dockerfile: https://github.com/tazend/docker-centos7-slurm/blob/1cdc401df445ecf00e2db431a99c583eda950300/Dockerfile as it contains the latest slurm

you could drop the python versions lower than 3.8.

@adi611
Copy link
Contributor Author

adi611 commented Sep 9, 2023

btw, you may want to see if you can build this dockerfile: https://github.com/tazend/docker-centos7-slurm/blob/1cdc401df445ecf00e2db431a99c583eda950300/Dockerfile as it contains the latest slurm

you could drop the python versions lower than 3.8.

Sure, I'll start working on it.

@adi611
Copy link
Contributor Author

adi611 commented Sep 11, 2023

I built the docker image, it can be found here .

  • Removed the python versions 3.6 and 3.7.
  • Getting the error This cluster linux already exists. Not adding. at the step Display previous jobs with sacct (logs here). So I removed the following:
docker exec slurm bash -c "sacctmgr -i add cluster name=linux \
  && supervisorctl restart slurmdbd \
  && supervisorctl restart slurmctld"
  • Some tests are failing with the following errors:
rc, stdout, stderr = await read_and_display_async(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: coroutine ignored GeneratorExit
RuntimeError: Could not extract job ID
=========================== short test summary info ============================
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_2[slurm-list-splitter1-state_splitter1-state_rpn1-expected1-expected_ind1] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_2[slurm-array-splitter0-state_splitter0-state_rpn0-expected0-expected_ind0] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_inputspec_state_1a[slurm-result_submitter] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_2[slurm-array-splitter1-state_splitter1-state_rpn1-expected1-expected_ind1] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_2[slurm-mixed-splitter1-state_splitter1-state_rpn1-expected1-expected_ind1] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_6[slurm] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_6a[slurm] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_comb_2[slurm-splitter2-a-state_splitter2-state_rpn2-state_combiner2-state_combiner_all2-NA.b-state_rpn_final2-expected2-expected_val2] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_comb_2[slurm-splitter3-b-state_splitter3-state_rpn3-state_combiner3-state_combiner_all3-NA.a-state_rpn_final3-expected3-expected_val3] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_comb_2[slurm-splitter4-combiner4-state_splitter4-state_rpn4-state_combiner4-state_combiner_all4-None-state_rpn_final4-expected4-expected_val4] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_outputspec_8c[slurm-result_submitter] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_state_outputspec_1[slurm-result_submitter] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_6[slurm] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_7[slurm] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_workflow.py::test_wf_3nd_st_2[slurm] - RuntimeError: Could not find results of 'mult' node in a sub-directory name...
FAILED pydra/pydra/engine/tests/test_workflow.py::test_wfasnd_st_2[slurm] - ValueError: Tasks ['wfnd'] raised an error

= 16 failed, 910 passed, 88 skipped, 7 xfailed, 7 warnings, 1 rerun in 935.96s (0:15:35) =
  • The GA workflow file I used can be found here

@satra
Copy link
Contributor

satra commented Sep 11, 2023

nice work @adi611 - it may be good to track down those specific tests to see what's going on perhaps executing and debugging why the slurm return doesn't have a job id.

at least we have an up to date slurm container now! for this PR perhaps change to your newly built slurm container.

@djarecka
Copy link
Collaborator

@adi611 - Have you tried to repeat the run? Do you have the errors every time you run? I've run the test with your new image on my laptop and can't reproduce the errors...

But I run twice the GA in this PR and seems to work fine.

fyi. still don't have access to the MIT slurm computers, so can't compare.

@adi611
Copy link
Contributor Author

adi611 commented Sep 11, 2023

Should I create a PR to run the tests using all the available python versions (3.8.16, 3.9.16, 3.10.9, 3.11.1) for the container or using just the default version (3.11.1)?

@satra
Copy link
Contributor

satra commented Sep 11, 2023

let's just get the default working. it may be overkill to try all at the moment. they are already being tested normally outside of slurm.

@adi611
Copy link
Contributor Author

adi611 commented Sep 11, 2023

@adi611 - Have you tried to repeat the run? Do you have the errors every time you run? I've run the test with your new image on my laptop and can't reproduce the errors...

But I run twice the GA in this PR and seems to work fine.

fyi. still don't have access to the MIT slurm computers, so can't compare.

Yes I did re-run the workflow but I still got the same errors

@adi611
Copy link
Contributor Author

adi611 commented Sep 11, 2023

let's just get the default working. it may be overkill to try all at the moment. they are already being tested normally outside of slurm.

Ok sure.

@satra
Copy link
Contributor

satra commented Sep 12, 2023

@adi611 - it looks like this now returns the same error as your list. perhaps you can check if you can reproduce one of those errors by limiting pytest to just check that test. also i think @djarecka tested the original slurm container not the new one.

@djarecka
Copy link
Collaborator

djarecka commented Sep 12, 2023

@satra - I also tested the new one

@adi611 - you can also try to remove -n auto from the pytest command

@djarecka
Copy link
Collaborator

just want to confirm that with -n auto I also see errors running in the container on my laptop (earlier missed the fact that GA runs with -n).

I don't understand why -n leads to the error in this case, but I would say that if running serially all the tests don't lead to the issue, we could go with it for now

@adi611
Copy link
Contributor Author

adi611 commented Sep 12, 2023

I think there may be some confusion.

@adi611
Copy link
Contributor Author

adi611 commented Sep 12, 2023

@adi611 - it looks like this now returns the same error as your list. perhaps you can check if you can reproduce one of those errors by limiting pytest to just check that test. also i think @djarecka tested the original slurm container not the new one.

Yes the errors exist even while limiting pytest to a single test from the list of failed tests, as can be seen here.

@djarecka
Copy link
Collaborator

djarecka commented Sep 12, 2023 via email

@djarecka
Copy link
Collaborator

@adi611 - I've just checked the GA reports and I see that there is -n auto in the pytest command

@adi611
Copy link
Contributor Author

adi611 commented Sep 12, 2023

I ran it separately here. But I should update the pytest command in the current PR to check if it runs fine without the -n option.

@djarecka
Copy link
Collaborator

djarecka commented Sep 12, 2023

ok, it looks like for GA, the option doesn't really make any difference...

Do you also see the same error when running on your laptop?

@adi611
Copy link
Contributor Author

adi611 commented Sep 13, 2023

Yes I tried it on my laptop and I get the same error RuntimeError: Could not extract job ID for failed tests.

@adi611
Copy link
Contributor Author

adi611 commented Sep 13, 2023

The issue seems to be more random, with the two stdouts in lines 292 and 341 of workers.py. Sometimes the first stdout doesn't return anything, the exception is then Could not extract job ID otherwise when the second stdout doesn't return anything the exception is Job information not found, or they both might not return anything. When they both return something the test which previously failed actually pass.

@adi611
Copy link
Contributor Author

adi611 commented Sep 13, 2023

was that the stdout for a task that failed? if so, you could perhaps look at where the exception is being raised and check if there is an intervening check like asking sacct or scontrol that the job has been queued.

I am unable to find such a check

@adi611
Copy link
Contributor Author

adi611 commented Sep 13, 2023

Logs:

=========================== short test summary info ============================
FAILED pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_7[slurm] - RuntimeError: Could not extract job ID
========================= 1 failed, 1 warning in 9.04s =========================
Exception ignored in: <coroutine object SlurmWorker._submit_job at 0x7f9bc66d8ea0>
Traceback (most recent call last):
  File "/pydra/pydra/engine/workers.py", line 314, in _submit_job
  File "/pydra/pydra/engine/workers.py", line 336, in _poll_job
  File "/pydra/pydra/engine/helpers.py", line 331, in read_and_display_async
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/subprocess.py", line 218, in create_subprocess_exec
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py", line 1688, in subprocess_exec
RuntimeError: coroutine ignored GeneratorExit
Exception ignored in: <coroutine object SlurmWorker._submit_job at 0x7f9bc66d8bc0>
Traceback (most recent call last):
  File "/pydra/pydra/engine/workers.py", line 293, in _submit_job
RuntimeError: coroutine ignored GeneratorExit
Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x7f9bdc8958a0>
Traceback (most recent call last):
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_subprocess.py", line 126, in __del__
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_subprocess.py", line 104, in close
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/unix_events.py", line 558, in close
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/unix_events.py", line 582, in _close
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py", line 761, in call_soon
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py", line 519, in _check_closed
RuntimeError: Event loop is closed

(The line numbers for workers.py may not exactly match the remote due to my local changes)

@adi611
Copy link
Contributor Author

adi611 commented Sep 13, 2023

The same test:

============================= test session starts ==============================
platform linux -- Python 3.11.1, pytest-7.4.2, pluggy-1.3.0 -- /root/.pyenv/versions/3.11.1/bin/python3.11
cachedir: .pytest_cache
rootdir: /pydra
plugins: rerunfailures-12.0, cov-4.1.0, forked-1.6.0, timeout-2.1.0, env-1.0.1, xdist-1.34.0
collecting ... collected 1 item

pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_7[slurm] PASSED

=============================== warnings summary ===============================
pydra/engine/tests/test_shelltask.py::test_shell_cmd_7[slurm]
  /pydra/pydra/engine/helpers.py:469: DeprecationWarning: There is no current event loop
    loop = asyncio.get_event_loop()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

---------- coverage: platform linux, python 3.11.1-final-0 -----------
Coverage XML written to file /pydra/cov.xml

======================== 1 passed, 1 warning in 10.45s =========================

@satra
Copy link
Contributor

satra commented Sep 13, 2023

instead of just running it, you can run the test with the pdb option, you can also enable debug logging around the relevant parts of the code. that may give you more insight into the state of the system. i suspect this is a resource contention with the slurm database issue. and couple of retries could help or looking at stderr and stdout.

@djarecka
Copy link
Collaborator

@adi611 - I've noticed that some tests do not use pytest's fixture: tmp_path as cache_dir (see for cache_dir=tmp_path in most of the tests), this can sometimes leads to issues. Could you fix the tests and see if that helps?

@adi611
Copy link
Contributor Author

adi611 commented Sep 14, 2023

@adi611 - I've noticed that some tests do not use pytest's fixture: tmp_path as cache_dir (see for cache_dir=tmp_path in most of the tests), this can sometimes leads to issues. Could you fix the tests and see if that helps?

I checked and many of the failed tests, like test_node_task.py::test_task_state_2 and test_shelltask.py::test_shell_cmd_7 already have the cache_dir=tmp_path.

@adi611
Copy link
Contributor Author

adi611 commented Sep 14, 2023

Currently I am unable to reproduce the issue for single tests

@adi611
Copy link
Contributor Author

adi611 commented Sep 14, 2023

This seems to be a python 3.11.1 specific issue, should I try the newer version for 3.11 like 3.11.5 and see if it works? I have seen discussions on cpython with similar issues and maybe they have rolled out a fix for it.

@ghisvail
Copy link
Collaborator

This seems to be a python 3.11.1 specific issue, should I try the newer version for 3.11 like 3.11.5 and see if it works? I have seen discussions on cpython with similar issues and maybe they have rolled out a fix for it.

If you're testing on the current Python branch, it is best to try out the latest published version first and then bissect with previous versions if you notice a regression.

Could you reference the exact issues you think may be of interest on cpython?

@adi611
Copy link
Contributor Author

adi611 commented Sep 14, 2023

This seems to be a python 3.11.1 specific issue, should I try the newer version for 3.11 like 3.11.5 and see if it works? I have seen discussions on cpython with similar issues and maybe they have rolled out a fix for it.

If you're testing on the current Python branch, it is best to try out the latest published version first and then bissect with previous versions if you notice a regression.

Could you reference the exact issues you think may be of interest on cpython?

This is one such issue

@djarecka
Copy link
Collaborator

thanks for tracking this! yes, please check for 3.11.5!

@djarecka
Copy link
Collaborator

it looks like it works for 3.11.5! :) just remove 3.11.1, and we will merge it.
great job!

@satra
Copy link
Contributor

satra commented Sep 15, 2023

is there a way to exclude a series of python dependencies in pydra's python config? we should add that to the PR so we know why we did this.

@adi611
Copy link
Contributor Author

adi611 commented Sep 15, 2023

it looks like it works for 3.11.5! :) just remove 3.11.1, and we will merge it. great job!

Thanks!

@djarecka
Copy link
Collaborator

@adi611 - could you please exclude the python version in pyproject.toml as @satra suggested

@adi611
Copy link
Contributor Author

adi611 commented Sep 17, 2023

I added !=3.11.1 to requires-python in pyproject.toml which specifies 3.11.1 should be excluded from the list of acceptable python versions.

@adi611
Copy link
Contributor Author

adi611 commented Sep 17, 2023

The Slurm workflow for 3.11.5 is failing at the Display previous jobs with sacct step:

Run echo "Allowing ports/daemons time to start" && sleep 10
Allowing ports/daemons time to start
 This cluster 'linux' doesn't exist.
        Contact your admin to add it to accounting.
Error: Process completed with exit code 1.

Is it possible this is due to the sleep time of 10 seconds not being enough?

@djarecka
Copy link
Collaborator

I've just restarted and it seems to work, if we have this issue again, we can increase the time

@djarecka djarecka merged commit fa4d4f9 into nipype:master Sep 18, 2023
31 of 32 checks passed
@djarecka
Copy link
Collaborator

@adi611 - I've realized that your name is not in the zenodo file, please open a PR if you want your name to be included!

@adi611
Copy link
Contributor Author

adi611 commented Sep 18, 2023

@adi611 - I've realized that your name is not in the zenodo file, please open a PR if you want your name to be included!

Thanks I'll do that, but I need some help since I've never done it before. Is there a preferred order for including my name? Also, what should I specify as my affiliation?

@djarecka
Copy link
Collaborator

we should think about the order, for now we don't use any rule except that Satra is at the end, so you can put your name before him.
It's up to you what you want to use as your affiliation, you could use your university or you can leave it empty

@satra
Copy link
Contributor

satra commented Nov 8, 2023

@adi611 - just a quick thing. can you post the slurm dockerfile somewhere?

@adi611
Copy link
Contributor Author

adi611 commented Nov 14, 2023

@satra - Sorry for the delay. I have added the dockerfile as a public github gist here.

@djarecka
Copy link
Collaborator

Thank you, perhaps you can add this to .github directory for the references

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants