Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Adding new worker which uses PSI/J to run tasks #694

Merged
merged 45 commits into from
Sep 24, 2023

Conversation

adi611
Copy link
Contributor

@adi611 adi611 commented Sep 1, 2023

Types of changes

  • New feature (non-breaking change which adds functionality)

Summary

Add a new worker called PsijWorker to workers.py which uses PSI/J to run tasks.

Checklist

  • I have added tests to cover my changes (if necessary)
  • I have updated documentation (if necessary)

@codecov
Copy link

codecov bot commented Sep 1, 2023

Codecov Report

Patch coverage: 21.87% and project coverage change: -1.61% ⚠️

Comparison is base (0245cdc) 83.42% compared to head (28ec2fd) 81.82%.

❗ Current head 28ec2fd differs from pull request most recent head 2c695d5. Consider uploading reports for the commit 2c695d5 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #694      +/-   ##
==========================================
- Coverage   83.42%   81.82%   -1.61%     
==========================================
  Files          22       23       +1     
  Lines        4894     4990      +96     
  Branches     1410        0    -1410     
==========================================
  Hits         4083     4083              
- Misses        807      907     +100     
+ Partials        4        0       -4     
Flag Coverage Δ
unittests 81.82% <21.87%> (-1.61%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
pydra/engine/workers.py 18.31% <11.66%> (-4.42%) ⬇️
pydra/engine/run_pickled.py 23.80% <23.80%> (ø)
pydra/conftest.py 68.62% <60.00%> (-9.16%) ⬇️

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@djarecka
Copy link
Collaborator

djarecka commented Sep 7, 2023

i'm trying to understand why the slurm test fails now (last run I cancelled, but it took 5h...), since you didn't any new test here yet. any idea?

@djarecka
Copy link
Collaborator

djarecka commented Sep 7, 2023

@adi611 - perhaps you can merge current master and try new actions/checkout that was introduced. I've run test_shelltask.py with the docker image on my laptop and it was fine

@adi611
Copy link
Contributor Author

adi611 commented Sep 7, 2023

@adi611 - perhaps you can merge current master and try new actions/checkout that was introduced. I've run test_shelltask.py with the docker image on my laptop and it was fine

Ok I will try this out

@adi611
Copy link
Contributor Author

adi611 commented Sep 7, 2023

i'm trying to understand why the slurm test fails now (last run I cancelled, but it took 5h...), since you didn't any new test here yet. any idea?

I couldn't find any reason for the workflow to not work here, when it works elsewhere. I will try the new actions/checkout and check if the problem persists.

@adi611
Copy link
Contributor Author

adi611 commented Sep 7, 2023

I have updated the worker. Currently I am using the slurm instance for psij.JobExecutor to run Slurm tests.

Prerequisite:

  • The local pydra repository must be mounted as /pydra as done in the Slurm GA workflow
  • Install psij-python in the docker container
  • Change the plugin for the test to psij or make changes in conftest file as discussed in GSoC 2023: Add PSI/J Worker #691

Features:

  • Can run the Slurm tests individually

Known issues:

  • Tests fail when running many simultaneously. For e.g., pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_4 pass but test_shell_cmd_4 fails when testing all tests in test_shelltask.py.
  • Not clean code

I am working on fixing these issues.

@djarecka
Copy link
Collaborator

@adi611 - I've had problems with my MIT account during the weekend, so I was not able to test it, but hopefully will be able to solve my issues tomorrow

@adi611
Copy link
Contributor Author

adi611 commented Sep 11, 2023

@djarecka - I think the issue arises when running the tests with the psij plugin and -n auto mode of pytest. I tried running all the tests in test_shelltask.py:

  • Using the local instance of the psij job executor, with the command pytest --color=yes -v pydra/engine/tests/test_shelltask.py. Here is the result: (2 tests fail, others run and pass without any issue)
========================================== short test summary info ==========================================
FAILED pydra/engine/tests/test_shelltask.py::test_shell_cmd_outputspec_1b_exception[psij] - Failed: DID NOT RAISE <class 'Exception'>
FAILED pydra/engine/tests/test_shelltask.py::test_shell_cmd_outputspec_2a_exception[psij] - Failed: DID NOT RAISE <class 'Exception'>
====================== 2 failed, 147 passed, 1 skipped, 3 xfailed in 388.45s (0:06:28) ======================
  • Using the Slurm container and the slurm instance of psij job executor (without -n auto): the tests run and pass, but at a very slow pace
  • Using the Slurm container and the slurm instance of psij job executor (with -n auto): many tests fail

@adi611
Copy link
Contributor Author

adi611 commented Sep 12, 2023

I have fixed the issue with pytest's -n auto mode; it can now run multiple tests at once.

@djarecka
Copy link
Collaborator

@adi611 - I've tried to run the tests with new worker on our cluster, but tests are failing, I'm not able to get results when running with Submitter. Will try to check debug it in the next few days

@adi611
Copy link
Contributor Author

adi611 commented Sep 15, 2023

@adi611 - I've tried to run the tests with new worker on our cluster, but tests are failing, I'm not able to get results when running with Submitter. Will try to check debug it in the next few days

I think the issue might be with the paths to files and functions (like /pydra/pydra/engine/run_pickled_function.py), and/or the python3.9 command. I used the previous Slurm container as reference for writing this and I should make it more generalized now. Could you please check if the issue still persists after correcting the paths and python command? Meanwhile I will update the PR.

@djarecka
Copy link
Collaborator

yes, it should be more generalized, you can always use the task output_dir

@adi611
Copy link
Contributor Author

adi611 commented Sep 15, 2023

I think it should work now

@adi611
Copy link
Contributor Author

adi611 commented Sep 19, 2023

Added the option to switch between the different schedulers provided by psij. For e.g., plugin='psij-local' and plugin='psij-slurm' will use local and slurm instances for psij's job executor respectively. For now, the list contains only local and slurm options.

@adi611
Copy link
Contributor Author

adi611 commented Sep 20, 2023

The psij-local and psij-slurm pass all the required tests, though psij-slurm takes much longer than the actual slurm plugin. I will update the conftest.py file to resolve the conflicts with other tests. Please let me know if I need to make other changes.

@adi611
Copy link
Contributor Author

adi611 commented Sep 20, 2023

I will also try to improve the performance of psij-slurm

@djarecka
Copy link
Collaborator

with new version the psij-local works on my osx! will check the new version of slurm tomorrow morning.

for the conftest, you could just expand the list of worker and add psij-local to cf and psij-slurm to slurm and see how long it takes on GA

@adi611
Copy link
Contributor Author

adi611 commented Sep 21, 2023

for the conftest, you could just expand the list of worker and add psij-local to cf and psij-slurm to slurm and see how long it takes on GA

I have updated the conftest.py to add the psij plugin options. The fixtures --psij=local and --psij=slurm correspond to the psij-local and psij-slurm plugins respectively. Also, I have created specific GA workflows for the psij plugins: PSI/J-Local and PSI/J-SLURM.

@adi611
Copy link
Contributor Author

adi611 commented Sep 21, 2023

For PSI/J-SLURM it took 1h 56m 35s to complete, as can be seen here.

Copy link
Collaborator

@djarecka djarecka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some small suggestions, but I also think we should try to run psij-local with docker. It seems like all test_docker are disabled when PSI/J-Local / test

pydra/engine/workers.py Outdated Show resolved Hide resolved
pydra/engine/workers.py Outdated Show resolved Hide resolved
pydra/engine/workers.py Outdated Show resolved Hide resolved
pydra/engine/workers.py Outdated Show resolved Hide resolved
pydra/engine/workers.py Outdated Show resolved Hide resolved
@adi611
Copy link
Contributor Author

adi611 commented Sep 22, 2023

@djarecka - I checked the test_dockertask tests in PSI/J-Local / test and all the tests which pass for cf pass for psij-local, for e.g., test_docker_1 pass while test_docker_3 gets skipped, for both cf and psij-local.

@djarecka djarecka merged commit 0e3d33c into nipype:master Sep 24, 2023
34 of 35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants