-
Notifications
You must be signed in to change notification settings - Fork 53
Switch from subprocess to multiprocessing #259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This PR should not be merged until #257 has been merged, as this branch contains that commit. |
|
@pwolfram and @milenaveneziani, I have made some fairly significant changes to how tasks are handled in parallel. When I was trying to set up tasks that depended on one another (either as prerequisites or as subtasks), I ran into trouble. When tasks are launched as separate processes with subprocess as I was doing before, the whole analysis in the subprocess forgets everything that has been done in the "master" process and starts over at the beginning. Not all tasks were getting set up in the subprocess so a lot of data that was needed wasn't available. A solution was to use the You can feel free to dig into the code if you like. Or you can trust that I have a good reason for making the changes I did and just run some tests to make sure I didn't break anything. Whichever you're up for. One thing you'll notice is that I am now using a Please let me know if you have concerns about these changes, as I plan to build a lot of future work on this infrastructure. |
ecf910c to
e320392
Compare
|
Okay, @milenaveneziani and @pwolfram, this branch is ready to be reviewed when you have time. |
|
Funny that @pwolfram and I split our attention to this PR and #258 without planning it.
And about |
No, that is no longer supported. Instead, it is one node, all tasks run on that node. If you want to run different tasks on different nodes, you would need to launch them manually using separate config files or --generate flags.
If I understand right, you don't like
I'm glad you think it will be useful. I've found it to be kind of a mixed bag but as soon as I figure out how to redirect print statements, warnings and exceptions to the log it seems like it will be useful. |
ok, so we will need to change the sample scripts for running batch jobs to reflect this change (and I need to take a note that there will have to be a similar change on the a-prime side).
It just seemed a bit too general. I am good with |
Yes, indeed. I'll add these changes to this PR. I might need your help on systems other than LANL and NERSC still.
If this process happens quickly, I'm fine with holding off on this PR until we have a v1.0 tag. But this PR is the first in a long chain of PRs that are a high priority for me an the ALCC team. |
I can test and change the relevant script on titan/olcf. |
|
@milenaveneziani and @pwolfram, I just updated the job scripts. I'll test on Edison and Wolf. @milenaveneziani, can you verify that things work on OLCF? If 12 tasks isn't a good number, please let me know what is and I'll update. |
60bd1d0 to
bee15b1
Compare
|
Having trouble on Edison. I'll keep you posted. Maybe hold off on testing elsewhere for now. |
4365d2d to
4130c44
Compare
|
I fixed the issue on Edison. Everything ran just fine for the QU240 case both in serial on the login node and in parallel on the compute nodes (with maxParallelTasks=12). I'm testing the EC60to30 case now. I squashed commits so we'll hopefully be ready to merge when the time comes. |
4130c44 to
7d63a89
Compare
|
I just rebased after the last merge. |
|
I'm seeing the following type of error on Edison. I have seen similar errors on my laptop but hoped it was just something weird with my laptop. The error seems to always be "Bad file descriptor" and always to result from a call to |
|
I have a fix to the problem above that doesn't seem too bad. I am creating remappers (therefore mapping files) during the Various tests worked on my laptop with this change (and not before this change). The EC60to30 test on Edison also worked fine in batch mode with this change. I'll re-test on wolf next. |
|
After explicitly setting the number of Open MP threads to 1, I was able to successfully run the EC60to30 test case on wolf. I see warnings from NCO that it's having to run with 1 thread instead of 2 (as I also saw on NERSC) but things work correctly. Not quite sure what the deal is there but things go terribly wrong without the flag (and presumably therefore with 2 OMP threads by default). |
|
@milenaveneziani, feel free to test on Titan. You may also need to add a line to the job script setting the number of Open MP threads to 1. I'm not sure how Titan works. |
|
I OLCF I recommend you target rhea first - the nco and other packages are kept more up to date there since its the analysis machine. |
|
Hi @kevans32: the problem with rhea is that I cannot get batch jobs to end cleanly, because of the last cp of png files to the html location. For this reason, in the last a-prime commit, we decided to support batch jobs on titan only (at least for now). |
|
@milenaveneziani, I removed the without the |
|
Regarding he segfault, I can't really help much until I have access to OLCF. Could you post the full log file (via gist.github.com)? It should tell you where things were run (i.e. out of @pwolfram's acme-unified vs. @czender's directory). Other than that, I guess I would try running the command that cause the segfault on its own and see if you can make figure out what might be going wrong. If we can at least reproduce the error outside of MPAS-Analysis, we might be able to get some help from @czender. |
|
let me poke around with this error on titan. It is strange that I get different errors every time. |
|
hmm, this is weird. I re-run with a |
|
No, I really don't know if it's not reproducible. I'd try running a few more times and make sure it doesn't happen again. |
|
I have run a few more times (with and without purging first) and all went well. |
|
Okay, that sounds good to me. @pwolfram, do you have time to look at this anytime soon? Since there are a lot of fairly high-level python changes (notably to the base class |
|
@xylar, thanks for the ping on this. I took a quick look and nothing major stood out. I could potentially review this late next week, maybe on Tuesday or Wednesday at the earliest but likely Thursday or Friday or even the next week. I won't know for sure until early Tuesday. What is your needed time scale for a review? |
|
@pwolfram, late next week would be fine. If you find you don't have time next week, could you let us know? |
|
Sure @xylar |
|
Note: I have seen issues with processes freezing up during calls to I have not seen this kind of lock-up behavior with this particular branch so I think it is safe to merge these changes first and work toward removing |
pwolfram
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xylar, to be fair there are a lot of moving pieces here that have changed and it would take a considerable amount of time to do a proper review. I'm not sure at this point I can give you a reasonable review given my time constraints, although I have done the best I can at this point. I certainly won't stand in the way of the PR.
|
@xylar, I should also note that at this point my code is essentially "legacy-esq" as I have contributed 6400 lines / changes to the repo and you have committed 62,600-- which is essentially an order of magnitude more. I would, at this point, benefit from some type of developer tutorial for the system as I'm sure others would too. |
|
@pwolfram, a developer's guide is a good idea. For now, what we have is the analysis task template but a more thorough guide would be useful. I will put it on my list. |
|
And thank you for taking the time to review this PR! |
|
@xylar: do you think I should do more testing on this or are we ready to merge? |
cb887c1 to
e4059a3
Compare
TestingI've tested one run each (both batch mode and on the login node) on:
All run successfully to completion and create the HTML, though I haven't carefully looked at the output in all cases. |
|
@vanroekel, would you have time to run a quick test on Anvil or Theta or wherever is convenient for you? If you're happy with it, I would like to merge later today if possible. |
|
@xylar sure, I'll try it out on theta and try to get it before the end of the day. |
| # non-public attributes related to multiprocessing and logging | ||
| self.daemon = True | ||
| self._runStatus = Value('i', AnalysisTask.UNSET) | ||
| self._staskTrace = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be stackTrace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure should be. Fortunately (or unfortunately?), the code never accesses the _stackTrace attribute unless the task has failed and given it a valid value. I'll fix this right away.
|
@xylar sorry for the delay, I'm testing this right now. |
In order to support better data availability between tasks (e.g. if tasks depend on data from the setup of other tasks), tasks are now subclasses of multiprocess.Process. This work is needed as a precursor to adding support for subtasks (e.g. for producing cliamtologies, or for generating each plot from a task). Logging is also altered so that each task has a logger, to which info, errors and warnings should be logged. Because multiprocess.Process already uses a method called run(), the run() method of AnalysisTask and its dervied classes had to be renamed to run_task(). Job scripts have been updated to run on 1 node. The default ncclim mode is background (12 threads) The path to @czender's NCO installation has been removed from the edison job script. It is no longer needed for the latest NCO and MPAS-Analysis, which includes a flag to ignore @czender's path on LCF machines. The LANL jobscript has been updated to have 1 Open MP thread instead of 2. This is needed (as on NERSC) to prevent NCO problems when running with the default 2 threads. (Not sure why.)
e4059a3 to
16081b8
Compare
|
@xylar, I was able to reproduce plots visually identical to develop with high and low res data. Thanks for the patience with testing on this PR. |
|
Thanks @vanroekel. |
In order to support better data availability between tasks (e.g.
if tasks depend on data from the setup of other tasks), tasks are
now subclasses of
multiprocess.Process. This work is needed tosupport subtasks (e.g. for producing climatologies, or for generating
each plot from a task).
Logging is also altered so that each task has a logger, to which
info, errors and warnings should be logged. Print statements will
automatically go to the logger as "info" while exceptions and
warnings raised through the
warningmodule will be logged aserrors.
Because
multiprocess.Processalready uses a method calledrun(),the
run()method ofAnalysisTaskand its derived classes had tobe renamed to
run_analysis().