Switch from subprocess to multiprocessing #259

xylar · 2017-10-08T03:42:41Z

In order to support better data availability between tasks (e.g.
if tasks depend on data from the setup of other tasks), tasks are
now subclasses of multiprocess.Process. This work is needed to
support subtasks (e.g. for producing climatologies, or for generating
each plot from a task).

Logging is also altered so that each task has a logger, to which
info, errors and warnings should be logged. Print statements will
automatically go to the logger as "info" while exceptions and
warnings raised through the warning module will be logged as
errors.

Because multiprocess.Process already uses a method called run(),
the run() method of AnalysisTask and its derived classes had to
be renamed to run_analysis().

xylar · 2017-10-08T03:44:01Z

This PR should not be merged until #257 has been merged, as this branch contains that commit.

xylar · 2017-10-08T03:54:32Z

@pwolfram and @milenaveneziani, I have made some fairly significant changes to how tasks are handled in parallel. When I was trying to set up tasks that depended on one another (either as prerequisites or as subtasks), I ran into trouble. When tasks are launched as separate processes with subprocess as I was doing before, the whole analysis in the subprocess forgets everything that has been done in the "master" process and starts over at the beginning. Not all tasks were getting set up in the subprocess so a lot of data that was needed wasn't available.

A solution was to use the multiprocess module, which launches new processes that are copies of the current process up to the point when the new process was launched. That means that all tasks could have access to whatever information is in other tasks up to (but not including) when we call the run() (now renamed to run_analysis()) method. So one task can find out what input files another task used or which files it will plan to write out.

You can feel free to dig into the code if you like. Or you can trust that I have a good reason for making the changes I did and just run some tests to make sure I didn't break anything. Whichever you're up for.

One thing you'll notice is that I am now using a logger in a lot of places where we were using print statements before. I think it would be a good idea to move to the logger because it has several levels it can operate at (debug, info, warning and error) but you can feel free to use print as before (which will be logged as "info") and warnings.warn (which, however, will be logged as an error because it goes to stderr).

Please let me know if you have concerns about these changes, as I plan to build a lot of future work on this infrastructure.

xylar · 2017-10-08T06:51:22Z

Okay, @milenaveneziani and @pwolfram, this branch is ready to be reviewed when you have time.

milenaveneziani · 2017-10-10T17:37:47Z

Funny that @pwolfram and I split our attention to this PR and #258 without planning it.
A couple of quick questions from me on this one:

I don't suppose anything changes as far as parallel tasks are executed (one task per node) with this one, right?
how about changing run to run_analysis_task or run_task?

And about logger: I learned about it last week at the python class. I think it's going to be quite useful for mpas-analysis.

xylar · 2017-10-10T17:46:46Z

@milenaveneziani

I don't suppose anything changes as far as parallel tasks are executed (one task per node) with this one, right?

No, that is no longer supported. Instead, it is one node, all tasks run on that node. If you want to run different tasks on different nodes, you would need to launch them manually using separate config files or --generate flags.

how about changing run to run_analysis_task or run_task

If I understand right, you don't like run_analysis. I'm fine with either of these, with preference for run_task since it's shorter.

And about logger: I learned about it last week at the python class. I think it's going to be quite useful for mpas-analysis.

I'm glad you think it will be useful. I've found it to be kind of a mixed bag but as soon as I figure out how to redirect print statements, warnings and exceptions to the log it seems like it will be useful.

milenaveneziani · 2017-10-10T17:58:56Z

No, that is no longer supported. Instead, it is one node, all tasks run on that node. If you want to run different tasks on different nodes, you would need to launch them manually using separate config files or --generate flags.

ok, so we will need to change the sample scripts for running batch jobs to reflect this change (and I need to take a note that there will have to be a similar change on the a-prime side).
Do you think this should go in before the next tag is made? I was thinking that we could be ready for a v1.0 tag, especially if we can include the license/official public release/installable package PR in it.

If I understand right, you don't like run_analysis. I'm fine with either of these, with preference for run_task since it's shorter.

It just seemed a bit too general. I am good with run_task.

xylar · 2017-10-10T18:31:12Z

ok, so we will need to change the sample scripts for running batch jobs to reflect this change

Yes, indeed. I'll add these changes to this PR. I might need your help on systems other than LANL and NERSC still.

Do you think this should go in before the next tag is made? I was thinking that we could be ready for a v1.0 tag, especially if we can include the license/official public release/installable package PR in it.

If this process happens quickly, I'm fine with holding off on this PR until we have a v1.0 tag. But this PR is the first in a long chain of PRs that are a high priority for me an the ALCC team.

milenaveneziani · 2017-10-10T19:27:59Z

I might need your help on systems other than LANL and NERSC still.

I can test and change the relevant script on titan/olcf.

xylar · 2017-10-10T19:42:14Z

@milenaveneziani and @pwolfram, I just updated the job scripts. I'll test on Edison and Wolf. @milenaveneziani, can you verify that things work on OLCF? If 12 tasks isn't a good number, please let me know what is and I'll update.

xylar · 2017-10-10T20:06:41Z

Having trouble on Edison. I'll keep you posted. Maybe hold off on testing elsewhere for now.

xylar · 2017-10-10T20:47:02Z

I fixed the issue on Edison. Everything ran just fine for the QU240 case both in serial on the login node and in parallel on the compute nodes (with maxParallelTasks=12). I'm testing the EC60to30 case now.

I squashed commits so we'll hopefully be ready to merge when the time comes.

xylar · 2017-10-10T21:07:08Z

I just rebased after the last merge.

xylar · 2017-10-10T21:27:18Z

I'm seeing the following type of error on Edison. I have seen similar errors on my laptop but hoped it was just something weird with my laptop.

running: ESMF_RegridWeightGen --source /tmp/5c5kFF.nc --destination /tmp/fEJYV6.nc --weight /scratch1/scratchdirs/xylar/analysis/EC60to30.beta1/multiprocessing/mapping/map_climatologyMapSeaIceConcNH_EC60to30v3_to_0.5x0.5degree_bilinear.nc --method bilinear --netcdf4 --no_log --src_regional --ignore_unmapped
ERROR: application called MPI_Abort(comm=0x84000000, 1) - process 0
ERROR: [unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1
ERROR: :
ERROR: system msg for write_line failure : Bad file descriptor
ERROR:
ERROR: analysis task climatologyMapSeaIceConcNH failed during run
Traceback (most recent call last):
  File "/global/u2/x/xylar/mpas_work/analysis/switch_to_multiprocessing/mpas_analysis/shared/analysis_task.py", line 251, in run
    self.run_task()
  File "/global/u2/x/xylar/mpas_work/analysis/switch_to_multiprocessing/mpas_analysis/sea_ice/climatology_map.py", line 135, in run_task
    logger=self.logger)
  File "/global/u2/x/xylar/mpas_work/analysis/switch_to_multiprocessing/mpas_analysis/shared/climatology/climatology.py", line 141, in get_remapper
    remapper.build_mapping_file(method=method, logger=logger)
  File "/global/u2/x/xylar/mpas_work/analysis/switch_to_multiprocessing/mpas_analysis/shared/interpolation/remapper.py", line 188, in build_mapping_file
    ' '.join(args))
CalledProcessError: Command 'ESMF_RegridWeightGen --source /tmp/5c5kFF.nc --destination /tmp/fEJYV6.nc --weight /scratch1/scratchdirs/xylar/analysis/EC60to30.beta1/multiprocessing/mapping/map_climatologyMapSeaIceConcNH_EC60to30v3_to_0.5x0.5degree_bilinear.nc --method bilinear --netcdf4 --no_log --src_regional --ignore_unmapped' returned non-zero exit status 1

The error seems to always be "Bad file descriptor" and always to result from a call to ESMF_RegridWeightGen. It doesn't come up every time but often enough to be problematic. I'll try to come up with a workaround.

xylar · 2017-10-11T04:30:47Z

I have a fix to the problem above that doesn't seem too bad. I am creating remappers (therefore mapping files) during the setup_and_check phase rather than the run_task phase. This means that the remapping happens in serial, which seems to work fine.

Various tests worked on my laptop with this change (and not before this change). The EC60to30 test on Edison also worked fine in batch mode with this change.

I'll re-test on wolf next.

xylar · 2017-10-11T05:35:46Z

After explicitly setting the number of Open MP threads to 1, I was able to successfully run the EC60to30 test case on wolf. I see warnings from NCO that it's having to run with 1 thread instead of 2 (as I also saw on NERSC) but things work correctly. Not quite sure what the deal is there but things go terribly wrong without the flag (and presumably therefore with 2 OMP threads by default).

xylar · 2017-10-11T05:36:49Z

@milenaveneziani, feel free to test on Titan. You may also need to add a line to the job script setting the number of Open MP threads to 1. I'm not sure how Titan works.

kevans32 · 2017-10-11T12:54:59Z

I OLCF I recommend you target rhea first - the nco and other packages are kept more up to date there since its the analysis machine.

milenaveneziani · 2017-10-11T16:02:12Z

Hi @kevans32: the problem with rhea is that I cannot get batch jobs to end cleanly, because of the last cp of png files to the html location. For this reason, in the last a-prime commit, we decided to support batch jobs on titan only (at least for now).

xylar · 2017-10-13T17:33:41Z

@milenaveneziani, I removed the 'ERROR: ' at the beginning of error messages in the log file. I re-ran and now see:

  Make ice concentration plots...
/home/xylar/miniconda2/envs/mpas_analysis/lib/python2.7/site-packages/numpy/ma/core.py:2344: RuntimeWarning: invalid value encountered in less_equal
  mabs(xnew - value), atol + rtol * mabs(value))

without the 'ERROR:' prefix.

xylar · 2017-10-13T17:42:25Z

Regarding he segfault, I can't really help much until I have access to OLCF. Could you post the full log file (via gist.github.com)? It should tell you where things were run (i.e. out of @pwolfram's acme-unified vs. @czender's directory). Other than that, I guess I would try running the command that cause the segfault on its own and see if you can make figure out what might be going wrong. If we can at least reproduce the error outside of MPAS-Analysis, we might be able to get some help from @czender.

milenaveneziani · 2017-10-13T17:55:23Z

let me poke around with this error on titan. It is strange that I get different errors every time.
I believe I am using the nco from @pwolfram acme-conda environment.

milenaveneziani · 2017-10-13T18:04:37Z

hmm, this is weird. I re-run with a --purge (but I had done the same thing the first time around) and this time all the tasks completed successfully. Any idea?

xylar · 2017-10-13T18:19:53Z

No, I really don't know if it's not reproducible. I'd try running a few more times and make sure it doesn't happen again.

milenaveneziani · 2017-10-13T19:26:22Z

I have run a few more times (with and without purging first) and all went well.

xylar · 2017-10-13T20:29:03Z

Okay, that sounds good to me.

@pwolfram, do you have time to look at this anytime soon? Since there are a lot of fairly high-level python changes (notably to the base class AnalysisTask), we would really benefit from your input. But if you don't have time, that would also be important to know so we can move on.

pwolfram · 2017-10-13T20:42:15Z

@xylar, thanks for the ping on this. I took a quick look and nothing major stood out. I could potentially review this late next week, maybe on Tuesday or Wednesday at the earliest but likely Thursday or Friday or even the next week. I won't know for sure until early Tuesday.

What is your needed time scale for a review?

xylar · 2017-10-13T20:50:33Z

@pwolfram, late next week would be fine. If you find you don't have time next week, could you let us know?

pwolfram · 2017-10-13T21:31:29Z

Sure @xylar

xylar · 2017-10-19T21:08:14Z

Note: I have seen issues with processes freezing up during calls to open_multifile_dataset in branches where I introduce more subtasks. @pwolfram and I agree that this is likely due to the dask package not "playing nice" with multiprocessing. Because we will be moving to using NCO within pre-processing tasks to compute climatologies and extract time series from multiple files, we do not foresee needing dask in the future so these freeze-ups will be addressed soon by removing all calls to open_multifile_dataset and instead opening single NetCDF files that have been pre-processed through NCO.

I have not seen this kind of lock-up behavior with this particular branch so I think it is safe to merge these changes first and work toward removing dask in a future PR.

pwolfram

@xylar, to be fair there are a lot of moving pieces here that have changed and it would take a considerable amount of time to do a proper review. I'm not sure at this point I can give you a reasonable review given my time constraints, although I have done the best I can at this point. I certainly won't stand in the way of the PR.

pwolfram · 2017-10-19T22:31:34Z

@xylar, I should also note that at this point my code is essentially "legacy-esq" as I have contributed 6400 lines / changes to the repo and you have committed 62,600-- which is essentially an order of magnitude more. I would, at this point, benefit from some type of developer tutorial for the system as I'm sure others would too.

xylar · 2017-10-20T13:46:51Z

@pwolfram, a developer's guide is a good idea. For now, what we have is the analysis task template but a more thorough guide would be useful. I will put it on my list.

xylar · 2017-10-20T13:47:16Z

And thank you for taking the time to review this PR!

milenaveneziani · 2017-10-20T14:18:08Z

@xylar: do you think I should do more testing on this or are we ready to merge?

xylar · 2017-11-07T22:04:53Z

Testing

I've tested one run each (both batch mode and on the login node) on:

Edison
Anvil
Wolf
Grizzly
my Ubuntu laptop (both serial and parallel)
The runs are all based on the config files and job scripts provided in the configs/ folder.

All run successfully to completion and create the HTML, though I haven't carefully looked at the output in all cases.

xylar · 2017-11-07T22:05:42Z

@vanroekel, would you have time to run a quick test on Anvil or Theta or wherever is convenient for you? If you're happy with it, I would like to merge later today if possible.

vanroekel · 2017-11-07T22:14:55Z

@xylar sure, I'll try it out on theta and try to get it before the end of the day.

vanroekel · 2017-11-08T04:36:08Z

mpas_analysis/shared/analysis_task.py

+        # non-public attributes related to multiprocessing and logging
+        self.daemon = True
+        self._runStatus = Value('i', AnalysisTask.UNSET)
+        self._staskTrace = None


should this be stackTrace?

Sure should be. Fortunately (or unfortunately?), the code never accesses the _stackTrace attribute unless the task has failed and given it a valid value. I'll fix this right away.

vanroekel · 2017-11-08T14:45:42Z

@xylar sorry for the delay, I'm testing this right now.

@czender

In order to support better data availability between tasks (e.g. if tasks depend on data from the setup of other tasks), tasks are now subclasses of multiprocess.Process. This work is needed as a precursor to adding support for subtasks (e.g. for producing cliamtologies, or for generating each plot from a task). Logging is also altered so that each task has a logger, to which info, errors and warnings should be logged. Because multiprocess.Process already uses a method called run(), the run() method of AnalysisTask and its dervied classes had to be renamed to run_task(). Job scripts have been updated to run on 1 node. The default ncclim mode is background (12 threads) The path to @czender's NCO installation has been removed from the edison job script. It is no longer needed for the latest NCO and MPAS-Analysis, which includes a flag to ignore @czender's path on LCF machines. The LANL jobscript has been updated to have 1 Open MP thread instead of 2. This is needed (as on NERSC) to prevent NCO problems when running with the default 2 threads. (Not sure why.)

vanroekel · 2017-11-13T17:14:31Z

@xylar, I was able to reproduce plots visually identical to develop with high and low res data. Thanks for the patience with testing on this PR.

xylar · 2017-11-13T19:42:40Z

Thanks @vanroekel.

xylar added in progress enhancement priority labels Oct 8, 2017

xylar self-assigned this Oct 8, 2017

xylar requested review from milenaveneziani and pwolfram October 8, 2017 03:42

xylar mentioned this pull request Oct 8, 2017

Add MpasClimatologyTask and RemapMpasClimatologySubtask #258

Merged

xylar force-pushed the switch_to_multiprocessing branch from ecf910c to e320392 Compare October 8, 2017 06:49

xylar removed the in progress label Oct 8, 2017

xylar force-pushed the switch_to_multiprocessing branch from 60bd1d0 to bee15b1 Compare October 10, 2017 20:01

xylar force-pushed the switch_to_multiprocessing branch from 4365d2d to 4130c44 Compare October 10, 2017 20:45

xylar force-pushed the switch_to_multiprocessing branch from 4130c44 to 7d63a89 Compare October 10, 2017 21:06

pwolfram approved these changes Oct 19, 2017

View reviewed changes

xylar force-pushed the switch_to_multiprocessing branch from cb887c1 to e4059a3 Compare October 30, 2017 17:42

xylar requested a review from vanroekel November 8, 2017 02:11

vanroekel reviewed Nov 8, 2017

View reviewed changes

xylar force-pushed the switch_to_multiprocessing branch from e4059a3 to 16081b8 Compare November 8, 2017 22:37

vanroekel approved these changes Nov 13, 2017

View reviewed changes

xylar merged commit ea20b2a into MPAS-Dev:develop Nov 13, 2017

xylar deleted the switch_to_multiprocessing branch November 13, 2017 19:43

Switch from subprocess to multiprocessing #259

Switch from subprocess to multiprocessing #259

Uh oh!

Conversation

xylar commented Oct 8, 2017

Uh oh!

xylar commented Oct 8, 2017

Uh oh!

xylar commented Oct 8, 2017

Uh oh!

xylar commented Oct 8, 2017

Uh oh!

milenaveneziani commented Oct 10, 2017

Uh oh!

xylar commented Oct 10, 2017

Uh oh!

milenaveneziani commented Oct 10, 2017

Uh oh!

xylar commented Oct 10, 2017

Uh oh!

milenaveneziani commented Oct 10, 2017

Uh oh!

xylar commented Oct 10, 2017

Uh oh!

xylar commented Oct 10, 2017

Uh oh!

xylar commented Oct 10, 2017

Uh oh!

xylar commented Oct 10, 2017

Uh oh!

xylar commented Oct 10, 2017

Uh oh!

xylar commented Oct 11, 2017

Uh oh!

xylar commented Oct 11, 2017

Uh oh!

xylar commented Oct 11, 2017

Uh oh!

kevans32 commented Oct 11, 2017

Uh oh!

milenaveneziani commented Oct 11, 2017

Uh oh!

xylar commented Oct 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xylar commented Oct 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

milenaveneziani commented Oct 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

milenaveneziani commented Oct 13, 2017

Uh oh!

xylar commented Oct 13, 2017

Uh oh!

milenaveneziani commented Oct 13, 2017

Uh oh!

xylar commented Oct 13, 2017

Uh oh!

pwolfram commented Oct 13, 2017

Uh oh!

xylar commented Oct 13, 2017

Uh oh!

pwolfram commented Oct 13, 2017

Uh oh!

xylar commented Oct 19, 2017

Uh oh!

pwolfram left a comment

Choose a reason for hiding this comment

Uh oh!

pwolfram commented Oct 19, 2017

Uh oh!

xylar commented Oct 20, 2017

Uh oh!

xylar commented Oct 20, 2017

Uh oh!

milenaveneziani commented Oct 20, 2017

Uh oh!

xylar commented Nov 7, 2017

Testing

xylar commented Oct 13, 2017 •

edited

Loading

xylar commented Oct 13, 2017 •

edited

Loading

milenaveneziani commented Oct 13, 2017 •

edited

Loading

vanroekel commented Nov 13, 2017 •

edited

Loading