Skip to content

Conversation

@xylar
Copy link
Collaborator

@xylar xylar commented Oct 8, 2017

In order to support better data availability between tasks (e.g.
if tasks depend on data from the setup of other tasks), tasks are
now subclasses of multiprocess.Process. This work is needed to
support subtasks (e.g. for producing climatologies, or for generating
each plot from a task).

Logging is also altered so that each task has a logger, to which
info, errors and warnings should be logged. Print statements will
automatically go to the logger as "info" while exceptions and
warnings raised through the warning module will be logged as
errors.

Because multiprocess.Process already uses a method called run(),
the run() method of AnalysisTask and its derived classes had to
be renamed to run_analysis().

@xylar
Copy link
Collaborator Author

xylar commented Oct 8, 2017

This PR should not be merged until #257 has been merged, as this branch contains that commit.

@xylar
Copy link
Collaborator Author

xylar commented Oct 8, 2017

@pwolfram and @milenaveneziani, I have made some fairly significant changes to how tasks are handled in parallel. When I was trying to set up tasks that depended on one another (either as prerequisites or as subtasks), I ran into trouble. When tasks are launched as separate processes with subprocess as I was doing before, the whole analysis in the subprocess forgets everything that has been done in the "master" process and starts over at the beginning. Not all tasks were getting set up in the subprocess so a lot of data that was needed wasn't available.

A solution was to use the multiprocess module, which launches new processes that are copies of the current process up to the point when the new process was launched. That means that all tasks could have access to whatever information is in other tasks up to (but not including) when we call the run() (now renamed to run_analysis()) method. So one task can find out what input files another task used or which files it will plan to write out.

You can feel free to dig into the code if you like. Or you can trust that I have a good reason for making the changes I did and just run some tests to make sure I didn't break anything. Whichever you're up for.

One thing you'll notice is that I am now using a logger in a lot of places where we were using print statements before. I think it would be a good idea to move to the logger because it has several levels it can operate at (debug, info, warning and error) but you can feel free to use print as before (which will be logged as "info") and warnings.warn (which, however, will be logged as an error because it goes to stderr).

Please let me know if you have concerns about these changes, as I plan to build a lot of future work on this infrastructure.

@xylar
Copy link
Collaborator Author

xylar commented Oct 8, 2017

Okay, @milenaveneziani and @pwolfram, this branch is ready to be reviewed when you have time.

@milenaveneziani
Copy link
Collaborator

Funny that @pwolfram and I split our attention to this PR and #258 without planning it.
A couple of quick questions from me on this one:

  • I don't suppose anything changes as far as parallel tasks are executed (one task per node) with this one, right?
  • how about changing run to run_analysis_task or run_task?

And about logger: I learned about it last week at the python class. I think it's going to be quite useful for mpas-analysis.

@xylar
Copy link
Collaborator Author

xylar commented Oct 10, 2017

@milenaveneziani

I don't suppose anything changes as far as parallel tasks are executed (one task per node) with this one, right?

No, that is no longer supported. Instead, it is one node, all tasks run on that node. If you want to run different tasks on different nodes, you would need to launch them manually using separate config files or --generate flags.

how about changing run to run_analysis_task or run_task

If I understand right, you don't like run_analysis. I'm fine with either of these, with preference for run_task since it's shorter.

And about logger: I learned about it last week at the python class. I think it's going to be quite useful for mpas-analysis.

I'm glad you think it will be useful. I've found it to be kind of a mixed bag but as soon as I figure out how to redirect print statements, warnings and exceptions to the log it seems like it will be useful.

@milenaveneziani
Copy link
Collaborator

No, that is no longer supported. Instead, it is one node, all tasks run on that node. If you want to run different tasks on different nodes, you would need to launch them manually using separate config files or --generate flags.

ok, so we will need to change the sample scripts for running batch jobs to reflect this change (and I need to take a note that there will have to be a similar change on the a-prime side).
Do you think this should go in before the next tag is made? I was thinking that we could be ready for a v1.0 tag, especially if we can include the license/official public release/installable package PR in it.

If I understand right, you don't like run_analysis. I'm fine with either of these, with preference for run_task since it's shorter.

It just seemed a bit too general. I am good with run_task.

@xylar
Copy link
Collaborator Author

xylar commented Oct 10, 2017

ok, so we will need to change the sample scripts for running batch jobs to reflect this change

Yes, indeed. I'll add these changes to this PR. I might need your help on systems other than LANL and NERSC still.

Do you think this should go in before the next tag is made? I was thinking that we could be ready for a v1.0 tag, especially if we can include the license/official public release/installable package PR in it.

If this process happens quickly, I'm fine with holding off on this PR until we have a v1.0 tag. But this PR is the first in a long chain of PRs that are a high priority for me an the ALCC team.

@milenaveneziani
Copy link
Collaborator

I might need your help on systems other than LANL and NERSC still.

I can test and change the relevant script on titan/olcf.

@xylar
Copy link
Collaborator Author

xylar commented Oct 10, 2017

@milenaveneziani and @pwolfram, I just updated the job scripts. I'll test on Edison and Wolf. @milenaveneziani, can you verify that things work on OLCF? If 12 tasks isn't a good number, please let me know what is and I'll update.

@xylar xylar force-pushed the switch_to_multiprocessing branch from 60bd1d0 to bee15b1 Compare October 10, 2017 20:01
@xylar
Copy link
Collaborator Author

xylar commented Oct 10, 2017

Having trouble on Edison. I'll keep you posted. Maybe hold off on testing elsewhere for now.

@xylar xylar force-pushed the switch_to_multiprocessing branch from 4365d2d to 4130c44 Compare October 10, 2017 20:45
@xylar
Copy link
Collaborator Author

xylar commented Oct 10, 2017

I fixed the issue on Edison. Everything ran just fine for the QU240 case both in serial on the login node and in parallel on the compute nodes (with maxParallelTasks=12). I'm testing the EC60to30 case now.

I squashed commits so we'll hopefully be ready to merge when the time comes.

@xylar xylar force-pushed the switch_to_multiprocessing branch from 4130c44 to 7d63a89 Compare October 10, 2017 21:06
@xylar
Copy link
Collaborator Author

xylar commented Oct 10, 2017

I just rebased after the last merge.

@xylar
Copy link
Collaborator Author

xylar commented Oct 10, 2017

I'm seeing the following type of error on Edison. I have seen similar errors on my laptop but hoped it was just something weird with my laptop.

running: ESMF_RegridWeightGen --source /tmp/5c5kFF.nc --destination /tmp/fEJYV6.nc --weight /scratch1/scratchdirs/xylar/analysis/EC60to30.beta1/multiprocessing/mapping/map_climatologyMapSeaIceConcNH_EC60to30v3_to_0.5x0.5degree_bilinear.nc --method bilinear --netcdf4 --no_log --src_regional --ignore_unmapped
ERROR: application called MPI_Abort(comm=0x84000000, 1) - process 0
ERROR: [unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1
ERROR: :
ERROR: system msg for write_line failure : Bad file descriptor
ERROR:
ERROR: analysis task climatologyMapSeaIceConcNH failed during run
Traceback (most recent call last):
  File "/global/u2/x/xylar/mpas_work/analysis/switch_to_multiprocessing/mpas_analysis/shared/analysis_task.py", line 251, in run
    self.run_task()
  File "/global/u2/x/xylar/mpas_work/analysis/switch_to_multiprocessing/mpas_analysis/sea_ice/climatology_map.py", line 135, in run_task
    logger=self.logger)
  File "/global/u2/x/xylar/mpas_work/analysis/switch_to_multiprocessing/mpas_analysis/shared/climatology/climatology.py", line 141, in get_remapper
    remapper.build_mapping_file(method=method, logger=logger)
  File "/global/u2/x/xylar/mpas_work/analysis/switch_to_multiprocessing/mpas_analysis/shared/interpolation/remapper.py", line 188, in build_mapping_file
    ' '.join(args))
CalledProcessError: Command 'ESMF_RegridWeightGen --source /tmp/5c5kFF.nc --destination /tmp/fEJYV6.nc --weight /scratch1/scratchdirs/xylar/analysis/EC60to30.beta1/multiprocessing/mapping/map_climatologyMapSeaIceConcNH_EC60to30v3_to_0.5x0.5degree_bilinear.nc --method bilinear --netcdf4 --no_log --src_regional --ignore_unmapped' returned non-zero exit status 1

The error seems to always be "Bad file descriptor" and always to result from a call to ESMF_RegridWeightGen. It doesn't come up every time but often enough to be problematic. I'll try to come up with a workaround.

@xylar
Copy link
Collaborator Author

xylar commented Oct 11, 2017

I have a fix to the problem above that doesn't seem too bad. I am creating remappers (therefore mapping files) during the setup_and_check phase rather than the run_task phase. This means that the remapping happens in serial, which seems to work fine.

Various tests worked on my laptop with this change (and not before this change). The EC60to30 test on Edison also worked fine in batch mode with this change.

I'll re-test on wolf next.

@xylar
Copy link
Collaborator Author

xylar commented Oct 11, 2017

After explicitly setting the number of Open MP threads to 1, I was able to successfully run the EC60to30 test case on wolf. I see warnings from NCO that it's having to run with 1 thread instead of 2 (as I also saw on NERSC) but things work correctly. Not quite sure what the deal is there but things go terribly wrong without the flag (and presumably therefore with 2 OMP threads by default).

@xylar
Copy link
Collaborator Author

xylar commented Oct 11, 2017

@milenaveneziani, feel free to test on Titan. You may also need to add a line to the job script setting the number of Open MP threads to 1. I'm not sure how Titan works.

@kevans32
Copy link

I OLCF I recommend you target rhea first - the nco and other packages are kept more up to date there since its the analysis machine.

@milenaveneziani
Copy link
Collaborator

Hi @kevans32: the problem with rhea is that I cannot get batch jobs to end cleanly, because of the last cp of png files to the html location. For this reason, in the last a-prime commit, we decided to support batch jobs on titan only (at least for now).

@xylar
Copy link
Collaborator Author

xylar commented Oct 13, 2017

@milenaveneziani, I removed the 'ERROR: ' at the beginning of error messages in the log file. I re-ran and now see:

  Make ice concentration plots...
/home/xylar/miniconda2/envs/mpas_analysis/lib/python2.7/site-packages/numpy/ma/core.py:2344: RuntimeWarning: invalid value encountered in less_equal
  mabs(xnew - value), atol + rtol * mabs(value))

without the 'ERROR:' prefix.

@xylar
Copy link
Collaborator Author

xylar commented Oct 13, 2017

Regarding he segfault, I can't really help much until I have access to OLCF. Could you post the full log file (via gist.github.com)? It should tell you where things were run (i.e. out of @pwolfram's acme-unified vs. @czender's directory). Other than that, I guess I would try running the command that cause the segfault on its own and see if you can make figure out what might be going wrong. If we can at least reproduce the error outside of MPAS-Analysis, we might be able to get some help from @czender.

@milenaveneziani
Copy link
Collaborator

milenaveneziani commented Oct 13, 2017

let me poke around with this error on titan. It is strange that I get different errors every time.
I believe I am using the nco from @pwolfram acme-conda environment.

@milenaveneziani
Copy link
Collaborator

hmm, this is weird. I re-run with a --purge (but I had done the same thing the first time around) and this time all the tasks completed successfully. Any idea?

@xylar
Copy link
Collaborator Author

xylar commented Oct 13, 2017

No, I really don't know if it's not reproducible. I'd try running a few more times and make sure it doesn't happen again.

@milenaveneziani
Copy link
Collaborator

I have run a few more times (with and without purging first) and all went well.

@xylar
Copy link
Collaborator Author

xylar commented Oct 13, 2017

Okay, that sounds good to me.

@pwolfram, do you have time to look at this anytime soon? Since there are a lot of fairly high-level python changes (notably to the base class AnalysisTask), we would really benefit from your input. But if you don't have time, that would also be important to know so we can move on.

@pwolfram
Copy link
Contributor

@xylar, thanks for the ping on this. I took a quick look and nothing major stood out. I could potentially review this late next week, maybe on Tuesday or Wednesday at the earliest but likely Thursday or Friday or even the next week. I won't know for sure until early Tuesday.

What is your needed time scale for a review?

@xylar
Copy link
Collaborator Author

xylar commented Oct 13, 2017

@pwolfram, late next week would be fine. If you find you don't have time next week, could you let us know?

@pwolfram
Copy link
Contributor

Sure @xylar

@xylar
Copy link
Collaborator Author

xylar commented Oct 19, 2017

Note: I have seen issues with processes freezing up during calls to open_multifile_dataset in branches where I introduce more subtasks. @pwolfram and I agree that this is likely due to the dask package not "playing nice" with multiprocessing. Because we will be moving to using NCO within pre-processing tasks to compute climatologies and extract time series from multiple files, we do not foresee needing dask in the future so these freeze-ups will be addressed soon by removing all calls to open_multifile_dataset and instead opening single NetCDF files that have been pre-processed through NCO.

I have not seen this kind of lock-up behavior with this particular branch so I think it is safe to merge these changes first and work toward removing dask in a future PR.

Copy link
Contributor

@pwolfram pwolfram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xylar, to be fair there are a lot of moving pieces here that have changed and it would take a considerable amount of time to do a proper review. I'm not sure at this point I can give you a reasonable review given my time constraints, although I have done the best I can at this point. I certainly won't stand in the way of the PR.

@pwolfram
Copy link
Contributor

@xylar, I should also note that at this point my code is essentially "legacy-esq" as I have contributed 6400 lines / changes to the repo and you have committed 62,600-- which is essentially an order of magnitude more. I would, at this point, benefit from some type of developer tutorial for the system as I'm sure others would too.

@xylar
Copy link
Collaborator Author

xylar commented Oct 20, 2017

@pwolfram, a developer's guide is a good idea. For now, what we have is the analysis task template but a more thorough guide would be useful. I will put it on my list.

@xylar
Copy link
Collaborator Author

xylar commented Oct 20, 2017

And thank you for taking the time to review this PR!

@milenaveneziani
Copy link
Collaborator

@xylar: do you think I should do more testing on this or are we ready to merge?

@xylar xylar force-pushed the switch_to_multiprocessing branch from cb887c1 to e4059a3 Compare October 30, 2017 17:42
@xylar
Copy link
Collaborator Author

xylar commented Nov 7, 2017

Testing

I've tested one run each (both batch mode and on the login node) on:

  • Edison
  • Anvil
  • Wolf
  • Grizzly
  • my Ubuntu laptop (both serial and parallel)
    The runs are all based on the config files and job scripts provided in the configs/ folder.

All run successfully to completion and create the HTML, though I haven't carefully looked at the output in all cases.

@xylar
Copy link
Collaborator Author

xylar commented Nov 7, 2017

@vanroekel, would you have time to run a quick test on Anvil or Theta or wherever is convenient for you? If you're happy with it, I would like to merge later today if possible.

@vanroekel
Copy link
Collaborator

@xylar sure, I'll try it out on theta and try to get it before the end of the day.

@xylar xylar requested a review from vanroekel November 8, 2017 02:11
# non-public attributes related to multiprocessing and logging
self.daemon = True
self._runStatus = Value('i', AnalysisTask.UNSET)
self._staskTrace = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be stackTrace?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure should be. Fortunately (or unfortunately?), the code never accesses the _stackTrace attribute unless the task has failed and given it a valid value. I'll fix this right away.

@vanroekel
Copy link
Collaborator

@xylar sorry for the delay, I'm testing this right now.

In order to support better data availability between tasks (e.g.
if tasks depend on data from the setup of other tasks), tasks are
now subclasses of multiprocess.Process.  This work is needed as
a precursor to adding support for subtasks (e.g. for producing
cliamtologies, or for generating each plot from a task).

Logging is also altered so that each task has a logger, to which
info, errors and warnings should be logged.

Because multiprocess.Process already uses a method called run(),
the run() method of AnalysisTask and its dervied classes had to
be renamed to run_task().

Job scripts have been updated to run on 1 node. The default
ncclim mode is background (12 threads)

The path to @czender's NCO installation has been removed from
the edison job script.  It is no longer needed for the latest
NCO and MPAS-Analysis, which includes a flag to ignore
@czender's path on LCF machines.

The LANL jobscript has been updated to have 1 Open MP thread
instead of 2.  This is needed (as on NERSC) to prevent NCO
problems when running with the default 2 threads. (Not sure why.)
@xylar xylar force-pushed the switch_to_multiprocessing branch from e4059a3 to 16081b8 Compare November 8, 2017 22:37
@vanroekel
Copy link
Collaborator

vanroekel commented Nov 13, 2017

@xylar, I was able to reproduce plots visually identical to develop with high and low res data. Thanks for the patience with testing on this PR.

@xylar
Copy link
Collaborator Author

xylar commented Nov 13, 2017

Thanks @vanroekel.

@xylar xylar merged commit ea20b2a into MPAS-Dev:develop Nov 13, 2017
@xylar xylar deleted the switch_to_multiprocessing branch November 13, 2017 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants