Show pickling issues in notebook on windows #3991

aseyboldt · 2020-07-02T14:24:28Z

There have been issues about parallel sampling on Windows and Mac, because there the default multiprocessing backend is spawn.
Error messages end up on the console, and sometimes there seems to be a deadlock that I haven't been able to find.
This should work around those two issues: Instead of relying on multiprocessing for pickling we can simply do it ourselves, and pass the pickled objects to the remote process. If unpickling fails we send a normal error message back, that we can show in notebooks.
Also, two more arguments to pm.sample:

mp_ctx: for explicit control over the multiprocessing context: fork, spawn or forkserver
pickle_backend: one of pickle or dill (an alternative to pickle that claims that it can pickle some additional objects).

CC @lucianopaz I think that is the issue you were referring to earlier? It would be great if someone could test this on windows a bit.

An example of a model that doesn't pickle:

import theano
import theano.tensor as tt
import numpy as np
import multiprocessing
import pymc3 as pm

@theano.as_op([tt.dvector], [tt.dvector])
def somefunc(a):
    return 2 * np.array(a)

with pm.Model() as model:
    x = pm.Normal('x', shape=2)
    pm.Normal('y', mu=somefunc(x), shape=2)
    #pm.Normal('y', mu=2, shape=2)

with model:
    step = pm.Metropolis()
    trace = pm.sample(step=step, mp_ctx='spawn')

pymc3/tests/test_parallel_sampling.py

AlexAndorra

Thanks a lot @aseyboldt, this looks very helpful! I'll wait for @lucianopaz's review before merging though.
I had two questions:

Contrary to the title of the PR, it seems like this solves the pickling issue on WinOS and MacOS, doesn't it?
Since the crux of the issue seems to be the spawn default, why don't we set the context to forkserver by default on MacOS? It seems cumbersome if users have to change the context manually each time.

RELEASE-NOTES.md

aseyboldt · 2020-07-06T16:10:00Z

Contrary to the title of the PR, it seems like this solves the pickling issue on WinOS and MacOS, doesn't it?

Depends on what "the pickling issue" is. It does not solve problems where we can't sample on win because the model can't be pickled, but it does solve the issue that the error message ends up at the incorrect place or that there is a deadlock.

Since the crux of the issue seems to be the spawn default, why don't we set the context to forkserver by default on MacOS? It seems cumbersome if users have to change the context manually each time.

I now set the default context to forkserver on macos, and removed the previous forkserver context in init.py. This is a different issue than the one above, where things crash because of seaborn.

codecov · 2020-07-06T16:32:43Z

Codecov Report

Merging #3991 into master will increase coverage by 0.06%.
The diff coverage is 81.69%.

@@            Coverage Diff             @@
##           master    #3991      +/-   ##
==========================================
+ Coverage   86.56%   86.62%   +0.06%     
==========================================
  Files          88       88              
  Lines       14020    14062      +42     
==========================================
+ Hits        12136    12181      +45     
+ Misses       1884     1881       -3

Impacted Files	Coverage Δ
pymc3/__init__.py	`95.23% <ø> (+1.90%)`	⬆️
pymc3/parallel_sampling.py	`88.36% <81.15%> (+2.59%)`	⬆️
pymc3/sampling.py	`86.03% <100.00%> (+0.01%)`	⬆️

lucianopaz

This looks great @aseyboldt! I had to go through it a couple of times because I'm definitely out of my league with all the multiprocessing communication that has to go on here.

I think that I understood the basic idea:

Pickle the step method in the master process
Start worker processes that each create their own Process instance that has the communication msg_pipe set up.
Send the picked step method through the pipe and wait to hear any error.

What I don't fully understand is why the remote connection of the process adapter has to be closed immediately after spawning the worker processes? The adapter still has to receive messages through the message pipe, but doesn't it also need the remote connection to send a potential keyboard interrupt? Or to send an abortion message to every process after a single process raised an error? Anyway, I still think that this PR is great and is worth merging ASAP.

lucianopaz · 2020-07-06T21:07:43Z

The code coverage error seems to be mostly because the dill backend is never tested

aseyboldt · 2020-07-06T22:19:36Z

Maybe a small example helps to illustrate why we have to close one of the local pipes.

We could use the multiprocessing pipe in a normal single threaded program (there would be faster ways...)

import multiprocessing

# Create a pipe, the two endpoints are conn1 and conn2
conn1, conn2 = multiprocessing.Pipe()

# We can send and recieve from both
conn1.send_bytes(b'hallo')
conn2.recv_bytes()
# b'hallo'

conn2.send_bytes(b'hi')
conn1.recv_bytes()
# b'hi'

# We close one of the endpoint
conn2.close()

# Sending and recieving will fail

conn1.send_bytes(b'hallo')
# BrokenPipeError, because there is nobody we could send to

conn1.recv_bytes()
# EOFError, (EOF=end of file) because nobody could be sending any more data

Now we do something like this:

import multiprocessing

# Create a pipe, the two endpoints are conn1 and conn2
conn1, conn2 = multiprocessing.Pipe()

# Start new process and give a *copy* of conn2 to the subprocess
start_new_worker(conn2)

# The new worker errors out, it will close its copy of conn2
# Our local copy of conn2 is still open though!

# We don't know yet, that the worker is dead, and wait for the next message
conn1.recv_bytes()
# No BrokenPipeError, because conn2 is still open, and the operating system doesn't
# know that it won't be sending anything.
# Instead, we just wait for the other end (conn2) to send something. But because
# our process is the only one that could send something, and because we are currently
# busy waiting for messages, there can never be a message from conn2, so we
# wait forever.

aseyboldt · 2020-07-06T22:20:19Z

I'll add a test with dill tomorrow, and fix the conflict with master.

Co-authored-by: Junpeng Lao <[email protected]>

AlexAndorra · 2020-07-07T16:22:45Z

Thanks for your explanantion @aseyboldt, quite clear now!

I now set the default context to forkserver on macos, and removed the previous forkserver context in init.py. This is a different issue than the one above, where things crash because of seaborn.

Ah yeah, mixed up both issues, but thanks for changing this.

I think the codecov error is a false negative due to the introduction of dill? I'll let you merge if you think this is ready 😉

aseyboldt · 2020-07-07T17:37:54Z

dill is tested now, the issue is that there is some error handling code when sending a message to the workers that isn't tested. I don't really know how to trigger that case though.
I think it is fine to merge.

lucianopaz · 2020-07-07T18:44:58Z

Great work @aseyboldt! The code coverage looks fine, the project's total coverage will increase a bit, but for some reason the patch itself is less covered. I'll go ahead and merge.

* Update GP NBs to use standard notebook style (pymc-devs#3978) * update gp-latent nb to use arviz * rerun, run black * rerun after fixes from comments * rerun black * rewrite radon notebook using ArviZ and xarray (pymc-devs#3963) * rewrite radon notebook using ArviZ and xarray Roughly half notebook has been updated * add comments on xarray usage * rewrite 2n half of notebook * minor fix * rerun notebook and minor changes * rerun notebook on pymc3.9.2 and ArviZ 0.9.0 * remove unused import * add change to release notes * SMC: refactor, speed-up and run multiple chains in parallel for diagnostics (pymc-devs#3981) * first attempt to vectorize smc kernel * add ess, remove multiprocessing * run multiple chains * remove unused imports * add more info to report * minor fix * test log * fix type_num error * remove unused imports update BF notebook * update notebook with diagnostics * update notebooks * update notebook * update notebook * Honor discard_tuned_samples during KeyboardInterrupt (pymc-devs#3785) * Honor discard_tuned_samples during KeyboardInterrupt * Do not compute convergence checks without samples * Add time values as sampler stats for NUTS (pymc-devs#3986) * Add time values as sampler stats for NUTS * Use float time counters for nuts stats * Add timing sampler stats to release notes * Improve doc of time related sampler stats Co-authored-by: Alexandre ANDORRA <[email protected]> Co-authored-by: Alexandre ANDORRA <[email protected]> * Drop support for py3.6 (pymc-devs#3992) * Drop support for py3.6 * Update RELEASE-NOTES.md Co-authored-by: Colin <[email protected]> Co-authored-by: Colin <[email protected]> * Fix Mixture distribution mode computation and logp dimensions Closes pymc-devs#3994. * Add more info to divergence warnings (pymc-devs#3990) * Add more info to divergence warnings * Add dataclasses as requirement for py3.6 * Fix tests for extra divergence info * Remove py3.6 requirements * follow-up of py36 drop (pymc-devs#3998) * Revert "Drop support for py3.6 (pymc-devs#3992)" This reverts commit 1bf867e. * Update README.rst * Update setup.py * Update requirements.txt * Update requirements.txt Co-authored-by: Adrian Seyboldt <[email protected]> * Show pickling issues in notebook on windows (pymc-devs#3991) * Merge close remote connection * Manually pickle step method in multiprocess sampling * Fix tests for extra divergence info * Add test for remote process crash * Better formatting in test_parallel_sampling Co-authored-by: Junpeng Lao <[email protected]> * Use mp_ctx forkserver on MacOS * Add test for pickle with dill Co-authored-by: Junpeng Lao <[email protected]> * Fix keep_size for arviz structures. (pymc-devs#4006) * Fix posterior pred. sampling keep_size w/ arviz input. Previously posterior predictive sampling functions did not properly handle the `keep_size` keyword argument when getting an xarray Dataset as parameter. Also extended these functions to accept InferenceData object as input. * Reformatting. * Check type errors. Make errors consistent across sample_posterior_predictive and fast_sample_posterior_predictive, and add 2 tests. * Add changelog entry. Co-authored-by: Robert P. Goldman <[email protected]> * SMC-ABC add distance, refactor and update notebook (pymc-devs#3996) * update notebook * move dist functions out of simulator class * fix docstring * add warning and test for automatic selection of sort sum_stat when using wassertein and energy distances * update release notes * fix typo * add sim_data test * update and add tests * update and add tests * add docs for interpretation of length scales in periodic kernel (pymc-devs#3989) * fix the expression of periodic kernel * revert change and add doc * FIXUP: add suggested doc string * FIXUP: revertchanges in .gitignore * Fix Matplotlib type error for tests (pymc-devs#4023) * Fix for issue 4022. Check for support for `warn` argument in `matplotlib.use()` call. Drop it if it causes an error. * Alternative fix. * Switch from pm.DensityDist to pm.Potential to describe the likelihood in MLDA notebooks and script examples. This is done because of the bug described in arviz-devs/arviz#1279. The commit also changes a few parameters in the MLDA .py example to match the ones in the equivalent notebook. * Remove Dirichlet distribution type restrictions (pymc-devs#4000) * Remove Dirichlet distribution type restrictions Closes pymc-devs#3999. * Add missing Dirichlet shape parameters to tests * Remove Dirichlet positive concentration parameter constructor tests This test can't be performed in the constructor if we're allowing Theano-type distribution parameters. * Add a hack to statically infer Dirichlet argument shapes Co-authored-by: Brandon T. Willard <[email protected]> Co-authored-by: Bill Engels <[email protected]> Co-authored-by: Oriol Abril-Pla <[email protected]> Co-authored-by: Osvaldo Martin <[email protected]> Co-authored-by: Adrian Seyboldt <[email protected]> Co-authored-by: Alexandre ANDORRA <[email protected]> Co-authored-by: Colin <[email protected]> Co-authored-by: Brandon T. Willard <[email protected]> Co-authored-by: Junpeng Lao <[email protected]> Co-authored-by: rpgoldman <[email protected]> Co-authored-by: Robert P. Goldman <[email protected]> Co-authored-by: Tirth Patel <[email protected]> Co-authored-by: Brandon T. Willard <[email protected]>

aseyboldt force-pushed the better-pickle branch from 4649cb4 to 8eba95e Compare July 2, 2020 16:34

aseyboldt changed the title ~~Better pickle~~ Show pickling issues in notebook on windows Jul 2, 2020

aseyboldt force-pushed the better-pickle branch from 8eba95e to ea71951 Compare July 2, 2020 17:59

aseyboldt mentioned this pull request Jul 3, 2020

Close main process copy of pipe when sampling in parallel #3988

Closed

aseyboldt force-pushed the better-pickle branch from 0a763be to aad62eb Compare July 3, 2020 08:08

junpenglao reviewed Jul 3, 2020

View reviewed changes

pymc3/tests/test_parallel_sampling.py Outdated Show resolved Hide resolved

aseyboldt marked this pull request as ready for review July 3, 2020 09:33

aseyboldt mentioned this pull request Jul 5, 2020

Remove pip from scripts and use only conda #3993

Merged

AlexAndorra requested changes Jul 6, 2020

View reviewed changes

RELEASE-NOTES.md Outdated Show resolved Hide resolved

RELEASE-NOTES.md Outdated Show resolved Hide resolved

aseyboldt force-pushed the better-pickle branch from 5d408c0 to 8f14987 Compare July 6, 2020 16:00

lucianopaz approved these changes Jul 6, 2020

View reviewed changes

aseyboldt and others added 6 commits July 7, 2020 12:56

Manually pickle step method in multiprocess sampling

611e14b

Merge close remote connection

44215ca

Fix tests for extra divergence info

f0b5d28

Add test for remote process crash

8cc6f8f

Better formatting in test_parallel_sampling

881c65a

Co-authored-by: Junpeng Lao <[email protected]>

Use mp_ctx forkserver on MacOS

9789ec6

aseyboldt force-pushed the better-pickle branch 2 times, most recently from 9e92aa8 to 8b36471 Compare July 7, 2020 14:18

Add test for pickle with dill

600ed93

aseyboldt force-pushed the better-pickle branch from 8b36471 to 600ed93 Compare July 7, 2020 14:56

AlexAndorra approved these changes Jul 7, 2020

View reviewed changes

lucianopaz merged commit 90f48ed into pymc-devs:master Jul 7, 2020

eigenfoo mentioned this pull request Jul 9, 2020

New commits to pymc3/sampling.py or pymc3/step_methods/hmc/ eigenfoo/littlemcmc#78

Closed

kyleabeauchamp added this to the 3.9.3 milestone Jul 28, 2020

AlexAndorra mentioned this pull request Aug 14, 2020

Parallel sampling hangs on macOS with Python 3.8 with pickling error #3844

Closed

gmingas mentioned this pull request Sep 7, 2020

Parallel sampling with Theano Op fails on MacOS 10.15 #4085

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show pickling issues in notebook on windows #3991

Show pickling issues in notebook on windows #3991

aseyboldt commented Jul 2, 2020 •

edited

Loading

AlexAndorra left a comment •

edited

Loading

aseyboldt commented Jul 6, 2020 •

edited

Loading

codecov bot commented Jul 6, 2020 •

edited

Loading

lucianopaz left a comment

lucianopaz commented Jul 6, 2020

aseyboldt commented Jul 6, 2020

aseyboldt commented Jul 6, 2020

AlexAndorra commented Jul 7, 2020

aseyboldt commented Jul 7, 2020

lucianopaz commented Jul 7, 2020

Show pickling issues in notebook on windows #3991

Show pickling issues in notebook on windows #3991

Conversation

aseyboldt commented Jul 2, 2020 • edited Loading

AlexAndorra left a comment • edited Loading

Choose a reason for hiding this comment

aseyboldt commented Jul 6, 2020 • edited Loading

codecov bot commented Jul 6, 2020 • edited Loading

Codecov Report

lucianopaz left a comment

Choose a reason for hiding this comment

lucianopaz commented Jul 6, 2020

aseyboldt commented Jul 6, 2020

aseyboldt commented Jul 6, 2020

AlexAndorra commented Jul 7, 2020

aseyboldt commented Jul 7, 2020

lucianopaz commented Jul 7, 2020

aseyboldt commented Jul 2, 2020 •

edited

Loading

AlexAndorra left a comment •

edited

Loading

aseyboldt commented Jul 6, 2020 •

edited

Loading

codecov bot commented Jul 6, 2020 •

edited

Loading