Register the overloads added by CustomDist in worker processes #7241

EliasRas · 2024-04-06T16:38:32Z

Description

Currently sample_smc can fail due to a NotImplementedError if it's used with a model defined usingCustomDist. If a CustomDist is used without dist parameter, the overloads for _logprob, _logcdf and _support_point are registered only in the main process.

This PR adds an initializer which registers the overloads in the worker processes of the pool used in sample_smc.

Related Issue

Closes BUG: pymc.sample_smc fails with pymc.CustomDist #7224
Related to #

Checklist

Checked that the pre-commit linting/style checks pass
Included tests that prove the fix is effective or that the new feature works
Added necessary documentation (docstrings and/or example notebooks)
If you are a pro: each commit corresponds to a relevant logical change

Type of change

📚 Documentation preview 📚: https://pymc--7241.org.readthedocs.build/en/7241/

welcome · 2024-04-06T16:38:35Z

]
💖 Thanks for opening this pull request! 💖 The PyMC community really appreciates your time and effort to contribute to the project. Please make sure you have read our Contributing Guidelines and filled in our pull request template to the best of your ability.

pymc/smc/sampling.py

ricardoV94 · 2024-04-09T08:52:07Z

Hi @EliasRas, I'll need some time to review this one properly. Thanks for taking the initiative

EliasRas · 2024-05-26T04:26:17Z

Looks like I messed up by rebasing instead of merging and introduced plenty of unnecessary commits to this feature. Does it need to be fixed?

twiecki · 2024-05-26T11:25:20Z

Yes, that needs to be fixed, happens to everyone. One approach is to start clean and cherry-pick your commits.

…tialization

ricardoV94 · 2024-05-27T10:26:49Z

tests/smc/test_smc.py

+            return rng.normal(loc=mu, scale=1, size=size)
+
+        with pm.Model():
+            mu = pm.CustomDist("mu", 0, logp=_logp, random=_random)


There are two kinds of CustomDist, if instead of random you pass dist you get a different Op type, that I guess would still fail after this PR

Edit: I see you mentioned this in your top message

Why does it fail with random but not dist. Testing locally it seems to fail with both for me?

I just tested both the code I used in the linked issue and your example below. I still get no errors using pm.Potential or dist argument. Originally I was using pymc==5.10.0 but now I tested with pymc==5.13.0. Maybe there are differences between Windows and Linux if you're using one?

I'm not completely sure why using dist works for me but, based on some quick testing, DistributionMeta.__new__ is called when e.g. Normal is defined and the overloads for builtin distributions are registered there. I'm not well versed in multiprocessing or the way that Python does importing but my hunch is that the worker processes automatically import stuff from pymc and the overloads get registered as a side effect. For user-defined logprob etc. this is not the case since the registration isn't done during importing.

pymc/smc/sampling.py

ricardoV94 · 2024-05-27T10:30:53Z

I think it's more complicated than this. The following example has specific dispatch, but no RV that shows up in the graph:

import pymc as pm

def _logp(value, mu):
    return -((value - mu) ** 2)

def _dist(mu, size=None):
    return pm.Normal.dist(mu, 1, size=size)

with pm.Model():
    mu = pm.Normal("mu", 0)
    pm.Potential("term", pm.logp(pm.CustomDist.dist(mu, logp=_logp, dist=_dist), [1, 2]))
    pm.sample_smc(draws=6, cores=1)

It also fails even with a single core

EliasRas · 2024-05-27T12:04:26Z

It also fails even with a single core

22e8f0b did refactoring for sample_smc and I think that errors should now pop up even with single core since the sampling is always done in another process. Previously this was the case only when cores>1 since there were separate run_chains_parallel and run_chains_sequential.

Co-authored-by: Ricardo Vieira <[email protected]>

ricardoV94 · 2024-05-27T13:42:43Z

Somehow, in main, I am getting ConnectionResetError: [Errno 104] Connection reset by peer even for unrelated models without any sort of CustomDist

tests/smc/test_smc.py

ricardoV94 · 2024-05-27T13:55:49Z

Somehow, in main, I am getting ConnectionResetError: [Errno 104] Connection reset by peer even for unrelated models without any sort of CustomDist

Okay it's something about the new progressbal and pycharm interactive python console. If I use from ipython/terminal it works. But also works in main for me?

ricardoV94 · 2024-05-27T13:59:20Z

I cannot reproduce a failure with your test locally (after avoding the pycharm issue) nor in a Colab environment: https://colab.research.google.com/drive/1I1n6c9IlmXknIfhxC5s7sAQghv0vfRSY?usp=sharing

Can you share more details about your environment/setup?

EliasRas · 2024-05-28T07:36:13Z

Can you share more details about your environment/setup?

I added the output of conda list to "PyMC version information" section of #7224. I'm running the code using VSCode if that matters. Do you need anything else?

Basically I followed the install instructions and the pull request tutorial when installing. Might have also pip installed a couple of extra packages here and there.

ricardoV94 · 2024-05-28T08:39:38Z

I added the output of conda list to "PyMC version information" section of #7224. I'm running the code using VSCode if that matters. Do you need anything else?

We should have at least one person reproduce the problem because I cannot. It may be a VSCode environment issue. Ideally we wouldn't have to change the codebase

EliasRas · 2024-05-28T08:44:33Z

The test does fail without the changes when I run it from miniforge prompt though.

ricardoV94 · 2024-05-28T08:49:21Z

The test does fail without the changes when I run it from miniforge prompt though.

Not sure what miniforge prompt is, can we try to reproduce here on the CI then? Push just the test without the fixes into a new PR and well run it to see if we can reproduce

EliasRas · 2024-06-11T07:18:00Z

Is there anything that needs to be done here besides running the tests?

twiecki · 2024-06-11T11:00:03Z

Is there anything that needs to be done here besides running the tests?

Sorry for the delay, just kicked off tests.

codecov · 2024-06-11T11:30:09Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.19%. Comparing base (c8b22df) to head (d2669f2).
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7241      +/-   ##
==========================================
+ Coverage   92.18%   92.19%   +0.01%     
==========================================
  Files         103      103              
  Lines       17259    17282      +23     
==========================================
+ Hits        15910    15933      +23     
  Misses       1349     1349

Files	Coverage Δ
pymc/smc/sampling.py	`99.34% <100.00%> (+0.11%)`	⬆️

lucianopaz · 2024-06-12T09:23:10Z

Thanks @EliasRas, I haven't been able to reproduce this yet but that's just because I'm in the middle of switching workstations and haven't gotten everything setup yet.
Your fix looks fine to me and I understand what you identified as the cause of the issue: the dispatching mechanism isn't registering the logp and other methods to the dynamically created class. I think that this highlights a caveat in pymc's and pytensor's design: spawned processes may not have all the registered dispatch signatures as the main process. I imagine that this is mostly a problem on Windows, where multiprocessing can only spawn new processes whereas linux based systems will default to forks which in principle should copy over the memory contents of the main processes. I'm not sure what will happen under MacOS because I think that they cannot use fork multiprocessing for some reason either.
With this design caveat in hand, I'm not sure if it's better to have a package level utility function that serves as a sort of book-keeper or something that can handle communicating the extra dispatch registration needed to ensure that child processes will use the correct dispatching functions. I'm curious to know what @ricardoV94 thinks about this. I don't think that this PR should have to tackle this kind of work, but I think that we can discuss if it's necessary here, and maybe later open an issue and a separate PR (also maybe in pytensor where dispatching is used for transpilation/compilation and maybe at some point for lazy gradients?).

aseyboldt · 2024-06-12T12:29:30Z

I guess the underlying reason for the failure is that pickling of DensityDist doesn't work out of the box? Sounds like for some reason the dispatch functions don't get registered when the object is unpickled. But wouldn't it be cleaner to overwrite the pickling behavior of this class then? We could override __getstate__ and __setstate__ methods to that effect?

lucianopaz · 2024-06-12T15:02:06Z

I guess the underlying reason for the failure is that pickling of DensityDist doesn't work out of the box?

I don’t think the problem is about pickling. The DensityDist end up returning an op that can be cloudpickled. If I recall correctly it can’t be pickled because the op class is created on the fly. In the process of creating the op, the dispatchers get populated with the callables that are supplied as inputs to the distribution class. As far as I understand, those functions are detached from the rv op and that’s why they never get populated on a spawned process.

aseyboldt · 2024-06-12T15:44:25Z

I don't mean that the pickling itself throws an error (it doesn't), but that it would be the responsibility of the DensityDist object to ensure that the set-up it needs (ie registering the logp) is done when it is unpickled.

For instance the following fails with the NotImplementedError, and has nothing to do with smc, so I guess the solution shouldn't be specific to smc?

import pymc as pm
import cloudpickle
import multiprocessing


def use_logp_func(pickled_model):
    model = cloudpickle.loads(pickled_model)
    logp = model.logp()
    func = pm.pytensorf.compile_pymc(model.value_vars, logp)
    print(func(1.0))


if __name__ == "__main__":
    with pm.Model() as model:

        def logp(value):
            return -(value**2)

        pm.DensityDist("x", logp=logp)

    logp = model.logp()
    func = pm.pytensorf.compile_pymc(model.value_vars, logp)
    pickled_model = cloudpickle.dumps(model)

    ctx = multiprocessing.get_context("spawn")
    process = ctx.Process(target=use_logp_func, args=(pickled_model,))
    process.start()
    process.join()

lucianopaz · 2024-06-12T19:26:07Z

I completely agree that this problem isn’t unique to smc and is a design caveat that needs to be addressed more comprehensively.
I think that we can kind of patch some things:

Make Model objects __setstate__ and __getstate__ repopulate the dispatch registries
Get CustomDist rv ops to have these methods defined somehow (maybe clojures) that repopulates the dispatch registries.

I’m not sure if these two methods can cover all use patterns though.

ricardoV94 · 2024-06-12T21:32:37Z

Alternatively we could pass the functions needed to each process which is more like what pm.sample does.

This also avoids recompiling the same functions multiple times?

EliasRas · 2024-06-24T08:46:55Z

@lucianopaz
Point 1. is pretty straightforward but could you explain what you meant by 2.? How would it be different from overriding __getstate__ and __setstate__?

…tialization

EliasRas · 2024-07-17T07:00:18Z

I started work on __getstate__ and __setstate__ but I realized that I can't just copy the current implementation due to circular imports. Would either of these sound like a good idea?

Create utils module (or something similar) in pymc.model that handles the custom methods and registering them.
Add a method for RVs that registers the custom methods if necessary. CustomDistRV and CustomSymbolicDistRV would have the registrations and RandomVariable and SymbolicRandomVariable would just pass.

ricardoV94 · 2024-07-17T07:04:54Z

I think we should explore an alternative where we compile the functions SMC needs and fork afterwards like pm.sample does.

This approach seems more brittle?

It would also avoid re-compiling the same functions in each chain

EliasRas · 2024-07-17T07:12:37Z

I agree on the re-compiling part but shouldn't this still be fixed? It feels like an arbitrary decision to "disallow" using multiprocessing this way only on Windows even if it is a bad way.

ricardoV94 · 2024-07-17T07:17:51Z

I agree on the re-compiling part but shouldn't this still be fixed? It feels like an arbitrary decision to "disallow" using multiprocessing this way only on Windows even if it is a bad way.

I think this limitation is likely deeper than what you're addressing here. As @aseyboldt and @lucianopaz mentioned we're using dynamic dispatching as a recurring theme in our codebase and pytensor's

However, I don't agree with their solutions

ricardoV94 · 2024-07-17T07:23:44Z

Using the class that's being dispatched to register the dispatches during pickling seems at odds with the point of dispatching. The class shouldn't have to know what's being dispatched upon.

For instance we also have icdf methods, what if someone dispatched on it from the outside, does pickling work for it? Or would the setstate/getstate need to know about icdf (as well any other dispatch that may not even be part of PyMC)?

It's also not a PyMC model responsibility. CustomDist can be defined just fine outside of a model

EliasRas · 2024-07-17T07:27:35Z

Thank you for taking the time to explain. I'll start working on the compilation approach.

ricardoV94 · 2024-07-17T07:38:53Z

However since this fixes existing behavior I think we can go ahead and merge it as a temporary patch?

pymc/smc/sampling.py

Co-authored-by: Ricardo Vieira <[email protected]>

ricardoV94

Can you open a follow up issue for the compiling once and then forking?

EliasRas · 2024-08-22T12:58:25Z

I started working on compiling the functions in the main process. Should I close this or is there any chance of this getting merged before the newer PR?

EliasRas · 2024-09-04T05:33:45Z

I started working on compiling the functions in the main process. Should I close this or is there any chance of this getting merged before the newer PR?

@ricardoV94 @lucianopaz @aseyboldt @twiecki Just following up on this. Should I close the PR?

EliasRas added 2 commits April 6, 2024 19:11

Register custom overloads in all processes

9757434

Test for using CustomDist in multiprocess smc sampling

54ca772

EliasRas commented Apr 6, 2024

View reviewed changes

pymc/smc/sampling.py Outdated Show resolved Hide resolved

ricardoV94 self-assigned this Apr 9, 2024

EliasRas force-pushed the pool_initialization branch from 506e49f to 54ca772 Compare May 26, 2024 12:18

Merge commit '19be124e4d128276803e6ce108591654c9625da1' into pool_ini…

54f72e2

…tialization

ricardoV94 reviewed May 27, 2024

View reviewed changes

pymc/smc/sampling.py Outdated Show resolved Hide resolved

EliasRas and others added 2 commits May 27, 2024 15:15

Use basic_RVs instead of manually collecting variables

f96152d

Co-authored-by: Ricardo Vieira <[email protected]>

Use CustomDistRV to find variables defined using CustomDist

f539b9f

ricardoV94 reviewed May 27, 2024

View reviewed changes

tests/smc/test_smc.py Outdated Show resolved Hide resolved

EliasRas added 3 commits May 28, 2024 10:37

Handle missing overloads

d6cfb89

Also catch the variable defined using CustomDist and the dist argument

5b460e4

Check both kinds of CustomDist in multiprocessing test

997d730

EliasRas mentioned this pull request May 28, 2024

Add a test for SMC sampling from CustomDist in multiple processes #7337

Closed

11 tasks

EliasRas marked this pull request as ready for review June 3, 2024 05:01

twiecki requested a review from ricardoV94 June 11, 2024 10:59

ricardoV94 requested a review from lucianopaz June 11, 2024 11:57

EliasRas added 2 commits July 17, 2024 08:09

Merge commit 'c8b22df212efe75a8c43c813295e6e000d3a5fc4' into pool_ini…

b625511

…tialization

Fix CustomDistRV imports

c572542

ricardoV94 reviewed Jul 17, 2024

View reviewed changes

pymc/smc/sampling.py Outdated Show resolved Hide resolved

pymc/smc/sampling.py Outdated Show resolved Hide resolved

pymc/smc/sampling.py Show resolved Hide resolved

EliasRas and others added 3 commits July 17, 2024 11:18

Better function name

0e4c01c

Use type instead of __class__

bb3e16f

Co-authored-by: Ricardo Vieira <[email protected]>

Also register icdf overloads

d2669f2

ricardoV94 approved these changes Jul 17, 2024

View reviewed changes

EliasRas mentioned this pull request Jul 17, 2024

ENH: Compile the functions needed by SMC before the worker processes are started #7413

Open

EliasRas mentioned this pull request Aug 22, 2024

Compile the functions needed by SMC before the worker processes are started #7472

Draft

11 tasks

Register the overloads added by CustomDist in worker processes #7241

Are you sure you want to change the base?

Register the overloads added by CustomDist in worker processes #7241

Conversation

EliasRas commented Apr 6, 2024 • edited by github-actions bot Loading

Description

Related Issue

Checklist

Type of change

welcome bot commented Apr 6, 2024

ricardoV94 commented Apr 9, 2024

EliasRas commented May 26, 2024

twiecki commented May 26, 2024

ricardoV94 May 27, 2024 • edited Loading

Choose a reason for hiding this comment

ricardoV94 May 27, 2024

Choose a reason for hiding this comment

EliasRas May 27, 2024

Choose a reason for hiding this comment

ricardoV94 commented May 27, 2024 • edited Loading

EliasRas commented May 27, 2024

ricardoV94 commented May 27, 2024 • edited Loading

ricardoV94 commented May 27, 2024

ricardoV94 commented May 27, 2024

EliasRas commented May 28, 2024

ricardoV94 commented May 28, 2024

EliasRas commented May 28, 2024

ricardoV94 commented May 28, 2024

EliasRas commented Jun 11, 2024

twiecki commented Jun 11, 2024

codecov bot commented Jun 11, 2024 • edited Loading

Codecov Report

lucianopaz commented Jun 12, 2024

aseyboldt commented Jun 12, 2024

lucianopaz commented Jun 12, 2024

aseyboldt commented Jun 12, 2024

lucianopaz commented Jun 12, 2024

ricardoV94 commented Jun 12, 2024 • edited Loading

EliasRas commented Jun 24, 2024

EliasRas commented Jul 17, 2024

ricardoV94 commented Jul 17, 2024 • edited Loading

EliasRas commented Jul 17, 2024 • edited Loading

ricardoV94 commented Jul 17, 2024 • edited Loading

ricardoV94 commented Jul 17, 2024 • edited Loading

EliasRas commented Jul 17, 2024 • edited Loading

ricardoV94 commented Jul 17, 2024

ricardoV94 left a comment

Choose a reason for hiding this comment

EliasRas commented Aug 22, 2024

EliasRas commented Sep 4, 2024

EliasRas commented Apr 6, 2024 •

edited by github-actions bot

Loading

ricardoV94 May 27, 2024 •

edited

Loading

ricardoV94 commented May 27, 2024 •

edited

Loading

ricardoV94 commented May 27, 2024 •

edited

Loading

codecov bot commented Jun 11, 2024 •

edited

Loading

ricardoV94 commented Jun 12, 2024 •

edited

Loading

ricardoV94 commented Jul 17, 2024 •

edited

Loading

EliasRas commented Jul 17, 2024 •

edited

Loading

ricardoV94 commented Jul 17, 2024 •

edited

Loading

ricardoV94 commented Jul 17, 2024 •

edited

Loading

EliasRas commented Jul 17, 2024 •

edited

Loading