Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Compile the functions needed by SMC before the worker processes are started #7413

Open
EliasRas opened this issue Jul 17, 2024 · 6 comments · May be fixed by #7472
Open

ENH: Compile the functions needed by SMC before the worker processes are started #7413

EliasRas opened this issue Jul 17, 2024 · 6 comments · May be fixed by #7472

Comments

@EliasRas
Copy link

Before

No response

After

No response

Context for the issue:

As explained in #7224, sample_smc can fail on Windows if the model is defined using CustomDist. Everything works on Linux because the worker processes are forked by default whereas they're spawned on Windows. Spawning leads to missing dispatches.

#7241 addressed the issue by manually registering the dispatches in the worker processes. A better way to solve the problem would be to compile the necessary functions before starting the worker processes (as is done in sample).

@EliasRas
Copy link
Author

This is based on the issues that come from using CustomDist since it's where I encountered the issue. I don't know if similar issues could arise somewhere else or if those issues should even be addressed here.

The functions needed by SMC are compiled in SMC_KERNEL._initialize_kernel, which is currently called in each worker process.

smc._initialize_kernel()

First function (inside Model.initial_point) is used to generate initial_point. initial_point is used to determine the shape/size of variables and to later create the likelihood function and prior functions. The function uses support_point by default but can also sample from prior. If support_point is not defined, the implementation defaults to using the prior, If I understood right, sampling from prior shouldn't cause any issues since random should be defined using numpy which is always available.

initial_point = self.model.initial_point(random_seed=self.rng.integers(2**30))

Second function (in SMC_KERNEL.initialize_population) is very similar to the previous but it samples from prior by default and shouldn't cause issues.

init_rnd = self.initialize_population()

Finally, the functions used to calculate the prior probabilities and likelihoods are defined using the respective model properties. Here the issues come from using logp.

pymc/pymc/smc/kernels.py

Lines 237 to 242 in ab467da

self.prior_logp_func = _logp_forw(
initial_point, [self.model.varlogp], self.variables, shared
)
self.likelihood_logp_func = _logp_forw(
initial_point, [self.model.datalogp], self.variables, shared
)

Is it possible that SMC_KERNEL.setup_kernel would some day contain compilation? Currently it is empty but is called nonetheless.

I think that the necessary fixes are

  1. Compile the function that generates initial_point in the main process. It would be simple to copy the relevant code from Model.initial_point and call it before the worker processes are started but I don't know if that's a very good solution. Feels like unnecessary duplication.
  2. Compile the prior and likelihood functions beforehand. Could be done by storing SMC_KERNEL.model.varlogp and SMC_KERNEL.model.datalogp in SMC_KERNEL.__init__.
  3. Somehow make the pitfall of dispatch registration more obvious. I don't know if this is necessary since the test added in Register the overloads added by CustomDist in worker processes #7241 catches at least some of the issues and the breaking changes wouldn't get through.

@EliasRas
Copy link
Author

As a side note, isn't this

shared = make_shared_replacements(initial_point, self.variables, self.model)

unnecessary? It makes shared variables from set(self.model.value_vars) - set(self.variables) but the set is always empty if I didn't miss any modifications to self.variables.

@ricardoV94
Copy link
Member

shared = make_shared_replacements(initial_point, self.variables, self.model). Right I think this is used in step samplers, where each sampler can be responsible for a subset of the variables. I guess all the SMC kernels must always sample all the variables so this shouldn't be needed?

@ricardoV94
Copy link
Member

For the initial point, we could just pass the result for each chain instead of the function that computes the initial point?

@EliasRas
Copy link
Author

I think this is used in step samplers, where each sampler can be responsible for a subset of the variables. I guess all the SMC kernels must always sample all the variables so this shouldn't be needed?

Oh, seems like a good choice to leave it as it is then if there's even a remote possibility to break something in the future. It doesn't cost much to include the check anyway.

For the initial point, we could just pass the result for each chain instead of the function that computes the initial point?

Sounds good to me.

@ricardoV94
Copy link
Member

ricardoV94 commented Jul 22, 2024

Oh, seems like a good choice to leave it as it is then if there's even a remote possibility to break something in the future. It doesn't cost much to include the check anyway.

No, I think this used to be how it worked back when SMC was just another step sampler. But since it hasn't been for a while it doesn't make sense to keep the code complexity around. We can always reintroduce

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants