Conversation
|
What's this for? Generating sample data?
…On 26 Feb 2018 23:00, "Colin" ***@***.***> wrote:
This allows samples from a model, ignoring all observed variables. See the
screenshot below for an example in a simple model.
Right now it relies on unofficial python3.6 behavior, and official
python3.7 behavior
<https://mail.python.org/pipermail/python-dev/2017-December/151283.html>.
Namely, dictionaries keeping insertion order. I would love a suggestion to
avoid that requirement, but I can also take a swing at having tree_dict
subclass from OrderedDict instead.
[image: image]
<https://user-images.githubusercontent.com/2295568/36700677-87c7b582-1b1e-11e8-8dc0-cfd0efb5db09.png>
------------------------------
You can view, comment on, or merge this pull request online at:
#2876
Commit Summary
- Add sample_prior function
File Changes
- *M* pymc3/sampling.py
<https://github.com/pymc-devs/pymc3/pull/2876/files#diff-0> (59)
- *M* pymc3/tests/test_sampling.py
<https://github.com/pymc-devs/pymc3/pull/2876/files#diff-1> (12)
Patch Links:
- https://github.com/pymc-devs/pymc3/pull/2876.patch
- https://github.com/pymc-devs/pymc3/pull/2876.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2876>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA8DiPYmNSGrxc6swnYZ6fiWe-7010Ypks5tYzePgaJpZM4SUEpy>
.
|
|
Yep! That would be one use case. Or faster prototyping (for example, seeing if the generated data looks reasonable). We wanted to use something like this last week for generating a toy data set for a gerrymandering project. |
|
This paper shows a good example of where you might usefully use prior sampling. |
|
Ha, this is great!!! I was thinking excatly the same in another issue the other day: #2856 (comment) |
pymc3/sampling.py
Outdated
There was a problem hiding this comment.
It is better to use the part from smc where it samples from the prior:
https://github.com/pymc-devs/pymc3/blob/801accb5f236ab9daa89a8fcd9d09a3ba4ed0a39/pymc3/step_methods/smc.py#L186-L193
Otherwise you will get error with bounded RVs:
AttributeError: 'TransformedDistribution' object has no attribute 'random'
|
This is great then. I often wonder how to do the Stan "data generation
process" part.
…On 27 Feb 2018 05:28, "Junpeng Lao" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pymc3/sampling.py
<#2876 (comment)>:
> +
+ if vars is None:
+ vars = set(model.named_vars.keys())
+
+ if random_seed is not None:
+ np.random.seed(random_seed)
+
+ if progressbar:
+ indices = tqdm(range(samples))
+
+ try:
+ prior = {var: [] for var in vars}
+ for _ in indices:
+ point = {}
+ for var_name, var in model.named_vars.items():
+ val = var.distribution.random(point=point, size=size)
It is better to use the part from smc where it samples from the prior:
https://github.com/pymc-devs/pymc3/blob/801accb5f236ab9daa89a8fcd9d09a
3ba4ed0a39/pymc3/step_methods/smc.py#L186-L193
Otherwise you will get error with bounded RVs:
AttributeError: 'TransformedDistribution' object has no attribute 'random'
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2876 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA8DiG28JAKsHD3UtNoPO6Y31qN6-fHDks5tY5JtgaJpZM4SUEpy>
.
|
|
@junpenglao That's a good sign that we even named the functions the same! You seemed to sketch out a pretty complete method in the comment on the other issue (along with some good edge cases for the test) - I'll hopefully update later today. |
|
I was hoping someone will pick it up ;-) |
|
Also need to add to release note. |
fc62dcc to
efb9b28
Compare
|
@junpenglao i updated to sample correctly from transformed variables. I decided against (for now) using the trick from I am a little confused because sampling from a transformed distribution is super slow: changing |
|
I have tried a few things without much luck to fix the speed problem. I might give a try tomorrow to do something similar to what |
Agree - that is more for the initialization. After this PR we can replace the jitter function currently used with sample_prior (with jitter etc to handle corner cases).
[Edit]: using forward_val doesnt speed up things currently, but potentially could if we rewrite it into numpy functions. |
pymc3/sampling.py
Outdated
There was a problem hiding this comment.
name_vars also contains Deterministic and Potential, which dont have distribution and random
RELEASE-NOTES.md
Outdated
There was a problem hiding this comment.
I would move this further up as it's a major feature. Also, I think we should add author names to who contributed to feature / bugfix.
pymc3/sampling.py
Outdated
There was a problem hiding this comment.
I would add a bit more description of why this is useful and when you would use it. It's really the prior predictive we're sampling from here.
There was a problem hiding this comment.
If its helpful -- I think one use case for this function is to generate a unique starting point for each chain, when multiple are required, like in this case #2856
|
Would be great to add a NB with some motivation and example usage we can add to the docs. |
|
Some of this seems maybe trickier than I thought. I've tried a few methods that are almost clever and don't work. My current favorite approach tries to clone the whole model, but I am not able to clone an Here is the current implementation of |
|
Why is forward pass random (like what you did before) doesnt work? (besides of slowness) I would like to contribute a bit more to this issue, as efficient forward random is quite important for likelihood-free that I would like to address. Could you share your experiments? |
cdad527 to
a0d638a
Compare
|
Gosh, it is easy to forget how useful outside input can be sometimes. I am going to focus on that instead of the many hours I spent trying to get something else to work :D It looks like forward pass continues to work, and I actually fixed the speed problem in a ninja edit last week. @twiecki would you rather have an example NB along with this PR, or merge this to master to start working more bugs out? |
|
@ColCarroll rather with this one :). the API shouldn't change all that much. |
|
@ColCarroll did you push the new changes? |
|
Yes - the major change is this line for deterministic variables: (it is complicated because passing unused variables throws an error) |
|
Nice!!! LGTM |
|
Failure looks like something related to Working on a tiny case study notebook to use as well. |
|
LGTM I think fine to mark that with xfail, since we often have errors like that. Maybe add in to one of the other notebooks sample prior, that might be easier than doing your own notes. |
|
Awesome update. The fact that now we can check |
|
Maybe the test fail could be fix by specifying the dtype? similar to #2891 (comment)? |
pymc3/tests/test_sampling.py
Outdated
51451e3 to
9a80d0d
Compare
|
Updated this to use #2902. Huge thanks to @lucianopaz, as that code cleans this up a lot, and it looks quite tricky! You might take a look to make sure I did not mess anything up: The first one I think is good, the second one might not be wanted elsewhere. Note that now we just sample all the points we want from each node as we scan through, so it is quite fast, and no longer uses a progressbar since it is not iterative. I have confirmed that I can sample from the Efron-Morris baseball generative model, and am going to work on turning that into an actual example notebook. |
|
|
||
| names = get_default_varnames(model.named_vars, include_transformed=False) | ||
| # draw_values fails with auto-transformed variables. transform them later! | ||
| values = draw_values([model[name] for name in names], size=samples) |
There was a problem hiding this comment.
Wow this is really efficient! However, is it sure that the values draw in the children of a graph is depending on the samples from their parent? In the previous implementation, we always sample by evaluating a point which contains samples from a higher hierarchy. For example, if b ~ p(a), we sample a_tilde first then sample b~p(a_tilde). Is it the case her also?
There was a problem hiding this comment.
A simple example:
X = theano.shared(np.arange(3))
with pm.Model() as m:
ind = pm.Categorical('i', np.ones(3)/3)
x = pm.Deterministic('X', X[ind])
prior=pm.sample_generative(10)
prior
{'X': array([0, 0, 2, 1, 2, 2, 1, 0, 0, 1]),
'i': array([1, 0, 0, 2, 2, 0, 2, 1, 0, 0])}i and X should be identical.
There was a problem hiding this comment.
This is a super helpful example! Let me take a look at it -- there's some work already to avoid some edge cases, and I would have thought this got caught.
There was a problem hiding this comment.
Caught the bug (will add tests for all this, too). Your example runs as desired now!
I make sure I evaluate the params by making a dictionary of index integers to nodes (avoids non-hashability of ndarray). After evaluating the nodes, I was accidentally using the index integer to check if it was a child of another node. This was never true, so I never supplied that value to the rest of the graph.
| if size is None: | ||
| return func(*values) | ||
| else: | ||
| return np.array([func(*value) for value in zip(*values)]) |
There was a problem hiding this comment.
The size seems to only be imposed to param's with a random method, and we hope the content of values to be the right size in the end. Shouldn't there be some enforcement of the size, for the numbers.Number, np.ndarray, tt.TensorConstant, tt.sharedvar.SharedVariable and tt.TensorVariable in point cases for us to be sure that values will in fact have the desired output size?
There was a problem hiding this comment.
I am relying here on theano catching those sorts of errors, and giving more informative errors than I could. I am running this on a few different models to make sure it gives reasonable results, but so far those sorts of inputs get broadcast in a sensible manner.
pymc3/distributions/distribution.py
Outdated
| evaluated[param] = _draw_value(params[param], point=point, givens=givens.values(), size=size) | ||
| if any(param in j for j in named_nodes_children.values()): | ||
| givens[param.name] = (params[param], evaluated[param]) | ||
| if any(params[param] in j for j in named_nodes_children.values()): |
There was a problem hiding this comment.
Oops I actually commented on an older commit so it shows as outdated, sorry for the mess. First off, it looks very nice, however I think this line is confusing, You're trying to see if the node params[param] is a child of some other named node. If params[param] were to be a named node, that information should be available in the dictionary named_nodes_parents. If params[param] were not to be a named node, then it would not be registered neither in the named_nodes_parents nor the named_nodes_children dictionaries.
If params[param] is a named node, you should be able to replace this line by:
if named_nodes_parents[params[param]]:
If params[param] is not a named node, then I think it shouldn't bee added to givens but I may be overlooking something.
There was a problem hiding this comment.
This is much nicer, thank you!
There was a problem hiding this comment.
(also just checked out travis, and your suggestion will also fix failing tests)
junpenglao
left a comment
There was a problem hiding this comment.
LGTM. Are you adding more tests or is this ready?
|
Not yet - I am now looking at In particular, I would guess there is still something funny going on with passing nodes appropriately. |
|
This is a difficult model to generate from. But yeah there seems to be some problem with the last RV |
|
Since this is currently blocked by #2909, I suggested we rolled back to the original implementation with (slower) forward passing. I have a version that works fairly OK and could serve as a baseline implementation: https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/master/Miscellaneous/Test_sample_prior.ipynb |
|
I'm confused if this is working or not? :) |
|
Closing this based on the newer PR |
This allows samples from a model, ignoring all observed variables. See the screenshot below for an example in a simple model.
Right now it relies on unofficial python3.6 behavior, and official python3.7 behavior. Namely, dictionaries keeping insertion order. I would love a suggestion to avoid that requirement, but I can also take a swing at having
tree_dictsubclass fromOrderedDictinstead.