Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring to support Structured VI #2416

Merged
merged 97 commits into from
Sep 8, 2017
Merged

Refactoring to support Structured VI #2416

merged 97 commits into from
Sep 8, 2017

Conversation

class GroupApproximation(object):
"""
Grouped Approximation that is used for modelling mutual dependencies
fro a specified group of variables.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: fro->for

@twiecki twiecki added this to the 3.2 milestone Jul 14, 2017
@ferrine
Copy link
Member Author

ferrine commented Jul 15, 2017

API change proposal №1:

local variables are held in separate group. So that you do not pass a dict[var, (mu, rho)]. But specify it as a group of variables and provide needed params.

On low level(high level API can be less verbose) it will roughly look like

approx = Approximation([
    MeanField([local_rv], # group of local variables
        shapes=dict([(local_rv, (s, 12))]), # where `s` is symbolic
        params=dict(mu=mu, rho=rho)), # where mu, rho have valid shape: (flexible, latent_ndim)
        # in this case flexible shape is `s`, and latent ndim is 12
    MeanField(-1)  # delayed init with `-1` instead of group. Will capture all the rest variables
])

As a bonus the following construction will be possible

approx = Approximation([
    FullRank([local_rv], # group of local variables
        shapes=dict([(local_rv, (s, 12))]), # where `s` is symbolic
        params=dict(mu=mu, L_tril=L_tril)), # where mu, L_tril have valid shape: (flexible, param_standard_shape)
        # in this case flexible shape is `s`, and latent ndim is 12, so
        # mu.shape = (s, 12)
        # L_tril.shape = (s, 12 * (12 + 1) / 2)
    MeanField(-1)  # delayed init with `-1` instead of group. Will capture all the rest variables
])

Holding backward compatibility for the first iterations will make future work super hard. API change is still under discussion, feel free to suggest ideas and criticize solutions.

The main Idea behind is handling local_rv as a separate group with it's flexible shape that depends on dynamic shape. Now I try to do it with 3d tensors where 2d dim holds for flexible shape. I hope to implement broadcastable Approximations so that one can parametrize any distribution for local variables. Even in this setup it's all hard enough.

@ferrine
Copy link
Member Author

ferrine commented Jul 15, 2017

Personally I feel myself free to change API for 3.2 as we have released 3.1 but still a bit worried about regular users. They might not like so rapid changes in API. For top level API like ADVI, NFVI I do not expect significant changes as they are just a special case in Structured VI framework

@ferrine
Copy link
Member Author

ferrine commented Jul 15, 2017

I also invite @fonnesbeck for discussion

@junpenglao
Copy link
Member

How is it compare to the current API? The global and local vars are pass together to the Approximation?
I think now is still a good time to make API changes, as long as the high-level API is the same (e.g., pm.fit still has the same API). Personally, I dont interact with the low-level API this way.

@ferrine
Copy link
Member Author

ferrine commented Jul 16, 2017

I think reasonable way to go is to restrict so called groups for local variables to have only one variable per group.

OK
local_1 : {locv_1}
local_2: {locv_2}
global_1: {var1, var2}
global_2: {var3, var4}

not OK
local_1 : {locv_1, locv_2}
global_1: {var1, var2}
global_2: {var3, var4}

not OK
local_1 : {locv_1}
global_1: {var1, var2, locv_2}
global_2: {var3, var4}

So we'll have no way to capture interactions between local and global variables as well as local with local. This is exactly what we have now but much more flexible. Here one can parametrize local variables with flows of FullRank or separate groups with more complex distributions rather than MeanField.

BTW I see it too hard to implement convenient API that allows more than one variable in local group.

self._mid_size = size
self.vmap = dict()
self.ndim = 0
self._global_ = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What’s the trailing underscore for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use this to mention that it should not be changed even internally after init

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Another option would be to use all caps (e.g. self.GLOBAL). This is common for fixed variables, such as constants in Python. Not sure what PEP8 has to say about that, but the Google style guide uses this convention.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all caps is more about class attributes, not instance specific

@fonnesbeck
Copy link
Member

fonnesbeck commented Jul 16, 2017

I still need to properly read these papers. Could there not be default groupings based on, say, hierarchical structure in the model?

Whether a drastic API change is worth it depends on what structured VI gives us beyond what can be achieved with normalizing flows. The copula VI paper shows a 4x slowdown compared to mean field; is this comparable to what we are getting with NFVI?

I haven’t seen these implemented anywhere, which seems odd. Would have thought Edward would have them. Any idea why not?

@ferrine
Copy link
Member Author

ferrine commented Jul 16, 2017

NFVI is not scalable in terms of model size. Ofc there is a slowdown but there is a tradeoff. We can vary flow length and thus computational cost. I want to make this tradeoff more flexible and require flows or FullRank when needed.

@ferrine
Copy link
Member Author

ferrine commented Jul 16, 2017

Remark: NFVI is the most scalable approach we currently have for complex models but we can make it better

@junpenglao
Copy link
Member

Would have thought Edward would have them. Any idea why not?

We can ask @dustinvtran for some opinions.

@ferrine
Copy link
Member Author

ferrine commented Jul 16, 2017

I find implementation not easy. I need careful architecture choice first.

@ferrine
Copy link
Member Author

ferrine commented Jul 16, 2017

Interested to hear from @dustinvtran too

@taku-y
Copy link
Contributor

taku-y commented Jul 17, 2017

I think reasonable way to go is to restrict so called groups for local variables to have only one variable per group.

I agree. If two local variables should interact with each other, these variables can be merged when inputted to the network for posterior parameter estimation. In my opinion, you can first implement for "OK" case. Then, if required, other two cases would be considered.

@dustinvtran
Copy link
Contributor

dustinvtran commented Jul 17, 2017

In practice, the tradeoff for any algorithm is its perceived benefit vs its complexity of implementation. I often use mean-field and structured approaches (i.e., low rank Gaussians) as simple baselines. They're fast and work well on many problems. Copula VI is difficult to implement if the library doesn't already have extensive support for copula distributions like R does, but it's also robust.

I haven't seen normalizing flows, operator VI, and many others used for problems besides training deep generative models. This is because they're quite difficult to train in practice, which means they're generally restricted to experts who are knowledgeable to get them to work.

Would have thought Edward would have them. Any idea why not?

Edward supports structured approximations. E.g., you just specify a Cholesky-parameterized normal distribution as your approximating family. If TensorFlow had support for copula distributions, that would also just work.™

@ferrine
Copy link
Member Author

ferrine commented Jul 17, 2017

What is the connectivity for low rank (structured) approximations and householder flows?

@ferrine
Copy link
Member Author

ferrine commented Jul 18, 2017

I'm still a bit stuck with architecture choice. I want to make more assumptions about local vars. They are the following

  1. There is one flexible dim.
  2. There is the first dimension that holds flexibility

How do feel about it?

@junpenglao
Copy link
Member

I am still going through the paper...

@taku-y
Copy link
Contributor

taku-y commented Jul 18, 2017

@ferrine Does the second choice means that the first dimension is the minibatch size? I don't understand the first choice.

@ferrine
Copy link
Member Author

ferrine commented Jul 18, 2017

That's right. First one is about user should have only one subsampling dimension there.

@taku-y
Copy link
Contributor

taku-y commented Jul 19, 2017

I don't come up with any ideas when "one flexible dim" is required. I think the 1st dim should always be mini-batch size for latent variables.

@ferrine
Copy link
Member Author

ferrine commented Jul 19, 2017

I feel like we speak about the same things. 1st dim is for minibatches and no other.

@ferrine
Copy link
Member Author

ferrine commented Jul 19, 2017

It is flexible because it depends on batch size

@junpenglao
Copy link
Member

@ferrine I dont think it is a good idea to rename it as Group, Global and Local. The name is too generic.

@ferrine
Copy link
Member Author

ferrine commented Jul 19, 2017

@aseyboldt Could you please review my last commit? I've decided to use dynamic class dispatching to use single implementation for Local and Global groups. Approximation families are supposed to be inherited from Group class. I wonder if you approve this architecture choice so I can focus on it and refactor all the rest related codebase

@ferrine
Copy link
Member Author

ferrine commented Jul 19, 2017

@junpenglao I see your point and had the same concerns. I'm thinking about better name for Group. Maybe {Group|Local|Global}Approx?

@ferrine
Copy link
Member Author

ferrine commented Jul 19, 2017

I feel like it is a good direction but I still have bad design. After calling new I'll then have Local or Global group but not approximation instance. Thus, calling init will give invalid results

@ferrine
Copy link
Member Author

ferrine commented Jul 19, 2017

I have 2 solutions:

  1. choose implementation based on ._is_local flag
  2. create metaclass and substitute instance before init (https://stackoverflow.com/a/5961102)
# metaclass' call
def __call__(cls, *args, **kwargs):
    is_local = kwargs.get('local', False)
    if is_local:
        _cls = Local
    else:
        _cls = Global
    instance = object.__new__(_cls)
    cls.__init__(instance, *args, **kwargs)
    return instance

Which is better?

@ferrine
Copy link
Member Author

ferrine commented Jul 20, 2017

I think that first one will be better

pymc3/math.py Outdated

def make_node(self, *matrices):
if not matrices:
raise ValueError('Got no matrices to allocate')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just “No matrices to allocate” is more appropriate.

def get_transformed(z):
if hasattr(z, 'transformed'):
z = z.transformed
return z
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need newline

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@fonnesbeck
Copy link
Member

Do we need an example or a notebook on how approximation groups are used? It won’t be entirely clear to users.

@ferrine
Copy link
Member Author

ferrine commented Sep 4, 2017

I yet do not have a good example for that. It can be done after special features like gumbel softmax

@junpenglao
Copy link
Member

@taku-y do you have any further comment? Otherwise, this is ready to merge!

@taku-y
Copy link
Contributor

taku-y commented Sep 8, 2017

@ferrine @junpenglao I'm sorry for not responding. The code and documentation looks good to me. Though I didn't run the notebooks, I believe it's fine to merge.

@twiecki twiecki merged commit 569a82e into pymc-devs:master Sep 8, 2017
@twiecki
Copy link
Member

twiecki commented Sep 8, 2017

Thanks @ferrine, this is a massive piece of work and adds amazing new functionality. Thanks also to all the reviewers!

@ferrine ferrine deleted the approximations_refactoring branch September 8, 2017 15:39
try:
approx = Approximation([cls([three_var_model.one], batched=True, **kw), Group(None, vfam='mf')])
inference = pm.KLqp(approx)
approx = inference.fit(3, obj_n_mc=2)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see typo here. batched->rowwise

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants