Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Replace infix ~ for formulas with a model macro #9

Merged
merged 9 commits into from
Dec 31, 2016

Conversation

ararslan
Copy link
Member

Fixes #3

I've gone with => here since we need something that parses as an infix operator. Once ~ stops being parsed as an infix macro, it's not clear to me whether it will still be parsed as an infix operator at all. Thus I've opted for something we can guarantee is infix.

@ararslan
Copy link
Member Author

Hah, so here's the thing. The curse of space-delimited macro arguments:

@model y => x1 + x2 - 1 # becomes Formula(:y, :(x1 + x2 - 1))
@model y => x1 + x2 -1  # errors, since -1 is parsed as a macro argument

I'm not sure what the best course of action for this is. For this approach to continue to be viable, we may have to require using parentheses, e.g.

@model(y => x1 + x2 -1) # parsed identically regardless of spacing

That's not ideal, but it's not so bad (IMO). Thoughts?

@nalimilan
Copy link
Member

Thanks for doing this. Indeed it looks like we're going to have to recommend using the parenthesized call to avoid spacing issues.

Regarding the name of the macro, why not @formula instead? It returns a Formula object, not a Model. A model is more than a formula. We could introduce convenience macros later like @fit or @model which would take a formula plus a model family, a distribution, etc.

Finally, I would still use ~ rather than => though, unless we're certain that it won't parse as an infix operator in the future. The tilde is a relatively established convention, while the arrow goes in the wrong direction. Or we could simply use =. Since that's pure bikeshedding, let's make a small poll: add thumbs up for ~, thumbs down for =>, and laugh (!) for = (multiple choices allowed).

@tkelman
Copy link

tkelman commented Dec 15, 2016

Someone will have to implement the parsing change before we can be sure, but I suspect ~ will parse as a normal infix operator just without a definition in base, until the deprecation can make its way around. What I don't know about the formula macro here is whether that can be made to work on existing Julia versions with ~ parsing as an infix macro while also inside a conventional macro.

For all but the simplest one-block annotation macros, I think parenthesizing is better form anyway.

@kleinschmidt
Copy link
Member

kleinschmidt commented Dec 15, 2016

You could also do something like creating a new expression for the RHS is there's more than 2 arguments. Something like Expr(:call, [:+, args[2:end]...]).

FWIW I'm in favor of using parentheses, but I imagine some people might still try to use the non-parenthesized syntax because it's used all over the place. A reasonable compromise might be to do the transformation (combining all the RHS arguments with +) but warn people that that's ambiguous when it happens.

@ararslan
Copy link
Member Author

ararslan commented Dec 15, 2016

Thanks for the feedback and help here, guys!

Regarding the name of the macro, why not @formula instead?

I went with @model because it was what Stefan had suggested and there was no opposition (but also no feedback except from David Kleinschmidt). @formula seems fine.

The tilde is a relatively established convention, while the arrow goes in the wrong direction

I'm not sure what you mean about the direction. Can you elaborate? I went with => per an initial suggestion from @dmbates. Personally I prefer it to ~; I've always found that convention rather unfortunate—a syntax inherited from S that seeped into other languages trying to emulate R's functionality. (I guess there's a paper somewhere that uses the notation, but IIRC it postdates its use in S? Could be wrong on that.) Though you're right, it has fairly widespread precedent and there isn't really a solid reason to deviate.

I had played around with using = but it makes parsing a formula out of the resulting expression more complicated when the formula itself is more complex. It's not not doable, just more complicated. 🙂

What I don't know about the formula macro here is whether that can be made to work on existing Julia versions with ~ parsing as an infix macro while also inside a conventional macro.

Should be fine, actually. Once ~ is no longer a macro, the macro here will have to change its logic (the head will no longer be macrocall with @~ as its first argument, but will have ~ as the head) but that's not hard to accommodate.

A reasonable compromise might be to do the transformation (combining all the RHS arguments with +) but warn people that that's ambiguous when it happens.

That seems reasonable to me. Then it's just a vararg macro, right?

@kleinschmidt
Copy link
Member

Outside of R and friends, I've seen the tilde used in a kind of descriptive way in statistics/ML papers, to describe in a more informal way the form of how a random variable depends on others. E.g., y ~ Normal(mu, sigma^2), instead of writing out the PDF for the normal distribution. But maybe that's people punning on how they work in R... Regardless, in my experience it's a pretty common way of expressing the relationship between random variables (and often in the kind of high-level way that a formula is meant to capture)

@ararslan
Copy link
Member Author

Right, ~ is used for "distributed as" when describing the probability distribution of a random variable. But does that really translate to models? I suppose in the sense of explaining the variance of Y based on some X1, X2, ..., Xp it sort of makes sense, as the Xi have their own probability distributions. I guess I just haven't seen it for specifying models though. In my experience, that's nearly always =.

@kleinschmidt
Copy link
Member

Yes, "distributed as" is a much more concise way of saying what I was thinking. I've only ever seen = when it's written out very explicitly, e.g. y_i = alpha + x_1,i beta_1 + x_2,i beta_2 + ....

@simonster
Copy link
Member

simonster commented Dec 15, 2016

We definitely need to encourage parens, since fit(Model, @model y ~ 1 + x1 + x2, ...) is parsed as fit(Model, @model((y ~ 1 + x1 + x2, ...))) and not fit(Model, @model(y ~ 1 + x1 + x2), ...) (JuliaLang/julia#12021).

@ararslan
Copy link
Member Author

ararslan commented Dec 15, 2016

Simon makes a good point, and I think that makes a compelling case for requiring parentheses here, which would also make the vararg thing unnecessary.

Edit: Well, the requirement can't be enforced, but we can at least tell people that they may get unexpected behavior without parentheses.

length(ex.args) == 3 || error("malformed expression in formula")
lhs = Base.Meta.quot(ex.args[2])
rhs = Base.Meta.quot(ex.args[3])
elseif ex.head === :(~)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be :call for most infix operators?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think you're right. I was reusing the logic for =>, which isn't a call.

@ararslan
Copy link
Member Author

Okay, now that I've fixed my dumb Vim find/replace mistakes and the tests are passing, here's what it looks like:

@formula(y ~ 1 + x1 + x2 & x3)                       # bare object
fit(SomeModel, @formula(y ~ 1 + x1 + x2 & x3), ...)  # in context

Does that seem reasonable?

@ararslan
Copy link
Member Author

Hm, once ~ is no longer a macro, we'll need to be careful about its precedence as an infix operator. If it has the same precedence as +, for example, things could get fairly messy with a multivariate response.

That's something else I like about =>; it always infix constructs a Pair, and you don't need to worry about its precedence with other mathematical operators. (In fact, if we were to use that, we may just be able to replace Formula with Pair...) I suppose = also has that advantage.

@nalimilan
Copy link
Member

I'm not sure what you mean about the direction. Can you elaborate?

It's just that the dependent variable is on the LHS, so it feels weird that the => arrow would go from it to the independent variables. A model consists in predicting the LHS from the RHS, not the other way around.

=> would have been interesting to avoid using a macro at all, but once we use our macro it doesn't have a clear advantage.

I kind of like = too, but outside of linear regression this notation is kind of abusive: with e.g. logistic regression or survival models, the relationship between the LHS and the RHS is more complex than that. This broader/weaker "equivalence relation" is one of the mathematical meanings described by Wikipedia for ~.

@dmbates
Copy link
Contributor

dmbates commented Dec 16, 2016

Going back to the "is distributed as" interpretation of ~ in the exchange between @kleinschmidt and @ararslan, that actually fits in extremely well with linear and generalized linear models. A linear model is

𝐲 ~ Normal(𝐗β, σ²I)

in mathematical notation and very close to that as a Julia expression. It may be a little too wordy to write out the expression for the linear predictor in place of 𝐗 in that expression but it definitely relates the model to the expression.

For generalized linear models it is even more meaningful. The probability model for logistic regression is

𝐲 ~ Bernoulli(logit.(𝐗β))

using logit. to indicate the vectorized logit function.

I have seen that model written incorrectly as

yᵢ ~ Binomial(logit(xᵢβ) + ϵᵢ)

or something like that so often that I have lost track.

The point is that people want to write the model in a

signal + noise

form and have a simplified expression for the distribution of noise. Subtracting the mean from a multivariate Normal distribution leaves you with a simpler (i.e. mean 0) multivariate Normal distribution. Subtracting the mean from a multivariate Bernoulli doesn't simplify the distribution.

I think I still would be in favour of an alternative notation for the model formula but I just wanted to note that the ~ in the sense of "is distributed as" does have a connection to the model being described.

@ararslan
Copy link
Member Author

Great explanation, Doug. Thanks! Makes total sense.

It's just that the dependent variable is on the LHS, so it feels weird that the => arrow would go from it to the independent variables. A model consists in predicting the LHS from the RHS, not the other way around.

Very good point. I guess in my mind I was thinking of the use of the arrow as pointing to what we're modeling the response as a function of, so "LHS as a function of (=>) RHS."

I'm on board with ~, but I still have concerns about its precedence as an operator once it's no longer parsed as an infix macro. Assuming it gets parsed with the same precedence as something like + rather than something like &&, I guess it could be okay to stipulate that the LHS be surrounded in parentheses in the case of a multivariate response?

@nalimilan
Copy link
Member

I'm on board with ~, but I still have concerns about its precedence as an operator once it's no longer parsed as an infix macro. Assuming it gets parsed with the same precedence as something like + rather than something like &&, I guess it could be okay to stipulate that the LHS be surrounded in parentheses in the case of a multivariate response?

AFAICT, as Tony said, ~ is going to have the same precedence as infix operators with no assigned meanings in Base. So this should work, just like a + b ≍ x + y is currently parsed.

@tkelman
Copy link

tkelman commented Dec 17, 2016

Precedence is set on an operator by operator basis whether or not there's a meaning in base, I don't think all currently unassigned operators are at the same precedence. Given these packages are the main user so far (even after it gets changed to parse as a normal operator), I suspect it can be given whichever precedence would be most convenient for usage here. But it may change if someone comes up with a new meaning for the operator that should have different precedence.

@ararslan
Copy link
Member Author

Okay, sounds good to me.

@ararslan
Copy link
Member Author

Any further comments or are we ready to commit to doing formulas this way?

@kleinschmidt
Copy link
Member

I think this is good to go. Can you update the documentation to reflect these changes?

@ararslan
Copy link
Member Author

Ah right, documentation! I had forgotten that some people like to know how to use software. 😜 Will do. Thanks for the reminder, @kleinschmidt!

@kleinschmidt
Copy link
Member

To be perfectly honest, I only thought of it because I just wrote this documentation...

@kleinschmidt
Copy link
Member

kleinschmidt commented Dec 19, 2016

Does anyone think it's worth saying something about why we're using an explicit macro, or would that be too much information for the documentation? I don't have a good sense of whether explaining these kind of design decisions is helpful to users or just noise...

One reason why it might be good is that it might preempt griping of the "why can't we have naked formulas like in R" sort.

@ararslan
Copy link
Member Author

ararslan commented Dec 19, 2016

I think documenting the design decision could be useful so long as it's separate from the usage documentation--otherwise it's a little noisy IMO. Though following Base Julia's lead, we could leave questions of design decisions to "search the GitHub issues/PRs." 😉

@kleinschmidt
Copy link
Member

I agree with separating from usage. I'd rather be explicit about it (since a lot of what happens on github assumes a lot of context that a curious user might not have), but it's also more work to summarize things in a concise but useful way...

@ararslan
Copy link
Member Author

ararslan commented Dec 19, 2016

You may be wondering why formulas in Julia require a macro, while in R they appear "bare." R supports nonstandard evaluation, allowing the formula to remain an unevaluated object while its terms are parsed out. Julia uses a much more standard evaluation mechanism, making this impossible using normal expressions. However, Julia provides macros, which allow code to be programmatically manipulated prior to evaluation. By constructing a formula using a macro, we're able to provide convenient, R-like syntax and semantics.

?

Edit: Edited to incorporate comments below

@kleinschmidt
Copy link
Member

I might say "while in R..." (instead of "but")

@nalimilan
Copy link
Member

"R uses nonstandard evaluation" -> maybe "R supports/allows nonstandard evaluation"

@ararslan
Copy link
Member Author

Good suggestions, thanks! I've edited my comment. How does that look now? If it looks alright I'll stick it in a commit and send it up here.

@kleinschmidt
Copy link
Member

I might even say something like "Julia, unlike R, uses macros to explicitly indicate when code itself will be manipulated before it's evaluated", just to emphasize why this is a good idea (or at least reasonable).

@codecov-io
Copy link

Current coverage is 93.56% (diff: 100%)

Merging #9 into master will decrease coverage by 0.89%

@@             master         #9   diff @@
==========================================
  Files             5          5          
  Lines           307        311     +4   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits            290        291     +1   
- Misses           17         20     +3   
  Partials          0          0          

Powered by Codecov. Last update b4f435b...e18f120

@ararslan
Copy link
Member Author

Sorry for the delay. I've incorporated the comments in the docs regarding the reasoning behind macros for formulas. Unless there are further comments, I'll go ahead and merge this later today. I'm trying to get this good to go ASAP in light of the impending feature freeze for Base Julia v0.6, wherein we hope to stop parsing ~ as a macro (ref JuliaLang/julia#19598).

@kleinschmidt After this is merged, is there anything else that needs to be done here before we can tag a release?

Thanks so much for your help and input, everyone!

@ararslan
Copy link
Member Author

Not sure what the nightly failure is about but it appears unrelated

@tkelman
Copy link

tkelman commented Dec 31, 2016

Can this be merged? What deprecation strategy for the parsing of ~ in base would make the migration doable soon?

@ararslan
Copy link
Member Author

ararslan commented Dec 31, 2016

I tried to write the macro in such a way that it should work regardless of whether ~ is a macro call or a regular call. The only thing that needs to be preserved is the parsing precedence of ~. (Does that answer your question?)

@ararslan ararslan merged commit 1e86a5b into master Dec 31, 2016
@ararslan ararslan deleted the aa/model-macro branch December 31, 2016 19:20
@tkelman
Copy link

tkelman commented Jan 1, 2017

Kinda. Are other packages using the old syntax in tests, or does this only ever appear in user code?

@ararslan
Copy link
Member Author

ararslan commented Jan 1, 2017

I think many packages that currently depend on DataFrames (which still defines this syntax) do use formulas in tests. Examples off the top of my head include MixedModels and FixedEffectModels.

@tkelman
Copy link

tkelman commented Jan 1, 2017

And they're getting the definition from here? So I guess this package should give a dep warn for tilde when it gets called as a macro rather than inside of one?

@ararslan
Copy link
Member Author

ararslan commented Jan 1, 2017

This package isn't registered yet, so the definition will come from here, but doesn't yet. The next release of DataFrames will not contain any of the formula code; that will all come from here. So for 0.6 compatibility with the current DataFrames we'd have to do a patch release with a deprecation warning. I guess one way to do it would be to have DataFrames define @~ with the dep warn and leave things as-is here.

@tkelman
Copy link

tkelman commented Jan 1, 2017

Oh! Thanks for the explanation, didn't realize that. Is the old release DataFrames branch the only place that defined a ~ macro implementation? Would this be ready for other packages to use as a replacement soon?

Or instead of DataFrames, maybe Base could keep ~ parsing as a macro but always throwing a depwarn? Looks like this may not change in Base for 0.6, but if packages can be made ready then maybe we hard change the parsing during 1.0-dev and packages might not notice if they've transitioned to this?

@ararslan
Copy link
Member Author

ararslan commented Jan 1, 2017

Is the old release DataFrames branch the only place that defined a ~ macro implementation?

Should still be on DataFrames master as well until this is registered, but otherwise yes.

Would this be ready for other packages to use as a replacement soon?

I think so, but @kleinschmidt may know better than I would.

Or instead of DataFrames, maybe Base could keep ~ parsing as a macro but always throwing a depwarn? Looks like this may not change in Base for 0.6, but if packages can be made ready then maybe we hard change the parsing during 1.0-dev and packages might not notice if they've transitioned to this?

👍 Sounds like the best course of action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants