WIP: Replace infix ~ for formulas with a model macro #9

ararslan · 2016-12-15T05:15:48Z

Fixes #3

I've gone with => here since we need something that parses as an infix operator. Once ~ stops being parsed as an infix macro, it's not clear to me whether it will still be parsed as an infix operator at all. Thus I've opted for something we can guarantee is infix.

ararslan · 2016-12-15T05:29:23Z

Hah, so here's the thing. The curse of space-delimited macro arguments:

@model y => x1 + x2 - 1 # becomes Formula(:y, :(x1 + x2 - 1))
@model y => x1 + x2 -1  # errors, since -1 is parsed as a macro argument

I'm not sure what the best course of action for this is. For this approach to continue to be viable, we may have to require using parentheses, e.g.

@model(y => x1 + x2 -1) # parsed identically regardless of spacing

That's not ideal, but it's not so bad (IMO). Thoughts?

nalimilan · 2016-12-15T09:38:26Z

Thanks for doing this. Indeed it looks like we're going to have to recommend using the parenthesized call to avoid spacing issues.

Regarding the name of the macro, why not @formula instead? It returns a Formula object, not a Model. A model is more than a formula. We could introduce convenience macros later like @fit or @model which would take a formula plus a model family, a distribution, etc.

Finally, I would still use ~ rather than => though, unless we're certain that it won't parse as an infix operator in the future. The tilde is a relatively established convention, while the arrow goes in the wrong direction. Or we could simply use =. Since that's pure bikeshedding, let's make a small poll: add thumbs up for ~, thumbs down for =>, and laugh (!) for = (multiple choices allowed).

tkelman · 2016-12-15T14:17:22Z

Someone will have to implement the parsing change before we can be sure, but I suspect ~ will parse as a normal infix operator just without a definition in base, until the deprecation can make its way around. What I don't know about the formula macro here is whether that can be made to work on existing Julia versions with ~ parsing as an infix macro while also inside a conventional macro.

For all but the simplest one-block annotation macros, I think parenthesizing is better form anyway.

kleinschmidt · 2016-12-15T16:05:42Z

You could also do something like creating a new expression for the RHS is there's more than 2 arguments. Something like Expr(:call, [:+, args[2:end]...]).

FWIW I'm in favor of using parentheses, but I imagine some people might still try to use the non-parenthesized syntax because it's used all over the place. A reasonable compromise might be to do the transformation (combining all the RHS arguments with +) but warn people that that's ambiguous when it happens.

ararslan · 2016-12-15T17:29:48Z

Thanks for the feedback and help here, guys!

Regarding the name of the macro, why not @formula instead?

I went with @model because it was what Stefan had suggested and there was no opposition (but also no feedback except from David Kleinschmidt). @formula seems fine.

The tilde is a relatively established convention, while the arrow goes in the wrong direction

I'm not sure what you mean about the direction. Can you elaborate? I went with => per an initial suggestion from @dmbates. Personally I prefer it to ~; I've always found that convention rather unfortunate—a syntax inherited from S that seeped into other languages trying to emulate R's functionality. (I guess there's a paper somewhere that uses the notation, but IIRC it postdates its use in S? Could be wrong on that.) Though you're right, it has fairly widespread precedent and there isn't really a solid reason to deviate.

I had played around with using = but it makes parsing a formula out of the resulting expression more complicated when the formula itself is more complex. It's not not doable, just more complicated. 🙂

What I don't know about the formula macro here is whether that can be made to work on existing Julia versions with ~ parsing as an infix macro while also inside a conventional macro.

Should be fine, actually. Once ~ is no longer a macro, the macro here will have to change its logic (the head will no longer be macrocall with @~ as its first argument, but will have ~ as the head) but that's not hard to accommodate.

A reasonable compromise might be to do the transformation (combining all the RHS arguments with +) but warn people that that's ambiguous when it happens.

That seems reasonable to me. Then it's just a vararg macro, right?

kleinschmidt · 2016-12-15T18:25:12Z

Outside of R and friends, I've seen the tilde used in a kind of descriptive way in statistics/ML papers, to describe in a more informal way the form of how a random variable depends on others. E.g., y ~ Normal(mu, sigma^2), instead of writing out the PDF for the normal distribution. But maybe that's people punning on how they work in R... Regardless, in my experience it's a pretty common way of expressing the relationship between random variables (and often in the kind of high-level way that a formula is meant to capture)

ararslan · 2016-12-15T18:31:24Z

Right, ~ is used for "distributed as" when describing the probability distribution of a random variable. But does that really translate to models? I suppose in the sense of explaining the variance of Y based on some X₁, X₂, ..., X_p it sort of makes sense, as the X_i have their own probability distributions. I guess I just haven't seen it for specifying models though. In my experience, that's nearly always =.

kleinschmidt · 2016-12-15T18:36:22Z

Yes, "distributed as" is a much more concise way of saying what I was thinking. I've only ever seen = when it's written out very explicitly, e.g. y_i = alpha + x_1,i beta_1 + x_2,i beta_2 + ....

simonster · 2016-12-15T18:53:37Z

We definitely need to encourage parens, since fit(Model, @model y ~ 1 + x1 + x2, ...) is parsed as fit(Model, @model((y ~ 1 + x1 + x2, ...))) and not fit(Model, @model(y ~ 1 + x1 + x2), ...) (JuliaLang/julia#12021).

ararslan · 2016-12-15T19:25:53Z

Simon makes a good point, and I think that makes a compelling case for requiring parentheses here, which would also make the vararg thing unnecessary.

Edit: Well, the requirement can't be enforced, but we can at least tell people that they may get unexpected behavior without parentheses.

tkelman · 2016-12-15T19:42:44Z

src/formula.jl

+        length(ex.args) == 3 || error("malformed expression in formula")
+        lhs = Base.Meta.quot(ex.args[2])
+        rhs = Base.Meta.quot(ex.args[3])
+    elseif ex.head === :(~)


wouldn't it be :call for most infix operators?

Yeah I think you're right. I was reusing the logic for =>, which isn't a call.

ararslan · 2016-12-15T20:21:32Z

Okay, now that I've fixed my dumb Vim find/replace mistakes and the tests are passing, here's what it looks like:

@formula(y ~ 1 + x1 + x2 & x3)                       # bare object
fit(SomeModel, @formula(y ~ 1 + x1 + x2 & x3), ...)  # in context

Does that seem reasonable?

ararslan · 2016-12-15T20:55:32Z

Hm, once ~ is no longer a macro, we'll need to be careful about its precedence as an infix operator. If it has the same precedence as +, for example, things could get fairly messy with a multivariate response.

That's something else I like about =>; it always infix constructs a Pair, and you don't need to worry about its precedence with other mathematical operators. (In fact, if we were to use that, we may just be able to replace Formula with Pair...) I suppose = also has that advantage.

nalimilan · 2016-12-16T10:01:49Z

I'm not sure what you mean about the direction. Can you elaborate?

It's just that the dependent variable is on the LHS, so it feels weird that the => arrow would go from it to the independent variables. A model consists in predicting the LHS from the RHS, not the other way around.

=> would have been interesting to avoid using a macro at all, but once we use our macro it doesn't have a clear advantage.

I kind of like = too, but outside of linear regression this notation is kind of abusive: with e.g. logistic regression or survival models, the relationship between the LHS and the RHS is more complex than that. This broader/weaker "equivalence relation" is one of the mathematical meanings described by Wikipedia for ~.

dmbates · 2016-12-16T17:05:54Z

Going back to the "is distributed as" interpretation of ~ in the exchange between @kleinschmidt and @ararslan, that actually fits in extremely well with linear and generalized linear models. A linear model is

𝐲 ~ Normal(𝐗β, σ²I)

in mathematical notation and very close to that as a Julia expression. It may be a little too wordy to write out the expression for the linear predictor in place of 𝐗 in that expression but it definitely relates the model to the expression.

For generalized linear models it is even more meaningful. The probability model for logistic regression is

𝐲 ~ Bernoulli(logit.(𝐗β))

using logit. to indicate the vectorized logit function.

I have seen that model written incorrectly as

yᵢ ~ Binomial(logit(xᵢβ) + ϵᵢ)

or something like that so often that I have lost track.

The point is that people want to write the model in a

signal + noise

form and have a simplified expression for the distribution of noise. Subtracting the mean from a multivariate Normal distribution leaves you with a simpler (i.e. mean 0) multivariate Normal distribution. Subtracting the mean from a multivariate Bernoulli doesn't simplify the distribution.

I think I still would be in favour of an alternative notation for the model formula but I just wanted to note that the ~ in the sense of "is distributed as" does have a connection to the model being described.

ararslan · 2016-12-16T18:12:08Z

Great explanation, Doug. Thanks! Makes total sense.

It's just that the dependent variable is on the LHS, so it feels weird that the => arrow would go from it to the independent variables. A model consists in predicting the LHS from the RHS, not the other way around.

Very good point. I guess in my mind I was thinking of the use of the arrow as pointing to what we're modeling the response as a function of, so "LHS as a function of (=>) RHS."

I'm on board with ~, but I still have concerns about its precedence as an operator once it's no longer parsed as an infix macro. Assuming it gets parsed with the same precedence as something like + rather than something like &&, I guess it could be okay to stipulate that the LHS be surrounded in parentheses in the case of a multivariate response?

nalimilan · 2016-12-17T10:56:07Z

I'm on board with ~, but I still have concerns about its precedence as an operator once it's no longer parsed as an infix macro. Assuming it gets parsed with the same precedence as something like + rather than something like &&, I guess it could be okay to stipulate that the LHS be surrounded in parentheses in the case of a multivariate response?

AFAICT, as Tony said, ~ is going to have the same precedence as infix operators with no assigned meanings in Base. So this should work, just like a + b ≍ x + y is currently parsed.

tkelman · 2016-12-17T17:06:51Z

Precedence is set on an operator by operator basis whether or not there's a meaning in base, I don't think all currently unassigned operators are at the same precedence. Given these packages are the main user so far (even after it gets changed to parse as a normal operator), I suspect it can be given whichever precedence would be most convenient for usage here. But it may change if someone comes up with a new meaning for the operator that should have different precedence.

ararslan · 2016-12-17T18:52:11Z

Okay, sounds good to me.

ararslan · 2016-12-19T19:33:56Z

Any further comments or are we ready to commit to doing formulas this way?

kleinschmidt · 2016-12-19T19:35:33Z

I think this is good to go. Can you update the documentation to reflect these changes?

ararslan · 2016-12-19T19:39:55Z

Ah right, documentation! I had forgotten that some people like to know how to use software. 😜 Will do. Thanks for the reminder, @kleinschmidt!

kleinschmidt · 2016-12-19T19:46:30Z

To be perfectly honest, I only thought of it because I just wrote this documentation...

kleinschmidt · 2016-12-19T20:28:48Z

Does anyone think it's worth saying something about why we're using an explicit macro, or would that be too much information for the documentation? I don't have a good sense of whether explaining these kind of design decisions is helpful to users or just noise...

One reason why it might be good is that it might preempt griping of the "why can't we have naked formulas like in R" sort.

ararslan · 2016-12-19T20:31:46Z

I think documenting the design decision could be useful so long as it's separate from the usage documentation--otherwise it's a little noisy IMO. Though following Base Julia's lead, we could leave questions of design decisions to "search the GitHub issues/PRs." 😉

kleinschmidt · 2016-12-19T20:42:03Z

I agree with separating from usage. I'd rather be explicit about it (since a lot of what happens on github assumes a lot of context that a curious user might not have), but it's also more work to summarize things in a concise but useful way...

ararslan · 2016-12-19T21:47:19Z

You may be wondering why formulas in Julia require a macro, while in R they appear "bare." R supports nonstandard evaluation, allowing the formula to remain an unevaluated object while its terms are parsed out. Julia uses a much more standard evaluation mechanism, making this impossible using normal expressions. However, Julia provides macros, which allow code to be programmatically manipulated prior to evaluation. By constructing a formula using a macro, we're able to provide convenient, R-like syntax and semantics.

?

Edit: Edited to incorporate comments below

kleinschmidt · 2016-12-19T22:30:11Z

I might say "while in R..." (instead of "but")

nalimilan · 2016-12-19T23:23:49Z

"R uses nonstandard evaluation" -> maybe "R supports/allows nonstandard evaluation"

ararslan · 2016-12-20T00:31:38Z

Good suggestions, thanks! I've edited my comment. How does that look now? If it looks alright I'll stick it in a commit and send it up here.

kleinschmidt · 2016-12-22T17:31:17Z

I might even say something like "Julia, unlike R, uses macros to explicitly indicate when code itself will be manipulated before it's evaluated", just to emphasize why this is a good idea (or at least reasonable).

codecov-io · 2016-12-28T19:14:51Z

Current coverage is 93.56% (diff: 100%)

Merging #9 into master will decrease coverage by 0.89%

@@             master         #9   diff @@
==========================================
  Files             5          5          
  Lines           307        311     +4   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits            290        291     +1   
- Misses           17         20     +3   
  Partials          0          0

Powered by Codecov. Last update b4f435b...e18f120

ararslan · 2016-12-28T19:20:55Z

Sorry for the delay. I've incorporated the comments in the docs regarding the reasoning behind macros for formulas. Unless there are further comments, I'll go ahead and merge this later today. I'm trying to get this good to go ASAP in light of the impending feature freeze for Base Julia v0.6, wherein we hope to stop parsing ~ as a macro (ref JuliaLang/julia#19598).

@kleinschmidt After this is merged, is there anything else that needs to be done here before we can tag a release?

Thanks so much for your help and input, everyone!

ararslan · 2016-12-28T19:27:46Z

Not sure what the nightly failure is about but it appears unrelated

tkelman · 2016-12-31T03:34:20Z

Can this be merged? What deprecation strategy for the parsing of ~ in base would make the migration doable soon?

ararslan · 2016-12-31T19:19:49Z

I tried to write the macro in such a way that it should work regardless of whether ~ is a macro call or a regular call. The only thing that needs to be preserved is the parsing precedence of ~. (Does that answer your question?)

tkelman · 2017-01-01T01:49:39Z

Kinda. Are other packages using the old syntax in tests, or does this only ever appear in user code?

ararslan · 2017-01-01T01:50:57Z

I think many packages that currently depend on DataFrames (which still defines this syntax) do use formulas in tests. Examples off the top of my head include MixedModels and FixedEffectModels.

tkelman · 2017-01-01T01:52:54Z

And they're getting the definition from here? So I guess this package should give a dep warn for tilde when it gets called as a macro rather than inside of one?

ararslan · 2017-01-01T02:00:43Z

This package isn't registered yet, so the definition will come from here, but doesn't yet. The next release of DataFrames will not contain any of the formula code; that will all come from here. So for 0.6 compatibility with the current DataFrames we'd have to do a patch release with a deprecation warning. I guess one way to do it would be to have DataFrames define @~ with the dep warn and leave things as-is here.

tkelman · 2017-01-01T02:29:04Z

Oh! Thanks for the explanation, didn't realize that. Is the old release DataFrames branch the only place that defined a ~ macro implementation? Would this be ready for other packages to use as a replacement soon?

Or instead of DataFrames, maybe Base could keep ~ parsing as a macro but always throwing a depwarn? Looks like this may not change in Base for 0.6, but if packages can be made ready then maybe we hard change the parsing during 1.0-dev and packages might not notice if they've transitioned to this?

ararslan · 2017-01-01T20:04:53Z

Is the old release DataFrames branch the only place that defined a ~ macro implementation?

Should still be on DataFrames master as well until this is registered, but otherwise yes.

Would this be ready for other packages to use as a replacement soon?

I think so, but @kleinschmidt may know better than I would.

Or instead of DataFrames, maybe Base could keep ~ parsing as a macro but always throwing a depwarn? Looks like this may not change in Base for 0.6, but if packages can be made ready then maybe we hard change the parsing during 1.0-dev and packages might not notice if they've transitioned to this?

👍 Sounds like the best course of action.

Replace infix ~ for formulas with a model macro

8b93202

Use s/model/formula/, s/=>/~/, use parens

b7386b9

tkelman reviewed Dec 15, 2016

View reviewed changes

ararslan added 3 commits December 15, 2016 11:53

Oops

3a8a364

Update macro logic to use :call

0a16392

Oops 2: Oops Harder

f01679a

Test error thrown by formula macro

81c5c4e

ararslan force-pushed the aa/model-macro branch from a2ef44d to 81c5c4e Compare December 17, 2016 18:56

ararslan added 2 commits December 19, 2016 12:20

Document the at-formula interface for Formulas

9520299

More at-~ -> at-formula

e18f120

Explain macro reasoning in the docs

5962702

ararslan merged commit 1e86a5b into master Dec 31, 2016

ararslan deleted the aa/model-macro branch December 31, 2016 19:20

StefanKarpinski mentioned this pull request Jan 26, 2017

things we should deprecate, 0.6 edition JuliaLang/julia#19598

Closed

22 tasks

StefanKarpinski mentioned this pull request Feb 2, 2017

change parsing of ~ from macro to operator JuliaLang/julia#20406

Closed

3 tasks

kleinschmidt mentioned this pull request Aug 28, 2017

Re-define ~: macro => function TuringLang/Turing.jl#173

Closed

WIP: Replace infix ~ for formulas with a model macro #9

WIP: Replace infix ~ for formulas with a model macro #9

Conversation

ararslan commented Dec 15, 2016

ararslan commented Dec 15, 2016

nalimilan commented Dec 15, 2016

tkelman commented Dec 15, 2016

kleinschmidt commented Dec 15, 2016 • edited Loading

ararslan commented Dec 15, 2016 • edited Loading

kleinschmidt commented Dec 15, 2016

ararslan commented Dec 15, 2016

kleinschmidt commented Dec 15, 2016

simonster commented Dec 15, 2016 • edited Loading

ararslan commented Dec 15, 2016 • edited Loading

tkelman Dec 15, 2016

Choose a reason for hiding this comment

ararslan Dec 15, 2016

Choose a reason for hiding this comment

ararslan commented Dec 15, 2016

ararslan commented Dec 15, 2016

nalimilan commented Dec 16, 2016

dmbates commented Dec 16, 2016

ararslan commented Dec 16, 2016

nalimilan commented Dec 17, 2016

tkelman commented Dec 17, 2016

ararslan commented Dec 17, 2016

ararslan commented Dec 19, 2016

kleinschmidt commented Dec 19, 2016

ararslan commented Dec 19, 2016

kleinschmidt commented Dec 19, 2016

kleinschmidt commented Dec 19, 2016 • edited Loading

ararslan commented Dec 19, 2016 • edited Loading

kleinschmidt commented Dec 19, 2016

ararslan commented Dec 19, 2016 • edited Loading

kleinschmidt commented Dec 19, 2016

nalimilan commented Dec 19, 2016

ararslan commented Dec 20, 2016

kleinschmidt commented Dec 22, 2016

codecov-io commented Dec 28, 2016

Current coverage is 93.56% (diff: 100%)

ararslan commented Dec 28, 2016

ararslan commented Dec 28, 2016

tkelman commented Dec 31, 2016

ararslan commented Dec 31, 2016 • edited Loading

tkelman commented Jan 1, 2017

ararslan commented Jan 1, 2017

tkelman commented Jan 1, 2017

ararslan commented Jan 1, 2017 • edited Loading

tkelman commented Jan 1, 2017

ararslan commented Jan 1, 2017 • edited Loading

kleinschmidt commented Dec 15, 2016 •

edited

Loading

ararslan commented Dec 15, 2016 •

edited

Loading

simonster commented Dec 15, 2016 •

edited

Loading

ararslan commented Dec 15, 2016 •

edited

Loading

kleinschmidt commented Dec 19, 2016 •

edited

Loading

ararslan commented Dec 19, 2016 •

edited

Loading

ararslan commented Dec 19, 2016 •

edited

Loading

ararslan commented Dec 31, 2016 •

edited

Loading

ararslan commented Jan 1, 2017 •

edited

Loading

ararslan commented Jan 1, 2017 •

edited

Loading