Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stan code canonicalizer #390

Merged
merged 8 commits into from
Dec 3, 2019
Merged

Stan code canonicalizer #390

merged 8 commits into from
Dec 3, 2019

Conversation

nhuurre
Copy link
Collaborator

@nhuurre nhuurre commented Nov 19, 2019

Based on issue #262.

A new command line option that removes deprecated syntax from Stan code and pretty prints the result. No attempt is made to infer priors.

I still need to figure out what tests are required. The pretty printer seems to have nothing but a pretty.expected file in every subdirectory of test/integration/good. How are those generated?

Copyright holder: Niko Huurre

By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:

@seantalts
Copy link
Member

dune runtest runs all of those tests. They're all expect tests based on the dune files in each directory. We could add some tests for the canonicalizer in a similar way to the pretty printer, though probably not as many.

Copy link
Member

@seantalts seantalts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good - was this your first ocaml?? We just needs some tests. You could make a new canonicalizer test folder and add a dune file that is really similar to any of the ones in good with stanzas for pretty printing (but with the argument changed from --pretty-print to whatever.

let deprecated_distributions = String.Table.create ()

let () =
Hashtbl.add_multi deprecated_functions ~key:"multiply_log" ~data:"lmultiply" ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional nit: we don't actually need Hashtbl with its mutability; we could just create an immutable Map here. I know we use this Hashtbl pattern elsewhere for the immutable stan math signatures - not sure if that's a pattern we want to keep using or not even for really large amounts of data.

Map would look something like:

let deprecated_distributions = String.Map.of_alist_multi
[("multiply_log", "lmultiply")
; ("binomial_coefficient_log", "lchoose")
...
]
@@ List.concat_map Middle.Stan_math_signatures.distributions 
  ~f:(fun (fnkinds, name, _) ->
      List.map fnkinds ~f:(function| Lpdf -> (name ^ "_log", name ^ "_lpdf")
...

let rename_distribution name =
match Hashtbl.find deprecated_distributions name with
| None | Some [] -> name
| Some (rename :: _) -> rename
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only use the first element of the value, why are we using a multi map?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's my first Ocaml and I have no idea what I'm doing :P. The obvious alternative Hashtbl.add had a weird return type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, this is great for a first run! Would you mind changing it? You can use add_exn instead.

| GetLP -> GetTarget
| FunApp (f, {name; id_loc}, e) ->
if is_distribution name then
CondDistApp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the structure of the tree (replacing a FunApp with a CondDisApp). Is that because sometimes we have incorrectly labeled distributions as FunApps prior to this stage?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deprecated density functions with _log suffix are real functions and are called as such. They're replaced by _lpdf suffix functions and those require the vertical bar syntax.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woah. Stan3 says this code is incorrect

target += std_normal_lpdf(x);

and expects instead

target += std_normal_lpdf(x|);

but Stan2 can't parse that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, looks like the former should be okay. Let me make an issue for this.

@@ -83,6 +84,9 @@ let options =
; ( "--auto-format"
, Arg.Set pretty_print_program
, " Pretty prints the program to the console" )
; ( "--canonize"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hahaha. What about --print-canonical or something like that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, sounds good.

@nhuurre
Copy link
Collaborator Author

nhuurre commented Nov 21, 2019

Thanks for the review.

A question I'd like input on is whether to use typed or untyped AST. For now it's untyped because that's what the pretty printer expects by default but should be easy switch. Either way, there's a problem:

  1. Stan2 incorrectly allows _lpdf suffix in tilde statements. This is checked for and throws an error in Semantic_check.ml. If this deprecated construct is to be fixed it must be done in untyped AST (or sematic check must be changed to allow it after all).
  2. the floating point absolute value function abs is deprecated and should be replaced by fabs when the argument is real; abs is still used as the integer absolute value function. Fixing this requires knowing the type of the argument so can only be done with typed AST.

@seantalts
Copy link
Member

An interesting conundrum! I think a good answer to this in the future might be that we split up type inference from type checking. That would make post-inference a really logical place for this phase. @enetsee has talked about this before but is pretty busy atm.

At the moment, could it make sense to have two phases in canonicalize, one pre and one post checking?

@nhuurre
Copy link
Collaborator Author

nhuurre commented Nov 23, 2019

File "/root/.opam/4.07.0/lib/core_kernel/core_kernel.dune", line 1, characters 0-0:
Warning: .dune files are ignored since 2.0. Reinstall the library with dune
>= 2.0 to get rid of this warning and enable support for the subsystem this
library provides.
File "src/analysis_and_optimization/dune", line 4, characters 2-16:
4 |   (inline_tests)
      ^^^^^^^^^^^^^^
Error: No inline tests backend found.

Hmm. I don't think that's my fault. All tests pass on my computer.

Anyway, I think this is basically done. A few more thoughts:

User defined functions with _log suffix are also deprecated and should be renamed. Semantic_check.mli won't let me look at the global symbol table so I'd have to read them directly from the AST. I think that would work fine (because shadowing is not allowed). I'd rather wait for a more cooperative semantic checker, though. There's also the issues of what to do if both _log and the corresponding _lpdf are defined.

The last remaining deprecated feature is lkj_cov distribution. Stan2 says

Info: the lkj_cov_log() sampling distribution is deprecated.  It will be removed in Stan 3.
Code LKJ covariance in terms of an lkj_corr() distribution on a correlation matrix and independent lognormals on the scales.

Strangely, Stan3 does not emit a warning when compiling test/integration/good/lkj_cov_deprecation1.stan.
Anyway, rewriting lkj_cov seems basically impossible.

Another transform I considered is combining declare and define.

real x;
real y;
x = f(z);

could be changed to

real y;
real x = f(z);

Makes the code a bit more compact. Now that declarations are allowed anywhere it's easy to simply defer the declaration until the first assignment, if it is within the same basic block. This is safe assuming the code does not use uninitialized variables. (Moving stuff between basic blocks would require complex analysis that's not available in AST.)

@seantalts
Copy link
Member

Hm, let me take a look at that failure - agreed it's probably not related to the code here.

@seantalts
Copy link
Member

What do you need the global symbol table for exactly? It seems off the cuff like it could be better to make an isolated pass that works on an AST than to couple this to the semantic check pass, even if that saves some duplicated work.

Strangely, Stan3 does not emit a warning when compiling test/integration/good/lkj_cov_deprecation1.stan.

Seems like it should! Would you mind filing an issue?

I don't have a strong opinion that declare-and-define is always superior, but seems fine to me. If there's a revolt we can always remove it :) But right now the code can rely on uninitialized variables being initialized to NaN (which is a feature I hate and would love to change). We might be able to change that already, I think there's only a couple of MPI tests in CmdStan that rely on that.

We do want to move stuff between basic blocks eventually :) Definitely for optimization and possibly also for canonicalization, not sure.

@nhuurre
Copy link
Collaborator Author

nhuurre commented Nov 25, 2019

Yeah, I don't know what I was thinking. The only problem with renaming user defined functions is that if there's a function or variable with the same name anywhere in the model it won't compile. But having something called foo_lpdf when you also have foo_log function is really really weird and the pretty printer will complain anyway if its output doesn't compile. So I guess there's no need to worry about it.

I haven't looked at the optimizer; I assume it handles many complex code transformations already. But it does all its work with the MIR so the results cannot be printed as Stan code.

There's an open issue for missing deprecation warnings. #165
(Maybe also useful for checking the completeness of the canonicalizer.)

I think I withdraw the declare-and-define thing. Cleaning up deprecated code is enough for one PR.

@seantalts
Copy link
Member

Thanks! Will get to this today or Wednesday I hope - uncharacteristically busy the past week and the next week or so.

String.drop_suffix name 4 ^ "_lpdf"
else String.drop_suffix name 4 ^ "_lpmf"
in
String.Table.add deprecated_userdefined ~key:name ~data:newname
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually want to ignore duplicates or do we want to do something with them? I'll merge for now but just food for thought going forwards.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicates come from forward declarations (which is needed if the function is recursive). The signatures given in function declaration and definition must be identical and no other function may have the same name.

@seantalts
Copy link
Member

This looks great, congrats on your first OCaml code and PR to the repo :) I'll add you to the contributors list.

@seantalts seantalts merged commit f9b98d7 into stan-dev:master Dec 3, 2019
@nhuurre nhuurre deleted the canonicalizer branch December 4, 2019 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants