Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/issue 2799 robust no u turn #2800

Merged
merged 12 commits into from
Aug 30, 2019

Conversation

betanalpha
Copy link
Contributor

Submission Checklist

  • Run unit tests: ./runTests.py src/test/unit
  • Run cpplint: make cpplint
  • Declare copyright holder and open-source license: see below

Summary

Resolves #2799.

Intended Effect

Adds additional no-u-turn checks across subtrees to avoid missing u-turns for approximately iid normal models.

How to Verify

See https://discourse.mc-stan.org/t/nuts-misses-u-turns-runs-in-circles-until-max-treedepth/9727/36?u=betanalpha.

Side Effects

Decrease in antithetic behavior for component means in correlated models.

Documentation

Inline.

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company):

Michael Betancourt

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Copy link
Member

@bbbales2 bbbales2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good. Do we have any hypothesis test-tests in place for this like we do for the RNGs? That's how the last sampler problem got found right?

EXPECT_EQ(4 * init_momentum, sampler.rho_values.at(5));
EXPECT_EQ(8 * init_momentum, sampler.rho_values.at(6));
EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(5));
EXPECT_EQ(4 * init_momentum, sampler.rho_values.at(6));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be 14 more of these checks since there are now 3x the number of rho_values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only one aggregate rho exposed in the build_tree function. The others needed for the new tests are constructed from this aggregate rho on the fly internally and hence not accessible for unit testing. That internal construction could be repeated externally using the boundary momenta but then there's nothing to compare it to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. These values are being returned so why shouldn't they be tested?

Copy link
Member

@bbbales2 bbbales2 Aug 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to work:

  EXPECT_EQ(7 * 3, sampler.rho_values.size());
  EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(0));
  EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(1));
  EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(2));
  EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(3));
  EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(4));
  EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(5));
  EXPECT_EQ(4 * init_momentum, sampler.rho_values.at(6));
  EXPECT_EQ(3 * init_momentum, sampler.rho_values.at(7));
  EXPECT_EQ(3 * init_momentum, sampler.rho_values.at(8));
  EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(9));
  EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(10));
  EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(11));
  EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(12));
  EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(13));
  EXPECT_EQ(2 * init_momentum, sampler.rho_values.at(14));
  EXPECT_EQ(4 * init_momentum, sampler.rho_values.at(15));
  EXPECT_EQ(3 * init_momentum, sampler.rho_values.at(16));
  EXPECT_EQ(3 * init_momentum, sampler.rho_values.at(17));
  EXPECT_EQ(8 * init_momentum, sampler.rho_values.at(18));
  EXPECT_EQ(5 * init_momentum, sampler.rho_values.at(19));
  EXPECT_EQ(5 * init_momentum, sampler.rho_values.at(20));

If it makes sense, just add em' in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I finally get what you were saying. I was thinking that there aren't any new rho values to check because the tree depth is the same but forgot how this code was instrumented to add a new rho values for every check, not every tree depth. I added the new checks split up into the values corresponding to the main check and the extra checks. Waiting to push in case any changes need to be made to the algorithm itself.

@betanalpha
Copy link
Contributor Author

There's currently no validation of the samplers as explicit integration tests because there's not enough functionality to run everything from pure C++ (the main obstruction is tearing out the var_context code and replacing it with a clear data access layer and then writing C++ data access layer callbacks so that everything can be run in memory; I have designs that were long ago discussed and agreed upon but no one has had the time to implement it). In any case I did check this code against some IID models of varying dimension and a correlated Gaussian in which the last bug manifested (using 100000 iterations to be sufficient sensitive to small effects) and everything looked good.

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Aug 18, 2019 via email

@bbbales2
Copy link
Member

@betanalpha oh okay, got it. Can you e-mail me the models for the test? I guess I should just manually double check that it's all working okay since it's not automated.

@seantalts yo just at-ing you so you're aware of the state of tests. We should probably do something about this eventually.

@betanalpha
Copy link
Contributor Author

betanalpha commented Aug 24, 2019 via email

@bbbales2
Copy link
Member

Got it thanks. I'll run these checks Tuesday or Wednesday (still travelin'). And the extra return values from build_tree?

@junpenglao
Copy link

Thanks a lot for the discussion and the validation!!
Just a quick drive by comment: you could skip doing the additional U turn check at depth == 1 within each tree (as it is already been checked), and skip the U turn check altogether at depth == 0 at the tree doubling level (i.e., the outer loop, as it is already been checked during the tree building).

@betanalpha
Copy link
Contributor Author

The test you were looking at checks just the rho aggregation. The edge momenta and sharp momenta are checked in other tests, including the diag_e and softabs tests.

Copy link
Member

@bbbales2 bbbales2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried out adding tests for the rho_values. See if you think they make sense.

@bbbales2
Copy link
Member

I'm having trouble with this. I ran tests with the model here: https://github.com/stan-dev/stat_comp_benchmarks/blob/master/benchmarks/low_dim_corr_gauss/low_dim_corr_gauss.stan

I ran a bunch of these models with different numbers of post warmup samples with the new and old code. I made plots for each parameter (and generated quantity) of mean +- 2 * the MCSE.

The left column are the results with the new code and the right is with the old code. Different rows are different parameters. I plotted the true parameter value in red. The x axis is the number of post warmup draws. The things are all jittered -- this plot shows results run for 64000, 256000, 1024000, 4096000, and 16384000 iters. For large numbers of samples, the delta_var1 and delta_var2 are not zero.

I don't know if I'm testing this correctly or if I'm doing the check wrong, but can you try to reproduce this?

uturn

For reproducing, my test code looked something like this (I used the dense adaptation for the new and old codes):

for i in 1000 4000 8000 16000 64000
do
    for j in 8 9 10 11 12 13 14 15
    do
        ./corr sample num_samples=$i algorithm=hmc metric=dense_e >& /dev/null
        bin/stansummary --sig_figs=5 output.csv | grep "^z\|^delta_var" | tr -s " " "," | sed -e "s/^/$i,$j,/"
    done
done

Run this script for each code and dump the output in analysis.csv. Make the plots with something like:

library(tidyverse)
library(ggplot2)

results_new = read_csv("/home/bbales2/cmdstan-uturn/analysis.csv", col_names = FALSE) %>%
  mutate(which = "new")
results_orig = read_csv("/home/bbales2/cmdstan-current/analysis.csv", col_names = FALSE) %>%
  mutate(which = "original")

bind_rows(results_new,
          results_orig) %>%
  mutate(iters = X1,
         chain = X2,
         name = X3,
         mean = X4,
         ub = mean + 2 * X5,
         lb = mean - 2 * X5) %>%
  filter(iters > 50000) %>%
  ggplot(aes(iters, mean)) +
  geom_hline(data = tibble(name = c("delta_var1", "delta_var2", "z[1]", "z[2]"),
                           y = c(0.0, 0.0, 0.0, 3.0)), aes(yintercept = y), color = "red") +
  geom_point(aes(group = chain), position = position_dodge(width=0.5)) +
  geom_errorbar(aes(group = chain, ymin = lb, ymax = ub), position=position_dodge(width=0.5)) +
  scale_x_log10() +
  facet_grid(name ~ which, scales = "free_y") +
  theme(text = element_text(size=20))

@betanalpha
Copy link
Contributor Author

betanalpha commented Aug 28, 2019 via email

@betanalpha
Copy link
Contributor Author

What is the j loop for in the code? Just for repeated runs?

Just to confirm -- the iteration lengths shown in your test code are different from those used to make the figure, right?

I'm still not seeing any bias for higher numbers of iterations using the default settings (sample num_samples=4096000 random seed=838389).

1 chains: each with iter=(4096000); warmup=(0); thin=(1); 4096000 iterations saved.

                    Mean     MCSE   StdDev     5%       50%       95%    N_Eff  N_Eff/s    R_hat
z[1]            -9.2e-05  7.4e-04  1.0e+00   -1.6   1.5e-04   1.6e+00  1.8e+06  1.1e+04  1.0e+00
z[2]             3.0e+00  1.5e-03  2.0e+00  -0.29   3.0e+00   6.3e+00  1.8e+06  1.1e+04  1.0e+00
delta_var1      -4.2e-04  9.7e-04  1.4e+00  -1.00  -5.4e-01   2.8e+00  2.1e+06  1.3e+04  1.0e+00
delta_var2       2.5e-03  4.0e-03  5.7e+00   -4.0  -2.2e+00   1.1e+01  2.0e+06  1.2e+04  1.0e+00
delta_corr       2.4e-03  8.2e-04  1.1e+00   -1.1  -3.3e-01   2.2e+00  1.9e+06  1.1e+04  1.0e+00

@betanalpha
Copy link
Contributor Author

Same for a dense metric,

1 chains: each with iter=(4096000); warmup=(0); thin=(1); 4096000 iterations saved.

                    Mean     MCSE   StdDev     5%       50%       95%    N_Eff  N_Eff/s    R_hat
z[1]            -3.9e-04  4.8e-04  1.0e+00   -1.6   4.5e-04   1.6e+00  4.3e+06  2.9e+04  1.0e+00
z[2]             3.0e+00  1.0e-03  2.0e+00  -0.29   3.0e+00   6.3e+00  3.8e+06  2.6e+04  1.0e+00
delta_var1      -3.7e-04  9.9e-04  1.4e+00  -1.00  -5.5e-01   2.8e+00  2.1e+06  1.4e+04  1.0e+00
delta_var2      -5.4e-03  4.0e-03  5.6e+00   -4.0  -2.2e+00   1.1e+01  2.0e+06  1.4e+04  1.0e+00
delta_corr       2.8e-05  7.6e-04  1.1e+00   -1.1  -3.4e-01   2.2e+00  2.2e+06  1.5e+04  1.0e+00

@bbbales2
Copy link
Member

Yeah j is just repeated runs.

And yeah I go up to 16384000 in the figures. Try some repeated runs if you don't mind. It looks like it's messing up about half the time at 4096000 iters. I was a little concerned with the dense as well, but the plain Stan works with the dense metric.

I've checked git diff and I don't have extra changes. I will repeat this calculation overnight, but I just pulled down a clean repo and the md5sum of the model built with this clean repo and the model I used for these calculations in the figure is the same.

@bbbales2
Copy link
Member

bbbales2 commented Aug 28, 2019

This is the script I'll run:

for i in 1000 4000 16000 64000 256000 1024000 4096000 16384000
do
    for j in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
    do
	for metric in diag_e dense_e
	do
	    ./corr sample num_samples=$i algorithm=hmc metric=$metric random seed=$j >& /dev/null
	    bin/stansummary --sig_figs=5 output.csv | grep "^z\|^delta" | tr -s " " "," | sed -e "s/^/$i,$j,$metric,/"
	done
    done
done

(edited to use random seed)

@betanalpha
Copy link
Contributor Author

I recommend fixing a seed and then using id to vary the seed for each run.

@betanalpha
Copy link
Contributor Author

This is weeeeeeeeeird. For some seeds I do see the effect that you're seeing.

1 chains: each with iter=(16384000); warmup=(0); thin=(1); 16384000 iterations saved.

                    Mean     MCSE   StdDev     5%       50%       95%    N_Eff  N_Eff/s    R_hat
z[1]             4.9e-04  3.4e-04  1.0e+00   -1.6   4.4e-04   1.6e+00  8.4e+06  1.2e+04  1.0e+00
z[2]             3.0e+00  7.3e-04  2.0e+00  -0.28   3.0e+00   6.3e+00  7.4e+06  1.0e+04  1.0e+00
delta_var1      -6.2e-03  4.7e-04  1.4e+00  -1.00  -5.5e-01   2.8e+00  8.9e+06  1.3e+04  1.0e+00
delta_var2      -2.4e-02  2.0e-03  5.6e+00   -4.0  -2.2e+00   1.1e+01  8.2e+06  1.2e+04  1.0e+00
delta_corr      -2.3e-03  4.0e-04  1.1e+00   -1.1  -3.4e-01   2.2e+00  7.8e+06  1.1e+04  1.0e+00

What's odd, however, it doesn't look there's some bias that manifests only once the standard error shrinks enough. Rather it looks like the standard error shrinks and then at some point a bias larger than the standard error suddenly appears. Then there's the fact that it's an inconsistent bias, and maybe correlated with slightly higher step sizes?

I'm going to go through the code carefully to see if I can see anything. On the other hand I'm not sure any algorithm has been tested to one million effective samples so if the current version didn't behave as expected then I'd suspect a numerical issue with the estimators themselves.

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Aug 29, 2019 via email

@betanalpha
Copy link
Contributor Author

betanalpha commented Aug 29, 2019 via email

@betanalpha
Copy link
Contributor Author

I can't find anything that looks suspect in the code -- anyone want to take a second look? I've been poking around at edge cases and places where redundant calculations in the added checks might be vulnerable to insufficient precision introducing errors but I can't find anything inconsistent.

@bbbales2
Copy link
Member

Here's the results of running overnight.

uturn_metric

anyone want to take a second look?

I'll make a post in the discourse thread. I will eventually, but this looks like it could be a difficult problem to figure out.

@betanalpha
Copy link
Contributor Author

betanalpha commented Aug 29, 2019 via email

@betanalpha
Copy link
Contributor Author

Yup, that did the trick.

1 chains: each with iter=(16384000); warmup=(0); thin=(1); 16384000 iterations saved.

                    Mean     MCSE   StdDev     5%       50%       95%    N_Eff  N_Eff/s    R_hat
z[1]             4.9e-04  3.4e-04  1.0e+00   -1.6   4.4e-04   1.6e+00  8.4e+06  1.2e+04  1.0e+00
z[2]             3.0e+00  7.3e-04  2.0e+00  -0.28   3.0e+00   6.3e+00  7.4e+06  1.0e+04  1.0e+00
delta_var1      -6.2e-03  4.7e-04  1.4e+00  -1.00  -5.5e-01   2.8e+00  8.9e+06  1.3e+04  1.0e+00
delta_var2      -2.4e-02  2.0e-03  5.6e+00   -4.0  -2.2e+00   1.1e+01  8.2e+06  1.2e+04  1.0e+00
delta_corr      -2.3e-03  4.0e-04  1.1e+00   -1.1  -3.4e-01   2.2e+00  7.8e+06  1.1e+04  1.0e+00

goes to

1 chains: each with iter=(16384000); warmup=(0); thin=(1); 16384000 iterations saved.

                    Mean     MCSE   StdDev     5%       50%       95%    N_Eff  N_Eff/s    R_hat
z[1]             9.0e-05  3.5e-04  1.0e+00   -1.6   1.8e-04   1.6e+00  8.0e+06  1.2e+04  1.0e+00
z[2]             3.0e+00  7.3e-04  2.0e+00  -0.29   3.0e+00   6.3e+00  7.5e+06  1.2e+04  1.0e+00
delta_var1       2.6e-04  4.8e-04  1.4e+00  -1.00  -5.5e-01   2.8e+00  8.6e+06  1.3e+04  1.0e+00
delta_var2       2.4e-03  2.0e-03  5.7e+00   -4.0  -2.2e+00   1.1e+01  8.2e+06  1.3e+04  1.0e+00
delta_corr       2.3e-04  4.0e-04  1.1e+00   -1.1  -3.4e-01   2.2e+00  7.6e+06  1.2e+04  1.0e+00

I'll push once the tests are updated to the new behavior.

@bbbales2
Copy link
Member

Great thanks!

@betanalpha
Copy link
Contributor Author

Changes pushed -- I updated base_nuts_test with a new instrumented mock sampler that tests the edge momenta that weren't being updated correctly before. I'm going to rerun the performance tests to verify that hasn't been affected, but given that this only shows up at <Dr. Evil voice>one million iterations</Dr. Evil voice> I don't think there will be much of an affect.

I don't think I've ever seen a sampler tested to this level of precision! Pretty awesome.

@betanalpha
Copy link
Contributor Author

Performance tests look good -- small differences but the general story is pretty much the same. I'll post to Discourse.

@bob-carpenter
Copy link
Contributor

bob-carpenter commented Aug 29, 2019 via email

@bbbales2
Copy link
Member

Don't jinx us. This looks good. @bob-carpenter can you merge? I'm not authorized.

@bob-carpenter bob-carpenter merged commit 6a98b12 into develop Aug 30, 2019
twiecki pushed a commit to pymc-devs/pymc that referenced this pull request Oct 17, 2019
* [WIP] Robust U-turn check

Following the recent discussion on the Stan side: stan-dev/stan#2800

For experiment, do not merge.

* typo fix

* bug fix

*  Additional U turn check only when depth > 1 (to avoid redundant work).

* further logic to reduce redundant U Turn check.

* bug fix

fix error in recording the end point of the reversed subtree

* [WIP] Robust U-turn check

Following the recent discussion on the Stan side: stan-dev/stan#2800

For experiment, do not merge.

* typo fix

* bug fix

*  Additional U turn check only when depth > 1 (to avoid redundant work).

* further logic to reduce redundant U Turn check.

* bug fix

fix error in recording the end point of the reversed subtree

* Add release note.
@serban-nicusor-toptal serban-nicusor-toptal added this to the 2.20.0++ milestone Oct 18, 2019
@mcol mcol deleted the feature/issue-2799-robust_no_u_turn branch February 22, 2020 12:27
matt-graham added a commit to matt-graham/mici that referenced this pull request May 7, 2020
Adds keyword argument to dynamic integration transitions and samplers to 
enable extra subtree termination criterion checks as described in 
stan-dev/stan#2800. Extra subtree checks are set 
to be enabled by default for the `DynamicMultinomialHMC` sampler.
facebook-github-bot pushed a commit to facebookresearch/beanmachine that referenced this pull request Jun 4, 2021
Summary:
Pull Request resolved: #864

An issue with the U-turn condition was discovered and discussed in [this post in Stan forum](https://discourse.mc-stan.org/t/nuts-misses-u-turns-runs-in-circles-until-max-treedepth/9727)

TL;DR: we can make the U-turn condition more robust by introducing two additional checks across subtrees. This can help us avoid missing U-turns for approximately iid normal models.

{F619223264}

Since the tree combining code are almost identical in `_build_tree` and `propose`, I also take the chance to refactor them into a common function called `_combine_tree`. If you look closely you will notice that most part of `_combine_tree` are moved from existing code as-is. The only addition is the two additional call to `_is_u_turning`

Related PR that implements this change:
- Stan: stan-dev/stan#2800
- PyMC3: pymc-devs/pymc#3605
- Turing.jl: TuringLang/AdvancedHMC.jl#207
- DynamicHMC.jl: tpapp/DynamicHMC.jl#145

Reviewed By: neerajprad

Differential Revision: D28735950

fbshipit-source-id: ada4ebcad26a87ef5e697f422b5c5b17007afe42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Binary recursive no-u-turn criterion misses u-turn
5 participants