-
Notifications
You must be signed in to change notification settings - Fork 1
/
10-customlik.Rmd
832 lines (608 loc) · 47.5 KB
/
10-customlik.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
# Custom distributions in Stan {#ch-custom}
\index{Custom distribution}
Stan includes a large number of distributions, but what happens if we need a distribution that is not provided? In many cases, we can simply build a custom distribution by combining the ever-growing number of functions available in the Stan language.
## A change of variables with the reciprocal normal distribution {#sec-change}
In previous chapters, when faced with response times, we assumed a log-normal distribution. The log-normal distribution moves the inferences relating to the location parameter into a multiplicative frame (see online section \@ref(app-lognormal)). Another alternative, however, is to assume a \index{Reciprocal-normal distribution} "reciprocal"-normal distribution of response times. This is referred to hereafter as rec-normal in the text, and as the $\mathit{RecNormal}$ distribution in equations. That is, we may want to assume that the reciprocal of the response times are normally distributed. This assumption may be theoretically motivated [@HarrisEtAl2014; @Harris2012] or it may arise from the application of the Box-Cox variance stabilizing transform procedure [@box1964analysis]. An example from psycholinguistics of a data analysis with a reciprocal transform of reading time data appears in @WuKaiserVasishth2017.
\begin{equation}
\begin{aligned}
1/y &\sim \mathit{Normal}(\mu, \sigma) \\
y &\sim \mathit{RecNormal}(\mu, \sigma)
\end{aligned}
(\#eq:recip)
\end{equation}
An interesting aspect of the \index{Reciprocal-normal distribution} rec-normal is that it affords an interpretation of the location parameter in terms of \index{Rate} rate or speed rather than time. Analogously to the case of the log-normal, neither the location $\mu$ nor the scale $\sigma$ are in the same scale as the dependent variable $y$. These parameters are in the same scale as the transformed dependent variable (here, $1/y$) that is assumed to be normally distributed.
By setting $\mu = 0.002$ and $\sigma = 0.0004$, data can be generated from the rec-normal that looks right-skewed and not unlike a distribution of response times. The code below shows some summary statistics of the generated data.
```{r}
N <- 100
mu <- 0.002
sigma <- 0.0004
rt <- 1 / rnorm(N, mu, sigma)
c(mean = mean(rt), sd = sd(rt), min = min(rt), max = max(rt))
```
Figure \@ref(fig:recnormalhist) shows the distribution generated with the code shown above.
(ref:recnormalhist) Distribution of synthetic data with a rec-normal distribution with parameters $\mu = 0.002$ and $\sigma =0.0004$.
```{r recnormalhist, echo = FALSE, fig.cap = "(ref:recnormalhist)", message = FALSE, fig.height = 2.2}
ggplot(tibble(rt = rt), aes(rt)) +
geom_histogram()
```
We can fit the rec-normal distribution to the response times in a simple model with a normal likelihood, by storing the reciprocal of the \index{Response time} response times (`1/RT`) in the vector variable `recRT` (rather than storing the "raw" response times):
```
model {
\\ priors go here
target += normal_lpdf(recRT | mu, sigma);
}
```
One issue here is that the parameters of the likelihood, `mu` and `sigma` are going to be very far away from the \index{Unit scale} unit scale ($0.002$ and $0.0004$ respectively). Due to the way Stan's sampler is built, parameters that are too small (much smaller than 1) or too large (much larger than 1) can cause convergence problems. A straightforward solution is to fit the normal distribution to parameters in reciprocal of seconds rather than milliseconds; this would make the parameters have values that are $1000$ times larger ($2$ and $0.4$ respectively). We can do this using the \index{\texttt{transformed parameters}} `transformed parameters` block.^[In this specific case, we could also transform the dependent variable first, but as will become clear later, we want to avoid changing our dependent variable.] Although we use `mu` and `sigma` to fit the data, the priors are defined on the parameters `mu_s` and `sigma_s` ($\mu_s$ and $\sigma_s$).
Unless one can rely on previous estimates, finding good priors for $\mu_s$ and $\sigma_s$ is not trivial. To define appropriate priors, we would need to start with relatively arbitrary priors, inspect the prior predictive distributions, adjust the priors, and repeat the inspection of prior predictive distributions until these distributions start to look realistic. In the interest of conserving space, we skip this iterative process here, and assign the following priors:
\begin{equation}
\begin{aligned}
\mu_s & \sim \mathit{Normal}(2, 1)\\
\sigma_s & \sim \mathit{Normal}_+(0.4, 0.2)
\end{aligned}
\end{equation}
```{r normalrectstore, echo = FALSE}
normal_recrt <- system.file("stan_models",
"normal_recrt.stan",
package = "bcogsci"
)
```
This model is implemented in the file `normal_recrt.stan`, available in the `bcogsci` package:
```{stan output.var = "normal_recrt", code = readLines(normal_recrt), tidy = TRUE, comment="", eval = FALSE, cache = FALSE, cache.lazy = FALSE}
```
Fit and display the summary of the previous model:
```{r normalrectrt, results = "hide", message = FALSE}
normal_recrt <- system.file("stan_models",
"normal_recrt.stan",
package = "bcogsci")
fit_rec <- stan(normal_recrt, data = list(N = N, recRT = 1 / rt))
```
```{r}
print(fit_rec, pars = c("mu", "sigma"), digits = 4)
```
Is a rec-normal likelihood more appropriate than a log-normal likelihood? As things stand, we cannot compare the models with these two likelihoods. This is because the dependent variables are different: we cannot compare reciprocal response times with untransformed response times on the millisecond scale. \index{Model comparison} Model comparison with the Bayes factor or cross-validation can only compare models with the same dependent variables; see chapters \@ref(ch-comparison)-\@ref(ch-cv).
If we do want to compare the reciprocal-normal likelihood and the log-normal likelihood, we have to set up the models with the two likelihoods in such a way that the dependent measure is on the raw millisecond scale in each model. This means that for the reciprocal normal likelihood, the model will receive as data raw reading times in milliseconds, and these will be treated as a transformed random variable from reciprocal reading times. This approach is discussed next, but requires knowledge of the Jacobian adjustment (see @RossProb).
### Scaling a probability density with the Jacobian adjustment
To work with the original dependent variable (`RT` rather than `1/RT`) we need to perform a \index{Change of variables} *change of variables*: If the random variable X represents the reciprocal reading times ($1/RT$), we can transform this random variable to a new one, $Y = 1/X = 1/(1/RT) = RT$, yielding a transformed random variable $Y$ which represents reading time.
This change of variables requires an adjustment to the (unnormalized log) posterior probability to account for the distortion caused by the transform.^[Not every transformation is valid, univariate changes of variables must be monotonic and differentiable, multivariate changes of variables must be surjective and differentiable.] The probability must be scaled by a \index{Jacobian adjustment} *Jacobian* adjustment, which, in the univariate case as this one, is the absolute value of the derivative of the transform.^[In the multivariate case, it is equal to the absolute determinant of the Jacobian, the matrix of all its first-order partial derivatives, of the transform; see section 6.7 of @RossProb, and p. 707 of @adams2018calculus.] The absolute value of any number is represented by enclosing it in two vertical bars: e.g., $|-2| = 2$.
\begin{equation}
\begin{aligned}
p(\mathit{RT}_n | \mu, \sigma )& = \mathit{Normal}(1/\mathit{RT}_n | \mu, \sigma) \left| \frac{\mathrm{d}}{\mathrm{d}\mathit{RT}_n} 1/\mathit{RT}_n \right| = \\
& \mathit{Normal}(1/\mathit{RT}_n | \mu, \sigma) \cdot \left| - 1/\mathit{RT}_n^2 \right| = \\
& \mathit{Normal}(1/\mathit{RT}_n | \mu, \sigma) \cdot 1/\mathit{RT}_n^2
\end{aligned}
\end{equation}
If we omit the Jacobian adjustment, we are essentially fitting a different and unnormalized probability density function (this will be shown later in this chapter). The discrepancy between the correct and incorrect posterior would depend on how large the Jacobian adjustment for our model is.
Because Stan works in log space, rather than multiplying the Jacobian adjustment, we add its logarithm to the log-probability density ($\log(1/\mathit{RT}_n^2) = - 2 \cdot \log(\mathit{RT}_n)$). The contribution to the log likelihood (of $\mu$ and $\sigma$) with its adjustment based from an individual observation, $\mathit{RT}_{n}$, would be as follows.
\begin{equation}
\log(\mathit{Normal}(1/\mathit{RT}_n | \mu, \sigma)) - 2 \cdot \log(\mathit{RT}_n)
\end{equation}
We obtain the log likelihood based on all the $N$ observations by summing the log likelihood of individual observations.
\begin{equation}
\log \mathcal{L} = \sum_{n=1}^N \log(\mathit{Normal}(1/\mathit{RT}_n | \mu, \sigma)) - \sum_{n=1}^N 2 \cdot \log(\mathit{RT}_n)
\end{equation}
In Stan, this summing up is done as follows. The function `normal_lpdf` applied to a vector already returns the sum of the individual log-probability densities. The Jacobian adjustment `2*log(RT)`, which returns a vector of values, has to be summed up and added in manually (the term therefore becomes `-sum(2*log(RT))`).
```
target += normal_lpdf(1 ./ RT | mu, sigma)
- sum(2 * log(RT));
```
Before fitting the model with the change of variables, we are going to truncate the distribution.
We didn't encounter negative values in our synthetic data, but this was because the distribution was not too spread out; that is, the scale was much smaller than the location ($\sigma << \mu$). However, in principle we could end up generating negative values. For this reason, we truncate the underlying normal distribution (for more details, see online section \@ref(app-truncation)). The \index{Reciprocal truncated normal distribution} reciprocal truncated normal distribution has been argued to be an appropriate model of \index{Response time} response times, neural \index{Inter-spike interval} inter-spike intervals, and \index{Latency distributions of saccades} latency distributions of saccades in a simple optimality model in which reward is maximized to yield an optimal response rate [@HarrisEtAl2014; @Harris2012].
Because we have $N$ observations, the truncation consists of adding ` - N * normal_lccdf(0 | mu, sigma)` to `target`; also see section \@ref(sec-pupilstan). The complete model, `recnormal_rt.stan`, including the truncation is as follows:
```{r, echo = FALSE}
recnormal_rt <- system.file("stan_models", "recnormal_rt.stan", package = "bcogsci")
```
```{stan output.var = "recnormal_rt", code = readLines(recnormal_rt), tidy = TRUE, comment="", eval = FALSE, cache = FALSE, cache.lazy = FALSE}
```
Next, generate data from a reciprocal truncated normal distribution:
```{r, echo = FALSE}
set.seed(123)
```
```{r}
N <- 500
mu <- .002
sigma <- .0004
rt <- 1 / rtnorm(N, mu, sigma, a = 0)
```
Fit the model to the data:
```{r recnormalrt, results = "hide", message = FALSE, eval = !file.exists("dataR/fit_rec.RDS")}
recnormal_rt <- system.file("stan_models",
"recnormal_rt.stan",
package = "bcogsci")
fit_rec <- stan(recnormal_rt, data = list(N = N, RT = rt))
```
```{r, echo= FALSE, eval = TRUE}
if(!file.exists("dataR/fit_rec.RDS")){
saveRDS(fit_rec, "dataR/fit_rec.RDS")
} else {
fit_rec <- readRDS("dataR/fit_rec.RDS")
}
```
Print the posterior summary:
```{r}
print(fit_rec, pars = c("mu", "sigma"), digits = 4)
```
We get the same results as before, but now we could potentially compare this model to one that assumes a log-normal likelihood, for example.
An important question is the following: Does every transformation of a random variable require a Jacobian adjustment? Essentially, if we assign a distribution to a random variable and then transform it, this does not require a Jacobian adjustment.
Alternatively, if we apply a transformation on a random variable first, and then assign a distribution to the transformed random variable afterwards, we have a \index{Change of variables} change of variables. This requires a Jacobian adjustment. This is the reason why `1 ./ RT` requires a Jacobian adjustment, but not `mu_s`, `mu`, `sigma_s`, or `sigma_s` in the previous model.
As a last step, we can encapsulate our new distribution in a function, and also create a \index{Random number generator} random number generator function. This is done in a \index{Block} block called \index{\texttt{functions}} `functions`. To create a function, we need to specify the type of every argument and the type that the function returns.
As a simple example, if we would like a function to center a vector, we would do it as follows:
\index{\texttt{num\_elements}}
```
functions {
vector center(vector x) {
vector[num_elements(x)] centered;
centered = x - mean(x);
return centered;
}
}
data {
...
}
parameters {
...
}
model {
...
}
```
We want to create a log(PDF) function, similar to the native Stan functions that end in `_lpdf`. Our function will take as arguments a vector `RT`, and real numbers `mu` and `sigma`. In `_lpdf` functions, (some of) the arguments (e.g., `RT`, `mu`, and `sigma`) can be vectorized, but the output of the function is always a real number: the sum of the log(PDF) evaluated at every value of the random variable at the left of the pipe. As we show below, to do that we just move the right hand side of `target` in our original model inside our new function. See the section entitled Further Reading at the end of this chapter for more about Stan functions.
For our custom random number generator function (which should end with `_rng`), we have the added complication that the function is truncated. We simply generate random values from a normal distribution, do the reciprocal transformation and if the number is above zero, this number is returned; otherwise, a new number is drawn.
This implementation may be inefficient when rejections occur frequently. Another, more complex implementation using the inverse CDF approach can be found in @Stan2023 (section 21.10 of the User's guide).
The complete code can be found in `recnormal_rt_f.stan`, and is also shown below:
```{r, echo = FALSE}
recnormal_rt_f <- system.file("stan_models",
"recnormal_rt_f.stan",
package = "bcogsci")
```
```{stan output.var = "recnormal_rt_f", code = readLines(recnormal_rt_f), tidy = TRUE, comment="", eval = FALSE, cache = FALSE, cache.lazy = FALSE}
```
Fit the model to the simulated data:
```{r recnormal_rt_f, results = "hide", message = FALSE,eval = !file.exists("dataR/fit_rec_f.RDS") }
recnormal_rt_f <- system.file("stan_models",
"recnormal_rt_f.stan",
package = "bcogsci")
fit_rec_f <- stan(recnormal_rt_f, data = list(N = N, RT = rt))
```
```{r, echo = FALSE}
if (!file.exists("dataR/fit_rec_f.RDS")) {
saveRDS(fit_rec_f, "dataR/fit_rec_f.RDS")
} else {
fit_rec_f <- readRDS("dataR/fit_rec_f.RDS")
}
```
Print the summary:
```{r}
print(fit_rec_f, pars = c("mu", "sigma"), digits = 4)
```
## \index{Validation of a computed posterior distribution} Validation of a computed posterior distribution {#sec-validSBC}
The model converged, but did it work? At the very least we expect that the simulated data that we used to test the model shows a very similar distribution to the posterior predictive distributions. This is because we are fitting the data with the same function that we use to generate the data. Figure \@ref(fig:ppcheckrec) shows that the simulated data and the posterior predictive distributions are similar.
(ref:ppcheckrec) A posterior predictive check of `fit_rec` in comparison with the simulated data.
```{r ppcheckrec, message = FALSE, fig.cap = "(ref:ppcheckrec)", fold = TRUE, fig.height = 2.2}
ppc_dens_overlay(rt, yrep = extract(fit_rec_f)$rt_pred[1:500, ]) +
coord_cartesian(xlim = c(0, 2000))
```
Our synthetic data set was generated by first defining a \index{Ground truth} *ground truth* vector of parameters $\langle \mu; \sigma \rangle$. We would expect that the true values of the parameters $\mu$ and $\sigma$ should be well inside the bulk of the posterior distribution of our model. We investigate this by plotting the true values of the parameters together with their posterior distributions using the function \index{\texttt{mcmc\_recover\_hist}} `mcmc_recover_hist()` of `bayesplot` as shown below in Figure \@ref(fig:recoveryrec).
(ref:recoveryrec) Posterior distributions of the parameters of `fit_rec` together with their true values.
```{r recoveryrec, message = FALSE, fig.cap = "(ref:recoveryrec)" , fig.height = 2.2}
post_rec <- as.data.frame(fit_rec_f) %>%
select(mu, sigma)
mcmc_recover_hist(post_rec, true = c(mu, sigma))
```
Even though this approach can capture serious misspecifications in our models, it has three important shortcomings.
First, it's unclear what we should conclude if the true value of a parameter is not in the bulk of its posterior distribution. Even in a well-specified model, if we inspect enough parameters we will find that for some parameters, the bulks of their respective marginal posterior distributions can be relatively far away from their true values.
The second shortcoming is that if the posteriors that we obtain are very wide, it might not be clear what constitutes the bulk of the posterior distribution that needs to be inspected.
The third shortcoming is that regardless of how a successful recovery of parameters is defined, we might be able to recover a posterior based on data generated from some parts of the parameter space, but not based on data generated from other parts of parameter space [@talts2018validating]. However, inspecting the entire parameter space is infeasible.
An alternative to our approach of plotting the true values of the parameters together with their posterior distributions is called simulation-based calibration [@talts2018validating; @ModrakEtAl2023], which extends ideas of @Cooketal2006. This is discussed next.
### The \index{Simulation-based calibration} simulation-based calibration procedure
@talts2018validating suggest the simulation-based calibration as the procedure for validating a model. We check that the Bayesian computation is \index{Faithful} *faithful* in the sense that it is not \index{Bias} biased. Our goal is to ensure that our model neither overestimates nor underestimates, nor does it possess incorrect precision in the parameters, given the data and priors. As initial steps, we employ two different implementations of the same statistical model: (1-2) a generator capable of directly simulating draws from the prior predictive distribution, and (3) a probabilistic program that, when combined with a posterior approximation algorithm (such as a Stan sampler), samples from the posterior distribution. Finally, (4-5) we verify that the results from both implementations have the same distribution conditional on the data.
1. Generate true values (i.e., a \index{Ground truth} ground truth) for the parameters of the model from the priors:
\begin{equation}
\tilde{\Theta}_n \sim p(\boldsymbol{\Theta})
\end{equation}
Here, $p(\boldsymbol{\Theta})$ represents a \index{Joint prior distribution} joint prior distributions of vector of parameters. In our previous example, the vector of parameters is $\langle \mu; \sigma \rangle$, and the joint prior distribution is two independent distributions, a normal and a truncated normal. Crucially, the prior distributions should be meaningful; that is, one should use priors that are at least regularizing or weakly informative. The \index{Prior space} prior space plays a crucial role because it determines the parameter space where we will verify the correctness or faithfulness of the model.
Assuming 150 samples (from the priors), this step for our model would be as follows:
```{r}
N_sim <- 150
mu_tilde <- rnorm(N_sim, 2, 1) / 1000
sigma_tilde <- rtnorm(N_sim, 0.4, 0.2, a = 0) / 1000
```
2. Generate multiple (`N_sim`) data sets based on the probability density function used as the likelihood:
\begin{equation}
\tilde{D}_n \sim p(\boldsymbol{D} | \tilde{\Theta}_n)
\end{equation}
The previous equation indicates that each data set $\tilde{D}_n$ is sampled from each of the generated parameters, $\tilde{\Theta}_n$. Following our previous example, use the reciprocal truncated normal distribution to generate data sets of response times:
```{r}
## Create a place holder for all the simulated datasets
rt_tilde <- vector(mode = "list", length = N_sim)
# Number of observations
N <- 500
for (n in 1:N_sim) {
rt_tilde[[n]] <- 1 / rtnorm(N, mu_tilde[n], sigma_tilde[n], a = 0)
}
```
Now, `rt_tilde` consists of a list with `r N_sim` vectors of `r N` observations each. Each list represents a \index{Simulate data} simulated data, which corresponds to $\tilde{D}_{n}$.
3. Fit the same Bayesian model to each of the generated data sets.
After fitting the same model to each data set, we should compare the recovered \index{Posterior distribution} posterior distribution for each parameter of each simulated data set; that is, $p(\Theta_n | \tilde{D}_{n})$, with the parameters that were used to generate the data, $\tilde{\Theta}_{n}$. This comparison is represented by `...` in the code below (that is, it is not yet implemented in the code shown below). This raises the question of how to compare the posterior distribution with the ground truth values of the parameters.
```{r, eval = FALSE}
for (i in 1:N_sim) {
fit <- stan(recnormal_rt,
data = list(N = N, rt = rt_tilde[[n]])
)
...
}
```
The procedure shown before (1-3) defines a natural condition for assessing whether the computed posterior distributions match the exact posterior distributions [@talts2018validating; @Cooketal2006]. This condition is met if the generator, probabilistic program, and posterior approximation algorithm function match: The generator and the probabilistic program must accurately represent the same data-generating process, while the posterior approximation algorithm should yield samples that closely match the correct posterior for the probabilistic program, based on data simulated from the prior. Any discrepancy suggests a mismatch among the components [@ModrakEtAl2023].^[We are working under the assumption that Stan, which yields the posterior approximation, works correctly. In principle, if we assume that our model is correct, we can also use simulation-based calibration to examine whether our approximation to the posterior distribution is correct (that is, whether Stan's sampler, or any posterior approximation method works correctly); for an example, see @yao2018yes.]
The next two steps describe a well-defined comparison of the exact posterior and prior distribution [different calibration checking methods differ in how exactly they test for mismatches between the distributions; @ModrakEtAl2023].
4. Generate an \index{Ensemble of rank statistics} ensemble of rank statistics of prior samples relative to corresponding posterior samples.
@talts2018validating demonstrate that the match between the exact posterior and prior can be evaluated by examining whether, for each element of $\boldsymbol{\Theta}$ (e. g., $\langle \mu; \sigma \rangle$), the \index{Rank statistics} *rank statistics* of the prior sample relative to $S$ posterior samples will be \index{Uniformly distributed} uniformly distributed across $[0, S]$. In other words, for each sample of the prior (e.g., $\tilde{\mu}_{n}$), we calculate the number of samples of the posterior that is smaller than the corresponding value of the parameter sampled, and we examine its distribution. A mismatch between the exact posterior and our posterior would show up as a \index{Non-uniform distribution} non-uniform distribution.
As a next step, we examine this new distribution visually using a histogram with $S + 1$ possible ranks (from $0$ to $S$). There are two issues that we need to take into account [@talts2018validating]:
(1) Regardless of our model, histograms will deviate from uniformity if the posterior samples are dependent. This can be solved by \index{Thinning} *thinning* the samples of the posterior, that is removing a number of intermediate samples [@talts2018validating recommend between 6 and 10].
(2) To reduce the noise in the histogram, we should have bins of equal size. If $S + 1$ are divisible by a large power of $2$, e.g, $1024$, we will be able to re-bin the histogram easily with bins of equal size. We complete the code shown in step 3 by generating an ensemble of ranks statistics as well.
Here we encapsulate steps 1 to 4 into a function.
```{r, message = FALSE, results = "hide", eval = !file.exists("dataR/df_rank.RDS")}
gen_ranks <- function(...,
N_sim = 150,
N_s = 1024 * 6, # Total samples from the posterior
N = 500 # Number of observations
){
mu_tilde <- rnorm(N_sim, 2, 1) / 1000
sigma_tilde <- rtnorm(N_sim, 0.4, 0.2, a = 0) / 1000
# Thin by 6, and remove one extra sample so that S + 1 = 1024
thinner <- seq(from = 1, to = N_s, by = 6)[-1]
# Placeholders
rank_mu <- rep(NA, N_sim)
rank_sigma <- rep(NA, N_sim)
rt_tilde <- vector(mode = "list", length = N_sim)
# Loop over the simulations
for (n in 1:N_sim) {
message("Fit number ", n)
# it will repeat the sampling unless the model converged
repeat{
rt_tilde[[n]] <- 1 / rtnorm(N,
mu_tilde[n],
sigma_tilde[n],
a = 0
)
# Fit is only stored temporarily
fit <- stan(...,
data = list(N = N,
RT = rt_tilde[[n]]),
warmup = 2500,
iter = 2500 + N_s / 4,
refresh = -1,
control = list(adapt_delta = .999,
max_treedepth = 14))
# continue if there are no divergent transitions
# and all Rhat are smaller than 1.05
if (!get_num_divergent(fit) &
all(monitor(fit)[,"Rhat"] <= 1.05)) break
message("Repeating fit ",n)
}
post <- extract(fit)
# Number of samples of the posterior that is smaller than
# the corresponding value of the parameter sampled
rank_mu[n] <- sum(post$mu[thinner] < mu_tilde[n])
rank_sigma[n] <- sum(post$sigma[thinner] < sigma_tilde[n])
}
#data frame with ranks:
tibble(sim = rep(1:N_sim, 2),
variable = rep(c("mu", "sigma"),
each = N_sim),
rank = c(rank_mu, rank_sigma))
}
set.seed(123)
# generate ranks with the default values:
df_rank <- gen_ranks(recnormal_rt)
```
```{r echo = FALSE}
if (file.exists("dataR/df_rank.RDS")) {
df_rank <- readRDS("dataR/df_rank.RDS")
} else {
saveRDS(df_rank, file = "dataR/df_rank.RDS")
}
```
5. Graphically assess the \index{Correctness of the model} correctness of the model using \index{Rank histogram} rank histograms
As a last step, we use a histogram for each parameter to identify deviations of uniformity.
If the posterior estimates are correct, then each of the $B$ bins has a probability of $1/B$ that a simulation (i.e., an individual rank) falls into it: $\mathit{Binomial}(N_{sim}, 1/B)$. This allow us to complement the histograms with confidence intervals, indicating where the variation expected from a uniform histogram should be.
Use the code below to build the plot shown in Figure \@ref(fig:SBChand). We can conclude that the implementation of our model (in the parameter space determined by the priors) is correct.
(ref:SBChand) Rank histograms of $\mu$ and $\sigma$ from the reciprocal truncated normal distribution. The dashed lines represent the 99% confidence interval.
```{r SBChand, message = FALSE, fig.cap = "(ref:SBChand)" , fig.height = 2.2}
B <- 16
bin_size <- 1024 / 16
ci_l <- qbinom(0.005, size = N_sim, prob = 1 / B)
ci_u <- qbinom(0.995, size = N_sim, prob = 1 / B)
ggplot(df_rank, aes(x = rank)) +
geom_histogram(breaks = seq(0, to = 1023, length.out = B + 1),
closed = "left", colour = "black") +
scale_y_continuous("count") +
geom_hline(yintercept = c(ci_l, ci_u), linetype = "dashed") +
facet_wrap(~variable, scales = "free_y")
```
Next, we consider an example where simulation-based calibration can reveal a problem.
### An example where simulation-based calibration reveals a problem
Let's assume that we made an error in the model implementation. We'll fit an \index{Incorrect model} **incorrect** model: rather than the complete likelihood including the normal PDF and the truncation and Jacobian adjustments, we will fit only the normal PDF evaluated at the reciprocal of the response times, that is `target += normal_lpdf(1 ./ RT | mu, sigma);`.
```{r, echo = FALSE}
recnormal_code_bad <- "
data {
int<lower = 1> N;
vector[N] RT;
}
parameters {
real mu_s;
real<lower = 0> sigma_s;
}
transformed parameters {
//aux pars so that it samples from pars that are not that small
real mu = mu_s / 1000;
real sigma = sigma_s / 1000;
}
model {
target += normal_lpdf(mu_s | 2, 1);
target += normal_lpdf(sigma_s | .4, .2)
- normal_lccdf(0 | .4, .2);
target += normal_lpdf(1 ./ RT | mu, sigma)
// - N * normal_lccdf(0 | mu, sigma)
// - sum(2 * log(RT))
;
}
"
```
```{r, echo = FALSE, message = FALSE, results = "hide"}
fit_bad <- stan(model_code = recnormal_code_bad,
data = list(N = N, RT = rt),
control = list(adapt_delta = .99, max_treedepth = 14))
```
Assessing the correctness of the model by looking at the recovery of the parameters is misleading in this case; this is evident from Figure \@ref(fig:recoverybad). One could conclude that the model is fine based on this plot, since the true values of the parameters are well inside the bulk of their posterior distributions.
(ref:recoverybad) Posterior distributions of the parameters of an incorrect model (truncation and Jacobian adjustments missing) together with their true values.
```{r recoverybad, message = FALSE, fig.cap = "(ref:recoverybad)", echo = FALSE , fig.height = 2.2}
post_rec <- as.data.frame(fit_bad) %>%
select("mu", "sigma")
mcmc_recover_hist(post_rec, true = c(mu, sigma))
```
```{r, message = FALSE, results = "hide", eval = !file.exists("dataR/df_rank_bad.RDS"), echo = FALSE}
set.seed(123)
df_rank_bad <- gen_ranks(model_code = recnormal_code_bad)
saveRDS(df_rank_bad, file = "dataR/df_rank_bad.RDS")
```
```{r, message = FALSE, results = "hide", eval = file.exists("dataR/df_rank_bad.RDS"), echo = FALSE}
df_rank_bad <- readRDS("dataR/df_rank_bad.RDS")
```
However, the rank histograms produced by the simulation-based calibration procedure in Figure \@ref(fig:SBCbad) show very tall bins at the left and at the right of each histogram, exceeding the 99% CI. This is a very clear indication that there is a problem with the \index{Model specification} model specification: our incorrect model is overestimating $\mu$ and underestimating $\sigma$. This example illustrates the importance of simulation-based calibration.
(ref:SBCbad) Rank histograms of $\mu$ and $\sigma$ from the **incorrect** implementation of the reciprocal truncated normal distribution. The dashed lines represent the 99% confidence interval.
```{r SBCbad, message = FALSE, fig.cap = "(ref:SBCbad)" , echo = FALSE, fig.height = 2.2}
ggplot(df_rank_bad, aes(x = rank)) +
geom_histogram(breaks =
seq(0, to = 1023, length.out = B + 1),
closed = "left", colour = "black") +
scale_y_continuous("count") +
geom_hline(yintercept = c(ci_l, ci_u), linetype = "dashed") +
facet_wrap(~variable, scales = "free_y")
```
### Issues with and limitations of simulation-based calibration
In the previous sections, we have used only rank histograms. Even though histograms are a very intuitive means of visualization to assess model correctness, they might not be sensitive enough for small deviations [@talts2018validating]. Other visualizations might be better suited. Another limitation of histograms is that they are sensitive to the number of bins [@sailynoja2022graphical]. One could bin the histograms multiple times, but this approach is difficult to interpret when there are many parameters, and can be vulnerable to multiple testing biases [@talts2018validating]. One alternative to rank histograms is visualizations based on the \index{Empirical cumulative density function of the ranks} empirical cumulative density function of the ranks. This is briefly presented in online section \@ref(app-sbc).
Even though simulation-based calibration is a comprehensive approach to test model correctness, it has several drawbacks, regardless of the means of visualization:
(i) This approach requires fitting a model many times, which can be too time intensive for complex models. It's worth bearing in mind that code with errors can sometimes show clear \index{Catastrophic failure} "catastrophic" failures in the recovery of the parameters. For this reason, one could start the verification of model correctness with a simple \index{Recovery of parameters} recovery of parameters [see @talts2018validating; @gelman2020bayesian], and then proceed with the simulation-based calibration procedure for at least some suspect aspects of the model (e.g., when working with a custom likelihood).
(ii) Another major limitation of simulation-based calibration is that it is concerned exclusively with the computational aspects of the analysis and offers no guarantee for any single observation. A complete Bayesian workflow should include prior and posterior predictive checks, see the online chapter \@ref(ch-workflow) and @schad2020toward.
(iii) Finally, it's important to apply simulation-based calibration only after appropriate priors are set. This is important because only the priors determine the parameter space that will be inspected. Weak priors that may be good enough for estimation may nevertheless lead to a parameter space that is unrealistically large since, during the calibration stage, there are no independent data to constrain the space. This means that it is important to conduct prior predictive checks before assessing the correctness of the model with simulation-based calibration.
## Another \index{Custom distribution} custom distribution: The exponential distribution \index{Exponential distribution} implemented manually
There are cases when one needs a random variable with a distribution that is not included in Stan. Often one can find the PDF (or a CDF) derived in a paper, and one needs to implement it manually by writing the log PDF in the Stan language.
Even though the exponential distribution is included in Stan, we demonstrate how we would include it step-by-step as if it weren't available. This example extends the demonstration in section 22.1 of the User's guide [@Stan2023].
The exponential distribution is used to model \index{Waiting time} waiting times until a certain event occurs. Its PDF is the following:
\begin{equation}
f(x | \lambda )={
\begin{cases}
\lambda e^{-\lambda x} & x\geq 0,\\0&x<0.
\end{cases}}
\end{equation}
The parameter $\lambda$ is often called the rate parameter and must be positive. A higher value for the rate parameter leads to shorter waiting times on average. The mean of the distribution is $1/\lambda$. The exponential distribution has the key property of being \index{Memoryless} *memoryless*. What this means is explained in @RossProb as follows: Suppose that $X$ is a random variable with an exponential distribution as a PDF; the random variable represents the lifetime of an item (e.g., $X$ could represent the time it takes for a radioactive particle to completely decay). If the item is $t$ time-units old, then the remaining life $s$ of that item has the same probability distribution as the life of a new item. Mathematically, this amounts to stating that
\begin{equation}
P(X>s + t | X> t)= P(X>s)
\end{equation}
The implication of the memoryless property is that we do not need to know the age of an item to know what the distribution of its remaining life is.
To give another example of memorylessness, the conditional probability that a certain event will happen in the next 100 ms is the same regardless of whether we have been already waiting 1000 ms, 10 ms, or 0 ms. Although the exponential distribution is not commonly used for modeling response times in cognitive science, it has been used in the past [@ashby1980decomposing; @ashby1982testing]. We focus on this distribution because of its simple analytical form.
As a first step, we make our own version of the \index{PDF} PDF in R, and then check that it integrates to one to verify that this is a proper distribution (and that we haven't introduced a typo in the formula).
To avoid \index{Underflow} underflow, that is getting a zero instead of a very small number, we'll work in the log-scale. This will also be useful when we implement this function in Stan, thus the log PDF is
\begin{equation}
\log(f(x| \lambda ))= \log(\lambda) -\lambda x
\end{equation}
where $x >0$.
Implement this function in `R` with the same arguments as the `d*` family of functions: if `log = TRUE` the output is a log density; call this new function `dexp2`.
```{r}
dexp2 <- function(x, lambda = 1, log = FALSE) {
log_density <- log(lambda) - lambda * x
if (log == FALSE) {
exp(log_density)
} else {
log_density
}
}
```
Verify that this function integrates to one for some point values of its parameter $\lambda$ (here, $\lambda = 1$ and $\lambda = 20$):
```{r}
dexp2_l1 <- function(x) dexp2(x, 1)
integrate(dexp2_l1, lower = 0, upper = Inf)$value
dexp2_l20 <- function(x) dexp2(x, 20)
integrate(dexp2_l20, lower = 0, upper = Inf)$value
```
To test our function, we'll also need to generate random values from the exponential distribution. If the \index{Quantile function} quantile function of the distribution exists, the \index{Inverse transform sampling} inverse transform sampling is a relatively straightforward way to get pseudo random numbers sampled from a target distribution [for an accessible introduction to inverse sampling, see @lynch2007introduction]. Given a target distribution with a PDF $f$, and a quantile function $F^{-1}$ (the inverse of the CDF), the inverse transform sampling method consists of the following:
1. Sample one number $u$ from $\mathit{Uniform}(0,1)$.
2. Then $z=F^{-1}(u)$ is a draw from $f(x)$.
In this case, the quantile function (the \index{Inverse of the CDF} inverse of the CDF) is the following:
\begin{equation}
- \log(1-p)/\lambda
\end{equation}
Here is how one can derive this inverse of the CDF. First, consider the fact that the CDF of the exponential distribution is as follows. The term $q$ is some quantile of the distribution:
\begin{equation}
F(q) = \int_0^q \lambda \exp(-\lambda x)\, \mathrm{d}x
\end{equation}
We can solve this \index{Integral} integral by using the \index{U-substitution method} $u$-substitution method [@salas2003calculus]. First, define
\begin{equation}
u = g(x) = -\lambda x
\end{equation}
Then, the derivative $du/dx$ is:
\begin{equation}
\frac{\mathrm{d}u}{\mathrm{d}x} = -\lambda
\end{equation}
This implies that $\mathrm{d}u = - \lambda \mathrm{d}x$, or that $- \mathrm{d}u = \lambda \mathrm{d}x$.
In the CDF, replace the term $\lambda \mathrm{d}x$ with $-\mathrm{d}u$:
\begin{equation}
F(q) = \int_0^q \lambda \exp(-\lambda x)\, \mathrm{d}x = \int_0^{-\lambda q} (- \exp(-\lambda x)\, \mathrm{d}u)
\end{equation}
Rewriting $-\lambda x$ as $u$, the CDF simplifies to:
\begin{equation}
F(q) = \int_0^q \lambda \exp(-\lambda x)\, \mathrm{d}x = \int_0^q (- \exp(u)\, \mathrm{d}u)
\end{equation}
We know from calculus that the integral of $-\exp(u)$ is $-\exp(u)$. So, the integral becomes:
\begin{equation}
F(q) = \left[ - \exp(u) \right]_0^q
\end{equation}
Replacing $u$ with $-\lambda x$, we get:
\begin{equation}
F(q) = \left[ - \exp(-\lambda x) \right]_0^q =
1- \exp(-\lambda q)
\end{equation}
Thus, we know that the CDF is $F(t) = 1- \exp(-\lambda q) = p$, where $p$ is the probability of observing the quantile $q$ or some value smaller than $q$. To derive the inverse of the CDF, solve the equation below for $q$:
\begin{equation}
\begin{aligned}
p &= 1- e^{-\lambda q}\\
p-1 &= - e^{-\lambda q}\\
-p+1 &= e^{-\lambda q}\\
\log(1-p) &= -\lambda q\\
- \log(1-p)/\lambda &= q\\
\end{aligned}
\end{equation}
Write the quantile function (the inverse of the CDF) and the random number generator for the exponential distribution in `R`. To differentiate it from the built-in function in `R`, we will call it `qexp2`:
```{r}
qexp2 <- function(p, lambda = 1) {
-log(1 - p) / lambda
}
rexp2 <- function(n, lambda = 1) {
u <- runif(n, 0, 1)
qexp2(u, lambda)
}
```
The functions that we would use in a Stan model are relatively faithful to the `R` code, but follow Stan conventions: The expression `log1m(p)` is more arithmetically stable than `log(1 - .)` for values of `p` close to one, the function `exp_lpdf` stores the sum of the log(PDF) evaluated at each value of x, this would be analogous to doing `sum(dexp2(x, lambda, log = TRUE))`; the function `exp_rng` implements a non-vectorized version of `rexp2` and uses the auxiliary function `exp_icdf` (icdf stands for inverse CDF), which is similar to `qexp2`.
```{r, echo = FALSE}
exponential <- system.file("stan_models",
"exponential.stan",
package = "bcogsci")
```
```{stan output.var = "exponential", code = readLines(exponential)[1:13], tidy = TRUE, comment="", eval = FALSE, cache = FALSE, cache.lazy = FALSE}
```
We are now ready to generate synthetic data and fit the distribution in Stan. Generate 1000 observations.
```{r, echo=FALSE}
set.seed(123)
```
```{r}
N <- 1000
lambda <- 1 / 200
rt <- rexp2(N, lambda)
```
Use `exponential.stan`, which includes the function block shown earlier, and the following obligatory blocks:
```{stan output.var = "exponential", code = readLines(exponential)[14:30], tidy = TRUE, comment="", eval = FALSE, cache = FALSE, cache.lazy = FALSE}
```
Fit the data with Stan.
```{r exp, message = FALSE, results = "hide"}
exponential <- system.file("stan_models",
"exponential.stan",
package = "bcogsci")
fit_exp <- stan(exponential, data = list(N = N, RT = rt))
```
Print the summary:
```{r}
print(fit_exp, pars = c("lambda"))
```
Carry out a quick check first, verifying that the true value of the parameter $\lambda$ is reasonably inside the bulk of the posterior distribution. This is shown in Figure \@ref(fig:recoveryexp)(a).
Figure \@ref(fig:recoveryexp)(b) shows the results of the simulation-based calibration procedure, and shows that our implementation was correct.
```{r ranklambda, message = FALSE, results = "hide", eval = !file.exists("dataR/df_rank_lambda.RDS"), echo = FALSE, fold = TRUE}
gen_ranks_lambda <- function(...,
N_sim = 200,
N_s = 1024 * 6, # Total samples from the posterior
N = 500 # Number of observations
){
lambda_tilde <- rtnorm(N_sim, 0, .1, a = 0)
# Thin by 6, and remove one extra sample so that S + 1 = 1024
thinner <- seq(from = 1, to = N_s, by = 6)[-1]
# Placeholders
rank_lambda <- rep(NA, N_sim)
rt_tilde <- vector(mode = "list", length = N_sim)
# Loop over the simulations
for (n in 1:N_sim) {
message("Fit number ", n)
# it will repeat the sampling unless the model converged
repeat{
rt_tilde[[n]] <- rexp2(N, lambda_tilde[[n]])
# Fit is only stored temporarily
fit <- stan(...,
data = list(N = N,
RT = rt_tilde[[n]]),
warmup = 2500,
iter = 2500 + N_s / 4,
refresh = -1,
control = list(adapt_delta = .999,
max_treedepth = 14)
)
# continue if there are no divergent transitions
# and all Rhat are smaller than 1.05
if (!get_num_divergent(fit) &
all(monitor(fit)[,"Rhat"] <= 1.05)) break
message("Repeating fit ",n)
}
post <- extract(fit)
rank_lambda[n] <- sum(post$lambda[thinner] < lambda_tilde[n])
}
#data frame with ranks:
tibble(sim = 1:N_sim,
variable = rep("lambda", each = N_sim),
rank = rank_lambda)
}
set.seed(123)
# remove the rng block to speed up
exponential_s <- paste(readLines(exponential)[1:25],collapse = "\n")
df_rank_lambda <- gen_ranks_lambda(model_code = exponential_s)
```
```{r echo = FALSE}
if (file.exists("dataR/df_rank_lambda.RDS")) {
df_rank_lambda <- readRDS("dataR/df_rank_lambda.RDS")
} else {
saveRDS(df_rank_lambda, file = "dataR/df_rank_lambda.RDS")
}
```
(ref:recoveryexp) (a) Posterior distribution of the parameter $\lambda$ of `fit_exp` together with its true values as a black line. (b) Rank histogram of $\lambda$ and from our hand-made implementation of the exponential distribution. The dashed lines represent the 99% confidence interval.
```{r recoveryexp, message = FALSE, fig.cap = "(ref:recoveryexp)", fig.show="hold", out.width="45%", fig.width = 3, fig.height = 3 , fold = TRUE}
# Plot a
post_exp <- as.data.frame(fit_exp) %>%
select("lambda")
mcmc_recover_hist(post_exp, true = lambda) +
# make it similar to plot b:
ggtitle("(a) Posterior distribution") +
xlab("lamda") +
theme(plot.title = element_text(hjust = 0),
legend.position="none",
strip.text.x = element_blank())
# Plot b
B <- 16
bin_size <- 1024 / 16
N_sim = 200
ci_l <- qbinom(0.005, size = N_sim, prob = 1 / B)
ci_u <- qbinom(0.995, size = N_sim, prob = 1 / B)
ggplot(df_rank_lambda, aes(x = rank)) +
geom_histogram(breaks =
seq(0, to = 1023, length.out = B + 1),
closed = "left", colour = "black") +
scale_y_continuous("count") +
geom_hline(yintercept = c(ci_l, ci_u), linetype = "dashed") +
ggtitle("(b) Rank histogram") +
theme(plot.title = element_text(hjust = 0))
```
## Summary
In this chapter, we learned how to create a PDF that is not provided by Stan. We learned how to do this by doing a change of variables (using the Jacobian adjustment) and by building the distribution from scratch in a function ending in `_lpdf`. We also learned how to verify the correctness of the new functions by recovering the true values of the parameters and by using simulation-based calibration.
## Further reading
\index{Jacobian adjustment} Jacobian adjustments in Bayesian models are a common source of confusion and there are several posts in blogs and study cases that try to shed light on them:
- https://rstudio-pubs-static.s3.amazonaws.com/486816_440106f76c944734a7d4c84761e37388.html
- https://betanalpha.github.io/assets/case_studies/probability_theory.html#42_probability_density_functions
- https://jsocolar.github.io/jacobians/
- https://mc-stan.org/documentation/case-studies/mle-params.html
Jacobian adjustments are also relevant for adjusting priors for order constraints, which is especially important for Bayes factors [@Heck2016].
A complete tutorial of \index{Simulation-based calibration} simulation-based calibration using the \index{SBC} `SBC` package was presented in the online event \index{Stanconnect 2021} Stanconnect 2021, and it is available in https://www.martinmodrak.cz/post/2021-sbc_tutorial/. The ideas that predate this technique can be found in @Cooketal2006 and were extended in @talts2018validating. The use of ECDF-based visualizations is discussed in @sailynoja2022graphical. The role of simulation-based calibration in the Bayesian workflow is discussed in @schad2020toward. Examples of the use of simulation-based calibration to validate novel models used in cognitive science are @hartmann2020rtmpt and @burkner2020modelling. The extension of this procedure to validate Bayes factors is discussed in @SchadEtAlBF.
Custom functions in general and custom probability functions are treated chapter 21 of the User's guide [@Stan2023].