-
-
Notifications
You must be signed in to change notification settings - Fork 114
/
Copy pathregression.qmd
2033 lines (1671 loc) · 68.8 KB
/
regression.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
pagetitle: Regression Models
---
# Regression Models
Stan supports regression models from simple linear regressions to
multilevel generalized linear models.
## Linear regression
The simplest linear regression model is the following, with a single
predictor and a slope and intercept coefficient, and normally
distributed noise. This model can be written using standard
regression notation as
$$
y_n = \alpha + \beta x_n + \epsilon_n
\quad\text{where}\quad
\epsilon_n \sim \operatorname{normal}(0,\sigma).
$$
This is equivalent to the following sampling involving the
residual,
$$
y_n - (\alpha + \beta X_n) \sim \operatorname{normal}(0,\sigma),
$$
and reducing still further, to
$$
y_n \sim \operatorname{normal}(\alpha + \beta X_n, \, \sigma).
$$
This latter form of the model is coded in Stan as follows.
```stan
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
y ~ normal(alpha + beta * x, sigma);
}
```
There are `N` observations and for each observation, $n \in N$, we have predictor
`x[n]` and outcome `y[n]`. The intercept and slope parameters are
`alpha` and `beta`. The model assumes a normally
distributed noise term with scale `sigma`. This model has
improper priors for the two regression coefficients.
### Matrix notation and vectorization {- #vectorization.section}
The distribution statement in the previous model is vectorized, with
```stan
y ~ normal(alpha + beta * x, sigma);
```
providing the same model as the unvectorized version,
```stan
for (n in 1:N) {
y[n] ~ normal(alpha + beta * x[n], sigma);
}
```
In addition to being more concise, the vectorized form is much faster.^[Unlike in Python and R, which are interpreted, Stan is translated to C++ and compiled, so loops and assignment statements are fast. Vectorized code is faster in Stan because (a) the expression tree used to compute derivatives can be simplified, leading to fewer virtual function calls, and (b) computations that would be repeated in the looping version, such as `log(sigma)` in the above model, will be computed once and reused.]
In general, Stan allows the arguments to distributions such as
`normal` to be vectors. If any of the other arguments are vectors or
arrays, they have to be the same size. If any of the other arguments
is a scalar, it is reused for each vector entry.
The other reason this works is that Stan's arithmetic operators are
overloaded to perform matrix arithmetic on matrices. In this case,
because `x` is of type `vector` and `beta` of type
`real`, the expression `beta * x` is of type `vector`.
Because Stan supports vectorization, a regression model with more than
one predictor can be written directly using matrix notation.
```stan
data {
int<lower=0> N; // number of data items
int<lower=0> K; // number of predictors
matrix[N, K] x; // predictor matrix
vector[N] y; // outcome vector
}
parameters {
real alpha; // intercept
vector[K] beta; // coefficients for predictors
real<lower=0> sigma; // error scale
}
model {
y ~ normal(x * beta + alpha, sigma); // data model
}
```
The constraint `lower=0` in the declaration of `sigma`
constrains the value to be greater than or equal to 0. With no prior
in the model block, the effect is an improper prior on non-negative
real numbers. Although a more informative prior may be added, improper
priors are acceptable as long as they lead to proper posteriors.
In the model above, `x` is an $N \times K$ matrix of predictors
and `beta` a $K$-vector of coefficients, so `x * beta` is an
$N$-vector of predictions, one for each of the $N$ data items. These
predictions line up with the outcomes in the $N$-vector `y`, so
the entire model may be written using matrix arithmetic as shown. It
would be possible to include a column of ones in the data matrix `x` to
remove the `alpha` parameter.
The distribution statement in the model above is just a more efficient,
vector-based approach to coding the model with a loop, as in the
following statistically equivalent model.
```stan
model {
for (n in 1:N) {
y[n] ~ normal(x[n] * beta, sigma);
}
}
```
With Stan's matrix indexing scheme, `x[n]` picks out row `n`
of the matrix `x`; because `beta` is a column vector,
the product `x[n] * beta` is a scalar of type `real`.
#### Intercepts as inputs {-}
In the model formulation
```stan
y ~ normal(x * beta, sigma);
```
there is no longer an intercept coefficient `alpha`. Instead, we
have assumed that the first column of the input matrix `x` is a
column of 1 values. This way, `beta[1]` plays the role of the
intercept. If the intercept gets a different prior than the slope
terms, then it would be clearer to break it out. It is also slightly
more efficient in its explicit form with the intercept variable
singled out because there's one fewer multiplications; it should not
make that much of a difference to speed, though, so the choice should
be based on clarity.
## The QR reparameterization {#QR-reparameterization.section}
In the previous example, the linear predictor can be written as $\eta
= x \beta$, where $\eta$ is a $N$-vector of predictions, $x$ is a $N
\times K$ matrix, and $\beta$ is a $K$-vector of coefficients.
Presuming $N \geq K$, we can exploit the fact that any design matrix
$x$ can be decomposed using the thin QR decomposition into an
orthogonal matrix $Q$ and an upper-triangular matrix $R$, i.e. $x = Q
R$.
The functions `qr_thin_Q` and `qr_thin_R` implement the thin QR decomposition,
which is to be preferred to the fat QR decomposition that would be obtained
by using `qr_Q` and `qr_R`, as the latter would more easily run out of memory
(see the Stan Functions Reference for more information on the `qr_thin_Q`
and `qr_thin_R` functions). In practice, it is best to write $x = Q^\ast
R^\ast$ where $Q^\ast = Q * \sqrt{n - 1}$ and $R^\ast =
\frac{1}{\sqrt{n - 1}} R$. Thus, we can equivalently write $\eta = x
\beta = Q R \beta = Q^\ast R^\ast \beta$. If we let $\theta = R^\ast
\beta$, then we have $\eta = Q^\ast \theta$ and $\beta = R^{\ast^{-1}}
\theta$. In that case, the previous Stan program becomes
```stan
data {
int<lower=0> N; // number of data items
int<lower=0> K; // number of predictors
matrix[N, K] x; // predictor matrix
vector[N] y; // outcome vector
}
transformed data {
matrix[N, K] Q_ast;
matrix[K, K] R_ast;
matrix[K, K] R_ast_inverse;
// thin and scale the QR decomposition
Q_ast = qr_thin_Q(x) * sqrt(N - 1);
R_ast = qr_thin_R(x) / sqrt(N - 1);
R_ast_inverse = inverse(R_ast);
}
parameters {
real alpha; // intercept
vector[K] theta; // coefficients on Q_ast
real<lower=0> sigma; // error scale
}
model {
y ~ normal(Q_ast * theta + alpha, sigma); // data model
}
generated quantities {
vector[K] beta;
beta = R_ast_inverse * theta; // coefficients on x
}
```
Since this Stan program generates equivalent predictions for $y$ and
the same posterior distribution for $\alpha$, $\beta$, and $\sigma$ as
the previous Stan program, many wonder why the version with this QR
reparameterization performs so much better in practice, often both in
terms of wall time and in terms of effective sample size. The
reasoning is threefold:
1. The columns of $Q^\ast$ are orthogonal whereas the columns of
$x$ generally are not. Thus, it is easier for a Markov Chain to move
around in $\theta$-space than in $\beta$-space.
1. The columns of $Q^\ast$ have the same scale whereas the columns
of $x$ generally do not. Thus, a Hamiltonian Monte Carlo algorithm
can move around the parameter space with a smaller number of larger
steps
1. Since the covariance matrix for the columns of $Q^\ast$ is an
identity matrix, $\theta$ typically has a reasonable scale if the
units of $y$ are also reasonable. This also helps HMC move
efficiently without compromising numerical accuracy.
Consequently, this QR reparameterization is recommended for linear and
generalized linear models in Stan whenever $K > 1$ and you do not have
an informative prior on the location of $\beta$. It can also be
worthwhile to subtract the mean from each column of $x$ before
obtaining the QR decomposition, which does not affect the posterior
distribution of $\theta$ or $\beta$ but does affect $\alpha$ and
allows you to interpret $\alpha$ as the expectation of $y$ in a linear
model.
## Priors for coefficients and scales {#regression-priors.section}
See our [general discussion of priors](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations)
for tips on priors for parameters in regression models.
Later sections discuss [univariate
hierarchical priors](#hierarchical-priors.section) and [multivariate
hierarchical priors](#multivariate-hierarchical-priors.section), as well
as [priors used to identify models](#priors-for-identification.section).
However, as described in [QR-reparameterization
section](#QR-reparameterization.section), if you do not have an
informative prior on the *location* of the regression coefficients,
then you are better off reparameterizing your model so that the
regression coefficients are a generated quantity. In that case, it
usually does not matter much what prior is used on on the
reparameterized regression coefficients and almost any weakly
informative prior that scales with the outcome will do.
## Robust noise models
The standard approach to linear regression is to model the noise
term $\epsilon$ as having a normal distribution. From Stan's
perspective, there is nothing special about normally distributed
noise. For instance, robust regression can be accommodated by giving
the noise term a Student-$t$ distribution. To code this in Stan, the
distribution distribution is changed to the following.
```stan
data {
// ...
real<lower=0> nu;
}
// ...
model {
y ~ student_t(nu, alpha + beta * x, sigma);
}
```
The degrees of freedom constant `nu` is specified as data.
## Logistic and probit regression {#logistic-probit-regression.section}
For binary outcomes, either of the closely related logistic or probit
regression models may be used. These generalized linear models vary
only in the link function they use to map linear predictions in
$(-\infty,\infty)$ to probability values in $(0,1)$. Their respective
link functions, the logistic function and the standard normal cumulative distribution
function, are both sigmoid functions (i.e., they are both *S*-shaped).
A logistic regression model with one predictor and an intercept is coded as
follows.
```stan
data {
int<lower=0> N;
vector[N] x;
array[N] int<lower=0, upper=1> y;
}
parameters {
real alpha;
real beta;
}
model {
y ~ bernoulli_logit(alpha + beta * x);
}
```
The noise parameter is built into the Bernoulli formulation here
rather than specified directly.
Logistic regression is a kind of generalized linear model with binary
outcomes and the log odds (logit) link function, defined by
$$
\operatorname{logit}(v) = \log \left( \frac{v}{1-v} \right).
$$
The inverse of the link function appears in the model:
$$
\operatorname{logit}^{-1}(u) = \texttt{inv}\mathtt{\_}\texttt{logit}(u) = \frac{1}{1 + \exp(-u)}.
$$
The model formulation above uses the logit-parameterized version of
the Bernoulli distribution, which is defined by
$$
\texttt{bernoulli}\mathtt{\_}\texttt{logit}\left(y \mid \alpha \right)
=
\texttt{bernoulli}\left(y \mid \operatorname{logit}^{-1}(\alpha)\right).
$$
The formulation is also vectorized in the sense that `alpha` and
`beta` are scalars and `x` is a vector, so that `alpha
+ beta * x` is a vector. The vectorized formulation is equivalent
to the less efficient version
```stan
for (n in 1:N) {
y[n] ~ bernoulli_logit(alpha + beta * x[n]);
}
```
Expanding out the Bernoulli logit, the model is equivalent to the more
explicit, but less efficient and less arithmetically stable
```stan
for (n in 1:N) {
y[n] ~ bernoulli(inv_logit(alpha + beta * x[n]));
}
```
Other link functions may be used in the same way. For example, probit
regression uses the cumulative normal distribution function, which is
typically written as
$$
\Phi(x) = \int_{-\infty}^x \textsf{normal}\left(y \mid 0,1 \right) \,\textrm{d}y.
$$
The cumulative standard normal distribution function $\Phi$ is implemented
in Stan as the function `Phi`. The probit regression model
may be coded in Stan by replacing the logistic model's distribution
statement with the following.
```stan
y[n] ~ bernoulli(Phi(alpha + beta * x[n]));
```
A fast approximation to the cumulative standard normal distribution
function $\Phi$ is implemented in Stan as the function
`Phi_approx`.^[The `Phi_approx` function is a rescaled version of the inverse logit function, so while the scale is roughly the same $\Phi$, the tails do not match.]
The approximate probit regression model may
be coded with the following.
```stan
y[n] ~ bernoulli(Phi_approx(alpha + beta * x[n]));
```
## Multi-logit regression {#multi-logit.section}
Multiple outcome forms of logistic regression can be coded directly in
Stan. For instance, suppose there are $K$ possible outcomes for each
output variable $y_n$. Also suppose that there is a $D$-dimensional
vector $x_n$ of predictors for $y_n$. The multi-logit model with
$\textsf{normal}(0,5)$ priors on the coefficients is coded as follows.
```stan
data {
int K;
int N;
int D;
array[N] int y;
matrix[N, D] x;
}
parameters {
matrix[D, K] beta;
}
model {
matrix[N, K] x_beta = x * beta;
to_vector(beta) ~ normal(0, 5);
for (n in 1:N) {
y[n] ~ categorical_logit(x_beta[n]');
}
}
```
where `x_beta[n]'` is the transpose of `x_beta[n]`. The prior on `beta` is coded in vectorized form.
As of Stan 2.18, the categorical-logit distribution is not vectorized
for parameter arguments, so the loop is required. The matrix multiplication
is pulled out to define a local variable for all of the predictors for
efficiency. Like the Bernoulli-logit, the categorical-logit
distribution applies softmax internally to convert an arbitrary vector
to a simplex,
$$
\texttt{categorical}\mathtt{\_}\texttt{logit}\left(y \mid \alpha\right)
=
\texttt{categorical}\left(y \mid \texttt{softmax}(\alpha)\right),
$$
where
$$
\texttt{softmax}(u) = \exp(u) / \operatorname{sum}\left(\exp(u)\right).
$$
The categorical distribution with log-odds (logit) scaled parameters
used above is equivalent to writing
```stan
y[n] ~ categorical(softmax(x[n] * beta));
```
#### Constraints on data declarations {-}
The data block in the above model is defined without constraints on
sizes `K`, `N`, and `D` or on the outcome array
`y`. Constraints on data declarations provide error checking at
the point data are read (or transformed data are defined), which is
before sampling begins. Constraints on data declarations also make
the model author's intentions more explicit, which can help with
readability. The above model's declarations could be tightened to
```stan
int<lower=2> K;
int<lower=0> N;
int<lower=1> D;
array[N] int<lower=1, upper=K> y;
```
These constraints arise because the number of categories, `K`,
must be at least two in order for a categorical model to be useful.
The number of data items, `N`, can be zero, but not negative;
unlike R, Stan's for-loops always move forward, so that a loop extent
of `1:N` when `N` is equal to zero ensures the loop's body
will not be executed. The number of predictors, `D`, must be at
least one in order for `beta * x[n]` to produce an
appropriate argument for `softmax()`. The categorical outcomes
`y[n]` must be between `1` and `K` in order for the
discrete sampling to be well defined.
Constraints on data declarations are optional. Constraints on
parameters declared in the `parameters` block, on the other hand,
are *not* optional---they are required to ensure support for all
parameter values satisfying their constraints. Constraints on
transformed data, transformed parameters, and generated quantities are
also optional.
### Identifiability {-}
Because softmax is invariant under adding a constant to each component
of its input, the model is typically only identified if there is a
suitable prior on the coefficients.
An alternative is to use $(K-1)$-vectors by fixing one of them to be
zero. The [partially known parameters
section](missing-data.qmd#partially-known-parameters.section) discusses how to mix
constants and parameters in a vector. In the multi-logit case, the
parameter block would be redefined to use $(K - 1)$-vectors
```stan
parameters {
matrix[D, K - 1] beta_raw;
}
```
and then these are transformed to parameters to use in the model.
First, a transformed data block is added before the parameters block
to define a vector of zero values,
```stan
transformed data {
vector[D] zeros = rep_vector(0, D);
}
```
which can then be appended to `beta_raw` to produce the
coefficient matrix `beta`,
```stan
transformed parameters {
matrix[D, K] beta = append_col(beta_raw, zeros);
}
```
The `rep_vector(0, D)` call creates a column vector of size `D` with
all entries set to zero. The derived matrix `beta` is then defined to
be the result of appending the vector `zeros` as a new column at the
end of `beta_raw`; the vector `zeros` is defined as transformed
data so that it doesn't need to be constructed from scratch each time
it is used.
This is not the same model as using $K$-vectors as parameters,
because now the prior only applies to $(K-1)$-vectors. In practice,
this will cause the maximum likelihood solutions to be different and
also the posteriors to be slightly different when taking priors
centered around zero, as is typical for regression coefficients.
## Parameterizing centered vectors
When there are varying effects in a regression, the resulting
likelihood is not identified unless further steps are taken. For
example, we might have a global intercept $\alpha$ and then a varying
effect $\beta_k$ for age group $k$ to make a linear predictor $\alpha +
\beta_k$. With this predictor, we can add a constant to $\alpha$ and
subtract from each $\beta_k$ and get exactly the same likelihood.
The traditional approach to identifying such a model is to pin the
first varing effect to zero, i.e., $\beta_1 = 0$. With one of the
varying effects fixed, you can no longer add a constant to all of them
and the model's likelihood is identified. In addition to the
difficulty in specifying such a model in Stan, it is awkward to
formulate priors because the other coefficients are all interpreted
relative to $\beta_1$.
In a Bayesian setting, a proper prior on each of the $\beta$ is enough
to identify the model. Unfortunately, this can lead to inefficiency
during sampling as the model is still only weakly identified through
the prior---there is a very simple example of the difference in
the discussion of collinearity in @collinearity.section.
An alternative identification strategy that allows a symmetric prior
is to enforce a sum-to-zero constraint on the varying effects, i.e.,
$\sum_{k=1}^K \beta_k = 0.$
A parameter vector constrained to sum to zero may also be used to
identify a multi-logit regression parameter vector (see the
[multi-logit section](#multi-logit.section) for details), or may be
used for ability or difficulty parameters (but not both) in an IRT
model (see the [item-response model
section](#item-response-models.section) for details).
### Built-in sum-to-zero vector {-}
As of Stan 2.36, there is a built in `sum_to_zero_vector` type, which
can be used as follows.
```stan
parameters {
sum_to_zero_vector[K] beta;
// ...
}
```
This produces a vector of size `K` such that `sum(beta) = 0`. In the
unconstrained representation requires only `K - 1` values because the
last is determined by the first `K - 1`.
Placing a prior on `beta` in this parameterization, for example,
```stan
beta ~ normal(0, 1);
```
leads to a subtly different posterior than what you would get with the
same prior on an unconstrained size-`K` vector. As explained below,
the variance is reduced.
The sum-to-zero constraint can be implemented naively by setting the
last element to the negative sum of the first elements, i.e., $\beta_K
= -\sum_{k=1}^{K-1} \beta_k.$ But that leads to high correlation among
the $\beta_k$.
The transform used in Stan eliminates these correlations by
constructing an orthogonal basis and applying it to the
zero-sum-constraint; @seyboldt:2024 provides an explanation. The
*Stan Reference Manual* provides the details in the chapter on
transforms. Although any orthogonal basis can be used, Stan uses the
inverse isometric log transform because it is convenient to describe
and the transform simplifies to efficient scalar operations rather
than more expensive matrix operations.
#### Marginal distribution of sum-to-zero components {-}
On the Stan forums, Aaron Goodman provided the following code to
produce a prior with standard normal marginals on the components of
`beta`,
```stan
model {
beta ~ normal(0, inv(sqrt(1 - inv(K))));
// ...
}
```
The scale component can be multiplied by `sigma` to produce a
`normal(0, sigma)` prior marginally.
To generate distributions with marginals other than standard normal,
the resulting `beta` may be scaled by some factor `sigma` and
translated to some new location `mu`.
### Soft centering {-}
Adding a prior such as $\beta \sim \textsf{normal}(0,\epsilon)$ for a
small $\epsilon$ will provide a kind of soft centering of a parameter
vector $\beta$ by preferring, all else being equal, that $\sum_{k=1}^K
\beta_k = 0$. This approach is only guaranteed to roughly center if
$\beta$ and the elementwise addition $\beta + c$ for a scalar constant
$c$ produce the same likelihood (perhaps by another vector $\alpha$
being transformed to $\alpha - c$, as in the IRT models). This is
another way of achieving a symmetric prior, though it requires
choosing an $\epsilon$. If $\epsilon$ is too large, there won't be a
strong enough centering effect and if it is too small, it will add
high curvature to the target density and impede sampling.
## Ordered logistic and probit regression {#ordered-logistic.section}
Ordered regression for an outcome $y_n \in \{ 1, \dotsc, k \}$ with
predictors $x_n \in \mathbb{R}^D$ is determined by a single coefficient
vector $\beta \in \mathbb{R}^D$ along with a sequence of cutpoints $c \in
\mathbb{R}^{K-1}$ sorted so that $c_d < c_{d+1}$. The discrete output is
$k$ if the linear predictor $x_n \beta$ falls between $c_{k-1}$ and
$c_k$, assuming $c_0 = -\infty$ and $c_K = \infty$. The noise term is
fixed by the form of regression, with examples for ordered logistic
and ordered probit models.
### Ordered logistic regression {-}
The ordered logistic model can be coded in Stan using the
`ordered` data type for the cutpoints and the built-in
`ordered_logistic` distribution.
```stan
data {
int<lower=2> K;
int<lower=0> N;
int<lower=1> D;
array[N] int<lower=1, upper=K> y;
array[N] row_vector[D] x;
}
parameters {
vector[D] beta;
ordered[K - 1] c;
}
model {
for (n in 1:N) {
y[n] ~ ordered_logistic(x[n] * beta, c);
}
}
```
The vector of cutpoints `c` is declared as `ordered[K - 1]`,
which guarantees that `c[k]` is less than `c[k + 1]`.
If the cutpoints were assigned independent priors, the constraint
effectively truncates the joint prior to support over points that
satisfy the ordering constraint. Luckily, Stan does not need to
compute the effect of the constraint on the normalizing term because
the probability is needed only up to a proportion.
#### Ordered probit {-}
An ordered probit model could be coded in exactly the same way by
swapping the cumulative logistic (`inv_logit`) for the cumulative
normal (`Phi`).
```stan
data {
int<lower=2> K;
int<lower=0> N;
int<lower=1> D;
array[N] int<lower=1, upper=K> y;
array[N] row_vector[D] x;
}
parameters {
vector[D] beta;
ordered[K - 1] c;
}
model {
vector[K] theta;
for (n in 1:N) {
real eta;
eta = x[n] * beta;
theta[1] = 1 - Phi(eta - c[1]);
for (k in 2:(K - 1)) {
theta[k] = Phi(eta - c[k - 1]) - Phi(eta - c[k]);
}
theta[K] = Phi(eta - c[K - 1]);
y[n] ~ categorical(theta);
}
}
```
The logistic model could also be coded this way by replacing
`Phi` with `inv_logit`, though the built-in encoding based
on the softmax transform is more efficient and more numerically
stable. A small efficiency gain could be achieved by computing the
values `Phi(eta - c[k])` once and storing them for re-use.
## Hierarchical regression
The simplest multilevel model is a hierarchical model in which the
data are grouped into $L$ distinct categories (or levels). An extreme
approach would be to completely pool all the data and estimate a
common vector of regression coefficients $\beta$. At the other
extreme, an approach with no pooling assigns each level $l$ its own
coefficient vector $\beta_l$ that is estimated separately from the
other levels. A hierarchical model is an intermediate solution where
the degree of pooling is determined by the data and a prior on the
amount of pooling.
Suppose each binary outcome $y_n \in \{ 0, 1 \}$ has an associated
level, $ll_n \in \{ 1, \dotsc, L \}$. Each outcome will also have
an associated predictor vector $x_n \in \mathbb{R}^D$. Each level $l$
gets its own coefficient vector $\beta_l \in \mathbb{R}^D$. The
hierarchical structure involves drawing the coefficients $\beta_{l,d}
\in \mathbb{R}$ from a prior that is also estimated with the data. This
hierarchically estimated prior determines the amount of pooling. If
the data in each level are similar, strong pooling will be
reflected in low hierarchical variance. If the data in the levels are
dissimilar, weaker pooling will be reflected in higher hierarchical variance.
The following model encodes a hierarchical logistic regression model
with a hierarchical prior on the regression coefficients.
```stan
data {
int<lower=1> D;
int<lower=0> N;
int<lower=1> L;
array[N] int<lower=0, upper=1> y;
array[N] int<lower=1, upper=L> ll;
array[N] row_vector[D] x;
}
parameters {
array[D] real mu;
array[D] real<lower=0> sigma;
array[L] vector[D] beta;
}
model {
for (d in 1:D) {
mu[d] ~ normal(0, 100);
for (l in 1:L) {
beta[l, d] ~ normal(mu[d], sigma[d]);
}
}
for (n in 1:N) {
y[n] ~ bernoulli(inv_logit(x[n] * beta[ll[n]]));
}
}
```
The standard deviation parameter `sigma` gets an implicit uniform
prior on $(0,\infty)$ because of its declaration with a lower-bound
constraint of zero. Stan allows improper priors as long as the
posterior is proper. Nevertheless, it is usually helpful to have
informative or at least weakly informative priors for all parameters;
see the [regression priors section](#regression-priors.section) for
recommendations on priors for regression coefficients and scales.
#### Optimizing the model {-}
Where possible, vectorizing distribution statements leads to faster log
probability and derivative evaluations. The speed boost is not
because loops are eliminated, but because vectorization allows sharing
subcomputations in the log probability and gradient calculations and
because it reduces the size of the expression tree required for
gradient calculations.
The first optimization vectorizes the for-loop over `D` as
```stan
mu ~ normal(0, 100);
for (l in 1:L) {
beta[l] ~ normal(mu, sigma);
}
```
The declaration of `beta` as an array of vectors means that the
expression `beta[l]` denotes a vector. Although `beta` could have
been declared as a matrix, an array of vectors (or a two-dimensional
array) is more efficient for accessing rows; see the [indexing
efficiency section](matrices-arrays.qmd#indexing-efficiency.section) for more information
on the efficiency tradeoffs among arrays, vectors, and matrices.
This model can be further sped up and at the same time made more
arithmetically stable by replacing the application of inverse-logit
inside the Bernoulli distribution with the logit-parameterized
Bernoulli,^[The Bernoulli-logit distribution builds in the log link function, taking $$\texttt{bernoulli}\mathtt{\_}\texttt{logit}\left(y \mid \alpha\right) = \texttt{bernoulli}\left(y \mid \operatorname{logit}^{-1}(\alpha)\right).$$]
```stan
for (n in 1:N) {
y[n] ~ bernoulli_logit(x[n] * beta[ll[n]]);
}
```
Unlike in R or BUGS, loops, array access and assignments are fast in
Stan because they are translated directly to C++. In most cases, the
cost of allocating and assigning to a container is more than made up
for by the increased efficiency due to vectorizing the log probability
and gradient calculations. Thus the following version is faster than
the original formulation as a loop over a distribution statement.
```stan
{
vector[N] x_beta_ll;
for (n in 1:N) {
x_beta_ll[n] = x[n] * beta[ll[n]];
}
y ~ bernoulli_logit(x_beta_ll);
}
```
The brackets introduce a new scope for the local variable
`x_beta_ll`; alternatively, the variable may be declared at the
top of the model block.
In some cases, such as the above, the local variable assignment leads
to models that are less readable. The recommended practice in such
cases is to first develop and debug the more transparent version of
the model and only work on optimizations when the simpler formulation
has been debugged.
## Hierarchical priors {#hierarchical-priors.section}
Priors on priors, also known as "hyperpriors," should be treated the
same way as priors on lower-level parameters in that as much prior
information as is available should be brought to bear. Because
hyperpriors often apply to only a handful of lower-level parameters,
care must be taken to ensure the posterior is both proper and not
overly sensitive either statistically or computationally to wide tails
in the priors.
### Boundary-avoiding priors for MLE in hierarchical models {-}
The fundamental problem with maximum likelihood estimation (MLE) in
the hierarchical model setting is that as the hierarchical variance
drops and the values cluster around the hierarchical mean, the overall
density grows without bound. As an illustration, consider a simple
hierarchical linear regression (with fixed prior mean) of $y_n \in
\mathbb{R}$ on $x_n \in \mathbb{R}^K$, formulated as
\begin{align*}
y_n & \sim \textsf{normal}(x_n \beta, \sigma) \\
\beta_k & \sim \textsf{normal}(0,\tau) \\
\tau & \sim \textsf{Cauchy}(0,2.5) \\
\end{align*}
In this case, as $\tau \rightarrow 0$ and $\beta_k \rightarrow 0$, the
posterior density
$$ p(\beta,\tau,\sigma|y,x) \propto p(y|x,\beta,\tau,\sigma) $$
grows without bound. See the [plot of Neal's funnel density](#funnel.figure), which has similar behavior.
There is obviously no MLE estimate for $\beta,\tau,\sigma$ in such a
case, and therefore the model must be modified if posterior modes are
to be used for inference. The approach recommended by
@ChungEtAl:2013 is to use a gamma distribution as a prior, such
as
$$
\sigma \sim \textsf{Gamma}(2, 1/A),
$$
for a reasonably large value of $A$, such as $A = 10$.
## Item-response theory models {#item-response-models.section}
Item-response theory (IRT) models the situation in which a number of
students each answer one or more of a group of test questions. The
model is based on parameters for the ability of the students, the
difficulty of the questions, and in more articulated models, the
discriminativeness of the questions and the probability of guessing
correctly; see @GelmanHill:2007 [pps. 314--320] for a textbook
introduction to hierarchical IRT models and @Curtis:2010 for
encodings of a range of IRT models in BUGS.
### Data declaration with missingness {-}
The data provided for an IRT model may be declared as follows
to account for the fact that not every student is required to answer
every question.
```stan
data {
int<lower=1> J; // number of students
int<lower=1> K; // number of questions
int<lower=1> N; // number of observations
array[N] int<lower=1, upper=J> jj; // student for observation n
array[N] int<lower=1, upper=K> kk; // question for observation n
array[N] int<lower=0, upper=1> y; // correctness for observation n
}
```
This declares a total of `N` student-question pairs in the data
set, where each `n` in `1:N` indexes a binary observation
`y[n]` of the correctness of the answer of student `jj[n]`
on question `kk[n]`.
The prior hyperparameters will be hard coded in the rest of this
section for simplicity, though they could be coded as data in
Stan for more flexibility.
### 1PL (Rasch) model {-}
The 1PL item-response model, also known as the Rasch model, has one
parameter (1P) for questions and uses the logistic link function (L).
The model parameters are declared as follows.
```stan
parameters {
real delta; // mean student ability
array[J] real alpha; // ability of student j - mean ability
array[K] real beta; // difficulty of question k
}
```
The parameter `alpha[J]` is the ability coefficient for student
`j` and `beta[k]` is the difficulty coefficient for question
`k`. The non-standard parameterization used here also includes
an intercept term `delta`, which represents the average student's
response to the average question.^[@GelmanHill:2007 treat the $\delta$ term equivalently as the location parameter in the distribution of student abilities.]
The model itself is as follows.
```stan
model {
alpha ~ std_normal(); // informative true prior
beta ~ std_normal(); // informative true prior
delta ~ normal(0.75, 1); // informative true prior
for (n in 1:N) {
y[n] ~ bernoulli_logit(alpha[jj[n]] - beta[kk[n]] + delta);
}
}
```
This model uses the logit-parameterized Bernoulli distribution, where
$$
\texttt{bernoulli}\mathtt{\_}\texttt{logit}\left(y \mid \alpha\right)
=
\texttt{bernoulli}\left(y \mid \operatorname{logit}^{-1}(\alpha)\right).
$$
The key to understanding it is the term inside the
`bernoulli_logit` distribution, from which it follows that
$$
\Pr[y_n = 1] = \operatorname{logit}^{-1}\left(\alpha_{jj[n]} - \beta_{kk[n]}
+ \delta\right).
$$
The model suffers from additive identifiability issues without the
priors. For example, adding a term $\xi$ to each $\alpha_j$ and
$\beta_k$ results in the same predictions. The use of priors for
$\alpha$ and $\beta$ located at 0 identifies the parameters; see
@GelmanHill:2007 for a discussion of identifiability issues and
alternative approaches to identification.
For testing purposes, the IRT 1PL model distributed with Stan uses
informative priors that match the actual data generation process used
to simulate the data in R (the simulation code is supplied in the same
directory as the models). This is unrealistic for most practical
applications, but allows Stan's inferences to be validated. A simple
sensitivity analysis with fatter priors shows that the posterior is
fairly sensitive to the prior even with 400 students and 100 questions
and only 25% missingness at random. For real applications, the
priors should be fit hierarchically along with the other parameters,
as described in the next section.