-
Notifications
You must be signed in to change notification settings - Fork 1
/
05-hierarchical.Rmd
executable file
·1472 lines (1099 loc) · 95.5 KB
/
05-hierarchical.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Bayesian hierarchical models {#ch-hierarchical}
Usually, experimental data in cognitive science contain "clusters." These are natural groups that contain observations that are more similar within the clusters than between them.
The most common examples of clusters in experimental designs are subjects and experimental items (e.g., words, pictures, objects that are presented to the subjects). These clusters arise because we have multiple (repeated) observations for each subject, and for each item. Such data are sometimes also called dependent data, because the observations within a cluster are not independent. If we want to incorporate this \index{Grouping structure} grouping structure in our analysis, we generally use a \index{Hierarchical model} hierarchical model [also called \index{Multi-level} multi-level or \index{Mixed model} mixed models, @pinheirobates]. This kind of clustering and hierarchical modeling arises as a consequence of the idea of \index{Exchangeability} *exchangeability*.
## Exchangeability and hierarchical models
Exchangeability is the Bayesian analog of the phrase \index{Independent and identically distributed} "independent and identically distributed" that appears regularly in classical (i.e., frequentist) statistics. Some connections and differences between exchangeability and the frequentist concept of independent and identically distributed (iid) are detailed in online section \@ref(app-exch).
Informally, the idea of exchangeability is as follows. Suppose we assign a numerical index to each of the levels of a group (e.g., to each subject). When the levels are exchangeable, we can reassign the indices arbitrarily and lose no information; that is, the joint distribution will remain the same, because we don't have any different prior information for each cluster (here, for each subject). In hierarchical models, we treat the levels of the \index{Group} group as exchangeable, and observations within each level in the group as also exchangeable. We generally include predictors at the level of the observations, those are the predictors that correspond to the experimental manipulations (e.g., attentional load, trial number, cloze probability, etc.); and maybe also at the group level, these are predictors that indicate characteristics of the levels in the group (e.g., the working memory capacity score of each subject). Then the conditional distributions given these explanatory variables would be exchangeable; that is, our predictors incorporate all the information that is not exchangeable, and when we factor the predictors out, the observations or units in the group are exchangeable. This is the reason why the item number is an appropriate cluster, but trial number is not: In the first case, if we permute the numbering of the items there is no loss of information because the indexes are exchangeable, all the information about the items is incorporated as predictors in the model. In the second case, the numbering of the trials include information that will be lost if we treat them as exchangeable. For example, consider the case where, as trial numbers increase, subjects get more experienced or fatigued. Even if we are not interested in the specific cluster-level estimates, hierarchical models allow us to generalize to the underlying population (subjects, items) from which the clusters in the sample were drawn. Of course, this generalization is only possible if the samples are representative (e.g., if the subjects and items are randomly drawn); strictly speaking, this condition often does not hold in experimental fields like psychology and linguistics. For more on exchangeability, consult the further reading at the end of the chapter.
Exchangeability is important in Bayesian statistics because of a theorem called the \index{Representation Theorem} Representation Theorem [@deFinetti]. This theorem states that if a sequence of random variables is exchangeable, then the prior distributions on the parameters in a model are a necessary consequence; priors are not merely an arbitrary addition to the frequentist modeling approach that we are familiar with. For a detailed discussion on the Representation Theorem, see chapter 4 of @kendall2004.
Furthermore, exchangeability has been shown [@bernardosmith] to be mathematically equivalent to assuming a hierarchical structure in the model. The argument goes as follows. Suppose that the parameters for each level in a group are $\mu_i$, where the levels are labeled $i=1,\dots,I$. An example of groups is subjects. Suppose also that the data $y_n$, where $n=1,\dots,N$ are observations from these subjects (e.g., pupil size measurements). The data are assumed to be generated as
\begin{equation}
y_n \sim \mathit{Normal}(\mu_{subj[n]},\sigma)
\end{equation}
The notation $subj[n]$, which roughly follows @GelmanHill2007, identifies the subject index. Suppose that each of 20 subjects provide 50 observations. If the data are ordered by subject id, then $subj[1]$ to $subj[50]$ corresponds to a subject with id $i=1$, $subj[51]$ to $subj[100]$ corresponds to a subject with id $i=2$, and so on.
We can code this representation in a straightforward way in R:
```{r}
N_subj <- 20
N_obs_subj <- 50
N <- N_subj * N_obs_subj
df <- tibble(row = 1:N,
subj = rep(1:N_subj, each = N_obs_subj))
df
# Example:
c(df$subj[1], df$subj[2], df$subj[51])
```
If the data $y_n$ are exchangeable, the parameters $\mu_i$ are also exchangeable. The fact that the $\mu_i$ are exchangeable can be shown [@bernardosmith] to be mathematically equivalent to assuming that they come from a common distribution, for example:
\begin{equation}
\mu_i \sim \mathit{Normal}(\mu,\tau)
\end{equation}
To make this more concrete, assume some completely arbitrary true values for the parameters, and generate observations based on a hierarchical process in R.
```{r, echo = FALSE}
set.seed(123)
```
```{r}
mu <- 100
tau <- 15
sigma <- 4
mu_i <- rnorm(N_subj, mu, tau)
df_h <- mutate(df, y = rnorm(N, mu_i[subj], sigma))
df_h
```
The parameters $\mu$ and $\tau$, called \index{Hyperparameter} hyperparameters, are unknown and have prior distributions \index{Hyperprior} (hyperpriors) defined for them. This fact leads to a hierarchical relationship between the parameters: there is a common parameter $\mu$ for each of the levels of a group (each of the subjects), and the parameters $\mu_i$ are assumed to be generated as a function of this common parameter $\mu$. Here, $\tau$ represents \index{Between-group variability} between-group (here, between-subject) variability, and $\sigma$ represents \index{Within-group variability} within-group (within-subject) variability. The three parameters have priors defined for them. The first two priors below are called hyperpriors.
- $p(\mu)$
- $p(\tau)$
- $p(\sigma)$
In such a model, information about $\mu_i$ comes from two sources:
(a) from each of the observed $y_n$ corresponding to the respective $\mu_{subj[n]}$ parameter, and
(b) from the parameters $\mu$ and $\tau$ that led to all the other $y_k$ (where $k\neq n$) being generated.
This is illustrated in Figure \@ref(fig:dags1).
Fit this model in `brms` in the following way. For now, the prior distributions are arbitrary.
```{r, message = FALSE, results = "hide"}
fit_h <- brm(y ~ 1 + (1 | subj), df_h,
prior =
c(prior(normal(50, 200), class = Intercept),
prior(normal(2, 5), class = sigma),
prior(normal(10, 20), class = sd)),
# increase iterations to avoid convergence issues
iter = 5000)
```
```{r, eval = FALSE}
fit_h
```
```{r, echo = FALSE}
short_summary(fit_h)
```
In this output, `Intercept` corresponds to the posterior of $\mu$, `sigma` to $\sigma$, and `sd(Intercept)` to $\tau$. There is more information in the `brms` object: we can also get the posteriors for each level of our group (i.e., for each subject). However, rather than estimating $\mu_i$, `brms` estimates the adjustments to $\mu$, $u_i$, named `r_subj[i,Intercept]`, so that $\mu_i = \mu + u_i$. See the code below.
```{r, warning = FALSE}
# Extract the posterior estimates of u_i
u_i_post <- as_draws_df(fit_h) %>%
select(starts_with("r_subj"))
# Extract the posterior estimate of mu
mu_post <- as_draws_df(fit_h)$b_Intercept
# Build the posterior estimate of mu_i
mu_i_post <- mu_post + u_i_post
colMeans(mu_i_post) %>% unname()
# Compare with true values
mu_i
```
(ref:dags1) A \index{Directed acyclic graph} directed acyclic graph illustrating a hierarchical model \index{Partial pooling} (partial pooling).
```{tikz dags1, fig.cap = "(ref:dags1)", out.width = "100%", echo = FALSE, fig.ext = 'png', cache = FALSE, echo = FALSE, dpi = 1000}
%% hierarchical model:
\begin{tikzpicture}
\matrix[matrix of math nodes, column sep=20pt, row sep=20pt] (mat)
{
& & && \mu, \tau & & & & & \\
& \mu_1 & & & \mu_{2} & & \ldots & \mu_{20} & &\\
y_{1} & y_2 & \ldots & y_{50} & y_{51} & \ldots& \ldots & y_{951} & \ldots & y_{1000}\\
& & & &\sigma & & & & & &\\
};
% top arrows:
\draw[->,>=latex] (mat-1-5) -- (mat-2-2);
\draw[->,>=latex] (mat-1-5) -- (mat-2-5);
\draw[->,>=latex] (mat-1-5) -- (mat-2-7);
\draw[->,>=latex] (mat-1-5) -- (mat-2-8);
% from mu1
\draw[->,>=latex] (mat-2-2) -- (mat-3-1);
\draw[->,>=latex] (mat-2-2) -- (mat-3-2);
\draw[->,>=latex] (mat-2-2) -- (mat-3-3);
\draw[->,>=latex] (mat-2-2) -- (mat-3-4);
% from mu2
\draw[->,>=latex] (mat-2-5) -- (mat-3-5);
\draw[->,>=latex] (mat-2-5) -- (mat-3-6);
% from ...
\draw[->,>=latex] (mat-2-7) -- (mat-3-7);
\draw[->,>=latex] (mat-2-8) -- (mat-3-8);
\draw[->,>=latex] (mat-2-8) -- (mat-3-9);
\draw[->,>=latex] (mat-2-8) -- (mat-3-10);
\foreach \column in {1, 2, 3,4,5,6, 7,8,9,10}
{
\draw[->,>=latex] (mat-4-5)-- (mat-3-\column);
}
\end{tikzpicture}
```
There are two other configurations possible that do not involve this hierarchical structure and which represent two alternative, extreme scenarios.
One of these two configurations is called the \index{Complete pooling model} *complete pooling model*, Here, the data $y_n$ are assumed to be generated from a single distribution:
\begin{equation}
y_n \sim \mathit{Normal}(\mu,\sigma).
\end{equation}
This model is an intercept only regression similar to what we saw in chapter \@ref(ch-compbda).
Generate simulated observations in a vector `y` based on arbitrary true values in R in the following way.
```{r}
sigma <- 4
mu <- 100
df_cp <- mutate(df, y = rnorm(N, mu, sigma))
df_cp
```
Fit it in `brms`.
```{r, message = FALSE, results = "hide"}
fit_cp <- brm(y ~ 1, df_cp,
prior =
c(prior(normal(50, 200), class = Intercept),
prior(normal(2, 5), class = sigma)))
```
```{r, eval = FALSE}
fit_cp
```
```{r, echo = FALSE}
short_summary(fit_cp)
```
The configuration of the complete pooling model is illustrated in Figure \@ref(fig:dags3).
(ref:dags3) A directed acyclic graph illustrating a complete pooling model.
```{tikz dags3, fig.cap = "(ref:dags3)", out.width = "100%", echo = FALSE, fig.ext = 'png', cache = FALSE, echo = FALSE, dpi = 1000, out.extra = ''}
%% complete pooling model:
\begin{tikzpicture}
\matrix[matrix of math nodes, column sep=20pt, row sep=20pt] (mat)
{
& & \mu & \\
y_{1} & y_{2} & \ldots & y_{1000} \\
& & \sigma & \\
};
\foreach \column in {1, 2,3,4}
{
\draw[->,>=latex] (mat-1-3) -- (mat-2-\column);
\draw[->,>=latex] (mat-3-3) -- (mat-2-\column);
}
\end{tikzpicture}
```
The other configuration is called the \index{No pooling model} *no pooling model*; here, each $y_n$ is assumed to be generated from an independent distribution:
\begin{equation}
y_n \sim \mathit{Normal}(\mu_{subj[n]},\sigma)
\end{equation}
```{r, echo = FALSE, eval = FALSE}
dput(round(runif(20, 50, 200)))
```
Generate simulated observations from the no pooling model in R with arbitrary true values.
```{r}
sigma <- 4
mu_i <- c(156, 178, 95, 183, 147, 191, 67, 153, 129, 119, 195,
150, 172, 97, 110, 115, 78, 126, 175, 80)
df_np <- mutate(df, y = rnorm(N, mu_i[subj], sigma))
df_np
```
Fit the no pooling model in `brms`. By using the formula `0 + factor(subj)`, the common intercept can be removed and the model is forced to estimate one intercept for each `subj` separately. The column `subj` is converted to a factor so that `brms` does not interpret it as a number.
```{r, message = FALSE, results = "hide"}
fit_np <- brm(y ~ 0 + factor(subj), df_np,
prior =
c(prior(normal(0, 200), class = b),
prior(normal(2, 5), class = sigma)))
```
The summary shows now the 20 estimates of $\mu_i$ as `b_factorsubj` and $\sigma$. (We ignore `lp__` and `lprior`.)
```{r}
fit_np %>% posterior_summary()
```
Unlike the hierarchical model, now there is no common distribution that generates the $\mu_i$ parameters. This is illustrated in Figure \@ref(fig:dags2).
(ref:dags2) A directed acyclic graph illustrating a no pooling model.
```{tikz dags2, fig.cap = "(ref:dags2)", out.width = "100%", echo = FALSE, fig.ext = 'png', cache = FALSE, echo = FALSE, dpi = 1000}
%% independent (no pooling) model:
\begin{tikzpicture}
\matrix[matrix of math nodes, column sep=20pt, row sep=20pt] (mat)
{
& & && & & & & & \\
& \mu_1 & & & \mu_{2} & & \ldots & \mu_{20} & &\\
y_{1} & y_2 & \ldots & y_{50} & y_{51} & \ldots& \ldots & y_{951} & \ldots & y_{1000}\\
& & & &\sigma & & & & & &\\
};
% from mu1
\draw[->,>=latex] (mat-2-2) -- (mat-3-1);
\draw[->,>=latex] (mat-2-2) -- (mat-3-2);
\draw[->,>=latex] (mat-2-2) -- (mat-3-3);
\draw[->,>=latex] (mat-2-2) -- (mat-3-4);
% from mu2
\draw[->,>=latex] (mat-2-5) -- (mat-3-5);
\draw[->,>=latex] (mat-2-5) -- (mat-3-6);
% from ...
\draw[->,>=latex] (mat-2-7) -- (mat-3-7);
\draw[->,>=latex] (mat-2-8) -- (mat-3-8);
\draw[->,>=latex] (mat-2-8) -- (mat-3-9);
\draw[->,>=latex] (mat-2-8) -- (mat-3-10);
\foreach \column in {1, 2, 3,4,5,6, 7,8,9,10}
{
\draw[->,>=latex] (mat-4-5)-- (mat-3-\column);
}
\end{tikzpicture}
```
The hierarchical model lies between these two extremes and for this reason is sometimes called a \index{Partial pooling} *partial pooling model*. One way that the hierarchical model is often described is that the estimates $\mu_i$ "borrow strength" from the parameter $\mu$ (which represents the grand mean in the above example).
An important practical consequence of partial pooling is the idea of "borrowing strength from the mean": if we have very sparse data from a particular member of a group (e.g., missing data from a particular subject), the estimate $\mu_i$ of that particular group member $n$ is determined by the parameter $\mu$. In other words, when the data are sparse for group member $n$, the posterior estimate $\mu_i$ is determined largely by the posterior on $\mu$. In this sense, even the frequentist hierarchical modeling software in R, `lmer()` from the package `lme4`, is essentially Bayesian in formulation (except of course that there is no prior on $\mu$).
So far we have focused on the structure of $\mu$, the location parameter of the likelihood. We could even have partial pooling, complete pooling or no pooling with respect to $\sigma$, the scale parameter of the likelihood. More generally, any parameter of a likelihood can have any of these kinds of pooling.
In the coming sections, we will be looking at each of these models with more detail and using realistic examples.
## A hierarchical model with a normal likelihood: The N400 effect {#sec-N400hierarchical}
\index{Event-related potentials} Event-related potentials \index{ERPs} (ERPs) allow scientists to observe electrophysiological responses in the brain measured by means of \index{Electroencephalography} electroencephalography \index{EEG} (EEG) that are time-locked to a specific event (i.e., the presentation of the stimuli). A very robust ERP effect in the study of language is the \index{N400} N400. Words with low predictability are accompanied by an *N400 effect* in comparison with high-predictable words, this is a relative negativity that peaks around 300-500 ms after word onset over central parietal scalp sites [first reported in @kutasReadingSenselessSentences1980; for semantic anomalies, and in @kutasBrainPotentialsReading1984 for low predictable word; for a review, see @kutasThirtyYearsCounting2011]. The N400 is illustrated in Figure \@ref(fig:N400noun).
```{r, echo = FALSE}
N400_c <- c("Cz", "CP1", "CP2", "P3", "Pz", "P4", "POz")
```
(ref:N400noun) Typical ERPs for the grand average across the N400 spatial window (central parietal electrodes: `r paste0(N400_c, collapse =", ")`) for high and low \index{Predictability} predictability nouns [specifically from the constraining context of the experiment reported in @NicenboimPreactivation2019]. The x-axis indicates time in seconds and the y-axis indicates voltage in microvolts (unlike many EEG/ERP plots, negative polarity is plotted downwards).
```{r N400noun, results = "hold",fig.height= 3.5, fig.cap = "(ref:N400noun)", echo = FALSE, warning = FALSE}
noun_s <- readRDS("dataR/external/noun_s.RDS") %>%
mutate(
constraint = ifelse(cond %in% c(0, 1),
"Constraining", "Non-constraining"
),
completion = ifelse(cond %in% c(0, 2), "a", "b"),
predictability = case_when(
cond == 0 ~ "high",
cond == 1 ~ "low",
TRUE ~ NA_character_
)
) %>%
filter(
channel == "neg", .time >= -1.6, .time <= 2.2,
constraint == "Constraining"
)
noun_s %>%
ggplot(aes(x = .time, y = mean_s, linetype = predictability)) +
labs(linetype = "Predictability", fill = "", color = "", x = "Time (s)", y = "Amplitude (\u03BCV)") +
scale_x_continuous(limits = c(-.1, .8)) +
coord_cartesian(xlim = c(-.1, .8), clip = "off") +
geom_line(size = .5, na.rm = TRUE) +
geom_vline(xintercept = .3, linetype = "dashed", color = "gray") +
geom_vline(xintercept = .5, linetype = "dashed", color = "gray")
```
For example, in (1) below, the continuation '*dog*' has lower predictability than the continuation '*paint*,' and thus we would expect a more negative signal, that is, an N400 effect, in '*dog*' in (b) in comparison with '*paint*' in (a). Predictability is often measured with a cloze task (see section \@ref(sec-binomialcloze)).
1. Example from @kutasBrainPotentialsReading1984
a. Don't touch the wet paint.
b. Don't touch the wet dog.
The EEG data are typically recorded in tens of electrodes every couple of milliseconds, but for our purposes (i.e., for learning about Bayesian hierarchical models), we can safely ignore the complexity of the data. A common way to simplify the high-dimensional EEG data when we are dealing with the N400 is to focus on the average amplitude of the EEG signal at the typical spatio-temporal window of the N400 [for example, see @frankERPResponseAmount2015].
For this example, we are going to focus on the N400 effect for critical nouns from a subset of the data of @nieuwlandLargescaleReplicationStudy2018. @nieuwlandLargescaleReplicationStudy2018 presented a replication attempt of an original experiment of @delongProbabilisticWordPreactivation2005 with sentences like (2).
2. Example from @delongProbabilisticWordPreactivation2005
a. The day was breezy so the boy went outside to fly a kite.
b. The day was breezy so the boy went outside to fly an airplane.
We'll ignore the goal of original experiment [@delongProbabilisticWordPreactivation2005], and its replication [@nieuwlandLargescaleReplicationStudy2018]. We are going to focus on the N400 at the final nouns in the experimental stimuli. In example (2), the final noun '*kite*' has higher predictability than '*airplane*,' and thus we would expect a more negative signal in '*airplane*' in (b) in comparison with '*kite*' in (a).
To speed up computation, we restrict the data set of @nieuwlandLargescaleReplicationStudy2018 to the subjects from the Edinburgh lab. This subset of the data can be found in `df_eeg` in the `bcogsci` package. Center the cloze probability before using it as a predictor.
```{r, echo = FALSE}
# hack:
select <- dplyr::select
```
```{r, message = FALSE}
data("df_eeg")
(df_eeg <- df_eeg %>%
mutate(c_cloze = cloze - mean(cloze)))
# Number of subjects
df_eeg %>%
distinct(subj) %>%
count()
```
One convenient aspect of using averages of EEG data is that they are roughly normally distributed. This allows us to use the normal likelihood. Figure \@ref(fig:histn400) shows the distribution of the data.
```{r histn400, fig.cap="Histogram of the N400 averages for every trial, overlaid is a density plot of a normal distribution.", message=FALSE, fold = TRUE,fig.height = 2.2 }
df_eeg %>% ggplot(aes(n400)) +
geom_histogram(binwidth = 4,
colour = "gray",
alpha = .5,
aes(y = after_stat(density))) +
stat_function(fun = dnorm,
args = list(mean = mean(df_eeg$n400),
sd = sd(df_eeg$n400))) +
xlab("Average voltage in microvolts for
the N400 spatiotemporal window")
```
### \index{Complete pooling model} Complete pooling model ($M_{cp}$) {#sec-Mcp}
We'll start from the simplest model which is basically the linear regression we encountered in the preceding chapter.
#### Model assumptions
This model, call it $M_{cp}$, makes the following assumptions.
1. The EEG averages for the N400 spatiotemporal window are normally distributed.
2. Observations are independent.
3. There is a linear relationship between cloze and the EEG signal for the trial.
**This model is incorrect for these data due to assumption (2) being violated.**
With the last assumption, we are saying that the difference in the average signal when we compare nouns with cloze probability of $0$ and $0.1$ is the same as the difference in the signal when we compare nouns with cloze values of $0.1$ and $0.2$ (or $0.9$ and $1$). This is just an assumption, and it may not necessarily be the case in the actual data. This means that we are going to get a posterior for $\beta$ conditional on the assumption that the linear relationship holds. Even if the assumption approximately holds, we still don't know how much we deviate from this assumption.
We can now decide on a likelihood and priors.
#### Likelihood and priors
A normal likelihood seems reasonable for these data:
\begin{equation}
signal_n \sim \mathit{Normal}( \alpha + c\_cloze_n \cdot \beta,\sigma)
(\#eq:Mcp)
\end{equation}
where $n =1, \ldots, N$, and $signal$ is the dependent variable (average signal in the N400 spatiotemporal window in microvolts). The variable $N$ represents the total number of data points.
As always, we need to rely on our previous knowledge and domain expertise to decide on priors. We know that ERPs (signals time-locked to a stimulus) have mean amplitudes of a couple of microvolts: This is easy to see in any plot of the EEG literature. This means that we don't expect the effect of our manipulation to exceed, say, $10 \mu V$. As before, a priori we'll assume that effects can be negative or positive. We can quantify our prior knowledge regarding plausible values of $\beta$ as normally distributed, centered at zero with a standard deviation of $10 \mu V$. (Other values such as $5 \mu V$ would have been equally reasonable, since this smaller standard deviation would entail that 95% of the prior mass probability is between $-10$ and $10 \mu V$.)
If the signal for each ERP is *baselined*, that is, the mean signal of a time window before the time window of interest is subtracted from the time window of interest, then the mean signal would be relatively close to $0$. Since we know that the ERPs were baselined in this study, we expect that the grand mean of our signal should be relatively close to zero. Our prior for $\alpha$ is then also normally distributed, and centered at zero with a standard deviation of $10 \mu V$.
The prior for the standard deviation $\sigma$ of our signal distribution requires some thought. We know that EEG signals are quite noisy, and that the standard deviation must be higher than zero. Our prior for $\sigma$ is a truncated normal distribution with location zero and scale 50. Recall that since we truncate the distribution, the parameters location and scale do not correspond to the mean and standard deviation of the new distribution; see online section \@ref(app-truncation).
We can draw random samples from this truncated distribution and calculate their mean and standard deviation:
```{r}
samples <- rtnorm(20000, mean = 0, sd = 50, a = 0)
c(mean = mean(samples), sd = sd(samples))
```
So we are essentially saying that we assume a priori that we will find the true standard deviation of the signal in the following interval with 95% probability:
```{r}
quantile(samples, probs = c(0.025, 0.975))
# Analytically:
# c(qtnorm(0.025, 0, 50, a = 0), qtnorm(0.975, 0, 50, a = 0))
```
To sum up, we are going to use the following priors:
\begin{equation}
\begin{aligned}
\alpha &\sim \mathit{Normal}(0,10)\\
\beta &\sim \mathit{Normal}(0,10)\\
\sigma &\sim \mathit{Normal}_{+}(0,50)
\end{aligned}
\end{equation}
A model such as $M_{cp}$ is sometimes called a \index{Fixed-effects model} *fixed-effects* model: all the parameters are fixed in the sense that do not vary from subject to subject or from item to item. A similar frequentist model would correspond to fitting a simple linear model using the `lm()` function: `lm(n400 ~ 1 + c_cloze, data = df_eeg)`.
We fit this model in \index{brms} `brms` as follows (the default family is `gaussian()` so we can omit it). As with the `lm()` function in `R`, by default an intercept is fitted and thus `n400 ~ c_cloze` is equivalent to `n400 ~ 1 + c_cloze`:
```{r, message = FALSE, results = "hide"}
fit_N400_cp <-
brm(n400 ~ c_cloze,
prior = c(prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = c_cloze),
prior(normal(0, 50), class = sigma)),
data = df_eeg)
```
For now, check the summary, and plot the posteriors of the model (Figure \@ref(fig:figurefitN400cp)).
```{r, eval = FALSE}
fit_N400_cp
```
```{r, echo = FALSE}
short_summary(fit_N400_cp)
```
(ref:figurefitN400cp) Posterior distributions of the complete pooling model, `fit_N400_cp`.
```{r figurefitN400cp, fig.cap="(ref:figurefitN400cp)",fig.height = 3.5}
plot(fit_N400_cp)
```
### \index{No pooling model} No pooling model ($M_{np}$)
One of the assumptions of the previous model is clearly wrong: observations are not independent, they are clustered by subject (and also by the specific item, but we'll ignore this until section \@ref(sec-mcvivs)). It is reasonable to assume that EEG signals are more similar within subjects than between them. The following model assumes that each subject is completely independent from each other.^[For simplicity, we assume that they share the same standard deviation.]
#### Model assumptions
1. EEG averages for the N400 spatio-temporal window are normally distributed.
2. Every subject's model is fit independently of the other subjects; the subjects have no parameters in common (an exception is the standard deviation, $\sigma$; this is the same for all subjects in Equation \@ref(eq:nplik)).
3. There is a linear relationship between cloze and the EEG signal for the trial.
What likelihood and priors can we choose here?
#### Likelihood and priors
The likelihood is a normal distribution as before:
\begin{equation}
signal_n \sim \mathit{Normal}( \alpha_{subj[n]} + c\_cloze_n \cdot \beta_{subj[n]},\sigma)
(\#eq:nplik)
\end{equation}
As before, $n$ represents each observation, that is, the $n$th row in the data frame, which has $N$ rows, and now the index $i$ identifies the subject. The notation $subj[n]$, which roughly follows @GelmanHill2007, identifies the subject index; for example, if $subj[10]=3$, then the $10$th row of the data frame is from subject $3$.
We define the priors as follows:
\begin{equation}
\begin{aligned}
\alpha_{i} &\sim \mathit{Normal}(0,10)\\
\beta_{i} &\sim \mathit{Normal}(0,10)\\
\sigma &\sim \mathit{Normal}_+(0,50)
\end{aligned}
\end{equation}
In `brms`, such a model can be fit by removing the common intercept with the formula `n400 ~ 0 + factor(subj) + c_cloze:factor(subj)`.
This formula forces the model to estimate one intercept and one slope for *each* level of `subj`. ^[If we don't remove the intercept, that is, if we use the formula `n400 ~ 1 + factor(subj) + c_cloze:factor(subj)`, with `factor(subj)` we are going estimate the deviation between the first subject and each of the other subjects. This is because the default contrast coding for the subjects, treated as a fixed effect, is treatment contrasts. See chapters \@ref(ch-contr) and \@ref(ch-coding2x2).]
The by-subject intercepts are indicated with `factor(subj)` and the by-subject slopes with `c_cloze:factor(subj)`.
It's very important to specify that `subject` should be treated as a factor and not as a numerical variable; we don't assume that subject number 3 will show 3 times more positive (or negative) average signal than subject number 1! The model fits 37 independent intercepts and 37 independent slopes. By setting a prior to `class = b` and omitting `coef`, we are essentially setting identical priors to all the intercepts and slopes of the model. The parameters are independent from each other; it is only our previous knowledge (or prior beliefs) about their possible values (encoded in the priors) that is identical. We can set different priors to each intercept and slope, but that will mean setting `r 37*2` priors!
```{r, message = FALSE, results = "hide"}
fit_N400_np <- brm(n400 ~ 0 + factor(subj) + c_cloze:factor(subj),
prior = c(prior(normal(0, 10), class = b),
prior(normal(0, 50), class = sigma)),
data = df_eeg)
```
For this model, printing a summary means printing the `r 37*2 +1` parameters ($\alpha_{1,...,37}$, $\beta_{1,...,37}$, and $\sigma$). We could do this as always by printing out the model results: just type `fit_N400_np`.
It may be easier to understand the output of the model by plotting $\beta_{1,..,37}$ using `bayesplot`. (`brms` also includes a wrapper for this function called `stanplot`.) We can take a look at the internal names that `brms` gives to the parameters with `variables(fit_N400_np)`; they are `b_factorsubj`, then the subject index and then `:c_cloze`. The code below changes the subject labels back to their original numerical indices and plots them in Figure \@ref(fig:nopooling). The subjects are ordered by the magnitude of their mean effects.
The model $M_{np}$ does not estimate a unique population-level effect; instead, there is a different effect estimated for each subject. However, given the posterior means from each subject, it is still possible to calculate the average of these estimates $\hat\beta_{1,...,I}$, where $I$ is the total number of subjects:
```{r}
# parameter name of beta by subject:
ind_effects_np <- paste0("b_factorsubj",
unique(df_eeg$subj), ":c_cloze")
beta_across_subj <- as.data.frame(fit_N400_np) %>%
#removes the meta data from the object
select(all_of(ind_effects_np)) %>%
rowMeans()
# Calculate the average of these estimates
(grand_av_beta <- tibble(mean = mean(beta_across_subj),
lq = quantile(beta_across_subj, c(0.025)),
hq = quantile(beta_across_subj, c(0.975))))
```
In Figure \@ref(fig:nopooling), the 95% credible interval of this overall mean effect is plotted as two vertical lines together with the effect of cloze probability for each subject (ordered by effect size). Here, rather than using a plotting function from `brms`, we can extract the summary of by-subject effects, reorder them by magnitude, and then plot the summary with a custom plot using `ggplot2`.
```{r message = FALSE}
# make a table of beta's by subject
beta_by_subj <- posterior_summary(fit_N400_np,
variable = ind_effects_np) %>%
as.data.frame() %>%
mutate(subject = 1:n()) %>%
## reorder plot by magnitude of mean:
arrange(Estimate) %>%
mutate(subject = factor(subject, levels = subject))
```
The code below generates Figure \@ref(fig:nopooling).
(ref:nopooling) 95% credible intervals of the effect of cloze probability for each subject according to the no pooling model, `fit_N400_np`. The solid vertical line represents the mean over all the subjects; and the broken vertical lines mark the 95% credible interval for this mean.
```{r nopooling,fig.height=9.5, message=FALSE, fig.cap="(ref:nopooling)"}
ggplot(beta_by_subj,
aes(x = Estimate, xmin = Q2.5, xmax = Q97.5, y = subject)) +
geom_point() +
geom_errorbarh() +
geom_vline(xintercept = grand_av_beta$mean) +
geom_vline(xintercept = grand_av_beta$lq, linetype = "dashed") +
geom_vline(xintercept = grand_av_beta$hq, linetype = "dashed") +
xlab("By-subject effect of cloze probability in microvolts")
```
### \index{Varying intercepts} Varying intercepts and \index{Varying slopes} varying slopes model ($M_{v}$) {#sec-uncorrelated}
One major problem with the no-pooling model is that we fit each subject's data ignoring the information available in the other subjects' data.
Depending on the characteristics of the data set, the Bayesian no-pooling model might either underfit or overfit the individual subjects' data. It will ignore what can be learned from all the data taken together, and might end up in one of these two situations: if there is not enough data for some subjects, it will underfit—relying too much on the priors and learning too little from the data. Alternatively, if there is enough data, it might overfit, that is, it might learn "too much" from the individual subjects (the estimates from some subjects might be quite extreme and unrepresentative), leading us to overinterpret noisy estimates from the subjects' data [Type M error; see @gelmancarlin]. The model can be modified to explicitly assume that the subjects have an overall effect common to all the subjects, with the individual subjects deviating from this common effect.
In the model that we fit next, we will assume that there is an overall effect that is common to the subjects and, importantly, that all subjects' parameters originate from one common (normal) distribution. This model specification will result in the estimation of posteriors for each subject being also influenced by what we know about all the subjects together. We begin with a hierarchical model with uncorrelated varying intercepts and slopes. The analogous frequentist model can be fit using `lmer()` from the package `lme4`, using `(1 + c_cloze || subj)` or, equivalently, `(c_cloze || subj)` for the by-subject random effects.
#### Model assumptions
1. EEG averages for the N400 spatio-temporal window are normally distributed.
2. Each subject deviates to some extent (this is made precise below) from the grand mean and from the mean effect of predictability. This implies that there is some between-subject variability in the individual-level intercept and slope adjustments by subject.
3. There is a linear relationship between cloze and the EEG signal.
#### Likelihood and priors
The likelihood now incorporates the assumption that both the intercept and slope are adjusted by subject.
\begin{equation}
signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1} + c\_cloze_n \cdot (\beta+ u_{subj[n],2}),\sigma)
\end{equation}
\begin{equation}
\begin{aligned}
\alpha &\sim \mathit{Normal}(0,10)\\
\beta &\sim \mathit{Normal}(0,10)\\
u_1 &\sim \mathit{Normal}(0,\tau_{u_1})\\
u_2 &\sim \mathit{Normal}(0,\tau_{u_2})\\
\tau_{u_1} &\sim \mathit{Normal}_+(0,20) \\
\tau_{u_2} &\sim \mathit{Normal}_+(0,20) \\
\sigma &\sim \mathit{Normal}_+(0,50)
\end{aligned}
\end{equation}
In this model each subject has their own intercept adjustment, $u_{subj,1}$, and slope adjustment, $u_{subj,2}$.^[The intercept adjustment is often called $u_0$ in statistics books, where the intercept might be called $\alpha$ (or sometimes also $\beta_0$), and thus $u_1$ refers to the adjustment to the slope. However, in this book, we start the indexing with 1 to be consistent with the Stan language.] If $u_{subj,1}$ is positive, the subject will have a more positive EEG signal than the grand mean average. If $u_{subj,2}$ is positive, the subject will have a more positive EEG response to a change of one unit in `c_cloze` than the overall mean effect (i.e., there will be a more positive effect of cloze probability on the N400). The parameters $u$ are sometimes called \index{Random effects} random effects and thus a model with \index{Fixed effects} fixed effects ($\alpha$ and $\beta$) and random effects is called a \index{Mixed model} mixed model. However, random effects have different meanings in different contexts. To avoid ambiguity, `brms` calls these random-effects parameters \index{Group-level effects} *group-level* effects. Since we are estimating $\alpha$ and $u$ at the same time and we assume that the average of the $u$'s is $0$ (since it is assumed to be normally distributed with mean $0$), what is common between the subjects, the grand mean, is estimated as the intercept $\alpha$, and the deviations of individual subjects' means from this grand mean are the adjustments $u_1$. Similarly, the mean effect of cloze is estimated as $\beta$, and the deviations of individual subjects' mean effects of cloze from $\beta$ are the adjustment $u_2$. The standard deviations of these two adjustment terms, $\tau_{u_1}$ and $\tau_{u_2}$, respectively, represent between subject variability.
Thus, the model $M_{v}$ has three *standard deviations*: $\sigma$, $\tau_{u_1}$ and $\tau_{u_2}$. In statistics, it is conventional to talk about variances (the square of these standard deviations); for this reason, these standard deviations are also (confusingly) called \index{Variance component} *variance components*. The variance components $\tau_{u_1}$ and $\tau_{u_2}$ characterize \index{Between-subject variability} between-subject variability, and the variance component $\sigma$ characterizes \index{Within-subject variability} within-subject variability.
The by-subject adjustments $u_1$ and $u_2$ are parameters in the model, and therefore have priors defined on them. By contrast, in the frequentist `lmer` model, the adjustments $u_1$ and $u_2$ are not parameters; they are called conditional modes; see @R-lme4.
Parameters that appear in the prior specifications for parameters, such as $\tau_u$, are often called \index{Hyperparameter} *hyperparameters*,^[Another potential source of confusion here is that *hyperparameters* is also used in the machine learning literature with a different meaning.] and the priors on such hyperparameters are called \index{Hyperprior} *hyperpriors*. Thus, the parameter $u_1$ has $\mathit{Normal}(0,\tau_{u_1})$ as a prior; $\tau_{u_1}$ is a hyperparameter, and the hyperprior on $\tau_{u_1}$ is $\mathit{Normal}(0,20)$.^[One could in theory keep going deeper and deeper, defining hyper-hyperpriors etc., but the model would quickly become impossible to fit.]
We know that in general, in EEG experiments, the standard deviations for the by-subject adjustments are smaller than the standard deviation of the observations (which is the within-subjects standard deviation). That is, usually the between-subject variability in the intercepts and slopes is smaller than the within-subjects variability in the data. For this reason, reducing the scale of the truncated normal distribution to $20$ (in comparison to $50$) seems reasonable for the priors of the $\tau$ parameters. As always, we can do a sensitivity analysis to verify that our priors are reasonably uninformative (if we intended them to be uninformative).
For now, we are assuming that there is no relationship (no correlation) between the by-subject intercept and slope adjustments $u_1$ and $u_2$; this lack of correlation is indicated in `brms` using the double pipe `||`. The double pipe is also used in the same way in `lmer()` from the package `lme4` (in fact `brms` bases its syntax on that of the `lme4` package).
In \index{brms} `brms`, we need to specify hyperpriors for $\tau_{u_1}$ and $\tau_{u_2}$; these are called `sd` in `brms`, to distinguish these standard deviations from the standard deviation of the residuals $\sigma$. As with the population-level effects, the by-subjects intercept adjustments are implicitly fit for the group-level effects and thus `(c_cloze || subj)` is equivalent to `(1 + c_cloze || subj)`. If we don't want an intercept we need to explicitly indicate this with `(0 + c_cloze || subj)` or `(-1 + c_cloze || subj)`. Such a removal of the intercept is not normally done.
```{r, message = FALSE, results = "hide"}
prior_v <-
c(prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = c_cloze),
prior(normal(0, 50), class = sigma),
prior(normal(0, 20), class = sd, coef = Intercept, group = subj),
prior(normal(0, 20), class = sd, coef = c_cloze, group = subj))
fit_N400_v <- brm(n400 ~ c_cloze + (c_cloze || subj),
prior = prior_v,
data = df_eeg)
```
When we print a `brms` fit, we first see the summaries of the posteriors of the standard deviation of the by-group intercept and slopes, $\tau_{u_1}$ and $\tau_{u_2}$ as `sd(Intercept)` and `sd(c_cloze)`, and then, as with previous models, the population-level effects, $\alpha$ and $\beta$ as `Intercept` and `c_cloze`, and the scale of the likelihood, $\sigma$, as `sigma`. The full summary can be printed out by typing:
```{r eval=FALSE}
fit_N400_v
```
Because the above command will result in some pages of output, it is easier to understand the summary graphically (Figure \@ref(fig:plotfitN400v)). Rather than the wrapper `plot()`, we use the original function of the package \index{\texttt{bayesplot}} `bayesplot`, \index{\texttt{mcmc\_dens}} `mcmc_dens()`, to only show density plots. We extract the first 5 parameters of the model with `variables(fit_N400_v)[1:5]`.
(ref:plotfitN400v) Posterior distributions of the parameters in the model `fit_N400_v`.
```{r plotfitN400v,fig.cap="(ref:plotfitN400v)", fig.height = 3.5}
mcmc_dens(fit_N400_v, pars = variables(fit_N400_v)[1:5])
```
Because we estimated how the population-level effect of cloze is adjusted for each subject, we could examine how each subject is being affected by the manipulation. For this we do the following, and we plot it in Figure \@ref(fig:partialpooling). These are adjustments, $u_{1,2},u_{2,2},...,u_{37,2}$, and not the effect of the manipulation by subject, $\beta + [u_{1,2},u_{2,2},...,u_{37,2}]$. The code below produces Figure \@ref(fig:partialpooling).
(ref:partialpooling) 95% credible intervals of adjustments to the effect of cloze probability for each subject ($u_{1..37,2}$) according to the varying intercept and varying slopes model, `fit_N400_v`. To obtain the effect of cloze probability for each subject, we would need to add the estimate of $\beta$ to each adjustment.
```{r partialpooling, fig.cap = "(ref:partialpooling)", fig.height=9.5,message=FALSE,warning=FALSE}
# make a table of u_2s
ind_effects_v <- paste0("r_subj[", unique(df_eeg$subj),
",c_cloze]")
u_2_v <- posterior_summary(fit_N400_v, variable = ind_effects_v) %>%
as_tibble() %>%
mutate(subj = 1:n()) %>%
## reorder plot by magnitude of mean:
arrange(Estimate) %>%
mutate(subj = factor(subj, levels = subj))
# We plot:
ggplot(u_2_v,
aes(x = Estimate, xmin = Q2.5, xmax = Q97.5, y = subj)) +
geom_point() +
geom_errorbarh() +
xlab("By-subject adjustment to the slope in microvolts")
```
There is an important difference between the no-pooling model and the varying intercepts and slopes model we just fit. The no-pooling model fits each individual subject's intercept and slope independently for each subject. By contrast, the varying intercepts and slopes model takes *all* the subjects' data into account in order to compute the fixed effects $\alpha$ and $\beta$; and the model "shrinks" [@pinheirobates] the by-subject intercept and slope adjustments towards the fixed effects estimates. In Figure \@ref(fig:comparison), we can see the \index{Shrinkage} shrinkage of the estimates in the varying intercepts model by comparing them with the estimates of the no pooling model ($M_{np}$).
```{r, fold = TRUE}
# Extract parameter estimates from the no pooling model:
par_np <- posterior_summary(fit_N400_np, variable = ind_effects_np) %>%
as_tibble() %>%
mutate(model = "No pooling",
subj = unique(df_eeg$subj))
# For the hierarchical model, the code is more complicated
# because we want the effect (beta) + adjustment.
# Extract the overall group level effect:
beta <- c(as_draws_df(fit_N400_v)$b_c_cloze)
# Extract the individual adjustments:
ind_effects_v <- paste0("r_subj[", unique(df_eeg$subj), ",c_cloze]")
adjustment <- as_draws_matrix(fit_N400_v, variable = ind_effects_v)
# Get the by subject effects in a data frame where each adjustment
# is in each column.
# Remove all the draws meta data by using as.data.frame
by_subj_effect <- as.data.frame(beta + adjustment)
# Summarize them by getting a table with the mean and the
# quantiles for each column and then binding them.
par_h <- lapply(by_subj_effect, function(x) {
tibble(Estimate = mean(x),
Q2.5 = quantile(x, .025),
Q97.5 = quantile(x, .975))}) %>%
bind_rows() %>%
# Add a column to identify that the model,
# and one with the subject labels:
mutate(model = "Hierarchical",
subj = unique(df_eeg$subj))
# The mean and 95% CI of both models in one data frame:
by_subj_df <- bind_rows(par_h, par_np) %>%
arrange(Estimate) %>%
mutate(subj = factor(subj, levels = unique(.data$subj)))
```
(ref:comparison) This plot compares the estimates of the effect of cloze probability for each subject between (i) the no pooling, `fit_N400_np` and (ii) the varying intercepts and varying slopes, hierarchical, model, `fit_N400_v`.
```{r comparison, message=FALSE, fig.height=9.5, fig.cap= "(ref:comparison)"}
b_c_cloze <- posterior_summary(fit_N400_v, variable = "b_c_cloze")
ggplot(by_subj_df,
aes(ymin = Q2.5, ymax = Q97.5, x = subj,
y = Estimate, color = model, shape = model)) +
geom_errorbar(position = position_dodge(1)) +
geom_point(position = position_dodge(1)) +
# We'll also add the mean and 95% CrI of the overall difference
# to the plot:
geom_hline(yintercept = b_c_cloze[, "Estimate"]) +
geom_hline(yintercept = b_c_cloze[, "Q2.5"],
linetype = "dotted", linewidth = .5) +
geom_hline(yintercept = b_c_cloze[, "Q97.5"],
linetype = "dotted", linewidth = .5) +
xlab("N400 effect of predictability") +
coord_flip()
```
### Correlated varying intercept varying slopes model ($M_{h}$) {#sec-mcvivs}
The model $M_{v}$ allowed for differences in intercepts (mean voltage) and slopes (effects of cloze) across subjects, but it has the implicit assumption that these varying intercepts and varying slopes are independent. It is in principle possible that subjects showing more negative voltage may also show stronger effects (or weaker effects). Next, we fit a model that allows a correlation between the intercepts and slopes. We model the \index{Correlation between varying intercepts and slopes} correlation between varying intercepts and slopes by defining a variance-covariance matrix $\boldsymbol{\Sigma}$ between the by-subject varying intercepts and slopes, and by assuming that both adjustments (intercept and slope) come from a \index{Multivariate} multivariate (in this case, a bivariate) \index{Normal distribution} normal distribution.
* In $M_h$, we model the EEG data with the following assumptions:
1. EEG averages for the N400 spatio-temporal window are normally distributed.
2. Some aspects of the mean signal voltage and of the effect of predictability depend on the subject, and these two might be correlated, i.e., we assume group-level intercepts and slopes, and allow a correlation between them by-subject.
3. There is a linear relationship between cloze and the EEG signal for the trial.
The likelihood remains identical to the model $M_v$, which assumes no correlation between group-level intercepts and slopes (section \@ref(sec-uncorrelated)):
\begin{equation}
signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1} + c\_cloze_n \cdot (\beta + u_{subj[n],2}),\sigma)
\end{equation}
The correlation is indicated in the priors on the adjustments for intercept $u_{1}$ and slopes $u_{2}$.
* Priors:
\begin{equation}
\begin{aligned}
\alpha & \sim \mathit{Normal}(0,10) \\
\beta & \sim \mathit{Normal}(0,10) \\
\sigma &\sim \mathit{Normal}_+(0,50)\\
{\begin{pmatrix}
u_{i,1} \\
u_{i,2}
\end{pmatrix}}
&\sim {\mathcal {N}}
\left(
{\begin{pmatrix}
0\\
0
\end{pmatrix}}
,\boldsymbol{\Sigma_u} \right)
\end{aligned}
\end{equation}
In this model, a \index{Bivariate normal distribution} bivariate normal distribution generates the varying intercepts and varying slopes $\mathbf{u}$; this is an $n\times 2$ matrix. The \index{Variance-covariance matrix} variance-covariance matrix $\boldsymbol{\Sigma_u}$ defines the standard deviations of the varying intercepts and varying slopes, and the correlation between them. Recall from section \@ref(sec-contbivar) that the diagonals of the variance-covariance matrix contain the variances of the correlated random variables, and the off-diagonals contain the \index{Covariance} covariances. In this example, the covariance $Cov(u_1,u_2)$ between two variables $u_1$ and $u_2$ is defined as the product of their \index{Correlation} correlation $\rho$ and their standard deviations $\tau_{u_1}$ and $\tau_{u_2}$. In other words, $Cov(u_1,u_2) = \rho_u \tau_{u_1} \tau_{u_2}$.
\begin{equation}
\boldsymbol{\Sigma_u} =
{\begin{pmatrix}
\tau_{u_1}^2 & \rho_u \tau_{u_1} \tau_{u_2} \\
\rho_u \tau_{u_1} \tau_{u_2} & \tau_{u_2}^2
\end{pmatrix}}
\end{equation}
In order to specify a prior for $\Sigma_u$, we need priors for the standard deviations, $\tau_{u_1}$, and $\tau_{u_2}$, and also for their correlation, $\rho_u$. We can use the same priors for $\tau$ as before. For the correlation matrix that contains $\rho_u$, we use the \index{LKJ prior} LKJ prior. The basic idea of the LKJ prior on the correlation matrix is that as its parameter, $\eta$ (*eta*), increases, it will favor correlations closer to zero.^[This is because an LKJ correlation distribution with a large $\eta$ corresponds to a correlation matrix with values close to zero in the lower and upper triangles.] At $\eta = 1$, the LKJ correlation distribution is uninformative (if there is a single correlation parameter, this is similar to $\mathit{Beta}(1,1)$, scaled to the interval $[-1,1]$), at $\eta < 1$, it favors extreme correlations (similar to $\mathit{Beta}(a<1,b<1)$). We set $\eta = 2$ so that we don't favor extreme correlations, and we still represent our lack of knowledge through the wide spread of the prior between $-1$ and $1$. Thus, $\eta = 2$ gives us a regularizing, relatively uninformative or mildly informative prior.
Figure \@ref(fig:lkjviz) shows a visualization of different parameterizations of the LKJ prior.
(ref:lkjviz) Visualization of the LKJ correlation distribution prior with four different values of the $\eta$ parameter.
```{r lkjviz,echo=FALSE, fig.cap ="(ref:lkjviz)", message= FALSE,warning=FALSE,results="asis",cache=TRUE, fig.width =3, fig.height =3,fig.show='hold', out.width='48%'}
## https://github.com/rmcelreath/rethinking/blob/1def057174071beb212532d545bc2d8c559760a2/R/distributions.r
# onion method correlation matrix
dlkjcorr <- function(x, eta = 1, log = FALSE) {
ll <- det(x)^(eta - 1)
if (log == FALSE) ll <- exp(ll)
return(ll)
}
# Simplified for a 2 x 2 matrix
dlkjcorr2 <- function(rho, eta = 1) {
map_dbl(rho, ~ matrix(c(1, .x, .x, 1), ncol = 2) %>%
dlkjcorr(., eta))
}
ggplot(tibble(rho = c(-0.99, 0.99)), aes(rho)) +
stat_function(fun = dlkjcorr2, geom = "line", args = list(eta = 1)) +
ylab("density") +
ggtitle("eta = 1")
ggplot(tibble(rho = c(-0.99, 0.99)), aes(rho)) +
stat_function(fun = dlkjcorr2, geom = "line", args = list(eta = 2)) +
ylab("density") +
ggtitle("eta = 2")
ggplot(tibble(rho = c(-0.99, 0.99)), aes(rho)) +
stat_function(fun = dlkjcorr2, geom = "line", args = list(eta = 4)) +
ylab("density") +
ggtitle("eta = 4")
ggplot(tibble(rho = c(-0.99, 0.99)), aes(rho)) +
stat_function(fun = dlkjcorr2, geom = "line", args = list(eta = .9)) +
ylab("density") +
ggtitle("eta = .9")
```
\begin{equation}
\begin{aligned}
\tau_{u_1} &\sim \mathit{Normal}_+(0,20)\\
\tau_{u_2} &\sim \mathit{Normal}_+(0,20)\\
\begin{bmatrix}
1 & \rho_u \\
\rho_u & 1
\end{bmatrix} &\sim \mathit{LKJcorr}(2)
\end{aligned}
\end{equation}
In our \index{brms} `brms` model, we allow a correlation between the by-subject intercepts and slopes by using a single pipe `|` instead of the double pipe `||` that we used previously. This convention follows the syntax used in the frequentist `lmer()` function. As before, the varying intercepts are implicitly fit.
Because we have a new parameter, the correlation $\rho_{u}$, we need to add a new prior for this correlation. In `brms`, this is achieved by adding a prior for the parameter type `cor`. This is not obvious from the `brms` syntax, but the LKJ prior is being defined on the correlation matrix, not the correlation $\rho_{u}$.
```{r, message = FALSE, results = "hide"}
prior_h <- c(prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = c_cloze),
prior(normal(0, 50), class = sigma),
prior(normal(0, 20),
class = sd, coef = Intercept,
group = subj),
prior(normal(0, 20),
class = sd, coef = c_cloze,
group = subj),
prior(lkj(2), class = cor, group = subj))
fit_N400_h <- brm(n400 ~ c_cloze + (c_cloze | subj),
prior = prior_h,
data = df_eeg)
```
The estimates do not change much in comparison with the varying intercept and varying slope model, probably because the estimation of the correlation is quite poor (i.e., there is a lot of uncertainty in the posterior). While the inclusion of the correlation parameter is justified, this results, in our experience, in a slower convergence of the models (see also the discussion at the end of the section \@ref(sec-distrmodel)).
As before, the estimates are shown graphically, in Figure \@ref(fig:plotfitN400h). As always, one can access the complete summary by typing `fit_N400_h`.
(ref:plotfitN400h) The posteriors of the parameters in the model `fit_N400_h`.
```{r plotfitN400h,fig.cap="(ref:plotfitN400h)", fig.height = 6.5}
plot(fit_N400_h, nvariables = 6)
```
We are now half-way to what is sometimes called the \index{Maximal hierarchical model} "maximal" hierarchical model [@barr2013]. This usually refers to a model with all the by-participant and by-items group-level variance components *allowed by the experimental design* and *a full variance covariance matrix* for all the group-level parameters. Not all variance components are allowed by the experimental design; in particular, between-group manipulations cannot have variance components. For example, even if we assume that the working memory capacity of the subjects might affect the N400, we cannot measure how working memory affects the subjects differently.
When we refer to a full variance-covariance matrix, we mean a variance-covariance matrix where all the elements (variances and covariances) are non-zero. In our previous model, for example, the variance-covariance matrix $\boldsymbol{\Sigma_u}$ was full because no element was zero. If we assume no correlation between group-level intercept and slope, there would need to be zeros in the diagonal of the matrix, and this would render the model identical to $M_{v}$ defined in section \@ref(sec-uncorrelated). If we also assume that the bottom right element ($\tau^2$) is zero, the model would turn into a \index{Varying intercept model} varying intercept model (in `brms` formula `n400 ~ c_cloze + (1 | subj)`); and if we assume that the variance-covariance matrix has zeros in every cell, the model would turn into a complete pooling model, $M_{cp}$, as defined in section \@ref(sec-Mcp).
As we will see in section \@ref(sec-distrmodel) and in the online chapter \@ref(ch-workflow), "maximal" is probably a misnomer for Bayesian models, since this mostly refers to limitations of the popular frequentist package for fitting models, `lme4`.
The next section spells out a model with full variance-covariance matrix for both subjects and items-level effects.
### By-subjects and by-items correlated varying intercept varying slopes model ($M_{sih}$) {#sec-sih}
Our new model, $M_{sih}$, will allow for differences in intercepts (mean voltage) and slopes (effects of predictability) across subjects and across items. In typical Latin square designs, subjects and items are said to be \index{Crossed random effects} *crossed random effects*---each subject sees exactly one instance of each item. Here we assume a possible \index{Correlation} correlation between varying intercepts and slopes by subjects, as well as a correlation between varying intercepts and slopes by items.
* In $M_{sih}$, we model the EEG data with the following assumptions:
1. EEG averages for the N400 spatio-temporal window are normally distributed.
2. Some aspects of the mean signal voltage and of the effect of predictability depend on the subject, i.e., we assume group-level intercepts, and slopes, and a correlation between them by-subject.
2. Some aspects of the mean signal voltage and of the effect of predictability depend on the item, i.e., we assume group-level intercepts, and slopes, and a correlation between them by-item.
4. There is a linear relationship between cloze and the EEG signal for the trial.
* Likelihood:
\begin{multline}
signal_n \sim \mathit{Normal}(\alpha + u_{subj[n],1} + w_{item[n],1} + \\ c\_cloze_n \cdot (\beta + u_{subj[n],2}+ w_{item[n],2}), \sigma)
\end{multline}
* Priors:
\begin{equation}
\begin{aligned}
\alpha & \sim \mathit{Normal}(0,10) \\
\beta & \sim \mathit{Normal}(0,10) \\
\sigma &\sim \mathit{Normal}_+(0,50)\\
{\begin{pmatrix}
u_{i,1} \\
u_{i,2}
\end{pmatrix}}
&\sim {\mathcal {N}}
\left(
{\begin{pmatrix}
0\\
0
\end{pmatrix}}
,\boldsymbol{\Sigma_u} \right) \\
{\begin{pmatrix}
w_{j,1} \\
w_{j,2}
\end{pmatrix}}
&\sim {\mathcal {N}}
\left(
{\begin{pmatrix}
0\\
0
\end{pmatrix}}
,\boldsymbol{\Sigma_w} \right)
\end{aligned}
\end{equation}
The index $j$ represents the item id; this is analogous to the index $i$ for subjects. Similarly, $item[n]$ indicates the item that corresponds to the observation in the $n$-th row of the data frame.
We have hyperparameters and hyperpriors as before:
\begin{equation}
\begin{aligned}
\boldsymbol{\Sigma_u} & =
{\begin{pmatrix}
\tau_{u_1}^2 & \rho_u \tau_{u_1} \tau_{u_2} \\
\rho_u \tau_{u_1} \tau_{u_2} & \tau_{u_2}^2
\end{pmatrix}}\\
\boldsymbol{\Sigma_w} & =
{\begin{pmatrix}
\tau_{w_1}^2 & \rho_w \tau_{w_1} \tau_{w_2} \\
\rho_w \tau_{w_1} \tau_{w_2} & \tau_{w_2}^2
\end{pmatrix}}
\end{aligned}
\end{equation}
\begin{equation}
\begin{aligned}
\tau_{u_1} &\sim \mathit{Normal}_+(0,20)\\
\tau_{u_2} &\sim \mathit{Normal}_+(0,20)\\
\begin{bmatrix}
1 & \rho_u \\
\rho_u & 1
\end{bmatrix} \sim \mathit{LKJcorr}(2)\\
\tau_{w_1} &\sim \mathit{Normal}_+(0,20)\\
\tau_{w_2} &\sim \mathit{Normal}_+(0,20)\\
\begin{bmatrix}
1 & \rho_w \\
\rho_w & 1
\end{bmatrix} \sim \mathit{LKJcorr}(2)\\
\end{aligned}
\end{equation}
We set identical priors for by-items group-level effects as for by-subject group-level effects, but this is only because we don't have any differentiated prior information about subject-level vs. item-level variation. However, bear in mind that the estimation for items is completely independent from the estimation for subjects. Although we wrote many more equations than before, the `brms` model is quite straightforward to extend.
We assign the priors to the by-subject and by-item parameters first:
```{r, eval = FALSE}
prior_sih_full <-
c(prior(normal(0, 10), class = Intercept),
prior(normal(0, 10), class = b, coef = c_cloze),
prior(normal(0, 50), class = sigma),
prior(normal(0, 20),
class = sd, coef = Intercept,
group = subj),
prior(normal(0, 20),
class = sd, coef = c_cloze,
group = subj),
prior(lkj(2), class = cor, group = subj),
prior(normal(0, 20),
class = sd, coef = Intercept,
group = item),
prior(normal(0, 20),
class = sd, coef = c_cloze,