-
Notifications
You must be signed in to change notification settings - Fork 1
/
17-mixtures.Rmd
1374 lines (1050 loc) · 71.8 KB
/
17-mixtures.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Mixture models {#ch-mixture}
\index{Mixture model} Mixture models integrate \index{Multiple data generating process} multiple data generating processes into a single model. This is especially useful in cases where the data alone don't allow us to fully identify which observations belong to which process.
Mixture models are important in cognitive science because many theories of cognition assume that the behavior of subjects in certain tasks is determined by an interplay of different cognitive processes [e.g., response times in schizophrenia in @levy1993eye; retrieval from memory in sentence processing in @Mcelree2000;@nicenboimModelsRetrievalSentence2018; fast choices in @Ollman1966; @DutilhEtAl2011; generalized processing tree models in @HeckEtAl2018GeneralizedProcessingTree]. It is important to stress that a mixture distribution of observations is an *assumption* of the latent process developing trial by trial based on a given theory---it doesn't necessarily represent the true generative process. The role of Bayesian modeling is to help us understand the extent to which this assumption is well-founded, by using posterior predictive checks and by comparing different models.
We focus here on the case where we have only two components; each component represents a distinct cognitive process based on the domain knowledge of the researcher. The vector $\mathbf{z}$ serves as a latent \index{Indicator variable} *indicator variable* that indicates which of the mixture components an observation $y_n$ belongs to ($n=1,\dots,N$ is the number of data points). We assume two components, and thus each $z_n$ can be either $0$ or $1$ (this will allows us to generate $z_n$ with a \index{Bernoulli distribution} Bernoulli distribution). We also assume two different generative processes, $p_1$ and $p_2$, which generate different distributions of the observations based on a vector of parameters indicated by $\Theta_{1}$ and $\Theta_{2}$, respectively. These two processes occur with probability $\theta$ and $1-\theta$, and each observation is generated as follows:
\begin{equation}
\begin{aligned}
z_n \sim \mathit{Bernoulli}(\theta)\\
y_n \sim
\begin{cases}
p_1(\Theta_1), & \text{ if } z_n =1 \\
p_2(\Theta_2), & \text{ if } z_n=0
\end{cases}
\end{aligned}
(\#eq:mixz)
\end{equation}
We focus on only two components because this type of model is already hard to fit and, as we show in this chapter, it requires plenty of prior information to be able to sample from the posterior in most applied situations. However, the approach presented here can in principle be extended to a larger number of mixtures by replacing the Bernoulli distribution with a categorical one. This can be done if the number of components in the mixture is finite; the number of components is determined by the researcher.
In order to fit this model, we need to estimate the posterior of each of the parameters contained in the vectors $\Theta_{1}$ and $\Theta_{2}$ (intercepts, slopes, group-level effects, etc.), the probability $\theta$, and the indicator variable that corresponds to each observation $z_n$. One issue that presents itself here is that $z_n$ must be a \index{Discrete parameter} discrete parameter, and Stan only allows \index{Continuous parameter} continuous parameters. This is because Stan's algorithm requires the \index{Derivative} derivatives of the (log) posterior distribution with respect to all parameters, and discrete parameters are not \index{Differentiable} differentiable (since they have "breaks"). In probabilistic programming languages like WinBUGS [@lunn2012bugs], JAGS [@plummer2016jags], PyMC3 [@Salvatier2016] and Turing [@turing], discrete parameters are possible to use; but not in Stan. In Stan, we can circumvent this issue by marginalizing out the indicator variable $z$.^[See section \@ref(sec-marginalizing) in chapter \@ref(ch-intro) for a review on the concept of marginalization.] If $p_1$ appears in the mixture with probability $\theta$, and $p_2$ with probability $1-\theta$, then the joint likelihood is defined as a function of $\Theta$ (which concatenates the mixing probability, $\theta$, and the parameters of the $p_{1}$ and $p_{2}$, $\Theta_1$ and $\Theta_2$), and importantly $z_n$ "disappears":
\begin{equation}
p(y_n | \Theta) = \theta \cdot p_1(y_n| \Theta_1) + (1-\theta) \cdot p_2(y_n | \Theta_2)
\end{equation}
The intuition behind this formula is that each \index{Likelihood function} likelihood function, $p_1$, $p_2$ is weighted by its probability of being the relevant generative process. For our purposes, it suffices to say that marginalization works; the reader interested in the mathematics behind marginalization is directed to the further reading section at the end of the chapter.^[As mentioned above, other probabilistic languages that do not rely on Hamiltonian dynamics (exclusively) are able to deal with this. However, even when sampling discrete parameters is possible, marginalization is more efficient [@YackulicEtAl2020]: when $z_n$ is omitted we are fitting a model with $n$ fewer parameters.]
Even though Stan cannot fit a model with the discrete indicator of the latent class $\mathbf{z}$ that we used in Equation \@ref(eq:mixz), this equation will prove very useful when we want to generate \index{Synthetic data} synthetic data.
In the following sections, we model a well-known phenomenon (i.e., the speed-accuracy trade-off) assuming an underlying \index{Finite mixture process} finite mixture process. We start from the verbal description of the model, and then implement the model step by step in Stan.
## A mixture model of the speed-accuracy trade-off: The fast-guess model account
When we are faced with multiple choices that require an immediate decision, we can speed up the decision at the expense of accuracy and become more accurate at the expense of speed; this is called the \index{Speed-accuracy trade-off} speed-accuracy trade-off [@wickelgren1977speed]. The most popular class of models that can incorporate both response times and accuracy, and give an account for the speed-accuracy trade-off is the class of \index{Sequential sampling model} sequential sampling models, which include the drift diffusion model [@Ratcliff1978], the linear ballistic accumulator [@brownSimplestCompleteModel2008], and the log-normal race model [@HeathcoteLove2012; @RouderEtAl2015], which we discuss in chapter \@ref(ch-lognormalrace); for a review see @Ratcliff2016.
However, an alternative model that has been proposed in the past is Ollman's simple \index{Fast-guess model} fast-guess model [@Ollman1966; @Yellott1967; @Yellott1971].^[Ollman's original model was meant to be relevant only for means; Yellott [-@Yellott1967; -@Yellott1971] generalized it to a distributional form.] Although it has mostly fallen out of favor [but see @DutilhEtAl2011; and @HeckErdfelder2020Benefitsresponsetime for more modern variants of this model], it presents a very simple framework using finite mixture modeling that can also account for the speed-accuracy trade-off. In the next sections, we'll use this model to exemplify the use of finite mixtures to represent different cognitive processes.
### The global motion detection task
One way to examine the behavior of human and primate subjects when faced with two-alternative forced choices is the detection of the global motion of a \index{Random dot kinematogram} random dot kinematogram [@britten_shadlen_newsome_movshon_1993]. In this task, a subject sees a number of \index{Random dots} random dots on the screen. A proportion of dots move in a single direction (e.g., right) and the rest move in random directions. The subject's goal is to estimate the overall direction of the movement. One of the reasons for the popularity of this task is that it permits the fine-tuning of the difficulty of trials [@Dutilh2019quality]: The task is harder when the proportion of dots that move coherently (the level of \index{Coherence} *coherence*) is lower; see Figure \@ref(fig:globalmotionpng).
(ref:glob) Three levels of difficulty of the global motion detection task. The figures show a consistent movement to the right with three levels of coherence (10%, 50%, and 100%). The subjects see the dots moving in the direction indicated by the arrows. The subjects do not see the arrows and all the dots look identical in the actual task. Adapted from @DingEtAl; licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/).
```{r globalmotionpng, fig.cap = "(ref:glob)", out.width = "100%", echo = FALSE, fig.align = "center"}
knitr::include_graphics("cc_figure/globalmotion_lr.PNG", dpi =1000)
```
Ollman's [-@Ollman1966] fast-guess model assumes that the behavior in this task (and in any other choice task) is governed by two distinct cognitive processes: (i) a \index{Guessing mode} guessing mode, and (ii) a \index{Task-engaged mode} task-engaged mode. In the guessing mode, responses are fast and accuracy is at chance level. In the task-engaged mode, responses are slower and accuracy approaches 100%. This means that intermediate values of response times and accuracy can only be achieved by \index{Mixing responses} mixing responses from the two modes. Further assumptions of this model are that response times depend on the difficulty of the choice, and that the probability of being on one of the two states depend on the speed incentives during the instructions.
To simplify matters, we ignore the possibility that the accuracy of the choice is also affected by the difficulty of the choice. Also, we ignore the possibility that subjects might be biased to one specific response in the guessing mode, but see exercise \@ref(exr:mixbias) in the online materials.
#### The data set
We implement the assumptions behind Ollman's fast-guess model and examine its fit to data of a global motion detection task from @Dutilh2019quality.
The data set from @Dutilh2019quality contains approximately 2800 trials of each of the 20 subjects participating in a global motion detection task and can be found in `df_dots` in the `bcogsci` package. There were two level of coherence, yielding hard and easy trials (`diff`), and the trials where done under instructions that emphasized either accuracy or speed (`emphasis`). More information about the data set can be found by accessing the documentation for the data set (by typing `?df_dots` on the `R` command line, assuming that the `bcogsci` package is installed).
```{r, message = FALSE}
data("df_dots")
df_dots
```
We might think that if the fast-guess model were true, we would see a \index{Bimodal distribution} bimodal distribution, when we plot a histogram of the data. Unfortunately, when two similar distributions are mixed, we won't necessarily see any apparent bimodality; see Figure \@ref(fig:dfdots).
(ref:dfdots) Distribution of response times in the data of the global motion detection task in @Dutilh2019quality.
```{r dfdots, message = FALSE, fig.cap = "(ref:dfdots)", fold = TRUE, fig.height = 2.2}
ggplot(df_dots, aes(rt)) +
geom_histogram()
```
However, Figure \@ref(fig:dfaccrt) reveals that incorrect responses were generally faster, and this was especially true when the instructions emphasized accuracy.
(ref:dfaccrt) The distribution of response times by \index{Accuracy} accuracy in the data of the global motion detection task in @Dutilh2019quality.
```{r dfaccrt, fig.cap = "(ref:dfaccrt)", fold = TRUE}
ggplot(df_dots, aes(x = factor(acc), y = rt)) +
geom_point(position = position_jitter(width = .4, height = 0),
alpha = 0.5) +
facet_wrap(diff ~ emphasis) +
xlab("Accuracy") +
ylab("Response time")
```
```{r, eval = FALSE, echo = FALSE}
#to complete
probs <- c(0, 0.1,0.3,0.5,0.7,0.9)
df_lp <- df_dots %>% group_by(diff, emphasis) %>%
mutate(q = cut(rt, breaks = quantile(rt, probs = probs), labels = probs[-1]))%>%
group_by(diff, emphasis, q) %>%
summarize(rt = max(rt), acc = mean(acc))
df_dots %>% group_by(diff, emphasis) %>%
mutate(art = mean(rt),
lrt = quantile(rt, .025),
hrt = quantile(rt, .975),
acc = mean(acc)) %>%
ggplot(aes(x = acc, y = art, ymin = lrt, ymax = hrt)) +
geom_point() +
facet_grid(~emphasis)
+
geom_errorbar()
df_lp
df_dots %>% filter(diff == "easy", emphasis == "accuracy") %>% {quantile(.$rt, probs = probs)}
df_dots %>% group_by(diff, emphasis) %>%
mutate(rtq = quantile(rt, probs = 0.1)) %>%
filter(diff == "easy", emphasis == "accuracy")
x <- 1:100
x[ntile(1:100,4)==10]
quantile(1:100)
```
### A very simple implementation of the fast-guess model {#sec-simplefastguess}
The description of the model makes it clear that an ideal subject who never guesses has a response time that depends only on the difficulty of the trial. As we did in previous chapters, we assume that response times are log-normally distributed, and for simplicity we start by modeling the behavior of a single subject:
\begin{equation}
rt_n \sim \mathit{LogNormal}(\alpha + \beta \cdot x_n, \sigma)
\end{equation}
In the previous equation, $x$ is larger for difficult trials. If we center $x$, $\alpha$ represents the average logarithmic transformed response time for a subject engaged in the task, and $\beta$ is the effect of \index{Trial difficulty} trial difficulty on log-response time. We assume a non-deterministic process, with a noise parameter $\sigma$. Also see the online section \@ref(app-lognormal) for more information about log-normally distributed response times.
Alternatively, a subject that guesses in every trial would show a response time distribution that is independent of the difficulty of the trial:
\begin{equation}
rt_n \sim \mathit{LogNormal}(\gamma, \sigma_2)
\end{equation}
Here $\gamma$ represents the the average logarithmic transformed response time when a subject only guesses. We assume that responses from the guessing mode might have a different noise component than from the task-engaged mode.
The fast-guess model makes the assumption that during a task, a single subject would behave in these two ways: They would be engaged in the task a proportion of the trials and would guess on the rest of the trials. This means that for a single subject, there is an underlying probability of being engaged in the task, $p_{task}$, that determines whether they are actually choosing ($z=1$) or guessing ($z=0$):
\begin{equation}
z_n \sim \mathit{Bernoulli}(p_{task})
\end{equation}
The value of the parameter $z$ in every trial determines the behavior of the subject. This means that the distribution that we observe is a mixture of the two distributions presented before:
\begin{equation}
rt_n \sim
\begin{cases}
\mathit{LogNormal}(\alpha + \beta \cdot x_n, \sigma), & \text{ if } z_n =1 \\
\mathit{LogNormal}(\gamma, \sigma_2), & \text{ if } z_n=0
\end{cases}
(\#eq:dismix)
\end{equation}
In order to have a Bayesian implementation, we also need to define some priors. We use priors that encode what we know about response time experiments. These priors are slightly more informative than the ones that we used in section \@ref(sec-trial), but they still can be considered \index{Regularizing prior} regularizing priors. One can verify this by performing prior predictive checks. As we increase the complexity of our models, it's worth spending some time designing more realistic priors. These will speed up computation and in some cases they will be crucial for solving convergence problems.
\begin{equation}
\begin{aligned}
\alpha &\sim \mathit{Normal}(6, 1)\\
\beta &\sim \mathit{Normal}(0, 0.1)\\
\sigma &\sim \mathit{Normal}_+(0.5, 0.2)
\end{aligned}
\end{equation}
\begin{equation}
\begin{aligned}
\gamma &\sim \mathit{Normal}(6, 1)\\
\sigma_2 &\sim \mathit{Normal}_+(0.5, 0.2)
\end{aligned}
\end{equation}
For now, we allow all values for the probability of having an engaged response equal likelihood; we achieve this by setting the following prior to $p_{task}$:
\begin{equation}
p_{task} \sim \mathit{Beta}(1, 1)
\end{equation}
This represents a flat, \index{Uninformative prior} uninformative prior over the probability parameter $p_{task}$.
Before we fit our model to the real data, we generate \index{Synthetic data} synthetic data to make sure that our model is working as expected.
We first define the number of observations, predictors, and fixed point values for each of the parameters. We assume $1000$ observations and two levels of difficulty, $x$, coded $-0.5$ (easy) and $0.5$ (hard). The point values chosen for the parameters are relatively realistic (based on our previous experience on response time experiments). Although in the priors we try to encode the range of possible values for the parameters, in this simulation we assume only one instance of this possible range:
```{r, echo = FALSE}
set.seed(123)
```
```{r}
N <- 1000
# level of difficulty:
x <- c(rep(-0.5, N/2), rep(0.5, N/2))
# Parameters true point values:
alpha <- 5.8
beta <- 0.05
sigma <- 0.4
sigma2 <- 0.5
gamma <- 5.2
p_task <- 0.8
# Median time
c("engaged" = exp(alpha), "guessing" = exp(gamma))
```
For generating a mixture of response times, we use the indicator of a latent class, `z`.
```{r}
z <- rbern(n = N, prob = p_task)
rt <- if_else(z == 1,
rlnorm(N,
meanlog = alpha + beta * x,
sdlog = sigma),
rlnorm(N,
meanlog = gamma,
sdlog = sigma2))
df_dots_simdata1 <- tibble(trial = 1:N, x = x, rt = rt)
```
We verify that our simulated data is realistic, that is, it's in the same range as the original data; see Figure \@ref(fig:simmix).
(ref:simmix) Response times in the simulated data (`df_dots_simdata1`) that follows the fast-guess model.
```{r simmix, fig.cap = "(ref:simmix)", message = FALSE, fig.height = 2.2}
ggplot(df_dots_simdata1, aes(rt)) +
geom_histogram()
```
To implement the mixture model defined in Equation \@ref(eq:logpriorsunif) in Stan, the discrete parameter $z$ needs to be marginalized out:
\begin{equation}
\begin{aligned}
p(rt_n | \Theta) &= p_{task} \cdot LogNormal(rt_n | \alpha + \beta \cdot x_n, \sigma) +\\
& (1 - p_{task}) \cdot LogNormal(rt_n | \gamma, \sigma_2)
\end{aligned}
\end{equation}
In addition, Stan requires the likelihood to be defined in log-space:
\begin{equation}
\begin{aligned}
\log(p(rt | \Theta)) &= \log(p_{task} \cdot LogNormal(rt_n | \alpha + \beta \cdot x_n, \sigma) +\\
& (1 - p_{task}) \cdot LogNormal(rt_n | \gamma, \sigma_2))
\end{aligned}
\end{equation}
```{r mixtures-stan, echo = FALSE}
mixture_rt <- system.file("stan_models", "mixture_rt.stan", package = "bcogsci")
mixture_rt2 <- system.file("stan_models", "mixture_rt2.stan", package = "bcogsci")
mixture_rtacc <- system.file("stan_models", "mixture_rtacc.stan", package = "bcogsci")
mixture_rtacc2 <- system.file("stan_models", "mixture_rtacc2.stan", package = "bcogsci")
mixture_h <- system.file("stan_models", "mixture_h.stan", package = "bcogsci")
mixture_h2_gen <- system.file("stan_models", "mixture_h2_gen.stan", package = "bcogsci")
```
A "naive" implementation in Stan would look like the following (recall that `_lpdf` functions provide log-transformed densities):
```
target += log(
p_task * exp(lognormal_lpdf(rt[n] | alpha + x[n] * beta, sigma)) +
(1-p_task) * exp(lognormal_lpdf(rt[n] | gamma, sigma2)));
```
However, we need to take into account that $log(A \pm B)$ can be \index{Numerical instability} numerically unstable [i.e., prone to \index{Underflow} underflow/\index{Overflow} overflow, see @BlanchardEtAl2020]. Stan provides several functions to deal with different special cases of logarithms of sums and differences. Here we need \index{\texttt{log\_sum\_exp}} `log_sum_exp(x, y)` that corresponds to `log(exp(x) + exp(y))` and \index{\texttt{log1m}} `log1m(x)` that corresponds to `log(1-x)`.
First, we need to take into account that the first summand of the logarithm, `p_task * exp(lognormal_lpdf(rt[n] | alpha + x[n] * beta, sigma))` corresponds to `exp(x)`, and the second one, `(1-p_task) * exp(lognormal_lpdf(rt[n] | gamma, sigma2))` to `exp(y)` in `log_sum_exp(x, y)`. This means that we need to first apply the logarithm to each of them to use them as arguments of `log_sum_exp(x, y)`:
```
target += log_sum_exp(
log(p_task) + lognormal_lpdf(rt[n] | alpha +
x[n] * beta, sigma),
log(1-p_task) + lognormal_lpdf(rt[n] | gamma, sigma2));
```
Now we can just replace `log(1-p_task)` by the more stable `log1m(p_task)`:
```
target += log_sum_exp(
log(p_task) + lognormal_lpdf(rt[n] | alpha +
x[n] * beta, sigma),
log1m(p_task) + lognormal_lpdf(rt[n] | gamma, sigma2));
```
The complete model (`mixture_rt.stan`) is shown below:
```{stan output.var = "mix1_stan", code = readLines(mixture_rt), tidy = TRUE, comment="", eval = FALSE, cache = FALSE, cache.lazy = FALSE}
```
Call the Stan model `mixture_rt.stan`, and fit it to the simulated data. First, we set up the simulated data as a list structure:
```{r}
ls_dots_simdata <- list(N = N, rt = rt, x = x)
```
Then fit the model:
```{r fitmixture_rt, results = "hide", message = "FALSE", eval = !file.exists("dataR/fit_mix_rt.RDS"), warning = FALSE}
mixture_rt <- system.file("stan_models",
"mixture_rt.stan",
package = "bcogsci")
fit_mix_rt <- stan(mixture_rt, data = ls_dots_simdata)
```
```{r, echo= FALSE}
if(!file.exists("dataR/fit_mix_rt.RDS")){
# hardcoded warnings, because last.warning doesn't work inside knitr:
# saveRDS(names(last.warning),"dataR/fit_mix_rt_w.RDS")
saveRDS(fit_mix_rt, "dataR/fit_mix_rt.RDS")
} else {
fit_mix_rt <- readRDS("dataR/fit_mix_rt.RDS")
# wrn <- readRDS("dataR/fit_mix_rt_w.RDS")
wrn <- c("The largest R-hat is 1.74, indicating chains have not mixed.\nRunning the chains for more iterations may help. See\nhttps://mc-stan.org/misc/warnings.html#r-hat",
"Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.\nRunning the chains for more iterations may help. See\nhttps://mc-stan.org/misc/warnings.html#bulk-ess",
"Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.\nRunning the chains for more iterations may help. See\nhttps://mc-stan.org/misc/warnings.html#tail-ess"
)
print_warning(wrn)
}
```
There are a lot of warnings, the Rhats are too large, and number of effective samples is too low:
```{r}
print(fit_mix_rt)
```
The traceplots show clearly that the chains aren't mixing; see Figure \@ref(fig:traceplotsmix1).
(ref:traceplotsmix1) Traceplots from the model `mixture_rt.stan` fit to simulated data.
```{r traceplotsmix1, fig.cap = "(ref:traceplotsmix1)", fig.height = 3.5}
traceplot(fit_mix_rt) +
scale_x_continuous(breaks = NULL)
```
The problem with this model is that the \index{Mixture component} mixture components (i.e., the fast-guesses and the engaged mode) are underlyingly \index{Exchangeable} exchangeable and thus the posterior is \index{Multimodal} multimodal and the model does not converge. Each chain (and each iteration) doesn't know how each component was identified by the rest of the chains (e.g., in some chains and iterations, $z=0$ corresponds to fast guesses, whereas in other cases, it corresponds to deliberate responses). However, we do have information that can identify the components: According to the theoretical model, we know that the average response in the engaged mode, represented by $\alpha$, should be slower than the average response in the guessing mode, $\gamma$.
Even though the theoretical model assumes that guesses are faster than engaged responses, this is not explicit in our computational model. That is, our model lacks some of the theoretical information that we have, namely that the distribution of engaged \index{Response time} response times should be slower than the distribution of guessing times. This can be encoded with \index{Order constraints} \index{inequality constraints} *order constraints* or *inequality constraints*: a \index{Strong prior} strong prior for $\gamma$, where we assume that the upper bound of its prior distribution is truncated at the value $\alpha$:
\begin{equation}
\gamma \sim \mathit{Normal}(6, 1), \text{for } \gamma < \alpha
\end{equation}
This would be enough to make the model converge.
Another softer constraint that we could add to our implementation is the assumption that subjects are generally more likely to be trying to do the task than just guessing. If this assumption is correct, we also improve the accuracy of our estimation of the posterior of the model. (The opposite is also true: If subjects are not trying to do the task, this assumption will be unwarranted and our prior information will lead us further from the "true" values of the parameters). The following prior has the probability density concentrated near $1$.
\begin{equation}
p_{task} \sim \mathit{Beta}(8, 2)
\end{equation}
Plotting this prior confirms where most of the probability mass lies; see Figure \@ref(fig:dbeta).
(ref:dbeta) A density plot for the $\mathit{Beta}(8,2)$ prior on $p_{task}$.
```{r dbeta, fold = TRUE, fig.cap = "(ref:dbeta)", fig.height = 3.2}
plot(function(x) dbeta(x, 8, 2), ylab = "density" , xlab= "probability")
```
The Stan code for this model is shown below as `mixture_rt2.stan`.
```{stan output.var = "mix2_stan", code = readLines(mixture_rt2), tidy = TRUE, comment="", eval = FALSE, cache = FALSE, cache.lazy = FALSE}
```
Once we change the upper bound of `gamma` in the `parameters` block, we also need to truncate the distribution in the `model` block by correcting the PDF with its CDF. This correction is carried out using the CDF because we are truncating the distribution at the right-hand side; recall that earlier we used the complement of the CDF when we truncate a distribution at the left-hand side); see the online section \@ref(app-truncation).
```
target += normal_lpdf(gamma | 6, 1) -
normal_lcdf(alpha | 6, 1);
```
Fit this model (call it `mixture_rt2.stan`) to the same simulated data set that we used before:
```{r fitmixture_rt2, results = "hide", message = "FALSE", eval = !file.exists("dataR/fit_mix_rt2.RDS")}
mixture_rt2 <- system.file("stan_models",
"mixture_rt2.stan",
package = "bcogsci")
fit_mix_rt2 <- stan(mixture_rt2, data = ls_dots_simdata)
```
```{r, echo= FALSE}
if(!file.exists("dataR/fit_mix_rt2.RDS")){
saveRDS(fit_mix_rt2,"dataR/fit_mix_rt2.RDS")
} else {
fit_mix_rt2 <- readRDS("dataR/fit_mix_rt2.RDS")
}
```
Now the summaries and traceplots look fine; see Figure \@ref(fig:traceplotsmix2).
(ref:traceplotsmix2) Traceplots from the model `mixture_rt2.stan` fit to simulated data.
```{r traceplotsmix2, fig.cap = "(ref:traceplotsmix2)", fig.height = 3.5}
print(fit_mix_rt2)
traceplot(fit_mix_rt2) +
scale_x_continuous(breaks = NULL)
```
### A \index{Multivariate implementation} multivariate implementation of the fast-guess model {#sec-multmix}
A problem with the previous implementation of the fast-guess model is that we ignore the accuracy information in the data. We can implement a version that is closer to the verbal description of the model:
In particular, we also want to model the fact that accuracy is at chance level in the fast-guessing mode and that accuracy approaches 100% during the task-engaged mode.
This means that the mixture affects two pairs of distributions:
\begin{equation}
z_n \sim \mathit{Bernoulli}(p_{task})
\end{equation}
The response time distribution
\begin{equation}
rt_n \sim
\begin{cases}
\mathit{LogNormal}(\alpha + \beta \cdot x_n, \sigma), & \text{ if } z_n =1 \\
\mathit{LogNormal}(\gamma, \sigma_2), & \text{ if } z_n=0
\end{cases}
(\#eq:dismix2)
\end{equation}
and an accuracy distribution
\begin{equation}
acc_n \sim
\begin{cases}
\mathit{Bernoulli}(p_{correct}), & \text{ if } z_n =1 \\
\mathit{Bernoulli}(0.5), & \text{ if } z_n=0
\end{cases}
(\#eq:dismix3)
\end{equation}
We have a new parameter $p_{correct}$, which represent the probability of making a correct answer in the engaged mode. The verbal description says that it is closer to 100%, and here we have the freedom to choose whatever prior we believe represents for us values that are close to 100% accuracy. We translate this belief into a prior as follows; our prior choice is relatively informative but does not impose a hard constraint; if a subject consistently shows relatively low (or high) accuracy, $p_{correct}$ will change accordingly:
\begin{equation}
p_{correct} \sim \mathit{Beta}(995, 5)
\end{equation}
In our simulated data, we assume that the global motion detection task is done by a very accurate subject, with an accuracy of 99.9%.
First, simulate response times, as done earlier:
```{r}
N <- 1000
x <- c(rep(-0.5, N/2), rep(0.5, N/2)) # difficulty
alpha <- 5.8
beta <- 0.05
sigma <- 0.4
sigma2 <- 0.5
gamma <- 5.2 # fast guess location
p_task <- 0.8 # prob of being on task
z <- rbern(n = N, prob = p_task)
rt <- if_else(z == 1,
rlnorm(N,
meanlog = alpha + beta * x,
sdlog = sigma),
rlnorm(N,
meanlog = gamma,
sdlog = sigma2))
```
Simulate \index{Accuracy} accuracy and include both response times and accuracy in the simulated data set:
```{r}
p_correct <- 0.999
acc <- ifelse(z, rbern(n = N, p_correct),
rbern(n = N, 0.5))
df_dots_simdata3 <- tibble(trial = 1:N,
x = x,
rt = rt,
acc = acc) %>%
mutate(diff = if_else(x == 0.5, "hard", "easy"))
```
Plot the simulated data in Figure \@ref(fig:simdata3). This time we can see the effect of task difficulty on the simulated response times and accuracy:
(ref:simdata3) Response times by accuracy, accounting for task difficulty in the simulated data (`df_dots_simdata3`) that follows the fast-guess model.
```{r simdata3, message = FALSE, fig.cap="(ref:simdata3)", fig.height = 2.2}
ggplot(df_dots_simdata3, aes(x = factor(acc), y = rt)) +
geom_point(position = position_jitter(width = 0.4, height = 0),
alpha = 0.5) +
facet_wrap(diff ~ .) +
xlab("Accuracy") +
ylab("Response time")
```
Next, we need to marginalize out the discrete parameters from the joint distribution of accuracy and response times. However, we assume that conditional on the latent indicator parameter $z$, response times and accuracy are independent. For this reason, we can multiply the likelihoods for `rt` and `acc` within each component.
\begin{equation}
\begin{aligned}
p(rt, acc | \Theta) = & p_{task} \cdot \\
& LogNormal(rt_n | \alpha + \beta \cdot x_n, \sigma) \cdot \\
& Bernoulli(acc_n | p_{correct}) \\
& +\\
& (1 - p_{task}) \cdot \\
& LogNormal(rt_n | \gamma, \sigma_2) \cdot\\
& Bernoulli(acc_n | 0.5)
\end{aligned}
\end{equation}
In log-space:
\begin{equation}
\begin{aligned}
\log(p(rt, acc | \Theta)) = \log(\exp(&\\
& \log(p_{task}) +\\
&\log(LogNormal(rt_n | \alpha + \beta \cdot x_n, \sigma)) + \\
&\log(Bernoulli(acc_n | p_{correct})))\\
+&\\
\exp(&\\
& \log(1 - p_{task}) + \\
& \log(LogNormal(rt_n |\gamma, \sigma_2)) + \\
& \log(Bernoulli(acc_n | 0.5)))\\
)& \\
\end{aligned}
\end{equation}
Our model translates to the following Stan code (`mixture_rtacc.stan`):
```{stan output.var = "mix3_stan", code = readLines(mixture_rtacc), tidy = TRUE, comment="", eval = FALSE, cache = FALSE, cache.lazy = FALSE}
```
Next, set up the data in list format:
```{r}
ls_dots_simdata <- list(N = N,
rt = rt,
x = x,
acc = acc)
```
Then fit the model:
```{r fitmixture_rtacc, results = "hide", message = "FALSE", eval = !file.exists("dataR/fit_mix_rtacc.RDS")}
mixture_rtacc <- system.file("stan_models",
"mixture_rtacc.stan",
package = "bcogsci")
fit_mix_rtacc <- stan(mixture_rtacc, data = ls_dots_simdata)
```
```{r, echo= FALSE}
if(!file.exists("dataR/fit_mix_rtacc.RDS")){
saveRDS(fit_mix_rtacc,"dataR/fit_mix_rtacc.RDS")
} else {
fit_mix_rtacc <- readRDS("dataR/fit_mix_rtacc.RDS")
}
```
We see that our model can be fit to both response times and accuracy, and its parameters estimates have sensible values (given the fixed parameters we used to generate our simulated data).
```{r}
print(fit_mix_rtacc)
```
We will evaluate the recovery of the parameters more carefully when we deal with the hierarchical version of the fast-guess model in section \@ref(sec-fastguessh). Before we extend this model hierarchically, let us also take into account the instructions given to the subjects.
### An implementation of the fast-guess model that takes instructions into account
The actual global motion detection experiment that we started with has another manipulation that can help us to evaluate better the fast-guess model. In some trials, the instructions emphasized accuracy (e.g., "Be as accurate as possible.") and in others speed (e.g., "Be as fast as possible."). The fast-guess model also assumes that the probability of being in one of the two states depends on the speed incentives given during the instructions. This entails that now $p_{task}$ depends on the instructions $x_2$, where we encode a speed incentive with $-0.5$ and an accuracy incentive with $0.5$. Essentially, we need to fit the following regression:
\begin{equation}
\alpha_{task} + x_2 \cdot \beta_{task}
\end{equation}
As we did with MPT models in the previous chapter (in section \@ref(sec-MPT-reg)), we need to bound the previous regression between 0 and 1; we achieve this using the \index{Logit} logistic or \index{Inverse logit} inverse logit function:
\begin{equation}
p_{task} = logit^{-1}(\alpha_{task} + x_2 \cdot \beta_{task})
\end{equation}
This means that we need to interpret $\alpha_{task} + x_2 \cdot \beta_{task}$ in \index{Log-odds space} log-odds space, which has the range $(-\infty, \infty)$ rather than the probability space $[0,1]$; also see section \@ref(sec-MPT-reg).
The likelihood (defined before in section \@ref(sec-multmix)) now depends on the value of $x_{2}$ for the specific row:
\begin{equation}
z_n \sim \mathit{Bernoulli}(p_{{task}_n})
\end{equation}
A response time distribution is defined:
\begin{equation}
rt_n \sim
\begin{cases}
\mathit{LogNormal}(\alpha + \beta \cdot x_n, \sigma), & \text{ if } z_n =1 \\
\mathit{LogNormal}(\gamma, \sigma_2), & \text{ if } z_n=0
\end{cases}
\end{equation}
and an accuracy distribution is defined as well:
\begin{equation}
acc_n \sim
\begin{cases}
\mathit{Bernoulli}(p_{correct}), & \text{ if } z_n =1 \\
\mathit{Bernoulli}(0.5), & \text{ if } z_n=0
\end{cases}
\end{equation}
The only further change in our model is that rather than a prior on $p_{task}$, we now need priors for $\alpha_{task}$ and $\beta_{task}$, which are on the log-odds scale.
For $\beta_{task}$, we assume an effect that can be rather large and we won't assume a direction a priori (for now):
\begin{equation}
\beta_{task} \sim \mathit{Normal}(0, 1)
\end{equation}
This means that the subject could be affected by the instructions in the expected way, with an increased probability to be task-engaged, leading to better accuracy when the instructions emphasize accuracy ($\beta_{task} >0$). Alternatively, the subject might behave in an unexpected way, with a decreased probability to be task-engaged, leading to worse accuracy when the instructions emphasize accuracy ($\beta_{task} <0$). The latter situation, $\beta_{task} <0$, could represent the instructions being misunderstood. It's certainly possible to include priors that encode the expected direction of the effect instead: $\mathit{Normal}_{+}(0,1)$. Unless there is a compelling reason to constrain the prior in this way, following Cromwell's rule (Box \@ref(thm:cromwell) in the online chapter \@ref(ch-priors)), we leave open the possibility of the $\beta$ parameter having negative values.
How can we choose a prior for $\alpha_{task}$ that encodes the same information that we had in the previous model in $p_{task}$? One possibility is to create an auxiliary parameter $p_{btask}$, that represents the baseline probability of being engaged in the task, with the same prior that we use in the previous section, and then transform it to an unconstrained space for our regression with the logit function:
\begin{equation}
\begin{aligned}
&p_{btask} \sim \mathit{\mathit{Beta}}(8, 2)\\
&\alpha_{task} = logit(p_{btask})
\end{aligned}
\end{equation}
To verify that our priors make sense, in Figure \@ref(fig:emphasis) we plot the difference in prior predicted probability of being engaged in the task under the two emphasis conditions:
```{r}
Ns <- 1000 # number of samples for the plot
# Priors
p_btask <- rbeta(n = Ns, shape1 = 8, shape2 = 2)
beta_task <- rnorm(n = Ns, mean = 0, sd = 1)
# Predicted probability of being engaged
p_task_easy <- plogis(qlogis(p_btask) + 0.5 * beta_task)
p_task_hard <- plogis(qlogis(p_btask) + -0.5 * beta_task)
# Predicted difference
diff_p_pred <- tibble(diff = p_task_easy - p_task_hard)
```
(ref:emphasis) The difference in prior predicted probability of being engaged in the task under the two emphasis conditions for the simulated data (`diff_p_pred`) that follows the fast-guess model.
```{r emphasis, fig.cap ="(ref:emphasis)", message = FALSE, fig.height = 2.2}
diff_p_pred %>%
ggplot(aes(diff)) +
geom_histogram()
```
Figure \@ref(fig:emphasis) shows that we are predicting a priori that the difference in $p_{task}$ will tend to be smaller than $\pm 0.3$, which seems to make sense intuitively. If we had more information about the likely range of variation, we could of course have adapted the prior to reflect that belief.
We are ready to generate a new data set, by deciding on some fixed values for $\beta_{task}$ and $p_{btask}$:
```{r, echo = FALSE}
set.seed(123)
```
```{r}
N <- 1000
x <- c(rep(-0.5, N/2), rep(0.5, N/2)) # difficulty
x2 <- rep(c(-0.5, 0.5), N/2) # instructions
# Verify that the predictors are crossed in a 2x2 design :
predictors <- tibble(x, x2)
xtabs(~ x + x2, predictors)
alpha <- 5.8
beta <- 0.05
sigma <- 0.4
sigma2 <- 0.5
gamma <- 5.2
p_correct <- 0.999
# New true point values:
p_btask <- 0.85
beta_task <- 0.5
# Generate data:
alpha_task <- qlogis(p_btask)
p_task <- plogis(alpha_task + x2 * beta_task)
z <- rbern(N, prob = p_task)
rt <- ifelse(z,
rlnorm(N, meanlog = alpha + beta * x, sdlog = sigma),
rlnorm(N, meanlog = gamma, sdlog = sigma2))
acc <- ifelse(z, rbern(N, p_correct),
rbern(N, 0.5))
df_dots_simdata4 <- tibble(trial = 1:N,
x = x,
rt = rt,
acc = acc,
x2 = x2) %>%
mutate(diff = if_else(x == 0.5, "hard", "easy"),
emphasis = ifelse(x2 == 0.5, "accuracy", "speed"))
```
We can generate a plot now where both the difficulty of the task and the instructions are manipulated; see Figure \@ref(fig:taskinstr).
(ref:taskinstr) Response times and accuracy by the difficulty of the task and the instructions type for the simulated data (`df_dots_simdata4`) that follows the fast-guess model.
```{r taskinstr, fig.cap ="(ref:taskinstr)", message = FALSE, fig.height = 3.5}
ggplot(df_dots_simdata4, aes(x = factor(acc), y = rt)) +
geom_point(position = position_jitter(width = .4, height = 0),
alpha = 0.5) +
facet_wrap(diff ~ emphasis) +
xlab("Accuracy") +
ylab("Response time")
```
In the Stan implementation, \index{\texttt{log\_inv\_logit}} `log_inv_logit(x)` is applying the logistic (or inverse logit) function to `x` to transform it into a probability and then applying the logarithm; \index{\texttt{log1m\_inv\_logit}} `log1m_inv_logit(x)` is applying the logistic function to `x`, and then applying the logarithm to its complement $(1 - p)$. We do this because rather than having `p_task` in probability space, we have `lodds_task` in log-odds space:
```
real lodds_task = logit(p_btask) + x2[n] * beta_task;
```
The parameter `lodds_task` estimates the mixing probabilities in log-odds:
```
target += log_sum_exp(log_inv_logit(lodds_task) + ...,
log1m_inv_logit(lodds_task) + ...)
```
We also add a \index{\texttt{generated quantities}} `generated quantities` block that can be used for further (prior or posterior) predictive checks. In this block, we do use `z` as an indicator of the latent class (task-engaged mode or fast-guessing mode), since we do not estimate `z`, but rather generate it based on the parameter's posteriors.
We use the dummy variable \index{\texttt{onlyprior}} `onlyprior` to indicate whether we use the data or we only sample from the priors. One can always do the predictive checks in `R`, transforming the code that we wrote for the simulation into a function, and writing the priors in `R`. However, it can be simpler to take advantage of Stan output format and rewrite the code in Stan. One downside of this is that the `stanfit` object that stores the model output can become too large for the memory of the computer. Another downside is reduced robustness, as it is more likely that we overlook an error if we only work in Stan rather than re-implementing the code in different programming languages (e.g., `R` and Stan) [@cooper2014implementations].
The code shown below is available in the `bcogsci` package and is called `mixture_rtacc2.stan`.
```{stan output.var = "mix4_stan", code = readLines(mixture_rtacc2), tidy = TRUE, comment="", eval = FALSE, cache = FALSE, cache.lazy = FALSE}
```
Before fitting the model to the simulated data, we perform prior predictive checks.
#### Prior predictive checks
Generate prior predictive distributions, by setting `onlyprior` to `1`.
```{r, results = "hide", message = FALSE}
ls_dots_simdata <- list(N = N,
rt = rt,
x = x,
x2 = x2,
acc = acc,
onlyprior = 1)
mixture_rtacc2 <- system.file("stan_models",
"mixture_rtacc2.stan",
package = "bcogsci")
```
```{r fitmixture_rtacc2pp, results = "hide", message = "FALSE", eval = !file.exists("dataR/fit_mix_rtacc2_priors.RDS")}
fit_mix_rtacc2_priors <- stan(mixture_rtacc2,
data = ls_dots_simdata)
```
```{r, echo= FALSE}
if(!file.exists("dataR/fit_mix_rtacc2_priors.RDS")){
saveRDS(fit_mix_rtacc2_priors,"dataR/fit_mix_rtacc2_priors.RDS")
} else {
fit_mix_rtacc2_priors <- readRDS("dataR/fit_mix_rtacc2_priors.RDS")
}
```
We plot prior predictive distributions of response times as follows in Figure \@ref(fig:ppdiff-ppcmix4)(a), by setting `y = rt` using `ppd_dens_overlay()`.
```{r ppcmix4, message = FALSE, fig.keep = "none"}
rt_pred <- extract(fit_mix_rtacc2_priors)$rt_pred
ppd_dens_overlay(rt_pred[1:100,]) +
coord_cartesian(xlim = c(0, 2000))
```
Some of the predictive data sets contain responses that are too large, and some of them have too much probability mass close to zero, but there is nothing clearly wrong in the prior predictive distributions (considering that the model hasn't "seen" the data yet).
If we want to plot the prior predicted distribution of differences in response time conditioning on task difficulty, we need to define a new function. Then we use the `bayesplot` function `ppc_stat()` that takes as an argument of `stat` any summary function; see Figure \@ref(fig:ppdiff-ppcmix4)(a).
```{r ppdiff, message = FALSE, fig.keep ="none"}
meanrt_diff <- function(rt){
mean(rt[x == 0.5]) -
mean(rt[x == -0.5])
}
ppd_stat(rt_pred, stat = meanrt_diff)
```
(ref:ppdiff-ppcmix4) (a) Prior predictive distributions of response times from the fast-guess model (`mixture_rtacc2.stan`). (b) Prior predictive distribution of response time differences, using the same model and prior settings, conditioned on task difficulty.
```{r ppdiff-ppcmix4, message = FALSE, fig.cap = "(ref:ppdiff-ppcmix4)", echo=FALSE, fig.show = "hold", out.width="49%", fig.width = 3.9, fig.height =3.2}
# a
rt_pred <- extract(fit_mix_rtacc2_priors)$rt_pred
ppd_dens_overlay(rt_pred[1:100,]) +
coord_cartesian(xlim = c(0, 2000)) +
ggtitle("(a) Prior predictive distribution\n")
# b
ppd_stat(rt_pred, stat = meanrt_diff) +
ggtitle("(b) Prior predictive distribution\nof differences in RTs") +
theme(legend.position="none")
```
```{r, echo = FALSE, eval = FALSE}
acc_pred <- extract(fit_mix_rtacc2_priors)$acc_pred
```
We find that the range of response times look reasonable. There are, however, always more checks that can be done; examples are plotting other summary statistics, or predictions conditioned on other aspects of the data.
#### Fit to simulated data
Fit the model to data, by setting `onlyprior = 0`:
```{r}
ls_dots_simdata <- list(N = N,
rt = rt,
x = x,
x2 = x2,
acc = acc,
onlyprior = 0)
```
```{r fitmixture_rtacc2, results = "hide", message = "FALSE", eval = !file.exists("dataR/fit_mix_rtacc2.RDS")}
fit_mix_rtacc2 <- stan(mixture_rtacc2, data = ls_dots_simdata)
```
```{r, echo= FALSE}
if(!file.exists("dataR/fit_mix_rtacc2.RDS")){
saveRDS(fit_mix_rtacc2,"dataR/fit_mix_rtacc2.RDS")
} else {
fit_mix_rtacc2 <- readRDS("dataR/fit_mix_rtacc2.RDS")
}
```
```{r}
print(fit_mix_rtacc2,
pars = c("alpha", "beta", "sigma", "gamma", "sigma2",
"p_correct", "p_btask", "beta_task"))
```
We see that we fit the model without problems. Before we evaluate the recovery of the parameters more carefully, we implement a hierarchical version of the fast-guess model.
### A \index{Hierarchical implementation} hierarchical implementation of the fast-guess model {#sec-fastguessh}
So far we have evaluated the behavior of one simulated subject.
We discussed before (in the context of distributional regression models, in section \@ref(sec-distrmodel), and in the MPT modeling chapter \@ref(ch-MPT)) that, in principle, every parameter in a model can be made hierarchical. However, this doesn't guarantee that we'll learn anything from the data for those parameters, or that our model will converge. A safe approach here is to start simple, using simulated data. If a model converges on simulated data, it does not guarantee convergence on real data. However, if a model fails to converge on simulated data, it is very likely to fail with real data as well.
For our hierarchical version, we assume that both response times and the effect of task difficulty vary by subject, and that different subjects have different guessing times. This entails the following change to the response time distribution:
\begin{equation}
rt_n \sim
\begin{cases}
\mathit{LogNormal}(\alpha + u_{subj[n],1} + x_n \cdot (\beta + u_{subj[n], 2}), \sigma), & \text{ if } z_n =1 \\
\mathit{LogNormal}(\gamma + u_{subj[n], 3}, \sigma_2), & \text{ if } z_n=0
\end{cases}
\end{equation}
We assume that the three vectors of $u$ (adjustment to the intercept and slope of the task-engaged distribution, and the adjustment to the guessing time distribution) follow a \index{Multivariate normal distribution} multivariate normal distribution centered on zero. For simplicity and lack of any prior knowledge about this experiment design and method, we assume the same (weakly informative) prior distribution for the three variance components and the same regularizing \index{LKJ prior} LKJ prior for the correlation matrix $\mathbf{R_u}$ that contains the three correlations between the adjustments ($\rho_{u_{1,2}}, \rho_{u_{1,3}}, \rho_{u_{2,3}}$):
\begin{equation}
\begin{aligned}
\boldsymbol{u} &\sim\mathcal{N}(0, \Sigma_u)\\
\tau_{u_{1}}, \tau_{u_{2}}, \tau_{u_{3}} & \sim \mathit{ \mathit{Normal}}_+(0, 0.5)\\
\mathbf{R_u} &\sim \mathit{LKJcorr}(2)
\end{aligned}
\end{equation}
Before we fit the model to the real data set, we simulate data again; this time we simulate $20$ subjects, each of whom delivers a total of $100$ trials (each subject sees $25$ trials for each of the four conditions).
```{r fake-data-pars-mix, message = FALSE, tidy = FALSE}
# Build the simulate data set
N_subj <- 20
N_trials <- 100
# Parameters' true point values
alpha <- 5.8
beta <- 0.05
sigma <- 0.4
sigma2 <- 0.5
gamma <- 5.2
beta_task <- 0.1
p_btask <- 0.85
alpha_task <- qlogis(p_btask)
p_correct <- 0.999
tau_u <- c(0.2, 0.005, 0.3)
rho <- 0.3
```
```{r fake-data-mix, message = FALSE, tidy = FALSE}
## Build the data set here:
N <- N_subj * N_trials
stimuli <- tibble(x = rep(c(rep(-0.5, N_trials / 2),
rep(0.5, N_trials / 2)), N_subj),
x2 = rep(rep(c(-0.5, 0.5), N_trials / 2), N_subj),
subj = rep(1:N_subj, each = N_trials),
trial = rep(1:N_trials, N_subj))
stimuli
# Build the correlation matrix for the adjustments u:
Cor_u <- matrix(rep(rho, 9), nrow = 3)
diag(Cor_u) <- 1
Cor_u
# Variance covariance matrix for subjects:
Sigma_u <- diag(tau_u, 3, 3) %*% Cor_u %*% diag(tau_u, 3, 3)
# Create the correlated adjustments:
u <- mvrnorm(n = N_subj, c(0, 0, 0), Sigma_u)
# Check whether they are correctly correlated
# (There will be some random variation,
# but if you increase the number of observations,
# you'll be able to obtain more exact values below):
cor(u)
# Check the SDs:
sd(u[, 1]); sd(u[, 2]); sd(u[, 3])
# Create the simulated data:
df_dots_simdata <- stimuli %>%
mutate(z = rbern(N, prob = plogis(alpha_task + x2 * beta_task)),
rt = ifelse(z,
rlnorm(N, meanlog = alpha + u[subj, 1] +
(beta + u[subj, 2]) * x,
sdlog = sigma),
rlnorm(N, meanlog = gamma + u[subj, 3],
sdlog = sigma2)),
acc = ifelse(z, rbern(n = N, p_correct),
rbern(n = N, 0.5)),
diff = if_else(x == 0.5, "hard", "easy"),
emphasis = ifelse(x2 == 0.5, "accuracy", "speed"))
```
```{r, echo = FALSE}
simdata_dots <- function(N_subj = 20,
N_trials = 100,
# Parameters true point values:
alpha = 5.8,
beta = 0.05,
sigma = 0.4,
sigma2 = 0.5,
gamma = 5.2,
beta_task = 0.1,
p_btask = 0.8,
alpha_task = qlogis(p_btask),
p_correct = 0.995,
tau_u = c(0.2,0.005, 0.3),
rho = 0.3){
N <- N_subj * N_trials
stimuli <- tibble(x = rep(c(rep(-0.5,N_trials/2), rep(0.5, N_trials/2)), N_subj),
x2 = rep(rep(c(-0.5,0.5), N_trials/2), N_subj),
subj = rep(1:N_subj, each = N_trials),
trial = rep(1:N_trials, N_subj)
)
<<fake-data-mix>>
## # This matrix is symmetric, so it won't matter, but
## # be careful with how R arranges the matrix values
## Cor_u <- matrix(rep(rho, 9), nrow = 3)
## diag(Cor_u) <- 1
## Cor_u
## # Variance covariance matrix for 'subj':
## Sigma_u <- diag(tau_u,3,3) %*% Cor_u %*% diag(tau_u,3,3)
## # Create the correlated adjustments
## u <- mvrnorm(n = N_subj, c(0, 0, 0), Sigma_u)
## # Create the data
## stimuli %>%
## mutate(z = rbern(n = N, prob = plogis(alpha_task + x2 * beta_task)),
## rt = ifelse(z,
## rlnorm(n = N, meanlog = alpha + u[subj, 1] +
## (beta + u[subj, 2]) * x, sdlog = sigma),
## rlnorm(n = N, meanlog = gamma + u[subj, 3], sdlog = sigma2)),
## acc = ifelse(z, rbern(n = N, p_correct),
## rbern(n = N, 0.5)),
## diff = if_else(x == .5, "hard", "easy"),
## emphasis = ifelse(x2 == 0.5, "accuracy", "speed"))
df_dots_simdata
}
```
Verify that the distribution of the simulated response times conditional on the simulated accuracy and the experimental manipulations make sense; see Figure \@ref(fig:simdatartacc).
(ref:simdatartacc) The distribution of response times conditional on the simulated accuracy and the experimental manipulations for the simulated hierarchical data (`df_dots_simdata`) that follows the fast-guess model.
```{r simdatartacc, fig.cap = "(ref:simdatartacc)",message = FALSE}
ggplot(df_dots_simdata, aes(x = factor(acc), y = rt)) +
geom_point(position = position_jitter(width = 0.4,
height = 0), alpha = 0.5) +
facet_wrap(diff ~ emphasis) +
xlab("Accuracy") +
ylab("Response time")
```
We implement the model in Stan as follows in `mixture_h.stan`. The hierarchical extension uses the Cholesky factorization for the group-level effects (as in section \@ref(sec-corrstan)).
```{stan output.var = "mixh_stan", code = readLines(mixture_h), tidy = TRUE, comment="", eval = FALSE, cache = FALSE, cache.lazy = FALSE}
```
Save the model code and fit it to the simulated data:
```{r fitmixture_hsim, results = "hide", message = "FALSE", eval = !file.exists("dataR/fit_mix_h.RDS")}
ls_dots_simdata <- list(N = N,
rt = df_dots_simdata$rt,
x = df_dots_simdata$x,
x2 = df_dots_simdata$x2,
acc = df_dots_simdata$acc,
subj = df_dots_simdata$subj,
N_subj = N_subj)
mixture_h <- system.file("stan_models",