-
Notifications
You must be signed in to change notification settings - Fork 1
/
18-lognormalrace.Rmd
1525 lines (1157 loc) · 89.2 KB
/
18-lognormalrace.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# A simple accumulator model to account for choice response time {#ch-lognormalrace}
\chaptermark{A simple accumulator model}
As mentioned in chapter \@ref(ch-mixture), the most popular class of cognitive-process models that can incorporate both response times and accuracy are \index{Sequential sampling model} sequential sampling models [for a review, see @Ratcliff2016]. This class of model includes, among others, the \index{Drift diffusion model} drift diffusion model [@Ratcliff1978], the \index{Linear ballistic accumulator} linear ballistic accumulator [@brownSimplestCompleteModel2008], and the \index{Log-normal race model} log-normal race model [@HeathcoteLove2012; @RouderEtAl2015]. We discuss the log-normal race model in the current chapter. Sequential sampling or \index{Evidence-accumulation model} evidence-accumulation models are based on the idea that decisions are made by gathering evidence from the environment (e.g., the computer screen in many experiments) until sufficient evidence is gathered and a threshold of evidence is reached. The log-normal race model seems to be the simplest sequential sampling model that can account for the joint distribution of \index{Response time} response times and response choice or \index{Accuracy} accuracy [@HeathcoteLove2012; @RouderEtAl2015].
This model belongs to the subclass of \index{Race model} *race models*, where the evidence for each response
grows gradually in time in separate \index{Racing accumulator} racing accumulators, until a threshold is reached. A
response is made when one of these \index{Accumulator} accumulators first reaches the threshold, and wins the race against the other accumulators. This model is sometimes referred as \index{Deterministic model} deterministic (or \index{Non-stochastic model} non-stochastic, and \index{Ballistic model} ballistic), since the noise only affects the rate of accumulation of evidence before each race starts, but once the accumulator starts accumulating evidence, the rate is fixed. This means that a given accumulator can be faster or slower in different trials (or between choices) but its rate of accumulation will be fixed during a trial (or within choices). @brown2005ballistic claim that even though it is clear that a range of factors might cause within-choice noise, the behavioral effects might sometimes be small enough to ignore (this is in contrast to models such as the drift diffusion model, where both types of noise are present).
The two main advantages of the log-normal race model in comparison with other sequential sampling models are that: (i) the log-normal race model is very simple, making it easy to extend hierarchically; (ii) it is relatively easy to avoid convergence issues; and (iii) it is straightforward to model more than two choices. This specific model is presented next for pedagogical purposes because it is relatively easy to derive its likelihood given some reasonable assumptions. However, even though the log-normal race is a "legitimate" cognitive model (see Further Reading for examples), the majority of the literature fits choice response times with the linear ballistic accumulator and/or the drift diffusion model, which provide more flexibility to the modeler.
The next section explains how the log-normal race model is implemented, using data from a lexical decision task.
## Modeling a lexical decision task
In a \index{Lexical decision task} lexical decision task, a subject is presented with a string of letters on the screen and they need to decide whether the string is a word or a non-word; see Figure \@ref(fig:LDT-tikz). In the example developed below, a subset of 600 words and 600 non-words from 20 subjects ($600\times2 \times 20$ data points) are used from the data of the \index{British Lexicon project} British Lexicon project [@keuleers2012british]. The data are stored as the object `df_blp` in the package `bcogsci`. In this data set, the lexicality of the string (word or non-word) is indicated in the column `lex`. The goal is to investigate how \index{Word frequency} word frequency, shown in the column `freq` (frequency is counted per million words using the \index{British National Corpus} British National Corpus), affects the lexical decision task as quantified by accuracy and response time. For more details about the data set, type `?df_blp` on the `R` command line after loading the library `bcogsci`.
(ref:LDT-tikz) Two trials in a lexical decision task. For the first trial, `rurble`, the correct answer would be to press the key on a keyboard or a response console that is mapped to the "non-word" response, for the second trial, `monkey`, the correct answer would be to press the key that is mapped to the "word" response.
```{r LDT-tikz,engine='tikz',fig.ext= if (knitr::is_html_output()) "svg" else "pdf", echo = FALSE, fig.cap = "(ref:LDT-tikz)", fig.height = 3.5}
\usetikzlibrary{positioning}
\begin{tikzpicture}
\tikzset{
basefont/.style = {font = \Large\sffamily},
timing/.style = {basefont, sloped,above,},
label/.style = {basefont, align = left},
screen/.style = {basefont, white, align = center,
minimum size = 6cm, fill = black!60, draw = white}};
% macro for defining screens
\newcommand*{\screen}[4]{%
\begin{scope}[xshift =#3, yshift = #4,
every node/.append style = {yslant = 0},
yslant = 0.33,
local bounding box = #1]
\node[screen] at (3cm,3cm) {#2};
\end{scope}
}
% define several screens
\screen{frame1}{\textbf+} {0} {0}
\screen{frame2}{rurble}{150} {-60}
\screen{frame3}{\textbf+} {300}{-120}
\screen{frame4}{monkey}{450}{-180}
\end{tikzpicture}
\end{document}
\end{tikzpicture}
```
```{r}
data("df_blp")
df_blp
```
The following code chunk adds $0.01$ to the frequency column. A frequency of $0.01$ corresponds to a word that appears only once in the corpus; $0.01$ is added in order to avoid word frequencies of zero. The frequencies are then log-transformed to compress their range of values [see @BrysbaertEtAl2018 for a more in-depth treatment of word frequencies] and centered them. It also creates a new variable that sum-codes the lexicality of the each given string (either a word, $0.5$, or a non-word, $-0.5$).
```{r}
df_blp <- df_blp %>%
mutate(lfreq = log(freq + 0.01),
c_lfreq = lfreq - mean(lfreq),
c_lex = ifelse(lex == "word", 0.5, -0.5))
```
If one wants to study the effect of frequency on words, the "traditional" way to analyze these data would be to fit response times and choice data in two separate models on words, ignoring non-words. One model would be fit on the response times of correct responses, and a second model on the \index{Accuracy} accuracy. These two models are fit below.
To fit the response times model, subset the correct responses given to strings that are words:
```{r}
df_blp_word_c <- df_blp %>%
filter(acc == 1, lex == "word")
```
Fit a hierarchical model with a log-normal likelihood and log-transformed frequency as a predictor (using `brms` here) and regularizing priors.
```{r, eval = !file.exists("dataR/fit_rt_word_c.RDS")}
fit_rt_word_c <- brm(rt ~ c_lfreq + (c_lfreq | subj),
data = df_blp_word_c,
family = lognormal,
prior = c(prior(normal(6, 1.5), class = Intercept),
prior(normal(0, 1), class = b),
prior(normal(0, 1), class = sigma),
prior(normal(0, 1), class = sd),
prior(lkj(2), class = cor)),
iter = 3000)
```
```{r, echo= FALSE}
if(!file.exists("dataR/fit_rt_word_c.RDS")){
saveRDS(fit_rt_word_c, file = "dataR/fit_rt_word_c.RDS")
} else {
fit_rt_word_c <- readRDS("dataR/fit_rt_word_c.RDS")
}
```
Show the estimate of the effect of log-frequency on the log-ms scale.
```{r}
posterior_summary(fit_rt_word_c, variable = "b_c_lfreq")
```
To fit the accuracy model, subset the responses given to strings that are words:
```{r }
df_blp_word <- df_blp %>%
filter(lex == "word")
```
Fit a hierarchical model with a \index{Bernoulli likelihood} Bernoulli likelihood (and \index{Logit link} logit link) using log-transformed frequency as a predictor (using `brms`) and relatively weak priors:
```{r, eval = !file.exists("dataR/fit_acc_word.RDS")}
fit_acc_word <- brm(acc ~ c_lfreq + (c_lfreq | subj),
data = df_blp_word,
family = bernoulli(link = logit),
prior = c(prior(normal(0, 1.5), class = Intercept),
prior(normal(0, 1), class = b),
prior(normal(0, 1), class = sd),
prior(lkj(2), class = cor)),
iter = 3000)
```
```{r, echo= FALSE}
if(!file.exists("dataR/fit_acc_word.RDS")){
saveRDS(fit_acc_word, file = "dataR/fit_acc_word.RDS")
} else {
fit_acc_word <- readRDS("dataR/fit_acc_word.RDS")
}
```
Show the estimate of the effect of log-frequency on the log-odds scale:
```{r}
posterior_summary(fit_acc_word, variable = "b_c_lfreq")
```
For this specific data set, it does not matter whether response times or accuracy are chosen as the dependent variable, since both yield results with a similar interpretation: More frequent words are identified more easily, that is, with shorter reading times (this is evident from the negative sign on the estimate of the effect), and with higher accuracy (positive sign on the estimate). However, it might be the case that some data set shows divergent directions in response times and accuracy. For example, more frequent words might take longer to identify, leading to a slowdown in response time as frequency increases, but might still be identified more accurately.
Furthermore, two models are fit above, treating response times and accuracy as independent. In reality, there is plenty of evidence that they are related (e.g., the \index{Speed-accuracy trade-off} speed-accuracy trade-off). Even in these data, as frequency increases, correct answers are given faster, and most errors are for low-frequency words (see Figure \@ref(fig:rtlexical)).
```{r, eval = FALSE, echo = FALSE}
alpha_rt <- fixef(fit_rt_word_c, summary = FALSE)[, "Intercept"]
beta_rt <- fixef(fit_rt_word_c, summary = FALSE)[, "c_lfreq"]
u_1_rt <- ranef(fit_rt_word_c, summary = FALSE)$subj[, , "Intercept"]
u_2_rt <- ranef(fit_rt_word_c, summary = FALSE)$subj[, , "c_lfreq"]
by_subj_rt <- exp(alpha_rt + u_1) - exp(alpha_rt + u_1 + (beta_rt + u_2))
by_subj_rt_s <- t(apply(by_subj_rt, 2,
function(x) c(Estimate = mean(x), quantile(x, c(.025,.975))))) %>%
bind_cols(subj = 1:nrow(.))
alpha_acc <- fixef(fit_acc_word, summary = FALSE)[, "Intercept"]
beta_acc <- fixef(fit_acc_word, summary = FALSE)[, "c_lfreq"]
u_1_acc <- ranef(fit_acc_word, summary = FALSE)$subj[, , "Intercept"]
u_2_acc <- ranef(fit_acc_word, summary = FALSE)$subj[, , "c_lfreq"]
by_subj_acc <- exp(alpha_acc + u_1) - exp(alpha_acc + u_1 + (beta_acc + u_2))
by_subj_acc_s <- t(apply(by_subj_acc,2, function(x) c(Estimate = mean(x), quantile(x, c(.025,.975))))) %>%
bind_cols(subj = 1:nrow(.))
by_subj <- left_join(by_subj_rt_s, by_subj_acc_s, by = "subj")
ggplot(by_subj, aes(x = Estimate.x, y = Estimate.y)) +
geom_point()
by_subj$rank_ms <- rank (by_subj$Estimate.x)
by_subj$`rank_%` <- rank(by_subj$Estimate.y)
ggplot(by_subj, aes(x=rank_ms, y=`rank_%`,label=subj))+geom_text()
```
(ref:rtlexical) The distribution of response times for words and non-words, and correct and incorrect answers.
```{r rtlexical, fig.cap = "(ref:rtlexical)", message = FALSE, warning= FALSE, fold = TRUE, fig.height = 3.5}
acc_lbl <- as_labeller(c(`0` = "Incorrect", `1` = "Correct"))
ggplot(df_blp, aes(y = rt, x = freq + .01, shape = lex, color = lex)) +
geom_point(alpha = .5) +
facet_grid(. ~ acc, labeller = labeller(acc = acc_lbl)) +
scale_x_continuous("Frequency per million (log-scaled axis)",
limits = c(.0001, 2000),
breaks = c(.01, 1, seq(5, 2000, 5)),
labels = ~ ifelse(.x %in% c(.01, 1, 5, 100, 2000), .x, "")
) +
scale_y_continuous("Response times in ms (log-scaled axis)",
limits = c(150, 8000),
breaks = seq(500,7500,500),
labels = ~ ifelse(.x %in% c(500,1000,2000, 7500), .x, "")
) +
scale_color_discrete("lexicality") +
scale_shape_discrete("lexicality") +
theme(legend.position = "bottom") +
coord_trans(x = "log", y = "log")
```
A powerful way to convey the relationship between response times and accuracy is using \index{Quantile probability plot} *quantile probability plots* [@ratcliff2002estimating; these are closely related to the latency probability plots of @audley1965some].^[Other useful tools to investigate errors and response time patterns are plots of *conditional accuracy functions* [@MaanenEtAl2018Fastslowerrors] and plots of *defective cumulative distributions of response times* [@brownSimplestCompleteModel2008].]
A quantile probability plot shows quantiles of the response time distribution (typically $0.1$, $0.3$, $0.5$, $0.7$, and $0.9$) for correct and \index{Incorrect responses} incorrect responses on the y-axis against probabilities of correct and incorrect responses for experimental conditions on the x-axis. The plot is built by first aggregating the data.
To display a quantile probability plot, create a custom function `qpf()` that takes as arguments a data set grouped by an experimental condition (e.g., words vs non-words, here by `lex`), and the quantiles that need to be displayed (by default, $0.1$, $0.3$, $0.5$, $0.7$, $0.9$). The function works as follows:
First, calculate the desired quantiles of the response times for incorrect and correct responses by condition (these are stored in `rt_q`). Second, calculate the proportion of incorrect and correct responses by condition (these are stored in `p`); because this information is needed for each quantile, repeat it for the number of quantiles chosen (here, five times). Last, record the quantile that each response time and response probability corresponds to (this is recorded in `q`), and whether it corresponds to an incorrect or a correct response (this information is stored in `response`).
```{r}
qpf <- function(df_grouped,
quantiles = c(0.1, 0.3, 0.5, 0.7, 0.9)) {
df_grouped %>% summarize(
rt_q = list(c(quantile(rt[acc == 0], quantiles),
quantile(rt[acc == 1], quantiles))),
p = list(c(rep(mean(acc == 0), length(quantiles)),
rep(mean(acc == 1), length(quantiles)))),
q = list(rep(quantiles, 2)),
response = list(c(rep("incorrect", length(quantiles)),
rep("correct", length(quantiles))))) %>%
# Since the summary contains a list in each column,
# we unnest it to have the following number of rows:
# number of quantiles x groups x 2 (incorrect, correct)
unnest(cols = c(rt_q, p, q, response))
}
df_blp_lex_q <- df_blp %>%
group_by(lex) %>%
qpf()
```
The aggregated data look like this:
```{r}
df_blp_lex_q %>% print(n = 10)
```
Plot the data by joining the points that belong to the same quantiles with lines. Given that incorrect responses in most tasks occur in less than 50% of the trials and correct responses occur in a complementary distribution (i.e., in more than 50% of the trials), incorrect responses usually appear in the left half of the plot, and correct ones in the right half. The code that appears below produces Figure \@ref(fig:qpplex).
(ref:qpplex) Quantile probability plots showing $0.1$, $0.3$, $0.5$, $0.7$, and $0.9$-th response time quantiles plotted against proportion of incorrect responses (left) and proportion of correct responses (right) for strings that are words and non-words.
```{r qpplex, fig.cap = "(ref:qpplex)", fig.height = 3.5}
ggplot(df_blp_lex_q, aes(x = p, y = rt_q)) +
geom_vline(xintercept = 0.5, linetype = "dashed") +
geom_point(aes(shape = lex)) +
geom_line(aes(group = interaction(q, response))) +
ylab("RT quantiles (ms)") +
scale_x_continuous("Response proportion", breaks = seq(0, 1, .2)) +
scale_shape_discrete("Lexicality") +
annotate("text", x = 0.40, y = 500, label = "incorrect") +
annotate("text", x = 0.60, y = 500, label = "correct")
```
The vertical spread among the lines shows the shape of the response time distribution. The lower \index{Quantile line} quantile lines correspond to the left part of the \index{Response time distribution} response time distribution, and the higher quantiles to the right part of the distribution. Since the response time distribution is \index{Long tailed} long tailed and \index{Right skewed} right skewed, the higher quantiles are more spread apart than the lower quantiles.
A quantile probability plot can also be used to corroborate the observation that high-frequency words are easier to recognize. To do that, subset the data to only words, and group the strings according to their "frequency group" (that is, according to the quantile of frequency that the strings belong to).
Whereas we previously aggregated over all the observations, ignoring subjects, we can also aggregate by subjects first, and then average the results. This will prevent some idiosyncratic responses from subjects from dominating in the plot. (We can also plot individual quantile probability plots by subject). Apart from the fact that the aggregation is by subjects, the code below follows the same steps as before, and the result is shown in Figure \@ref(fig:qppfreq). The plot shows that for more frequent words, accuracy improves and responses are faster.
(ref:qppfreq) Quantile probability plot showing $0.1$, $0.3$, $0.5$, $0.7$, and $0.9$-th response times quantiles plotted against proportion of incorrect responses (left) and proportion of correct responses (right) for words of different frequency. Word frequency is grouped according to quantiles: The first group is words with frequencies smaller than the $0.2$-th quantile, the second group is words with frequencies smaller than the $0.4$-th quantile and larger than the $0.2$-th quantile, and so forth.
```{r qppfreq, fig.cap = "(ref:qppfreq)", warning = FALSE, fig.height = 3.5}
df_blp_freq_q <- df_blp %>%
# Subset only words:
filter(lex == "word") %>%
# Create 5 word frequencies group
mutate(freq_group =
cut(lfreq,
quantile(lfreq, c(0,0.2,0.4, 0.6, 0.8, 1)),
include.lowest = TRUE,
labels =
c("0-0.2", "0.2-0.4",
"0.4-0.6", "0.6-0.8", ".8-1"))
) %>%
# Group by condition and subject:
group_by(freq_group, subj) %>%
# Apply the quantile probability function:
qpf() %>%
# Group again removing subject:
group_by(freq_group, q, response) %>%
# Get averages of all the quantities:
summarize(rt_q = mean(rt_q),
p = mean(p))
# Plot
ggplot(df_blp_freq_q, aes(x = p, y = rt_q)) +
geom_point(shape = 4) +
geom_text(
data = df_blp_freq_q %>%
filter(q == 0.1),
aes(label = freq_group), nudge_y = 12) +
geom_line(aes(group = interaction(q, response))) +
ylab("RT quantiles (ms)") +
scale_x_continuous("Response proportion", breaks = seq(0, 1, 0.2)) +
annotate("text", x = 0.40, y = 900, label = "incorrect") +
annotate("text", x = 0.60, y = 900, label = "correct")
```
So far, several ways were shown to describe the data by representing them graphically. Next, we turn to modeling the data.
### Modeling the lexical decision task with the log-normal race model {#sec-acccoding}
The log-normal race model is used here to examine the effect of word frequency in both response times and choice (word vs. non-word) in the lexical decision task presented earlier. In this example, the log-normal race model is limited to fitting two choices; as mentioned earlier, this model can in principle fit more than two choices. When modeling a task with two choices, there are two ways to account for the data: either fit the response times and the accuracy (i.e., \index{Accuracy coding} accuracy coding: correct vs. incorrect), or fit the response times and actual responses (i.e., \index{Stimulus coding} stimulus coding: in this case word vs. non-word). In this example, we will use the stimulus-coding approach.
The following code chunk adds a new column that incorporates the actual choice made (as `word` vs. `non-word` in `choice` and as `1` vs. `2` in `nchoice`):
```{r}
df_blp <- df_blp %>%
mutate(choice = ifelse((lex == "word" &
acc == 1) |
(lex == "non-word" &
acc == 0), "word", "non-word"),
nchoice = ifelse(choice == "word", 1, 2))
```
To start modeling the data, think about the behavior of one synthetic subject. This subject simultaneously accumulates evidence for the response, "word" in one \index{Accumulator} accumulator, and for "non-word" in another independent accumulator. Unlike other sequential sampling models, an increase in evidence for one choice doesn't necessarily reduce the evidence for the other choices.
@RouderEtAl2015 points out that it might seem odd to assume that we accumulate evidence for a non-word in the same manner as we accumulate evidence for a word, since non-words may be conceptualized as the absence of a word. However, they stress that this approach is closely related to \index{Novelty detection} novelty detection, where the salience of never-before experienced stimuli seems to indicate that novelty is psychologically represented as more than the absence of familiarity. Nevertheless, notions of words and non-word evidence accumulation are indeed controversial [see @dufau2012say]. The alternative approach of fitting accuracy rather than stimuli discussed before doesn't really circumvent the problem. This is because when the correct answer is `word`, we assume that the "correct" accumulator accumulates evidence for `word`, and the incorrect one for `non-word`, and the other way around when the correct answer is `non-word`.
### A generative model for a race between accumulators {#sec-genaccum}
To build a generative model of the task based on the log-normal race model, start by spelling out the assumptions. In a race of accumulators model, the assumption is that the decision time $T$ taken for each \index{Accumulator} accumulator of evidence to reach the \index{Threshold} threshold at distance to the decision threshold $D$ is simply defined by
\begin{equation}
T = D/V
\end{equation}
where the denominator $V$ is the rate (velocity, sometimes also called \index{Drift rate} drift rate) of \index{Evidence accumulation} evidence accumulation.
The log-normal race model assumes that the \index{Rate} rate $V$ in each trial is sampled from a log-normal distribution:
\begin{equation}
V \sim \mathit{LogNormal}(\mu_v, \sigma_v)
\end{equation}
```{r lognormalrace, echo = FALSE, fig.cap = "A schematic illustration of the log-normal race model for the lexical-decision task with a word stimulus. A larger rate of accumulation (V) leads to a larger slope. Here, the choice of word is selected.", fig.height = 4}
set.seed(100)
D_w <- 10000
D_nw <- 10000
mu_v_w <- log(20)
mu_v_nw <- log(10)
sigma <- .3
N <- 100
V_w <- rlnorm(N, mu_v_w, sigma)
V_nw <- rlnorm(N, mu_v_nw, sigma)
T_w <- D_w / V_w
T_nw <- D_nw/V_nw
data_acc <- tibble(accumulator = c(rep("word", N), rep("non-word",N)) %>%
factor(levels=c("word","non-word")),
V = c(V_w,V_nw),
T = c(T_w, T_nw),
D = c(rep(D_w, N), rep(D_nw,N)),
trial = c(1:N, 1:N))
onetrial <- data_acc %>%
group_by(trial) %>%
filter(T[accumulator=="word"] < T[accumulator=="non-word"]) %>%
ungroup(trial) %>%
filter(trial == min(trial))
ggplot(data_acc, aes(xend = T)) +
geom_segment(aes(x =0, y=0, yend = D), alpha = .5, color = "gray") +
geom_segment(data= onetrial, aes(x =0, y=0, yend = D), color = "black") +
facet_grid(rows = vars(accumulator))+
## geom_hline(aes(yintercept = D), linetype = "dashed")+
geom_vline(data = onetrial, aes(xintercept = T), linetype = "dashed")+
scale_y_continuous("Evidence", breaks = NULL, expand = c(0, 0)) +
scale_x_continuous("rt", breaks = NULL, expand = c(0,0) )+
coord_cartesian(clip = "off")+
## coord_cartesian(xlim=c(-200, max(T_nw)), ylim=c(450,D_nw+100))+
annotate(geom="segment", x = -350, xend = 0, y= 300,yend=300,arrow = arrow(length = unit(0.3,"cm")))+
annotate(geom="text", x =-150, y=1500,label = "non-decision\ntime")+
annotate(geom="segment", x = 1400, xend = 1400, y= 0,yend=D_w,arrow = arrow(length = unit(0.3,"cm")))+
annotate(geom="text", x =1600 , y=5800,label = "Decision\nthreshold (D)")+
geom_segment(data = onetrial, aes(x = 0, xend = T, y= 300,yend=300),arrow = arrow(length = unit(0.3,"cm")))+
annotate(geom="text", x =320, y=750,label = "Decision time (T)")
```
A log-normal distribution is partly justified by the work by @ulrichInformationProcessingModels1993 (also see online section \@ref(app-lognormal)), and as discussed later, it is very convenient mathematically.
For simplicity, assume that the distance $D$ to the threshold is kept constant. This might not be a good assumption if the experiment is designed so that subjects change their threshold depending on speed or accuracy incentives (that was not the case in this experiment), or if the subject gets fatigued as the experiment progresses, or if there is reason to believe that there might be \index{Random fluctuation} random fluctuations in this \index{Threshold} threshold. Later in this chapter, we will discuss what happens if this assumption is relaxed.
In each trial $n$, a string is presented, and both the word and non-word accumulators of evidence need to be modeled. For the model to capture the behavior of the subjects, evidence needs to be accumulated faster for the word accumulator than for the non-word accumulator when the string presented is a word, and vice versa when the string is a non-word.
Assume that, for trial $n$, the rate of accumulation of evidence for words, $V_{w,n}$, and the rate for non-words, $V_{nw,n}$, are generated as follows:
\begin{equation}
\begin{aligned}
V_{w,n} &\sim \mathit{LogNormal}(\mu_{v_{w,n}}, \sigma)\\
V_{nw,n} &\sim \mathit{LogNormal}(\mu_{v_{nw,n}}, \sigma)
\end{aligned}
\end{equation}
The location $\mu_{v_w}$ of the distribution of rates of accumulation of evidence for a word $w$ is a function of the \index{Lexicality} lexicality of the string presented (only a word will increase this rate of accumulation and not a non-word), \index{Frequency} frequency (i.e., high-frequency words might be easier to identify, leading to a faster rate of accumulation than with low-frequency words), and \index{Bias} bias (i.e., a subject might have a tendency to answer that a string is a word rather than non-word or vice-versa, regardless of the stimuli).
This assumption can be modeled with a linear regression over $\mu_{v_w}$, with parameters that represent the bias to categorize a string as a word, $\alpha_{w}$, the effect of lexicality, $\beta_{lex_{w}}$, and the effect of log-frequency $\beta_{\mathit{lfreq}_{w}}$.
\begin{equation}
mu_{v_{w,n}} = \alpha_w + \mathit{lex}_n \cdot \beta_{lex_{w}} + \mathit{\mathit{lfreq}}_n \cdot \beta_{\mathit{lfreq}_{w}}
\end{equation}
For the non-word accumulator, the location for the rate of accumulation of evidence is defined similarly:
\begin{equation}
mu_{v_{nw,n}}= \alpha_{nw} + \mathit{lex}_n \cdot \beta_{lex_{nw}} + \mathit{lfreq}_n \cdot \beta_{\mathit{lfreq}_{nw}}
\end{equation}
The accumulators reach the threshold at time $T_{w,n}$ (for words) and $T_{nw,n}$ (for non-words)
\begin{equation}
\begin{aligned}
T_{w,n} &= D_{w}/V_{w,n}\\
T_{nw,n} &= D_{nw}/ V_{nw,n}
\end{aligned}
\end{equation}
The choice for trial $n$ corresponds to the accumulator with the shortest time for that trial,
\begin{equation}
\mathit{choice}_n =
\begin{cases}
\mathit{word}, & \text{ if } T_{w,n} < T_{nw,n} \\
\mathit{non}\text{-}{word}, & \text{ otherwise }
\end{cases}
\end{equation}
and $T_n$, the time taken to make a decision for trial $n$ is the minimum of the time $T_{w,n}$ and $T_{nw,n}$:
\begin{equation}
T_n = min(T_{w,n},T_{nw,n})
\end{equation}
We also need to take into account that not all the time spent in the task involves making the
decision: Time is spent fixating the gaze on the screen, pressing a button, etc. We'll add a shift to the distribution, representing the minimum amount of time that a subject needs for all the peripheral processes that happened before and after the decision [also see @Rouder2005]. We represent this non-decision time with $T_0$. Although some variation in the \index{Non-decision time} non-decision time is highly likely, we use a constant as an approximation. This simplification is reasonable if the variation in non-decision time is small relative to the variation in decision time [@HeathcoteLove2012].
\begin{equation}
rt_n = T_0 + T_n
\end{equation}
The following chunk of code generates \index{Synthetic data} synthetic data for one subject, by setting true point values to the parameters and translating the previous equations to `R`. The true values are relatively arbitrary and were decided by trial and error until a relatively realistic distribution of \index{Response time} response times was obtained. Considering that this is only one subject (unlike what was shown in previous figures), Figure \@ref(fig:scatterracesim) looks relatively fine. (One can also inspect the quantile probability plots of individual subjects in the real data set and compare it to the synthetic data).
First, set a \index{Seed} seed to always generate the same \index{Pseudo-random value} pseudo-random values, take a subset of the data set to keep the same structure of the data frame for our simulated subject, and set true point values. The distance to the decision threshold, D, is set to the arbitrary constant value of 1800 so that, with the other parameters having magnitudes similar to those typically found in response time studies, the simulated reaction times have a realistic magnitude.
```{r lnsimdata1}
set.seed(123)
df_blp_1subj <- df_blp %>%
filter(subj == 1)
# Set the same threshold to both accumulators
D <- 1800
alpha_w <- 0.8
beta_wlex <- 0.5
beta_wlfreq <- 0.2
alpha_nw <- 1
beta_nwlex <- -0.5
beta_nwlfreq <- -0.05
sigma <- 0.8
T_0 <- 150
```
Second, generate the locations of both accumulators, `mu_v_w` and `mu_v_nw`, for every trial. This means that both variables are vectors of length, `N`, the number of trials:
```{r lnsimdata2}
mu_v_w <- alpha_w + df_blp_1subj$c_lfreq * beta_wlfreq +
df_blp_1subj$c_lex * beta_wlex
mu_v_nw <- alpha_nw + df_blp_1subj$c_lfreq * beta_nwlfreq +
df_blp_1subj$c_lex * beta_nwlex
N <- nrow(df_blp_1subj)
```
Third, generate values for the rates of accumulation, `V_w` and `V_nw`, for every trial. Use those rates to calculate `T_w` and `T_nw`, how long it will take for each accumulator to reach its threshold (assumed to be the same for both accumulators) for every trial:
```{r lnsimdata3}
V_w <- rlnorm(N, mu_v_w, sigma)
V_nw <- rlnorm(N, mu_v_nw, sigma)
T_w <- D / V_w
T_nw <- D / V_nw
```
Fourth, calculate the time it takes to reach a decision in every trial, `T_n` as the by-trial minimum between `T_w` and `T_nw`. Similarly, store the winner accumulator for each trial in `accumulator_winner`:
```{r lnsimdata4}
T_n <- pmin(T_w, T_nw)
accumulator_winner <- ifelse(T_w == pmin(T_w, T_nw),
"word",
"non-word")
```
Finally, add this information to the data frame that now indicates choice, time, and accuracy for each trial:
```{r lnsimdata5}
df_blp1_sim <- df_blp_1subj %>%
mutate(rt = T_0 + T_n,
choice = accumulator_winner,
nchoice = ifelse(choice == "word", 1, 2)) %>%
mutate(acc = ifelse(lex == choice, 1, 0))
```
(ref:scatterracesim) The distribution of response times for words and non-words, and correct and incorrect answers for the synthetic data of one subject.
```{r scatterracesim, fig.cap = "(ref:scatterracesim)", message = FALSE, warning= FALSE, fold = TRUE, fig.height =3.5}
acc_lbl <- as_labeller(c(`0` = "Incorrect", `1` = "Correct"))
ggplot(df_blp1_sim, aes(y = rt, x = freq + .01, shape = lex, color = lex)) +
geom_point(alpha = .5) +
facet_grid(. ~ acc, labeller = labeller(acc = acc_lbl)) +
scale_x_continuous("Frequency per million (log-scaled axis)",
limits = c(.0001, 2000),
breaks = c(.01, 1, seq(5, 2000, 5)),
labels = ~ ifelse(.x %in% c(.01, 1, 5, 100, 2000), .x, "")
)+
scale_y_continuous("Response times in ms (log-scaled axis)",
limits = c(150, 8000),
breaks = seq(500,7500,500),
labels = ~ ifelse(.x %in% c(500,1000,2000, 7500), .x, "")
) +
scale_color_discrete("lexicality") +
scale_shape_discrete("lexicality") +
theme(legend.position = "bottom") +
coord_trans(x = "log", y = "log")
```
### Fitting the log-normal race model
A first issue that that arises when attempting to fit the \index{Log-normal race model} log-normal race model is that we need to fit its likelihood to the decision time T, which is the ratio of $D$ and $V$; there are two situations that are compatible with our assumptions and are mathematically simple:
(1) If we assume that $D$ is a constant $k$ and only $V$ is a random variable, then $T = k/V$, and
\begin{equation}
\log(T) = \log(k/V) = \log(k) - \log(V)
\end{equation}
Since $V$ is log-normally distributed, $\log(V) \sim \mathit{Normal}(\mu_v, \sigma_v)$, and $\log(k)$ is a constant:
\begin{equation}
\begin{aligned}
\log(T) &\sim \mathit{Normal}(\log(k) - \mu_v, \sigma_v)\\
T &\sim \mathit{LogNormal}(\log(k) - \mu_v, \sigma_v)
\end{aligned}
(\#eq:lognormalkV)
\end{equation}
(2) If both $D$ and $V$ are random variables we need a \index{Ratio distribution function} ratio or \index{Quotient distribution function} quotient distribution function. While for arbitrary distributions this requires solving (sometimes extremely complex) integrals [see, for example, @Nelson1981], a log-normally distributed time is not uniquely predicted by assuming that distance is a constant. It also follows if distance is also a log-normally distributed variable and both distributions are independent: If we assume that $D \sim \mathit{LogNormal}(\mu_d, \sigma_d)$ then $T$ is the ratio of two random variables $D/V$, and
\begin{equation}
\log(T) = \log(D/V) = \log(D) - \log(V)
\end{equation}
We have a difference of independent, normally distributed random variables. It follows from random variable theory [see, e.g., @RossProb] that:
\begin{equation}
\begin{aligned}
\log(T) &\sim \mathit{Normal}(\mu_d - \mu_v, \sqrt{\sigma_d^2+\sigma_v^2})\\
T &\sim \mathit{LogNormal}(\mu_d - \mu_v, \sqrt{\sigma_d^2+\sigma_v^2})
\end{aligned}
(\#eq:lognormalDV)
\end{equation}
From Equations \@ref(eq:lognormalkV) and \@ref(eq:lognormalDV), it should be clear that the threshold and accumulation rate cannot be disentangled: a manipulation that affects the rate or the decision threshold will affect the location of the distribution in the same way [also see @RouderEtAl2015]. Another important observation is that $T$ won't have a \index{Log-normal distribution} log-normal distribution when $D$ has any other distributional form.
Following @RouderEtAl2015, we assume that the \index{Noise parameter} noise parameter is the same for each accumulator, since this means that contrasts between \index{Finishing time distribution} finishing time distributions are captured completely by contrasts of the locations of the log-normal distributions. We discuss at the end of the chapter why one would need to relax this assumption (also see online exercise \@ref(exr:lnracescale)).
In each trial $n$, with an accumulator for words, indicated with the subscript $w$, and one for
non-words, indicated with $nw$, we can model the time it takes for each accumulator to get to
the threshold $D$ in the following way. For the word accumulator,
\begin{equation}
\begin{aligned}
\mu_{w,n} &= \mu_{d_w} - \mu_{v_{w,n}}\\
\mu_{w,n} &= \mu_{d_{w}} - (\alpha_w + \mathit{lex}_n \cdot \beta_{lex_{w}} + \mathit{lfreq}_n \cdot \beta_{\mathit{lfreq}_{w}})\\
\mu_{w,n} &= (\mu_{d_w} - \alpha_w) - \mathit{lex}_n \cdot \beta_{lex_{w}} - \mathit{lfreq}_n \cdot \beta_{\mathit{lfreq}_{w}}\\
\mu_{w,n} &= \alpha'_w - \mathit{lex}_n \cdot \beta_{lex_{w}} - \mathit{lfreq}_n \cdot \beta_{\mathit{lfreq}_{w}}\\
T_{w,n} &\sim \mathit{LogNormal}(\mu_{w,n}, \sigma) \\
\end{aligned}
(\#eq:alphaprime)
\end{equation}
The parameter $\alpha'_w$ absorbs the location of the threshold distribution ($\mu_{d_w}$, which we assumed is independent of the accumulator; that is, it is equal to $\mu_d$) minus the intercept of the rate distribution ($\mu_{v_{w,n}}$), and represents a bias. As $\alpha'_w$ gets smaller, the accumulator will be more likely to reach the threshold first, all things being equal, biasing the responses to `word`.
Similarly, for the non-word accumulator,
\begin{equation}
\begin{aligned}
\mu_{nw,n} &= \alpha'_{nw} - \mathit{lex}_n \cdot \beta_{lex_{nw}} - \mathit{lfreq}_n \cdot \beta_{\mathit{lfreq}_{nw}}\\
T_{nw,n} &\sim \mathit{LogNormal}( \mu_{nw,n}, \sigma)
\end{aligned}
\end{equation}
The only observed time is the one associated with the winner accumulator, the response selected $s$, which corresponds to the faster accumulator:
\begin{equation}
T_{\mathit{accum}=s,n} \sim \mathit{LogNormal}(\mu_{\mathit{accum}=s,n}, \sigma)
\end{equation}
If we only fit the observed finishing times of the accumulators, we're always ignoring that in a given trial the accumulator that lost was slower than the accumulator for which we have the latency; this means that we underestimate the time it takes to reach the threshold and we overestimate the rate of accumulation of both accumulators. This can be treated as a problem of \index{Censored data} *censored data*, where for each trial the slower observations are not known.
Since the potential decision time for the accumulator that wasn't selected is definitely longer than the one of the winner accumulator, we obtain the likelihood for each unobserved time by integrating out all the possible decision times that the accumulator could have; that is, from the time it took for the winner accumulator to reach the threshold to infinitely large decision times. In other words, we are calculating how likely is each possible value of
the latent, unknown values of a random variable $T_{\mathit{accum} \neq s,n}$, knowing that they are always more than or equal to the observed (single) value of $T_{\mathit{accum}=s,n}$.
\begin{equation}
P(T_{\mathit{accum} \neq s,n} \geq T_{\mathit{accum} = s,n}) = \int_{T_{\mathit{accum}=s,n}}^{\infty} \mathit{LogNormal}(T|\mu_{a \neq s,n}, \sigma) \, dT
\end{equation}
This integral is the complement of the CDF of the log-normal distribution evaluated at
$T_{\mathit{accum} = s,n}$.
\begin{equation}
P(T_{\mathit{accum} \neq s,n}) = 1 - \mathit{LogNormal}\_CDF(T_{\mathit{accum}=s,n}| \mu_{\mathit{accum} \neq s,n}, \sigma)
\end{equation}
where $\mathit{LogNormal}\_CDF(T_{\mathit{accum}=s,n}| \mu_{\mathit{accum} \neq s,n}, \sigma)$ is a convenient shorthand for the CDF of the log-normal distribution with parameters $\mu_{\mathit{accum} \neq s,n}, \sigma$ evaluated at $T_{\mathit{accum}=s,n}$.
So far we have been fitting the \index{Decision time} decision time $T$, but our dependent variable is \index{Response time} response times, $rt$, the sum of the decision time $T$ and the non-decision time $T_0$. This requires a change of variables in our model, $T_{n} = rt_{n} - T_0$, since $rt$ but not $T$ is available as data. Here, the Jacobian is 1, since $\left|d\frac{\mathit{rt}_n - T{0}}{d\mathit{rt}_n}\right| = 1$. In a change of variables, we must multiply the likelihood by the Jacobian (for details, see section \@ref(sec-change)). Multiplying the likelihood by one, or equivalently, adding $\log(1) = 0$ to the log-likelihood, does not alter the likelihood (or the log-likelihood). Therefore, we do not need to code the Jacobian in the Stan code.^[Adding the Jacobian adjustment would be necessary if one assumes that the non-decision time has its own distribution rather than being constant.]
To sum up, our model can be stated as follows:
\begin{equation}
\begin{aligned}
T_n &= rt_n - T_0\\
\mu_{w,n} &= \alpha'_w - \mathit{lex}_n \cdot \beta_{lex_{w}} - \mathit{lfreq}_n \cdot \beta_{\mathit{lfreq}_{w}}\\
\mu_{nw,n} &= \alpha'_{nw} - \mathit{lex}_n \cdot \beta_{lex_{nw}} - \mathit{lfreq}_n \cdot \beta_{\mathit{lfreq}_{nw}}\\
T_n &\sim
\begin{cases}
\mathit{LogNormal}(\mu_{w,n}, \sigma) \text{ if } \mathit{choice}= \text{word}\\
\mathit{LogNormal}(\mu_{nw,n}, \sigma) \text{ otherwise }
\end{cases}
\end{aligned}
\end{equation}
Rather than trying to estimate all the censored observations, we integrate them out:^[One could estimate the censored times by fitting log-normal distributions that are truncated at $T_{n}$, since this is the minimum possible time for each censored observation:
\begin{equation}
T_{censored,n} \sim
\begin{cases}
\mathit{LogNormal}(\mu_{nw,n}, \sigma) \text{ with } T_{censored,n} > T_n \text{, if } \mathit{choice}= \text{word}\\
\mathit{LogNormal}(\mu_{w,n}, \sigma) \text{ with } T_{censored,n} > T_n \text{, otherwise }
\end{cases}
\end{equation}]
\begin{equation}
P(T_{censored}) =
\begin{cases}
1 - \mathit{LogNormal}\_CDF(T_{n}| \mu_{nw,n}, \sigma) \text{, if } \mathit{choice}= \text{word}\\
1 - \mathit{LogNormal}\_CDF(T_{n}| \mu_{w,n}, \sigma) \text{, otherwise }\\
\end{cases}
\end{equation}
We need priors for all the parameters. An added complication here is the prior for the non-decision time, $T_0$: we need to make sure that it's strictly positive and also that it's smaller than the shortest observed response time. This is because the decision time for each observation, $T_n$ should also be strictly positive:
\begin{equation}
\begin{aligned}
T_n = rt_n - T_0 &>0 \\
rt_n &> T_0\\
min(\mathbf{rt}) &> T_0
\end{aligned}
\end{equation}
We thus truncate the prior of $T_0$ so that the values lie between zero and $min(\mathbf{rt})$, the minimum value of the vector of response times. Although the prior of $T_0$ is informed by the data, the generative model underlying this approach is not. When generating data--before the simulated response times are available--one first draws a value of $T_0$ from a normal distribution truncated at zero. Then, decision times for each trial $n$, $T_n$, are generated, and finally, the simulated (or predicted) response time for each trial, $rt_n$, is the sum of $T_0$ and $T_n$. In other words, there is no circularity in this process. Given the time it takes to fixate the gaze on the screen and a minimal motor response time, centering the prior of $T_0$ in 150 ms seems reasonable. The rest of the priors are on the log scale. One should use prior predictive checks to verify that the order of magnitude of all the priors is appropriate. We skip this step here and present the priors below:
\begin{equation}
\begin{aligned}
T_0 &\sim \mathit{Normal}(150, 100) \text{ with } 0 < T_0 < min(rt_n)\\
\boldsymbol{\alpha} &\sim \mathit{Normal}(6, 1) \\
\boldsymbol{\beta} &\sim \mathit{Normal}(0, 0.5) \\
\sigma &\sim \mathit{Normal}_+(0.5, 0.2)\\
\end{aligned}
\end{equation}
where $\boldsymbol{\alpha}$ is a vector $\langle \alpha'_{n}, \alpha'_{nw} \ \rangle$, and $\boldsymbol{\beta}$ is a vector of all the $\beta$ used in the likelihoods.
To translate the model into Stan, we need a normal distribution truncated so that the values lie between zero and $min(\mathbf{rt})$ for the prior of `T_0`. This means dividing the original distribution by the \index{Difference of the CDFs} difference of the CDFs evaluated at these two points; see online section \@ref(app-truncation). In log-space, this is a difference between the log-transformed original distribution and the logarithm of the difference of the CDFs. The function `log_diff_exp` is a more stable version of this last operation [i.e., less prone to underflow/overflow, \index{Underflow} \index{Overflow} see @BlanchardEtAl2020]. What \index{\texttt{log\_diff\_exp}} `log_diff_exp` does is to take the log of the difference of the exponent of two functions. In this case the functions are two log-CDFs.
```{stan, output.var = "none", eval = FALSE, cache = FALSE, cache.lazy = FALSE}
target += normal_lpdf(T_0 | 150, 100)
- log_diff_exp(normal_lcdf(min(rt) | 150, 100),
normal_lcdf(0 | 150, 100));
```
We implement the likelihood of each joint observation of response time and choice with an if-else clause that calls the likelihood of the accumulator that corresponds to the choice selected in the trial `n`, and the complement CDF for the accumulator that was not selected:
```{stan, output.var = "none", eval = FALSE, cache = FALSE, cache.lazy = FALSE}
if(nchoice[n] == 1)
target += lognormal_lpdf(T[n] | alpha[1] -
c_lex[n] * beta[1] -
c_lfreq[n] * beta[2] , sigma) +
lognormal_lccdf(T[n] | alpha[2] -
c_lex[n] * beta[3] -
c_lfreq[n] * beta[4], sigma);
else
target += lognormal_lpdf(T[n] | alpha[2] -
c_lex[n] * beta[3] -
c_lfreq[n] * beta[4], sigma) +
lognormal_lccdf(T[n] | alpha[1] -
c_lex[n] * beta[1] -
c_lfreq[n] * beta[2], sigma);
}
```
The complete Stan code for this model is shown below as `lnrace.stan`:
```{r, echo = FALSE}
lnrace <- system.file("stan_models",
"lnrace.stan",
package = "bcogsci")
```
```{stan output.var = "lnrace_stan", code = readLines(lnrace), tidy = TRUE, comment="", eval = FALSE, cache = FALSE, cache.lazy = FALSE}
```
Store the data in a list and fit the model. Some warnings might appear during the warm-up, but these warnings can be ignored since they no longer appear afterwards, and all the convergence checks look fine (omitted here):
```{r, eval = !file.exists("dataR/fit_blp1_sim.RDS")}
lnrace <- system.file("stan_models",
"lnrace.stan",
package = "bcogsci")
ls_blp1_sim <- list(N = nrow(df_blp1_sim),
rt = df_blp1_sim$rt,
nchoice = df_blp1_sim$nchoice,
c_lex = df_blp1_sim$c_lex,
c_lfreq = df_blp1_sim$c_lfreq)
fit_blp1_sim <- stan(lnrace, data = ls_blp1_sim)
```
```{r, echo = FALSE}
if(!file.exists("dataR/fit_blp1_sim.RDS")){
saveRDS(fit_blp1_sim, file = "dataR/fit_blp1_sim.RDS")
} else {
fit_blp1_sim <- readRDS("dataR/fit_blp1_sim.RDS")
}
```
Print the parameters values:
```{r}
print(fit_blp1_sim, pars = c("alpha", "beta", "T_0", "sigma"))
```
As in previous chapters, `mcmc_recover_hist()` can be used to compare the posterior distributions of the relevant parameters of the model with their true point values (Figure \@ref(fig:recoverlnrace)). Indeed, Figure \@ref(fig:recoverlnrace) shows that the model recovers the parameters reasonably well. First, however, we need to reparameterize the true values, since $D$ cannot be known, and we don't fit $V$, but rather $D/V$, with $V$ log-normally distributed. Then, obtain an estimate of $\alpha'$, rather than $\alpha$, such that $\alpha' = log(mu_{d}) - \alpha$.
(ref:recoverlnrace) Posterior distributions of the main parameters of the log-normal race model `fit_blp1_sim` together with their true point values.
```{r recoverlnrace, message = FALSE, fig.cap = "(ref:recoverlnrace)"}
true_values <- c(alphapw = log(D) - alpha_w,
alphapnw = log(D) - alpha_nw,
beta_wlex = beta_wlex,
beta_wlfreq = beta_wlfreq,
beta_nwlex = beta_nwlex,
beta_nwlfreq = beta_nwlfreq,
sigma = sigma,
T_0 = T_0)
estimates <- as.data.frame(fit_blp1_sim) %>%
select(- lp__)
mcmc_recover_hist(estimates, true_values) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
```
Before moving on to a more complex version of this model, it's worth spending some time making the code more modular. Encapsulate the likelihood of the log-normal race model by writing it as a function. The function has four arguments, the decision time `T`, the choice `nchoice` (this will only work with two choices, `1` and `2`), an array of locations `mu` (which again we implicitly assume to have two elements), and a common scale `sigma`.
\Begin{samepage}
```{stan, output.var = "none", eval = FALSE, cache = FALSE, cache.lazy = FALSE}
functions {
real lognormal_race2_lpdf(real T, int nchoice,
array[] real mu, real sigma){
real lpdf;
if(nchoice == 1)
lpdf = lognormal_lpdf(T | mu[1] , sigma) +
lognormal_lccdf(T | mu[2], sigma);
else
lpdf = lognormal_lpdf(T | mu[2], sigma) +
lognormal_lccdf(T | mu[1], sigma);
return lpdf;
}
}
```
\End{samepage}
Next, for each iteration `n` of the original for loop, generate an auxiliary variable `T` which contains the decision time for the current trial, and `mu` as an array of size two that contains all the parameters that affect the location at each trial. This will allow us to use our new function as follows in the `model` block:^[This for-loop can also be implemented in the `transformed parameter` block; the advantage of doing this is that the log-likelihood of each observation can be used, for example, for cross-validation; the disadvantage is that the R object might be very large, because it will store the log-likelihood during the warm-up period as well.]
```{stan, output.var = "none", eval = FALSE, cache = FALSE, cache.lazy = FALSE}
array[N] real log_lik;
for(n in 1:N){
real T = rt[n] - T_0;
array[2] real mu = {alpha[1] -
c_lex[n] * beta[1] -
c_lfreq[n] * beta[2],
alpha[2] -
c_lex[n] * beta[3] -
c_lfreq[n] * beta[4]};
log_lik[n] = lognormal_race2_lpdf(T | nchoice[n], mu, sigma);
}
```
The variable `log_lik` contains the log-likelihood for each trial. We must not forget to add the total log-likelihood to the `target` variable. This is done simply by `target += sum(log_lik)`.^[There are two advantages of iterating first and then adding the total log likelihood to `target`: (i) we can use the variable `log_lik` for model comparison with cross-validation (see chapter \@ref(ch-cv)) without the need to repeating code in the generated quantities, and (ii) using `sum` and adding to `target` once is slighter faster than adding to `target` at each iteration.]
The complete Stan code for this model can be found in the `bcogsci` package as `lnrace_mod.stan`, it is left for the reader to verify that the results are the same as from the non-modular model `lnrace.stan` fit earlier.
```{r, echo = FALSE}
lnrace_mod <- system.file("stan_models",
"lnrace_mod.stan",
package = "bcogsci")
```
### A hierarchical implementation of the log-normal race model {#sec-lognormalh}
A simple hierarchical version of the previous model assumes that all the parameters $\alpha$ and $\beta$ have by-subject adjustments:
\begin{equation}
\begin{aligned}
\mu_{w,n} &= \alpha'_w + u_{subj[n], 1} - \mathit{lex}_n \cdot (\beta_{lex_{w}} + u_{subj[n], 2}) \\
& - \mathit{lfreq}_n \cdot (\beta_{\mathit{lfreq}_{w}} + u_{subj[n], 3}) \\
\mu_{nw,n} &= \alpha'_{nw} + u_{subj[n], 4} - \mathit{lex}_n \cdot (\beta_{lex_{nw}} + u_{subj[n], 5}) \\
& - \mathit{lfreq}_n \cdot (\beta_{\mathit{lfreq}_{nw}}+ u_{subj[n], 6}) \\
\end{aligned}
\end{equation}
Similarly to the \index{Hierarchical implementation} hierarchical implementation of the fast-guess model in section \@ref(sec-fastguessh), assume that $\boldsymbol{u}$ is a matrix with as many rows as subjects and six columns. Also assume that $u$ follows a \index{Multivariate normal distribution} multivariate normal distribution centered at zero. For lack of more information, we assume the same (weakly informative) prior distribution for the six variance components $\tau_{u_{1}}, \tau_{u_{2}}, \ldots, \tau_{u_{6}}$ with a somewhat smaller effect than we assumed for the prior of $\sigma$. As with previous hierarchical models, we assign a regularizing \index{LKJ prior} LKJ prior for the correlations between the adjustments:^[There are 15 correlations since there are 15 ways to choose 2 variables out of 6 for specifying the pairwise correlations, where order doesn't matter. This is calculated with ${6 \choose 2}$ which is `choose(6, 2)` in `R`.]
\begin{equation}
\begin{aligned}
\boldsymbol{u} &\sim\mathcal{N}(0, \Sigma_u)\\
\tau_{u_{1}}, \tau_{u_{2}}, \ldots, \tau_{u_{6}} & \sim \mathit{ \mathit{Normal}}_+(0.1, 0.1)\\
\mathbf{R}_u &\sim \mathit{LKJcorr}(2)
\end{aligned}
\end{equation}
Before fitting the model to the real data, verify that it works with simulated data. To create synthetic data of several subjects, repeat the same generative process used before, and add the by-subject adjustments `u` in the same way as in section \@ref(sec-fastguessh). This version of the log-normal race model assumes that all the parameters $\alpha$ and $\beta$ have by-subject adjustments; that is, six adjustments. To simplify the model, ignore the possibility of an adjustment for the non-decision time $T_0$, but see @nicenboimModelsRetrievalSentence2018 for an implementation of the log-normal race model with a hierarchical non-decision time.
For simplicity, all the adjustments `u` are normally distributed with the same standard deviation of $0.2$, and they have a $0.3$ correlation between pairs of `u`'s; see `tau_u` and `rho` below.
First, set a seed, take a subset of the data set to keep the same structure, set true point values, and auxiliary variables that indicate the number of observations, subjects, etc.
```{r lnsimdatah1}
set.seed(42)
df_blp_sim <- df_blp %>%
group_by(subj) %>%
slice_sample(n = 100) %>%
ungroup()
D <- 1800
alpha_w <- 0.8
beta_wlex <- 0.5
beta_wlfreq <- 0.2
alpha_nw <- 1
beta_nwlex <- -0.5
beta_nwlfreq <- -0.05
sigma <- 0.8
T_0 <- 150
N <- nrow(df_blp_sim)
N_subj <- max(df_blp_sim$subj)
N_adj <- 6
tau_u <- rep(0.2, N_adj)
rho <- 0.3
R_u <- matrix(rep(rho, N_adj * N_adj), nrow = N_adj)
diag(R_u) <- 1
Sigma_u <- diag(tau_u, N_adj, N_adj) %*%
R_u %*% diag(tau_u, N_adj, N_adj)
u <- mvrnorm(n = N_subj, rep(0, N_adj), Sigma_u)
subj <- df_blp_sim$subj
```
Second, generate the locations of both accumulators, `mu_v_w` and `mu_v_nw`, for every trial:
```{r lnsimdatah2}
mu_v_w <- alpha_w + u[subj, 1] +
df_blp_sim$c_lfreq * (beta_wlfreq + u[subj, 2]) +
df_blp_sim$c_lex * (beta_wlex + u[subj, 3])
mu_v_nw <- alpha_nw + u[subj, 4] +
df_blp_sim$c_lfreq * (beta_nwlfreq + u[subj, 5]) +
df_blp_sim$c_lex * (beta_nwlex + u[subj, 6])
```
Third, generate values for the rates of accumulation and use those rates to calculate `T_w` and `T_nw`.
```{r lnsimdatah3}
V_w <- rlnorm(N, mu_v_w, sigma)
V_nw <- rlnorm(N, mu_v_nw, sigma)
T_w <- D / V_w
T_nw <- D / V_nw
```
Fourth, calculate the time it takes to reach to a decision and the winner accumulator for each trial.
```{r lnsimdatah4}
T_n<- pmin(T_w, T_nw)
accumulator_winner <- ifelse(T_w == pmin(T_w, T_nw),
"word",
"non-word")
```
Finally, add this information to the data frame.
```{r lnsimdatah5}
df_blp_sim <- df_blp_sim %>%
mutate(rt = T_0 + T_n,
choice = accumulator_winner,
nchoice = ifelse(choice == "word", 1, 2),
acc = ifelse(lex == choice, 1, 0))
```
```{r, echo = FALSE}
lnrace_h <- system.file("stan_models",
"lnrace_h.stan",
package = "bcogsci")
```
The Stan code for this model implements the \index{Non-centered parameterization} non-centered parameterization for \index{Correlated adjustments} correlated adjustments (see section \@ref(sec-corrstan) for more details). The model is shown below as `lnrace_h.stan`:
```{stan output.var = "lnrace_h_stan", code = readLines(lnrace_h), tidy = TRUE, comment="", eval = FALSE, cache = FALSE, cache.lazy = FALSE}
```
Store the simulated data in a list and fit it.
```{r, eval = !file.exists("dataR/fit_blp_h_sim.RDS")}
lnrace_h <- system.file("stan_models",
"lnrace_h.stan",
package = "bcogsci")
ls_blp_h_sim <- list(N = nrow(df_blp_sim),
N_subj = max(df_blp_sim$subj),
subj = df_blp_sim$subj,
rt = df_blp_sim$rt,
nchoice = df_blp_sim$nchoice,
c_lex = df_blp_sim$c_lex,
c_lfreq = df_blp_sim$c_lfreq)
fit_blp_h_sim <- stan(lnrace_h, data = ls_blp_h_sim,
control = list(adapt_delta = 0.99,
max_treedepth = 14))
```
```{r lnracehsim, echo = FALSE, eval = TRUE}
if(!file.exists("dataR/fit_blp_h_sim.RDS")){
saveRDS(ls_blp_h_sim, file = "dataR/ls_blp_h_sim.RDS")
saveRDS(fit_blp_h_sim, file = "dataR/fit_blp_h_sim.RDS")
} else {
fit_blp_h_sim <- readRDS("dataR/fit_blp_h_sim.RDS")
ls_blp_h_sim <- readRDS("dataR/ls_blp_h_sim.RDS")
}
```
The code below compares the posterior distributions of the relevant parameters of the model with their true values, and plots them in Figure \@ref(fig:recoverlnraceh). The true value for all the correlations was $0.3$, but we need to correct the sign depending on whether there was a minus sign in front of the adjustment when we built `mu_v_w` and `mu_v_wn` or not: For example, there is no minus before `u[subj, 1]`, but there is one before `u[subj, 2]`, thus the true correlation between `u[subj, 1]` and `u[subj, 2]` we generated should be negative (plus times minus is minus); and there is a minus before `u[subj, 3]` and thus the correlation between `u[subj, 2]` and `u[subj, 3]` should be positive (minus times minus is positive).
(ref:recoverlnraceh) Posterior distributions of the main parameters of the log-normal race model `fit_blp_h_sim` together with their true values.
```{r, echo = FALSE, results = "hide"}
#which correlations should change the sign
apply(combn(c(1,-2,-3,4,-5,-6),2), 2, prod) %>%
{ifelse(. >0,1,-1)} %>% dput()
```
```{r recoverlnraceh, message = FALSE, fig.cap = "(ref:recoverlnraceh)", eval = TRUE, fig.height = 6}
R_us <- c(paste0("R_u[1,", 2:6 , "]"),
paste0("R_u[2,", 3:6 , "]"),
paste0("R_u[3,", 4:6 , "]"),
paste0("R_u[4,", 5:6 , "]"),
"R_u[5,6]")
corrs <- rho * c(-1, -1, 1, -1, -1, 1, -1, 1,
1, -1, 1, 1, -1, -1, 1)
true_values <- c(log(D) - alpha_w,
log(D) - alpha_nw,
beta_wlex,
beta_wlfreq,
beta_nwlex,
beta_nwlfreq,
T_0,
sigma,
tau_u,
corrs)
par_names = c("alpha",
"beta",
"T_0",