-
Notifications
You must be signed in to change notification settings - Fork 4
/
models_06_likelihood_maximization.Rmd
379 lines (272 loc) · 15.9 KB
/
models_06_likelihood_maximization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
---
title: 'Model Fitting: Likelihood Maximization'
output:
html_notebook: default
html_document:
df_print: paged
pdf_document: default
author: Marius 't Hart
---
```{r setup, cache=FALSE, include=FALSE}
library(knitr)
#opts_chunk$set(comment='', eval=FALSE)
```
Some model evaluation criteria don't work with minimizing a loss function (like MSE) but by maximizing the likelihood of the model. If you'd like to calculate a [BIC](https://en.wikipedia.org/wiki/Bayesian_information_criterion), [AICc](https://en.wikipedia.org/wiki/Akaike_information_criterion#Modification_for_small_sample_size) or [Hannan-Quinn](https://en.wikipedia.org/wiki/Hannan%E2%80%93Quinn_information_criterion) criterion for your model - and probably others, you need the likelihood of the model. There are other benefits too. The cost is that we need to have an estimate of noise in the data.
_Note 1:_ This does not mean that your previous model fits are wrong. The likelihood of the model [can be shown](https://stats.stackexchange.com/questions/16508/calculating-likelihood-from-rmse) to be at it's maximum at the same parameters where the error is minimized. It's just that you can only use AIC for model evaluation.
_Note 2:_ I suppose it might be possible to fit a model using MSE minimization and then calculate the likelihood of the fitted model afterwards.
# Simple example of likelihood
This is going to be an extremely simple model. We're taking the sepal length of setosa irisses from the `iris` dataset:
```{r}
SSL <- iris$Sepal.Length[which(iris$Species == 'setosa')]
str(SSL)
```
Now, these 50 numbers can be described as a fairly normal distribution, as seen in histogram and QQ-plot:
```{r}
par(mfrow=c(1,2))
hist(SSL)
qqnorm(SSL, bty='n')
```
That means that the mean and standard deviation give a good description of the distribution:
```{r}
print(c('mean'=mean(SSL), 'sd'=sd(SSL)))
```
Our simple model has no other information than these two numbers. So if we ask it to predict what any new setosa's sepal length will be, the best answer is the mean: 5.006. But now, we can also ask how likely it is to observe this value, given what we know. For this we use the probability density function:
```{r}
plot(seq(3,7,0.001),dnorm(seq(3,7,0.001), mean=mean(SSL), sd=sd(SSL)), type='l',xlab='setosa sepal length', ylab='probability density', bty='n')
```
Notice that the curve exceeds 1. How can there be a probability larger than 1? That's because this is not a probability function, but a probability _density_ function. (It's impossible to have an actual probability function, since the chance of any exact value is infinitely small, best described as 0.) We could convert this to probabilities by slicing the curve into very tiny bins, and taking the surface of the bin, as the total surface is supposed to be 1. However, for our purposes we can apparently just use probability density, as it provides a _relative_ likelihood (within some range) of a particular value occurring as compared to other values.
Let's Look at the first 10 of the 50 observations, so we can check the likelihood of those obervations relative to the whole dataset.
```{r}
print(data.frame('SSL'=SSL[1:10], 'likelihood'=dnorm(SSL[1:10], mean=mean(SSL), sd=sd(SSL))))
```
As you can see, closer to 5.006 the likelihood is highest.
## Two simple models
Now, we could make two simple models, one based on MSE minimization and based on likelihood maximization to find the best model. Both _should_ find the same solution. Let's see.
Here's a function that returns a mean square error:
```{r}
SSL_MSE <- function(par, data) {
return( mean( (data-par)^2, na.rm=TRUE ) )
}
```
And a function that returs the likelihood:
```{r}
SSL_Likelihood <- function(par, data) {
return( dnorm(par, mean=mean(data), sd=sd(data)) )
}
```
We can use `optim()` to find the best parameter estimate (`par`) for both, as `optim()` allows to set control$fncale to a negative value. A similar option is available for `optimx()`.
```{r}
control <- list()
control$fnscale <- -1
```
Here we fit it with the MSE:
```{r}
optim(par=4, fn=SSL_MSE, data=SSL, method='Brent', lower=0, upper=10)
```
And for the Likelihood:
```{r}
print(Lopt <- optim(par=4, fn=SSL_Likelihood, data=SSL, method='Brent', lower=0, upper=10, control=control))
```
## Evaluating the likelihood maximization model
So both methods find the exact same parameter values. Except that with the likelihood we can now also calculate BIC and Hannan-Quinn:
```{r}
L <- Lopt$value
k <- 1
N <- 50
#-- AIC --#
AIC <- (2 * k) - (2 * log(L))
#-- BIC --#
BIC <- log(N)*k - (2 * log(L)) # L based
#-- Hannan-Quinn --#
HQC <- (-2 * L) + (2 * k * log(log(N))) # L based
print(c('AIC'=AIC, 'BIC'=BIC,'Hannan-Quinn'=HQC))
```
Of course, you could debate if those 50 observations are really independent.
## Take away
You should now have some sense of what a likelihood is and how you can maximize the likelihood of a model given the data.
# Two-Rate Model
And we will now implement this for a somewhat harder problem: the two-rate model. Here, there is a distribution for every trial. So we can optimize the likelihood if, in addition to the mean reach deviation in each trial, we also have the standard deviation of reach deviations on each trial.
Let's set up that part of the solution:
```{r}
load('data/tworatedata.rda')
baseline <- function(reachvector,blidx) reachvector - mean(reachvector[blidx], na.rm=TRUE)
tworatedata[,4:ncol(tworatedata)] <- apply(tworatedata[,4:ncol(tworatedata)], FUN=baseline, MARGIN=c(2), blidx=c(17:32))
mu <- apply(tworatedata[4:ncol(tworatedata)],
FUN=mean,
MARGIN=c(1),
na.rm=TRUE)
# This is for individual likelihoods in each trial:
# sigma <- apply(tworatedata[4:ncol(tworatedata)]-mu,
# FUN=sd,
# MARGIN=c(1),
# na.rm=TRUE )
# This holds when the distribution is equally wide on every trial:
# (and should be more comparable to MSE)
sigma <- sd(as.matrix(tworatedata[4:ncol(tworatedata)] - mu), na.rm=T )
schedule <- tworatedata$schedule
reaches <- list( 'mean' = mu,
'sd' = sigma )
```
Let's see what the data frame `reaches` looks like:
```{r}
str(reaches)
```
Here I keep the sd as one single value. This has two reasons. If you want to fit a whole group of participants, then you do have an estimate of standard deviation on every trial, and the functions below can handle that. However, the fit would be different (probably not much) from the MSE fit, as it would effectively weight certain errors more than others. For this tutorial, where we want to compare the MSE-based fit and the Likelihood-based it, we should stick to a single noise estimate. The other issue is that if you want to fit the model to an individual participant's data, you don't get a separate estimate of noise per trial (unless you measure them many times), so in that case we should determine the noise of that participant, and use that to describe the data.
This determines a 2-dimensional likelihood "landscape", and our model fit will have stay on the top ridge of that landscape. Here's a visualization of that landscape:
```{r}
trials <- 164
reachdevs <- seq(-45,45,0.5)
likelihoods <- matrix(data=NA, nrow=trials, ncol=length(reachdevs))
for (trial in c(1:length(reaches$mean))) {
likelihoods[trial,] <- dnorm(reachdevs, mean=reaches$mean[trial], sd=reaches$sd)
}
# this bit copied from stack overflow:
# https://stackoverflow.com/questions/29786380/standardizing-color-scale-persp-plots-in-panel-r
nrz <- trials
ncz <- length(reachdevs)
# Create a function interpolating colors in the range of specified colors
jet.colors <- colorRampPalette( c("white", "purple") )
# Generate the desired number of colors from this palette
nbcol <- 100
color <- jet.colors(nbcol)
# Compute the z-value at the facet centres
zfacet <- likelihoods[-1, -1] + likelihoods[-1, -ncz] + likelihoods[-nrz, -1] + likelihoods[-nrz, -ncz]
# Recode facet z-values into color indices
facetcol <- cut(zfacet, nbcol)
persp(c(1:trials), reachdevs, likelihoods*300, phi = 50, theta = 0,
xlab = "trial", ylab = "reach deviation", zlab='likelihood',
main = "likelihood 'landscape' for two-rate model", col=color[facetcol], shade=0.2, border=NA,
scale = FALSE
)
```
Our model function will still be the same:
```{r}
twoRateModel <- function(par, schedule) {
# thse values should be zero at the start of the loop:
Et <- 0 # previous error: none
St <- 0 # state of the slow process: aligned
Ft <- 0 # state of the fast process: aligned
# we'll store what happens on each trial in these vectors:
slow <- c()
fast <- c()
total <- c()
# now we loop through the perturbations in the schedule:
for (t in c(1:length(schedule))) {
# first we calculate what the model does
# this happens before we get visual feedback about potential errors
St <- (par['Rs'] * St) - (par['Ls'] * Et)
Ft <- (par['Rf'] * Ft) - (par['Lf'] * Et)
Xt <- St + Ft
# now we calculate what the previous error will be for the next trial:
if (is.na(schedule[t])) {
Et <- 0
} else {
Et <- Xt + schedule[t]
}
# at this point we save the states in our vectors:
slow <- c(slow, St)
fast <- c(fast, Ft)
total <- c(total, Xt)
}
# after we loop through all trials, we return the model output:
return(data.frame(slow,fast,total))
}
```
But the criterion function will be different. It should return a likelihood now, which is done in the last line of code and it is a somewhat complicate line. It can be done in a single line, because `dnorm()` takes vectorized input for any of its arguments (make sure that if two or more arguments are a vector, that they are of equal length, not sure what would happen if that is not the case).
Instead of a bigError in case the parameters violate constraints, we now return a lowLikelihood, and this is actually easier to estimate, as the lowest possible likelihood is zero.
Also, we now add some other criteria that return our lowLikelihood. These are to ensure that the returned model is "stable".
Since most (all?) likelihoods we will get will be below 1, taking the product of many of them will result in a very low number. Differences between those likelihoods of the whole series with different parameters will soon become so small that the algorithm will consider them insignificant and stop, while least square errors algorithms will continue optimizing. That's why, in the last line, I also raise the likelihood to the power if one over the length of the sequence of likelihoods. This should get the total likelihood back on a scale of approximately 0 to 1, instead of between two very small numbers.
```{r}
twoRateLikelihood <- function(par, schedule, reaches, checkStability=FALSE) {
lowLikelihood <- 0
Rf = par["Rf"];
Rs = par["Rs"];
Lf = par["Lf"];
Ls = par["Ls"];
# learning and retention rates of the fast and slow process are constrained:
if (Ls > Lf) {
return(lowLikelihood)
}
if (Rs < Rf) {
return(lowLikelihood)
}
# stability constraints:
if (checkStability) {
if ( ( ( (Rf - Lf) * (Rs - Ls) ) - (Lf * Ls)) <= 0 ) {
return(lowLikelihood)
}
p <- Rf - Lf - Rs + Ls
q <- ( p^2 + (4 * Lf * Ls) )
if ( ((Rf - Lf + Rs - Ls) + (q^0.5)) >= 2 ) {
return(lowLikelihood)
}
}
# the following line works both if sd is a single value or a vector with the same length as reaches$mean:
return( prod( dnorm((twoRateModel(par, schedule)$total - reaches$mean), mean=0, sd=reaches$sd)^(1/length(reaches$mean)) ) )
}
```
And now we can fit the model by maximizing it's likelihood, here with `optimx()`:
```{r}
nvals <- 5
parvals <- seq(1/nvals/2,1-(1/nvals/2),1/nvals)
searchgrid <- expand.grid('Ls'=parvals,
'Lf'=parvals,
'Rs'=parvals,
'Rf'=parvals)
# evaluate starting positions:
Likelihoods <- apply(searchgrid, FUN=twoRateLikelihood, MARGIN=c(1), schedule=schedule, reaches=reaches, checkStability=TRUE)
# we need to tell optimx to maximize the likelihood:
library(optimx)
control <- list()
control$maximize <- TRUE
#control$fnscale <- -1 # do not use together with maximize
# run optimx on the best starting positions:
allfits <- do.call("rbind",
apply( searchgrid[order(Likelihoods, decreasing = T)[1:10],],
MARGIN=c(1),
FUN=optimx,
fn=twoRateLikelihood,
method='L-BFGS-B',
lower=c(0,0,0,0),
upper=c(1,1,1,1),
schedule=schedule,
reaches=reaches,
checkStability=TRUE,
control=control ) )
# pick the best fit:
win <- allfits[order(allfits$value, decreasing = T)[1],]
print(par <- unlist(win[,1:4]))
```
Let's see the quality of the fit, compared to that of the MSE solution from before. They should be equivalent.
```{r}
# these parameters are from the fit in the tutorial on independent observations
# which has the "best" two-rate model function so far... but based on MSE
MSEpar <- c('Ls'=0.07297697,
'Lf'=0.42161167,
'Rs'=0.99856856,
'Rf'=0.68608967)
MSEmodel <- twoRateModel(par=MSEpar, schedule=schedule)
# this is the model just fitted above:
model <- twoRateModel(par=par, schedule=schedule)
plot(reaches$mean,type='l',col='#333333',xlab='trial',ylab='reach deviation [deg]',xlim=c(0,165),ylim=c(-35,35),bty='n',ax=F)
lines(c(1,33,33,133,133,145,145),c(0,0,30,30,-30,-30,0),col='#AAAAAA')
lines(c(145,164),c(0,0),col='#AAAAAA',lty=2)
lines(model$slow,col='blue')
lines(model$fast,col='red')
lines(model$total,col='purple')
lines(MSEmodel$total,col='orange',lty=2)
axis(1,c(1,32,132,144,164),las=2)
axis(2,c(-30,-15,0,15,30))
```
There is a purple line going through the data, based on the MSE solution (with the red and blue lines indicating the fast and slow process respectively), and there is a dashed orange line showing the model output for the solution found when maximizing the likelihood. I can't spot any difference between the purple and orange line, so they're probably the same. Can we quantify this?
```{r}
MSE_likelihood <- twoRateLikelihood(par=MSEpar, schedule=schedule, reaches=reaches, checkStability = F)
LIK_likelihood <- twoRateLikelihood(par=par, schedule=schedule, reaches=reaches, checkStability = F)
print(likelihoods <- c('MSE'=MSE_likelihood, 'L'=LIK_likelihood))
```
The likelihood is different on the 8th decimal place, which I'm sure we can ignore. You can further verify that the solutions are the same by checking if the MSE is different or not for the solutions based on likelihood or MSE.
The parameters are highly similar, the likelihoods are highly similar and perhaps you checked that the MSE is also highly similar. This means that fittind models with likelihood gives you the same solutions as fitting models with MSE. The advantage of MSE is that it is in the unit the data was recorded in, and the advantage of the likelihood is that it can be used for other computations down the line. It will help us get closer to expectation maximization and probably is useful if we want our models to include noise terms.
## Noise as part of the model
If we want to fit individual participants, so we don't have an estimate of the standard deviation for every trial, but we do assume that noise is different in separate parts of the paradigm, we could make parameters of the noise terms and see how likely the participant's data is given the model, instead of how likely the model is given the data. Maybe this is a topic for a future tutorial.
# Take away
We can define likelihood functions and have models be fit by maximizing their likelihood. This is perhaps slightly harder than fitting models by minimizing errors. However, it should result in the same solution (if the approach is the same - as for the two-rate model), and with the likelihoods of models we can then extend the model evaluation metrics from only AIC to also BIC, Hannan-Quinn and perhaps others.