models_06_likelihood_maximization.Rmd

---
title: 'Model Fitting: Likelihood Maximization'
output:
  html_notebook: default
  html_document:
    df_print: paged
  pdf_document: default
author: Marius 't Hart
---

```{r setup, cache=FALSE, include=FALSE}
library(knitr)
#opts_chunk$set(comment='', eval=FALSE)
```

Some model evaluation criteria don't work with minimizing a loss function (like MSE) but by maximizing the likelihood of the model. If you'd like to calculate a [BIC](https://en.wikipedia.org/wiki/Bayesian_information_criterion), [AICc](https://en.wikipedia.org/wiki/Akaike_information_criterion#Modification_for_small_sample_size) or [Hannan-Quinn](https://en.wikipedia.org/wiki/Hannan%E2%80%93Quinn_information_criterion) criterion for your model - and probably others, you need the likelihood of the model. There are other benefits too. The cost is that we need to have an estimate of noise in the data.

_Note 1:_ This does not mean that your previous model fits are wrong. The likelihood of the model [can be shown](https://stats.stackexchange.com/questions/16508/calculating-likelihood-from-rmse) to be at it's maximum at the same parameters where the error is minimized. It's just that you can only use AIC for model evaluation.

_Note 2:_ I suppose it might be possible to fit a model using MSE minimization and then calculate the likelihood of the fitted model afterwards.

# Simple example of likelihood

This is going to be an extremely simple model. We're taking the sepal length of setosa irisses from the `iris` dataset:

```{r}
SSL <- iris$Sepal.Length[which(iris$Species == 'setosa')]
str(SSL)
```

Now, these 50 numbers can be described as a fairly normal distribution, as seen in histogram and QQ-plot:

```{r}
par(mfrow=c(1,2))
hist(SSL)
qqnorm(SSL, bty='n')
```

That means that the mean and standard deviation give a good description of the distribution:

```{r}
print(c('mean'=mean(SSL), 'sd'=sd(SSL)))
```

Our simple model has no other information than these two numbers. So if we ask it to predict what any new setosa's sepal length will be, the best answer is the mean: 5.006. But now, we can also ask how likely it is to observe this value, given what we know. For this we use the probability density function:

```{r}
plot(seq(3,7,0.001),dnorm(seq(3,7,0.001), mean=mean(SSL), sd=sd(SSL)), type='l',xlab='setosa sepal length', ylab='probability density', bty='n')
```

Notice that the curve exceeds 1. How can there be a probability larger than 1? That's because this is not a probability function, but a probability _density_ function. (It's impossible to have an actual probability function, since the chance of any exact value is infinitely small, best described as 0.) We could convert this to probabilities by slicing the curve into very tiny bins, and taking the surface of the bin, as the total surface is supposed to be 1. However, for our purposes we can apparently just use probability density, as it provides a _relative_ likelihood (within some range) of a particular value occurring as compared to other values.

Let's Look at the first 10 of the 50 observations, so we can check the likelihood of those obervations relative to the whole dataset.

```{r}
print(data.frame('SSL'=SSL[1:10], 'likelihood'=dnorm(SSL[1:10], mean=mean(SSL), sd=sd(SSL))))
```

As you can see, closer to 5.006 the likelihood is highest.

## Two simple models

Now, we could make two simple models, one based on MSE minimization and based on likelihood maximization to find the best model. Both _should_ find the same solution. Let's see.

Here's a function that returns a mean square error:

```{r}
SSL_MSE <- function(par, data) {
  
  return( mean( (data-par)^2, na.rm=TRUE ) )
  
}
```

And a function that returs the likelihood:

```{r}
SSL_Likelihood <- function(par, data) {
  
  return( dnorm(par, mean=mean(data), sd=sd(data)) )
  
}
```

We can use `optim()` to find the best parameter estimate (`par`) for both, as `optim()` allows to set control$fncale to a negative value. A similar option is available for `optimx()`.

```{r}
control <- list()
control$fnscale <- -1
```

Here we fit it with the MSE:

```{r}
optim(par=4, fn=SSL_MSE, data=SSL, method='Brent', lower=0, upper=10)
```

And for the Likelihood:

```{r}
print(Lopt <- optim(par=4, fn=SSL_Likelihood, data=SSL, method='Brent', lower=0, upper=10, control=control))
```

## Evaluating the likelihood maximization model

So both methods find the exact same parameter values. Except that with the likelihood we can now also calculate BIC and Hannan-Quinn:

```{r}

L <- Lopt$value
k <- 1
N <- 50

#-- AIC --#
AIC <- (2 * k) - (2 * log(L))

#-- BIC --#
BIC <- log(N)*k - (2 * log(L)) # L based
  
#-- Hannan-Quinn --#
HQC <- (-2 * L) + (2 * k * log(log(N))) # L based

print(c('AIC'=AIC, 'BIC'=BIC,'Hannan-Quinn'=HQC))
```

Of course, you could debate if those 50 observations are really independent.

## Take away

You should now have some sense of what a likelihood is and how you can maximize the likelihood of a model given the data.

# Two-Rate Model

And we will now implement this for a somewhat harder problem: the two-rate model. Here, there is a distribution for every trial. So we can optimize the likelihood if, in addition to the mean reach deviation in each trial, we also have the standard deviation of reach deviations on each trial.

Let's set up that part of the solution:

```{r}
load('data/tworatedata.rda')
baseline <- function(reachvector,blidx) reachvector - mean(reachvector[blidx], na.rm=TRUE)
tworatedata[,4:ncol(tworatedata)] <- apply(tworatedata[,4:ncol(tworatedata)], FUN=baseline, MARGIN=c(2), blidx=c(17:32))
mu <- apply(tworatedata[4:ncol(tworatedata)], 
            FUN=mean, 
            MARGIN=c(1), 
            na.rm=TRUE)

# This is for individual likelihoods in each trial:
# sigma <- apply(tworatedata[4:ncol(tworatedata)]-mu,
#                FUN=sd,
#                MARGIN=c(1),
#                na.rm=TRUE )

# This holds when the distribution is equally wide on every trial:
# (and should be more comparable to MSE)
sigma <- sd(as.matrix(tworatedata[4:ncol(tworatedata)] - mu), na.rm=T )

schedule <- tworatedata$schedule

reaches <- list( 'mean' = mu,
                 'sd'   = sigma )
```

Let's see what the data frame `reaches` looks like:

```{r}
str(reaches)
```

Here I keep the sd as one single value. This has two reasons. If you want to fit a whole group of participants, then you do have an estimate of standard deviation on every trial, and the functions below can handle that. However, the fit would be different (probably not much) from the MSE fit, as it would effectively weight certain errors more than others. For this tutorial, where we want to compare the MSE-based fit and the Likelihood-based it, we should stick to a single noise estimate. The other issue is that if you want to fit the model to an individual participant's data, you don't get a separate estimate of noise per trial (unless you measure them many times), so in that case we should determine the noise of that participant, and use that to describe the data.

This determines a 2-dimensional likelihood "landscape", and our model fit will have stay on the top ridge of that landscape. Here's a visualization of that landscape:

```{r}

trials <- 164
reachdevs <- seq(-45,45,0.5)

likelihoods <- matrix(data=NA, nrow=trials, ncol=length(reachdevs))
for (trial in c(1:length(reaches$mean))) {
  likelihoods[trial,] <- dnorm(reachdevs, mean=reaches$mean[trial], sd=reaches$sd)
}

# this bit copied from stack overflow:
# https://stackoverflow.com/questions/29786380/standardizing-color-scale-persp-plots-in-panel-r
nrz <- trials
ncz <- length(reachdevs)
# Create a function interpolating colors in the range of specified colors
jet.colors <- colorRampPalette( c("white", "purple") )
# Generate the desired number of colors from this palette
nbcol <- 100
color <- jet.colors(nbcol)
# Compute the z-value at the facet centres
zfacet <- likelihoods[-1, -1] + likelihoods[-1, -ncz] + likelihoods[-nrz, -1] + likelihoods[-nrz, -ncz]
# Recode facet z-values into color indices
facetcol <- cut(zfacet, nbcol)


persp(c(1:trials), reachdevs, likelihoods*300, phi = 50, theta = 0,
  xlab = "trial", ylab = "reach deviation", zlab='likelihood',
  main = "likelihood 'landscape' for two-rate model", col=color[facetcol], shade=0.2, border=NA,
  scale = FALSE
)
```

Our model function will still be the same:

```{r}
twoRateModel <- function(par, schedule) {
  
  # thse values should be zero at the start of the loop:
  Et <- 0 # previous error: none
  St <- 0 # state of the slow process: aligned
  Ft <- 0 # state of the fast process: aligned
  
  # we'll store what happens on each trial in these vectors:
  slow <- c()
  fast <- c()
  total <- c()
  
  # now we loop through the perturbations in the schedule:
  for (t in c(1:length(schedule))) {
    
    # first we calculate what the model does
    # this happens before we get visual feedback about potential errors
    St <- (par['Rs'] * St) - (par['Ls'] * Et)
    Ft <- (par['Rf'] * Ft) - (par['Lf'] * Et)
    Xt <- St + Ft
    
    # now we calculate what the previous error will be for the next trial:
    if (is.na(schedule[t])) {
      Et <- 0
    } else {
      Et <- Xt + schedule[t]
    }
    
    # at this point we save the states in our vectors:
    slow <- c(slow, St)
    fast <- c(fast, Ft)
    total <- c(total, Xt)
    
  }
  
  # after we loop through all trials, we return the model output:
  return(data.frame(slow,fast,total))
  
}
```

But the criterion function will be different. It should return a likelihood now, which is done in the last line of code and it is a somewhat complicate line. It can be done in a single line, because `dnorm()` takes vectorized input for any of its arguments (make sure that if two or more arguments are a vector, that they are of equal length, not sure what would happen if that is not the case).

Instead of a bigError in case the parameters violate constraints, we now return a lowLikelihood, and this is actually easier to estimate, as the lowest possible likelihood is zero.

Also, we now add some other criteria that return our lowLikelihood. These are to ensure that the returned model is "stable".

Since most (all?) likelihoods we will get will be below 1, taking the product of many of them will result in a very low number. Differences between those likelihoods of the whole series with different parameters will soon become so small that the algorithm will consider them insignificant and stop, while least square errors algorithms will continue optimizing. That's why, in the last line, I also raise the likelihood to the power if one over the length of the sequence of likelihoods. This should get the total likelihood back on a scale of approximately 0 to 1, instead of between two very small numbers.

```{r}
twoRateLikelihood <- function(par, schedule, reaches, checkStability=FALSE) {
  
  lowLikelihood <- 0
  
  Rf = par["Rf"];
  Rs = par["Rs"];
  Lf = par["Lf"];
  Ls = par["Ls"];

  # learning and retention rates of the fast and slow process are constrained:
  if (Ls > Lf) {
    return(lowLikelihood)
  }
  if (Rs < Rf) {
    return(lowLikelihood)
  }
  
  # stability constraints:
  if (checkStability) {
    
    if ( ( ( (Rf - Lf) * (Rs - Ls) ) - (Lf * Ls)) <= 0 ) {
      return(lowLikelihood)
    }
    
    p <- Rf - Lf - Rs + Ls
    q <- ( p^2 + (4 * Lf * Ls) )
    if ( ((Rf - Lf + Rs - Ls) + (q^0.5)) >= 2 ) {
      return(lowLikelihood)
    }
    
  }

  # the following line works both if sd is a single value or a vector with the same length as reaches$mean:
  return( prod( dnorm((twoRateModel(par, schedule)$total - reaches$mean), mean=0, sd=reaches$sd)^(1/length(reaches$mean)) ) )
  
} 
```

And now we can fit the model by maximizing it's likelihood, here with `optimx()`:

```{r}
nvals <- 5
parvals <- seq(1/nvals/2,1-(1/nvals/2),1/nvals)

searchgrid <- expand.grid('Ls'=parvals,
                          'Lf'=parvals,
                          'Rs'=parvals,
                          'Rf'=parvals)

# evaluate starting positions:
Likelihoods <- apply(searchgrid, FUN=twoRateLikelihood, MARGIN=c(1), schedule=schedule, reaches=reaches, checkStability=TRUE)

# we need to tell optimx to maximize the likelihood:
library(optimx)

control <- list()
control$maximize <- TRUE
#control$fnscale <- -1 # do not use together with maximize

# run optimx on the best starting positions:
allfits <- do.call("rbind",
                   apply( searchgrid[order(Likelihoods, decreasing = T)[1:10],],
                          MARGIN=c(1),
                          FUN=optimx,
                          fn=twoRateLikelihood,
                          method='L-BFGS-B',
                          lower=c(0,0,0,0),
                          upper=c(1,1,1,1),
                          schedule=schedule,
                          reaches=reaches,
                          checkStability=TRUE,
                          control=control ) )

# pick the best fit:
win <- allfits[order(allfits$value, decreasing = T)[1],]
print(par <- unlist(win[,1:4]))
```

Let's see the quality of the fit, compared to that of the MSE solution from before. They should be equivalent.

```{r}
# these parameters are from the fit in the tutorial on independent observations
# which has the "best" two-rate model function so far... but based on MSE
MSEpar <- c('Ls'=0.07297697,
            'Lf'=0.42161167,
            'Rs'=0.99856856,
            'Rf'=0.68608967)
MSEmodel <- twoRateModel(par=MSEpar, schedule=schedule)

# this is the model just fitted above:
model <- twoRateModel(par=par, schedule=schedule)
plot(reaches$mean,type='l',col='#333333',xlab='trial',ylab='reach deviation [deg]',xlim=c(0,165),ylim=c(-35,35),bty='n',ax=F)
lines(c(1,33,33,133,133,145,145),c(0,0,30,30,-30,-30,0),col='#AAAAAA')
lines(c(145,164),c(0,0),col='#AAAAAA',lty=2)
lines(model$slow,col='blue')
lines(model$fast,col='red')
lines(model$total,col='purple')
lines(MSEmodel$total,col='orange',lty=2)
axis(1,c(1,32,132,144,164),las=2)
axis(2,c(-30,-15,0,15,30))
```

There is a purple line going through the data, based on the MSE solution (with the red and blue lines indicating the fast and slow process respectively), and there is a dashed orange line showing the model output for the solution found when maximizing the likelihood. I can't spot any difference between the purple and orange line, so they're probably the same. Can we quantify this?

```{r}
MSE_likelihood <- twoRateLikelihood(par=MSEpar, schedule=schedule, reaches=reaches, checkStability = F)
LIK_likelihood <- twoRateLikelihood(par=par, schedule=schedule, reaches=reaches, checkStability = F)
print(likelihoods <- c('MSE'=MSE_likelihood, 'L'=LIK_likelihood))
```

The likelihood is different on the 8th decimal place, which I'm sure we can ignore. You can further verify that the solutions are the same by checking if the MSE is different or not for the solutions based on likelihood or MSE.

The parameters are highly similar, the likelihoods are highly similar and perhaps you checked that the MSE is also highly similar. This means that fittind models with likelihood gives you the same solutions as fitting models with MSE. The advantage of MSE is that it is in the unit the data was recorded in, and the advantage of the likelihood is that it can be used for other computations down the line. It will help us get closer to expectation maximization and probably is useful if we want our models to include noise terms.


## Noise as part of the model

If we want to fit individual participants, so we don't have an estimate of the standard deviation for every trial, but we do assume that noise is different in separate parts of the paradigm, we could make parameters of the noise terms and see how likely the participant's data is given the model, instead of how likely the model is given the data. Maybe this is a topic for a future tutorial.

# Take away

We can define likelihood functions and have models be fit by maximizing their likelihood. This is perhaps slightly harder than fitting models by minimizing errors. However, it should result in the same solution (if the approach is the same - as for the two-rate model), and with the likelihoods of models we can then extend the model evaluation metrics from only AIC to also BIC, Hannan-Quinn and perhaps others.