transfer-learning.Rmd

---
title: "Transfer learning"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

In this notebook we use pre-trained neural networks as a starting point for our own models. The rationale behind this tutorial is as follows: if we construct a neural network that can recognise images of cars, would it be possible for the network (with some extra training) to recognise trucks?

Other people have trained enormous general-purpose CNNs for image classification. We can exploit the features that these CNNs have found for our own image classification problem. (i.e. transfer the knowledge/features which have been acquired by training networks on large datasets, and applying the model along with it's weights to a new dataset).

Training networks (even more so for very deep networks) on large datasets (hundreds of thousands of examples) can take very long (days/weeks). So it doesn't always make sense to re-train from sratch.

Typically, transfer learning works by loading a pre-trained network and removing the final layer (which predicts class membership), since this will be particular to the problem used to train the original network. For example, the original network might have 10 classes, and our new problem might have 15 - so this is why we would need to remove the last layer. We then add on our own final layer, which contains our own output nodes (based on the number of classes, or in certain cases just a single output). We then retrain the network (either just the final layer, or the whole network).

The following image summarises the principle behind transfer learning in neural networks.

![](img/transfer_learning_setup.png)


```{r}
library(keras)
```

Say where the data is...


```{r}
train_directory <- "data/invasives/sample/train/"
validation_directory <- "data/invasives/sample/validation/"
test_directory <- "data/invasives/sample/test/"

# once you are satisfied the code is working, run full dataset
# train_directory <- "data/invasives/train/"
# validation_directory <- "data/invasives/validation/"
# test_directory <- "data/invasives/test/"
```

And work out how many images we have.


```{r}
train_samples <- length(list.files(paste(train_directory,"invasive",sep=""))) +
    length(list.files(paste(train_directory,"non_invasive",sep="")))

validation_samples <- length(list.files(paste(validation_directory,"invasive",sep=""))) +
    length(list.files(paste(validation_directory,"non_invasive",sep="")))

test_samples <- length(list.files(paste(test_directory,"invasive",sep=""))) +
    length(list.files(paste(test_directory,"non_invasive",sep="")))
```


```{r}
train_samples
validation_samples
test_samples
```

In this case we will use the VGG16 pre-trained network. 

VGG16's architecture is made up of 5 convolutional blocks, each block is made up of several convolutional layers. Max pooling is applied between these blocks. The architecutre has 16 layers and looks as follows:

![](vgg16.jpg)

This network needs input images to have a dimension of 224x224x3, so we set desired image height and width accordingly.


```{r}
img_height <- 224
img_width <- 224
batch_size <- 16
```

## Data generators

Since the data is neatly organised in folders, we can make use of flow_images_from_directory to easily read in the data. We do this for our training, validation and testing data.


```{r}
train_generator <- flow_images_from_directory(
  train_directory, 
  generator = image_data_generator(),
  target_size = c(img_height, img_width),
  color_mode = "rgb",
  class_mode = "binary", 
  batch_size = batch_size, 
  shuffle = TRUE,
  seed = 123)

validation_generator <- flow_images_from_directory(
  validation_directory, 
  generator = image_data_generator(), 
  target_size = c(img_height, img_width), 
  color_mode = "rgb", 
  classes = NULL,
  class_mode = "binary", 
  batch_size = batch_size, 
  shuffle = TRUE,
  seed = 123)

test_generator <- flow_images_from_directory(
  test_directory, 
  generator = image_data_generator(),
  target_size = c(img_height, img_width), 
  color_mode = "rgb", 
  class_mode = "binary", 
  batch_size = 1,
  shuffle = FALSE)
```

## Loading pre-trained model and adding custom layers

Here, include_top=FALSE means that we are not including the last 3 fully connected layers that are present in the orginal VGG16 architecture. weights='imagenet' means that the model will use the weights which were obtained when originally training on the ImageNet dataset (millions of) Additional references are available here: https://tensorflow.rstudio.com/keras/reference/application_vgg.html


```{r}
base_model <- application_vgg16(weights = "imagenet", 
                                       include_top = FALSE)
```

### Choices of weights are "imagenet" or "None". None means that the weights will be randomly initialised.

Imagenet has roughly 14 million images categorised into roughly 17 thousand classes. So it makes sense to use models that have good performance on this dataset.

## Add our custom layers

Here we add a global average 2d pooling layer followed by two fully connected layers. The the last layer there is a single output node. Why is there a single output node? Why are we using the sigmoid function instead of, say, ReLU?


```{r}
predictions <- base_model$output %>% 
  layer_global_average_pooling_2d(trainable=T) %>% 
  layer_dense(units = 512, activation = "relu", trainable=T) %>% 
  layer_dense(units = 1, activation = "sigmoid", trainable=T)

model <- keras_model(inputs = base_model$input, 
                     outputs = predictions)
```

## Print out a summary of the model.
### Take note of the number of trainable parameters


```{r}
summary(model)
```

## Here we "Freeze" some layers, i.e. we tell the model not to learn those weights in those layers.


```{r}
for (layer in base_model$layers)
  layer$trainable <- FALSE
```

## Now print out the summary and have a look at the number of parameters


```{r}
summary(model)
```

## Compile the model

This is a typical implementation of stochastic gradient descent with a learning rate of 0.0001.


```{r}
model %>% compile(
  loss = "binary_crossentropy",
  optimizer = optimizer_sgd(lr = 0.0001, 
                            momentum = 0.9, 
                            decay = 1e-5),
  metrics = "accuracy"
)
```

## Fit the model

Train the model on the training data, validate on the validation data. Run for 5 pochs. This is a typical implementation. Here we use fit_generator() because we read in our data using a generator above.


```{r}
model %>% fit_generator(
  train_generator,
  steps_per_epoch = as.integer(train_samples / batch_size), 
  epochs = 3, 
  validation_data = validation_generator,
  validation_steps = as.integer(validation_samples / batch_size),
  verbose = 1)
```

## Evaluate the model


```{r}
model %>% evaluate_generator(
    test_generator,
    steps = test_samples)
```

Various other pre-trained models for Keras are avaiable here: https://keras.rstudio.com/articles/applications.html The website also shows a simple example on how to load the model. If you do not have access to a GPU then this approach is a good place to start.

# Example 2

Now let's implement transfer learning for CIFAR-10. More details here: https://www.cs.toronto.edu/~kriz/cifar.html What is the dataset about? What can you say about the dimensions of the data? How many output nodes do you think we need in the last layer?


```{r}
# Load the CIFAR-10 dataset
cifar10 <- dataset_cifar10()
```

The CIFAR10 dataset has 50,000 training images and 10,000 test images. Here to speed things up we just take the first 1000 images from each of the training and test datasets. You might want to increase this number depending on the memory and time available to you.

```{r}
# Feature scale RGB values in test and train inputs  
x_train <- cifar10$train$x[1:1000,,,]/255
x_test <- cifar10$test$x[1:1000,,,]/255
y_train <- to_categorical(cifar10$train$y[1:1000], num_classes = 10)
y_test <- to_categorical(cifar10$test$y[1:1000], num_classes = 10)
```

```{r}
rm(cifar10)
```
Now load the pretrained model


```{r}
base_model <- application_vgg16(weights = "imagenet", 
                                       include_top = FALSE)
```

## Here once again we tell the model not to re-train every weight


```{r}
for (layer in base_model$layers)
  layer$trainable <- FALSE
```

Okay here we need to implement our new last layers slightly differently to the invasive species dataset. CIFAR-10. Firstly, we know that this is a classification problem, and that there are more than just 2 classes. So, from this, we know that we should have 10 units in the last layer. We need to use a softmax activation function. Softmax outputs the probability for each class, so this is the perfect activation function to use. Here we are adding only two fully connected layers, feel free to experiment with other layers, units or even add dropout.


```{r}
predictions <- base_model$output %>% 
  layer_global_average_pooling_2d() %>% 
  layer_dense(units = 1024, activation = "relu") %>% 
  layer_dense(units = 10, activation = "softmax")

model <- keras_model(inputs = base_model$input, 
                     outputs = predictions)
```

Display the network architecture


```{r}
summary(model)
```

This model has 15 million parameters. Yikes! We don't have to spend hours and re-train a lot of those parameters - thanks transfer learning!

Now that we have defined a model, we need to define the loss function and tell the model which optimiser it will use. In the previous example we used stochastic gradient descent. Let's use a different one this time.


```{r}
opt<-optimizer_adam(lr= 0.001)

model %>% compile(
  loss = "categorical_crossentropy",
  optimizer = opt,
  metrics = "accuracy"
)
```

In this example, we aren't used a data generator since all of our data are already in the x and y variables. So we can't call the fit function like above. Of course, we should have a separate validation and test set, but for simplicity we will just use the test set here as validation data. Instead we call it this way:


```{r}
model %>% fit( x_train,y_train ,batch_size=32,
               epochs=1,validation_data = list(x_test, y_test),
               shuffle=TRUE)
```

And to test


```{r}
model %>% evaluate(x_test, y_test, batch_size=32, verbose = 1)
```

More models are available here: https://keras.io/applications/#vgg16