forked from iandurbach/ml-for-ecology
-
Notifications
You must be signed in to change notification settings - Fork 0
/
transfer-learning.Rmd
299 lines (202 loc) · 10.2 KB
/
transfer-learning.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
---
title: "Transfer learning"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
In this notebook we use pre-trained neural networks as a starting point for our own models. The rationale behind this tutorial is as follows: if we construct a neural network that can recognise images of cars, would it be possible for the network (with some extra training) to recognise trucks?
Other people have trained enormous general-purpose CNNs for image classification. We can exploit the features that these CNNs have found for our own image classification problem. (i.e. transfer the knowledge/features which have been acquired by training networks on large datasets, and applying the model along with it's weights to a new dataset).
Training networks (even more so for very deep networks) on large datasets (hundreds of thousands of examples) can take very long (days/weeks). So it doesn't always make sense to re-train from sratch.
Typically, transfer learning works by loading a pre-trained network and removing the final layer (which predicts class membership), since this will be particular to the problem used to train the original network. For example, the original network might have 10 classes, and our new problem might have 15 - so this is why we would need to remove the last layer. We then add on our own final layer, which contains our own output nodes (based on the number of classes, or in certain cases just a single output). We then retrain the network (either just the final layer, or the whole network).
The following image summarises the principle behind transfer learning in neural networks.
![](img/transfer_learning_setup.png)
```{r}
library(keras)
```
Say where the data is...
```{r}
train_directory <- "data/invasives/sample/train/"
validation_directory <- "data/invasives/sample/validation/"
test_directory <- "data/invasives/sample/test/"
# once you are satisfied the code is working, run full dataset
# train_directory <- "data/invasives/train/"
# validation_directory <- "data/invasives/validation/"
# test_directory <- "data/invasives/test/"
```
And work out how many images we have.
```{r}
train_samples <- length(list.files(paste(train_directory,"invasive",sep=""))) +
length(list.files(paste(train_directory,"non_invasive",sep="")))
validation_samples <- length(list.files(paste(validation_directory,"invasive",sep=""))) +
length(list.files(paste(validation_directory,"non_invasive",sep="")))
test_samples <- length(list.files(paste(test_directory,"invasive",sep=""))) +
length(list.files(paste(test_directory,"non_invasive",sep="")))
```
```{r}
train_samples
validation_samples
test_samples
```
In this case we will use the VGG16 pre-trained network.
VGG16's architecture is made up of 5 convolutional blocks, each block is made up of several convolutional layers. Max pooling is applied between these blocks. The architecutre has 16 layers and looks as follows:
![](vgg16.jpg)
This network needs input images to have a dimension of 224x224x3, so we set desired image height and width accordingly.
```{r}
img_height <- 224
img_width <- 224
batch_size <- 16
```
## Data generators
Since the data is neatly organised in folders, we can make use of flow_images_from_directory to easily read in the data. We do this for our training, validation and testing data.
```{r}
train_generator <- flow_images_from_directory(
train_directory,
generator = image_data_generator(),
target_size = c(img_height, img_width),
color_mode = "rgb",
class_mode = "binary",
batch_size = batch_size,
shuffle = TRUE,
seed = 123)
validation_generator <- flow_images_from_directory(
validation_directory,
generator = image_data_generator(),
target_size = c(img_height, img_width),
color_mode = "rgb",
classes = NULL,
class_mode = "binary",
batch_size = batch_size,
shuffle = TRUE,
seed = 123)
test_generator <- flow_images_from_directory(
test_directory,
generator = image_data_generator(),
target_size = c(img_height, img_width),
color_mode = "rgb",
class_mode = "binary",
batch_size = 1,
shuffle = FALSE)
```
## Loading pre-trained model and adding custom layers
Here, include_top=FALSE means that we are not including the last 3 fully connected layers that are present in the orginal VGG16 architecture. weights='imagenet' means that the model will use the weights which were obtained when originally training on the ImageNet dataset (millions of) Additional references are available here: https://tensorflow.rstudio.com/keras/reference/application_vgg.html
```{r}
base_model <- application_vgg16(weights = "imagenet",
include_top = FALSE)
```
### Choices of weights are "imagenet" or "None". None means that the weights will be randomly initialised.
Imagenet has roughly 14 million images categorised into roughly 17 thousand classes. So it makes sense to use models that have good performance on this dataset.
## Add our custom layers
Here we add a global average 2d pooling layer followed by two fully connected layers. The the last layer there is a single output node. Why is there a single output node? Why are we using the sigmoid function instead of, say, ReLU?
```{r}
predictions <- base_model$output %>%
layer_global_average_pooling_2d(trainable=T) %>%
layer_dense(units = 512, activation = "relu", trainable=T) %>%
layer_dense(units = 1, activation = "sigmoid", trainable=T)
model <- keras_model(inputs = base_model$input,
outputs = predictions)
```
## Print out a summary of the model.
### Take note of the number of trainable parameters
```{r}
summary(model)
```
## Here we "Freeze" some layers, i.e. we tell the model not to learn those weights in those layers.
```{r}
for (layer in base_model$layers)
layer$trainable <- FALSE
```
## Now print out the summary and have a look at the number of parameters
```{r}
summary(model)
```
## Compile the model
This is a typical implementation of stochastic gradient descent with a learning rate of 0.0001.
```{r}
model %>% compile(
loss = "binary_crossentropy",
optimizer = optimizer_sgd(lr = 0.0001,
momentum = 0.9,
decay = 1e-5),
metrics = "accuracy"
)
```
## Fit the model
Train the model on the training data, validate on the validation data. Run for 5 pochs. This is a typical implementation. Here we use fit_generator() because we read in our data using a generator above.
```{r}
model %>% fit_generator(
train_generator,
steps_per_epoch = as.integer(train_samples / batch_size),
epochs = 3,
validation_data = validation_generator,
validation_steps = as.integer(validation_samples / batch_size),
verbose = 1)
```
## Evaluate the model
```{r}
model %>% evaluate_generator(
test_generator,
steps = test_samples)
```
Various other pre-trained models for Keras are avaiable here: https://keras.rstudio.com/articles/applications.html The website also shows a simple example on how to load the model. If you do not have access to a GPU then this approach is a good place to start.
# Example 2
Now let's implement transfer learning for CIFAR-10. More details here: https://www.cs.toronto.edu/~kriz/cifar.html What is the dataset about? What can you say about the dimensions of the data? How many output nodes do you think we need in the last layer?
```{r}
# Load the CIFAR-10 dataset
cifar10 <- dataset_cifar10()
```
The CIFAR10 dataset has 50,000 training images and 10,000 test images. Here to speed things up we just take the first 1000 images from each of the training and test datasets. You might want to increase this number depending on the memory and time available to you.
```{r}
# Feature scale RGB values in test and train inputs
x_train <- cifar10$train$x[1:1000,,,]/255
x_test <- cifar10$test$x[1:1000,,,]/255
y_train <- to_categorical(cifar10$train$y[1:1000], num_classes = 10)
y_test <- to_categorical(cifar10$test$y[1:1000], num_classes = 10)
```
```{r}
rm(cifar10)
```
Now load the pretrained model
```{r}
base_model <- application_vgg16(weights = "imagenet",
include_top = FALSE)
```
## Here once again we tell the model not to re-train every weight
```{r}
for (layer in base_model$layers)
layer$trainable <- FALSE
```
Okay here we need to implement our new last layers slightly differently to the invasive species dataset. CIFAR-10. Firstly, we know that this is a classification problem, and that there are more than just 2 classes. So, from this, we know that we should have 10 units in the last layer. We need to use a softmax activation function. Softmax outputs the probability for each class, so this is the perfect activation function to use. Here we are adding only two fully connected layers, feel free to experiment with other layers, units or even add dropout.
```{r}
predictions <- base_model$output %>%
layer_global_average_pooling_2d() %>%
layer_dense(units = 1024, activation = "relu") %>%
layer_dense(units = 10, activation = "softmax")
model <- keras_model(inputs = base_model$input,
outputs = predictions)
```
Display the network architecture
```{r}
summary(model)
```
This model has 15 million parameters. Yikes! We don't have to spend hours and re-train a lot of those parameters - thanks transfer learning!
Now that we have defined a model, we need to define the loss function and tell the model which optimiser it will use. In the previous example we used stochastic gradient descent. Let's use a different one this time.
```{r}
opt<-optimizer_adam(lr= 0.001)
model %>% compile(
loss = "categorical_crossentropy",
optimizer = opt,
metrics = "accuracy"
)
```
In this example, we aren't used a data generator since all of our data are already in the x and y variables. So we can't call the fit function like above. Of course, we should have a separate validation and test set, but for simplicity we will just use the test set here as validation data. Instead we call it this way:
```{r}
model %>% fit( x_train,y_train ,batch_size=32,
epochs=1,validation_data = list(x_test, y_test),
shuffle=TRUE)
```
And to test
```{r}
model %>% evaluate(x_test, y_test, batch_size=32, verbose = 1)
```
More models are available here: https://keras.io/applications/#vgg16