-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.qmd
388 lines (302 loc) · 11 KB
/
index.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
---
title: "Predicting the music genre of spotify tracks using deep learning"
author: "Daniel Loos"
date: today
editor: visual
format:
html:
toc: true
execute:
cache: true
---
```{r setup, echo=FALSE}
knitr::opts_knit$set(root.dir = here::here())
knitr::opts_chunk$set(warning = FALSE)
options(warn=-1)
```
## Abstract
Music genres are often composed of particular pitch patterns that can be used for prediction.
The Spotify API provides features for entire tracks, e.g. its loudness or acousticness scores, as well as the sequence of the individual pitches (notes).
Totaling 3600 tracks across techno, rock, jazz and classicsal generes were analyzed and used for both classical Machine Learning and Deep Learning modeling methods.
Validation accuracy of both approaches were similar suggesting that more sophisticated network architectures are needed to increase the model performance.
## Tech stack
- [keras](https://keras.io/) deep learning framework
- [Tensorflow](https://www.tensorflow.org/) deep learning framework
- [tidymodels](https://www.tidymodels.org/) machine learning framework
- [tidyverse](https://www.tidyverse.org/) data wrangling
- [R targets](https://books.ropensci.org/targets/) pipeline system
- [spotifyr](https://www.rcharlie.com/spotifyr/) REST API calls
- [quarto](https://quarto.org/) notebook documentation
Keywords:
- Spatial data analysis
- deep learning
- CNN
- LSTM
- REST APIs
[This project on GitHub](https://github.com/danlooo/spotify-datasci)
## ETL pipeline
[Spotify audio analysis API](https://developer.spotify.com/community/showcase/spotify-audio-analysis/) was queried to get summary features, e.g. danceability and accousticness, as well as the individual pitch sequences of 3600 music tracks.
An R targets pipeline DAG was created to retrieve and transform the data allowing parallelization and caching:
```{r}
source("_targets.R")
tar_load(c("terms", "track_audio_features", "selected_audio_features", "audio_analyses", "track_train_test_split", "track_searches", "track_pitches", "valid_tracks", "track_audio_analyses"))
tar_visnetwork()
```
## Data overview
Spotify was queried by the following terms. Up to 50 tracks per term were retrieved.
```{r}
terms
```
Total number of tracks:
```{r}
nrow(track_searches)
```
Tracks per term:
```{r}
track_searches |> count(term)
```
Features per track:
```{r}
tracks <-
track_audio_features |>
left_join(track_searches, by = "id") |>
filter(id %in% valid_tracks) |>
mutate(term = factor(term))
colnames(tracks)
```
Number of tracks after sanity checks:
```{r}
nrow(tracks)
```
## Summary features for prediction
```{r}
features <- c("danceability", "acousticness")
tar_load(track_train_test_split)
tracks_train <- tracks |> inner_join(track_train_test_split) |> filter(is_train)
tracks_train |>
select(term, features) |>
mutate(across(features, scale)) |>
pivot_longer(features) |>
ggplot(aes(term, value)) +
geom_quasirandom() +
geom_boxplot(outlier.size = NULL, width = 0.5) +
facet_wrap(~ name, scales = "free") +
coord_flip()
```
- Techno songs are high in danceability and low in acousticness
(Linear Euclidean) ordination biplot to show at all numeric features at once:
```{r}
pca <-
track_audio_features |>
semi_join(tracks_train) |>
column_to_rownames("id") |>
select(selected_audio_features) |>
mutate(across(everything(), scale)) |>
filter(if_any(everything(), ~ ! is.na(.x))) |>
prcomp()
tracks_pca <-
pca$x |>
as_tibble(rownames = "id") |>
left_join(track_audio_features, by = "id") |>
left_join(track_searches, by = "id")
# get medoids
track_clusters <-
tracks_pca |>
group_by(term) |>
summarise(across(c(PC1, PC2), median))
tibble() |>
ggplot(aes(x = PC1, y = PC2, color = group)) +
geom_text(
data = track_clusters |> mutate(group = "term"),
mapping = aes(label = term)
) +
geom_text(
data = pca$rotation |> as_tibble(rownames = "feature") |> mutate(group = "feature"),
mapping = aes(label = feature)
)
```
More detailed biplot:
```{r}
tibble() |>
ggplot(aes(x = PC1, y = PC2)) +
geom_point(
data = tracks_pca,
mapping = aes(color = term),
alpha = 0.3
) +
ggrepel::geom_label_repel(
data = track_clusters,
mapping = aes(label = term, color = term)
) +
guides(color = FALSE) +
ggnewscale::new_scale_color() +
geom_segment(
data = pca$rotation |> as_tibble(rownames = "feature"),
mapping = aes(x = 0, y = 0, xend = max(abs(pca$x[,1])) * PC1, yend = max(abs(pca$x[,2])) * PC2),
arrow = arrow()
) +
ggrepel::geom_label_repel(
data = pca$rotation |> as_tibble(rownames = "feature"),
mapping = aes(label = feature, x = max(abs(pca$x[,1])) * PC1, y = max(abs(pca$x[,2])) * PC2)
)
```
Sanity checks:
- classical track are associated with acousticness
- rock and techno tracks are associated with loudness
There is no clear separation between the genre clusters suggesting a complicated classification task.
```{r}
summary(pca)$importance["Cumulative Proportion","PC2"]
```
Almost half of the variance can be explained by the first principal components, motivating the prediction of the terms based on the features. These features were also significantly different across the terms:
```{r}
features |>
paste0(collapse = "+") |>
paste0("~ term") |>
lm(data = tracks) |>
anova()
```
```{r}
features |>
paste0(collapse = "+") |>
paste0("~ term") |>
lm(data = tracks) |>
lm() |>
summary()
```
We use the same set of test samples throughout the entire analysis:
```{r}
library(tidymodels)
tar_load(model_data)
train <-
track_audio_features |>
filter(id %in% rownames(model_data$train_y)) |>
left_join(track_searches, by = "id") |>
mutate(term = term |> factor()) |>
select(term, selected_audio_features)
test <-
track_audio_features |>
filter(id %in% rownames(model_data$test_y)) |>
left_join(track_searches, by = "id") |>
mutate(term = term |> factor()) |>
select(term, selected_audio_features)
```
Let's start with a (linear) Support Vector Machine (SVM):
```{r}
svm_linear(mode = "classification") |>
fit(term ~ ., data = train) |>#
predict(test) |>
bind_cols(test) |>
mutate(across(c("term", ".pred_class"), ~ factor(.x, levels = test$term |> unique()))) |>
accuracy(truth = term, estimate = .pred_class)
```
A (non-linear) random forest showed similar performance:
```{r}
rand_forest(mode = "classification") |>
fit(term ~ ., data = train) |>
predict(test) |>
bind_cols(test) |>
mutate(across(c("term", ".pred_class"), ~ factor(.x, levels = test$term |> unique()))) |>
accuracy(truth = term, estimate = .pred_class)
```
The test accuracy was very high in general.
Can it even be improved using the individual pitch sequences instead of relying on just a few summary features describing the entire track?
## Pitch sequences for prediction
Some features are highly correlated, suggesting redundancy, e.g. :
```{r}
tracks |>
ggplot(aes(danceability, loudness)) +
geom_point() +
stat_smooth(method = "lm") +
stat_cor()
```
Indeed, lots of features were significantly correlated after FDR adjustment:
```{r}
tracks |>
select(selected_audio_features) |>
as.matrix() |>
Hmisc::rcorr() |>
broom::tidy() |>
ungroup() |>
mutate(q.value = p.value |> p.adjust(method = "fdr")) |>
filter(q.value < 0.05 & abs(estimate) > 0.2) |>
arrange(-abs(estimate)) |>
unite(col = comparision, column1, column2, sep = " vs. ") |>
head(10) |>
ggplot(aes(comparision, estimate)) +
geom_col() +
coord_flip() +
labs(y = "Pearson correlation")
```
Music is composed of shorter and longer patterns.
We can make use of the temporal property by doing convolutions on the time axis while using loudness of pitch frequencies as features.
[Spotify audio analysis](https://developer.spotify.com/community/showcase/spotify-audio-analysis/) separates the track into many segments and calculates the loudness for each of the 12 pitches (half steps) of the scale.
```{r}
track_audio_analyses$audio_analysis[[1]]$segments$pitches[1][[1]]
```
These are spectrograms of a subset of tracks representing the feature space for deep learning:
```{r}
#| fig-height: 10
track_pitches |>
left_join(track_searches) |>
sample_frac(0.01) |>
select(id, term, pitches) |>
unnest(pitches) |>
group_by(id) |>
mutate(segment = row_number()) |>
pivot_longer(starts_with("V"), names_to = "pitch_name", values_to = "pitch") |>
mutate(pitch_name = pitch_name |> str_extract("[0-9]+") |> as.numeric() |> factor()) |>
group_by(term, segment, pitch_name) |>
summarise(pitch = median(pitch)) |>
ggplot(aes(segment, pitch_name)) +
geom_tile(aes(fill = pitch)) +
facet_wrap(~term, ncol = 1) +
scale_fill_viridis_c() +
scale_x_continuous(limits = c(0, 800), expand = c(0, 0)) +
labs(x = "Time (segment)", y = "Pitch (semitone)", fill = "Median loudness")
```
On average, techno tracks uses a variety of different niotes across the time of the track, whereas classical tracks mostly use a few particular notes.
Let's define some model architectures for deep learning:
```{r}
tar_load(model_archs)
model_archs$base() |> plot()
```
This base model does not utilize the spatialness of the data and is used for comparison.
```{r}
model_archs$cnn1() |> plot()
```
This is a *sequential* Convolutional Neural Network (CNN).
```{r}
model_archs$cnn2() |> plot()
```
This is a *non sequential* Convolutional Neural Network (CNN).
The idea behind this model is that both short and long pitch patterns can be directly used for final prediction.
```{r}
model_archs$lstm() |> plot()
```
Long Short-Term Memory (LSTM) networks view time as a one directional spatial feature, whereas one can go in both directions with CNNs.
This makes sense for time series data, like the pitch sequences.
## Evaluate deep learning models
```{r}
tar_load(evaluations)
evaluations
```
All models outperformed the random guess with an expected accuracy of 25%.
The simple CNN1 had the highest accuracy in the validation set.
```{r}
evaluations |>
select(name, model) |>
mutate(history = model |> map(~ str_glue("tmp/train_history/{.x}.csv") |> read_csv())) |>
unnest(history) |>
pivot_longer(c("accuracy", "val_accuracy"), names_to = "subset") |>
mutate(subset = subset |> recode(accuracy = "train", "val_accuracy" = "validation")) |>
ggplot(aes(epoch, value, color = subset)) +
geom_line() +
facet_wrap(~name, ncol = 1) +
labs(y = "Accuracy")
```
The base and CNN1 model generalize well on the external validation samples, whereas CNN2 is affected by over-fitting.
This is maybe due to the high number of trainable parameters.
## Conclusion
CNN1 outperformed a linear SVM but its test accuracy was lower compared to a random forest.
Regarding model complexity and computational effort for training, the analysis suggest that depp learning model might not be worth the efforts to predict the music genre if meaningful summary features, e.g. danceability and accousticness are available.
However, using more sophisticated deep learning architectures, new deep learning models might be developed in the future to improve the validation accuracy even further.