Too many classes this semester: Analyzing BERT on a large multi-class classification problem

This is the repository for the project of the Team Laboratory Computational Linguistics course at the University of stuttgart in the summer term of 2022.

In this project, we analyze BERT's performance for increasing numbers of classes. We use artist classification from lyrics as an example task for that.

We compare the performance deterioration of two baselines, random choice and kNN with set-based BOW as representation, with fine-tuning BERT and evaluating the performance on the classification head as well as using the embeddings for kNN. Additionally, we experiment with using embeddings from pre-trained BERT for kNN.

We evaluate these settings using accuracy.

Find the baseline kNN implementation here, examples for the random baseline and BERT experiments in the bert notebook and the full experiments for different numbers of classes in the bert-diff-classes.

Installation

To run the code, install the dependencies from the requirements file:

$ pip install -r requirements.txt

Baselines

Run the kNN baseline experiments by specifying the number of classes n and how many processes to use for multiprocessing p:

$ python baseline.py n p

Note that the results will be logged in a log file. To change the name or directory, change the file name in the second to last line of the script.

Dataset

Unfortunately, we have do not have the rights to distribute the dataset we used in this project. To use the code provided here with your own dataset, add your own training, validation and test datasets structured like the example file.

Dataset description

Per instance (song), the dataset used includes a triple with the artist name, song title and the lyrics of a song.

Overview over the number of artists/songs in the dataset:

Train Val Test

Artists 642 612 618

Songs 46,120 5,765 5,765

We provide a short descriptive data analysis here.

Task: Artist classification

The artist classification task is to classify which artist performed/wrote a certain song given only the lyrics in a textual form.

Results

While we ﬁnd that for lower numbers of classes, fully ﬁne-tuned BERT outperforms the baseline and the BERT kNN variants for lower numbers of classes, that trend does not hold for greater numbers of classes. For higher numbers of classes, using ﬁne-tuned BERT embeddings with kNN outperforms the other methods.

# of classes	random	knn-bow	knn-pre-bert	bert	knn-bert
10	0.098	0.481	0.481	0.87	0.614
20	0.035	0.335	0.455	0.63	0.44
50	0.022	0.182	0.328	0.381	0.28
100	0.009	0.118	0.224	0.219	0.25
200	0.004	0.08	0.157	0.181	0.174
300	0.003	0.065	0.127	0.155	0.165
400	0.002	0.06	0.116	0.001	0.055
500	0.002	0.056	0.095	0.065	0.103
643	0.001	0.056	0.088	0.073	0.104

Accuracy of different multi-class classiﬁcation methods using different class sizes. First column represents the number of classes in each set that the methods are trained/evaluated on. Knn-bow refers to BOW representation with a kNN classiﬁer. Knn-pre-bert refers to pre-trained BERT embeddings with a kNN classiﬁer. bert refers to ﬁne-tuned BERT model and knn-bert using the ﬁne-tuned embeddings with kNN. Best results for each size is highlighted in bold.

Appendix

Additionally to the functionality and experiments mentioned above, we conducted the following baseline experiments.

These experiments are conducted using the k-nearest neighbors (KNN) classifier (implementation).

We experiment with different combinations of data representations and distance/similarity measures as well as with additional structural features (number of lines in lyric, length of lyric in words).

Evaluation

We evaluate our experiments using Accuracy as well as micro-averaged Precision, Recall and F₁-Score. Find these and the implementations macro-averaged metrics here.

Data representations

We use two kinds of representations for the lyrics

Bags of Words (BOW) implemented as sets (implementation)

Distance/similarity measures:
TF-IDF vectors (implementation)

Distance/similarity measures:
- Cosine similarity
- Euclidean distance

Results

Find the experiments and results, and a tutorial on how to use our code in these two notebooks:

		10k train		20k train
	Distance/similarity metric	Acc	Micro-F₁	Acc	Micro-F₁
Random choice		0.004	0.007	0.001	0.002
BOW	Jaccard	0.07	0.115	0.09	0.149
	Sørensen-Dice	0.07	0.115	0.09	0.149
	Tversky	0.06	0.094	0.05	0.083
TF-IDF	Cosine	0.07	0.108	0.08	0.125
	Euclidean	0.01	0.018	0.04	0.071
TF-IDK + Stylometric features	Cosine	0.02	0.032	0.01	0.017
	Euclidean	0.01	0.017	0.01	0.018

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
data		data
notebooks		notebooks
src		src
README.md		README.md
baseline.py		baseline.py
report.pdf		report.pdf
requirements.txt		requirements.txt
results.png		results.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Too many classes this semester: Analyzing BERT on a large multi-class classification problem

Installation

Baselines

Dataset

Dataset description

Task: Artist classification

Results

Appendix

Evaluation

Data representations

Results

About

Releases

Packages

Contributors 2

Languages

	Train	Val	Test
Artists	642	612	618
Songs	46,120	5,765	5,765

mmmaurer/teamlab2022

Folders and files

Latest commit

History

Repository files navigation

Too many classes this semester: Analyzing BERT on a large multi-class classification problem

Installation

Baselines

Dataset

Dataset description

Task: Artist classification

Results

Appendix

Evaluation

Data representations

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages