Skip to content

Commit

Permalink
Added DVC with remote storage (#5)
Browse files Browse the repository at this point in the history
* Implemented DVC pipeline and added remote storage

* Re-attained perfect pylint score
  • Loading branch information
JvanderSaag authored May 31, 2023
1 parent 346c4e4 commit 2d767b7
Show file tree
Hide file tree
Showing 19 changed files with 319 additions and 45 deletions.
3 changes: 3 additions & 0 deletions .dvc/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
/config.local
/tmp
/cache
2 changes: 2 additions & 0 deletions .dvc/config
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
['remote "gdrive_remote"']
url = gdrive://1NY8yEl6N1ZhE-q9jnEt6G6cqIHyEiCKc
3 changes: 3 additions & 0 deletions .dvcignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Add patterns of files dvc should ignore, which could improve
# the performance. Learn more at
# https://dvc.org/doc/user-guide/dvcignore
2 changes: 1 addition & 1 deletion .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ evaluation=10.0 - ((float(5 * error + warning + refactor + convention) / stateme
# Set the output format. Available formats are text, parseable, colorized, json
# and msvs (visual studio). You can also give a reporter class, e.g.
# mypackage.mymodule.MyReporterClass.
output-format=text:data/reports/report.txt,colorized
output-format=text:reports/pylint_report.txt,colorized

# Tells whether to display a full report or only the messages.
reports=y
Expand Down
1 change: 1 addition & 0 deletions data/external/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.tsv
2 changes: 2 additions & 0 deletions data/models/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
c1_BoW_Sentiment_Model.pkl
c2_Classifier_Sentiment_Model
Binary file removed data/models/c1_BoW_Sentiment_Model.pkl
Binary file not shown.
Binary file removed data/models/c2_Classifier_Sentiment_Model
Binary file not shown.
1 change: 1 addition & 0 deletions data/processed/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.joblib
50 changes: 50 additions & 0 deletions dvc.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
schema: '2.0'
stages:
preprocessing:
cmd: python src/preprocessing.py
deps:
- path: data/external/a1_RestaurantReviews_HistoricDump.tsv
md5: 102f1f4193e0bdebdd6cce7f13e0a839
size: 54686
- path: src/preprocessing.py
md5: b45d76ab50b20ccabfb50d591ee7ef02
size: 2034
outs:
- path: data/processed/corpus.joblib
md5: 243212bb05cce5e3fdc72bfd2826d329
size: 31612
load_data:
cmd: python src/load_data.py
deps:
- path: src/load_data.py
md5: e579b1f5296f89c5f22d8ac4af92e1c0
size: 913
outs:
- path: data/external/a1_RestaurantReviews_HistoricDump.tsv
md5: 102f1f4193e0bdebdd6cce7f13e0a839
size: 54686
training:
cmd: python src/training.py
deps:
- path: data/external/a1_RestaurantReviews_HistoricDump.tsv
md5: 102f1f4193e0bdebdd6cce7f13e0a839
size: 54686
- path: data/processed/corpus.joblib
md5: 243212bb05cce5e3fdc72bfd2826d329
size: 31612
- path: src/evaluation.py
md5: 96c08113733680243cbc537a93cc128d
size: 396
- path: src/training.py
md5: 81ddde09ae93959e83afb4bae0ddd90a
size: 2073
outs:
- path: data/models/c1_BoW_Sentiment_Model.pkl
md5: 47e4584e52d616cbb5af92f988648e27
size: 39823
- path: data/models/c2_Classifier_Sentiment_Model
md5: e6e6744062a1d370a585d15df7f45934
size: 46127
- path: reports/model_evaluation.txt
md5: 35b131f5c189995225c586a8ae7025d9
size: 67
25 changes: 25 additions & 0 deletions dvc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
stages:
load_data:
cmd: python src/load_data.py
deps:
- src/load_data.py
outs:
- data/external/a1_RestaurantReviews_HistoricDump.tsv
preprocessing:
cmd: python src/preprocessing.py
deps:
- src/preprocessing.py
- data/external/a1_RestaurantReviews_HistoricDump.tsv
outs:
- data/processed/corpus.joblib
training:
cmd: python src/training.py
deps:
- src/training.py
- src/evaluation.py
- data/external/a1_RestaurantReviews_HistoricDump.tsv
- data/processed/corpus.joblib
outs:
- data/models/c1_BoW_Sentiment_Model.pkl
- data/models/c2_Classifier_Sentiment_Model
- reports/model_evaluation.txt
1 change: 1 addition & 0 deletions reports/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/model_evaluation.txt
102 changes: 102 additions & 0 deletions reports/pylint_report.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@


Report
======
109 statements analysed.

Statistics by type
------------------

+---------+-------+-----------+-----------+------------+---------+
|type |number |old number |difference |%documented |%badname |
+=========+=======+===========+===========+============+=========+
|module |7 |7 |= |100.00 |0.00 |
+---------+-------+-----------+-----------+------------+---------+
|class |1 |1 |= |100.00 |0.00 |
+---------+-------+-----------+-----------+------------+---------+
|method |3 |3 |= |100.00 |0.00 |
+---------+-------+-----------+-----------+------------+---------+
|function |3 |3 |= |100.00 |0.00 |
+---------+-------+-----------+-----------+------------+---------+



External dependencies
---------------------
::

joblib (src.main,src.preprocessing,src.training)
nltk (src.preprocessing)
\-corpus (src.preprocessing)
\-stem
\-porter (src.preprocessing)
pandas (src.preprocessing,src.training)
sklearn
\-feature_extraction
| \-text (src.training)
\-metrics (src.evaluation)
\-model_selection (src.classification,src.training)
\-naive_bayes (src.classification,src.training)



Raw metrics
-----------

+----------+-------+------+---------+-----------+
|type |number |% |previous |difference |
+==========+=======+======+=========+===========+
|code |122 |49.19 |122 |= |
+----------+-------+------+---------+-----------+
|docstring |32 |12.90 |32 |= |
+----------+-------+------+---------+-----------+
|comment |37 |14.92 |37 |= |
+----------+-------+------+---------+-----------+
|empty |57 |22.98 |57 |= |
+----------+-------+------+---------+-----------+



Duplication
-----------

+-------------------------+------+---------+-----------+
| |now |previous |difference |
+=========================+======+=========+===========+
|nb duplicated lines |0 |0 |0 |
+-------------------------+------+---------+-----------+
|percent duplicated lines |0.000 |0.000 |= |
+-------------------------+------+---------+-----------+



Messages by category
--------------------

+-----------+-------+---------+-----------+
|type |number |previous |difference |
+===========+=======+=========+===========+
|convention |0 |1 |1 |
+-----------+-------+---------+-----------+
|refactor |0 |0 |0 |
+-----------+-------+---------+-----------+
|warning |0 |0 |0 |
+-----------+-------+---------+-----------+
|error |0 |0 |0 |
+-----------+-------+---------+-----------+



Messages
--------

+-----------+------------+
|message id |occurrences |
+===========+============+




-------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 9.91/10, +0.09)

6 changes: 5 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,8 @@ joblib==1.1.1
nltk==3.7
scikit_learn==1.2.2
setuptools==45.2.0

dvc==2.58.1
dvc_gdrive==2.19.2
pylint==2.12.2
mllint==0.12.2
dslinter==2.0.9
8 changes: 1 addition & 7 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,6 @@

from setuptools import setup, find_packages

requirements = [ ]

test_requirements = [ ]

setup(
author="Team 08",
python_requires='>=3.6',
Expand All @@ -21,15 +17,13 @@
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
],
description="The model-training repository of Team 08 for the Release Engineering for Machine Learning (CS4295) course at the TU Delft.",
install_requires=requirements,
description="The model-training repository of Team 08 for the CS4295 course at the TU Delft.",
license="MIT license",
include_package_data=True,
keywords='model_training',
name='model_training',
packages=find_packages(include=['model_training', 'model_training.*']),
test_suite='tests',
tests_require=test_requirements,
url='https://github.com/remla23-team08/model-training',
version='0.2.0',
zip_safe=False,
Expand Down
7 changes: 4 additions & 3 deletions src/evaluation.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
#! /usr/bin/env

"""
Functions related to model evaluation
Evaluate the model and return results
"""

from sklearn.metrics import confusion_matrix, accuracy_score


def model_eval(classifier, X_test, y_test):
"""
Prints model evaluation metrics
Returns model evaluation metrics
"""
y_pred = classifier.predict(X_test)
conf_matrix = confusion_matrix(y_test, y_pred)
print(conf_matrix, accuracy_score(y_test, y_pred))
acc_score = accuracy_score(y_test, y_pred)
return conf_matrix, acc_score
29 changes: 23 additions & 6 deletions src/load_data.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,30 @@
#! /usr/bin/env
# #! /usr/bin/env

"""
This script loads data from the dataset_path into a pandas dataset.
"""

import pandas as pd
import os
import urllib.request
import zipfile

def load_data(dataset_path):
"""Function loading data from dataset_path into pandas dataset"""
dataset = pd.read_csv(dataset_path, delimiter = '\t', quoting = 3, dtype={'Review': object, 'Liked': int})[:]

return dataset
if __name__ == "__main__":
# Specify the relative path to data tsv
root_path = os.path.dirname(os.path.abspath(__file__))
dataset_path = os.path.join(root_path, '..', 'data', 'external', 'a1_RestaurantReviews_HistoricDump.tsv')

# Import the data from external source
print("Importing external dataset..")
URL = r'https://drive.google.com/uc?export=download&id=1G7rLkSloPUzkK4zCzb9lLR0zSYygu8mK'
zip_path, _ = urllib.request.urlretrieve(URL)

# Define export path for dataset
export_path = os.path.dirname(os.path.abspath(dataset_path))

# Unzip at export path
with zipfile.ZipFile(zip_path, "r") as f:
f.extractall(export_path)

# Print success to console
print("External dataset sucessfully imported!")
67 changes: 40 additions & 27 deletions src/preprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,43 +5,56 @@
"""

import os
import pickle
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import joblib

def data_preprocessing(dataset):
"""
Main preprocessing steps for ML data
"""
nltk.download('stopwords')
porter_stem = PorterStemmer()

all_stopwords = stopwords.words('english')
all_stopwords.remove('not')
class Preprocessing:
"""Class to easily preprocess datasets"""

corpus=[]
for i in range(0, len(dataset)):
review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
def __init__(self):
"""Initialize preprocess class"""
nltk.download('stopwords')
self.porter_stem = PorterStemmer()
self.all_stopwords = stopwords.words('english')
self.all_stopwords.remove('not')

self.dataset = None
self.count_vectorizer = None

def preprocess_dataset(self, dataset):
"""Loop over entire dataset to preprocess"""
corpus = []
for i in range(0, len(dataset)):
corpus.append(self.preprocess_review(dataset['Review'][i]))
return corpus

def preprocess_review(self, review):
"""Processing a single review"""
review = re.sub('[^a-zA-Z]', ' ', review)
review = review.lower()
review = review.split()
review = [porter_stem.stem(word) for word in review if not word in set(all_stopwords)]
review = [self.porter_stem.stem(word) for word in review if not word in set(self.all_stopwords)]
review = ' '.join(review)
corpus.append(review)
return review

# Use count vectoriser to transform dataset
count_vectoriser = CountVectorizer(max_features = 1420)
X = count_vectoriser.fit_transform(corpus).toarray()
y = dataset.iloc[:, -1].values

# Get the root path of the current script, and bow path to save dictionary later
if __name__ == "__main__":
# Specify the relative path to data tsv
root_path = os.path.dirname(os.path.abspath(__file__))
bow_path = os.path.join(root_path, '..', 'data', 'models', 'c1_BoW_Sentiment_Model.pkl')

# Saving BoW dictionary to later use in prediction
with open(bow_path, "wb") as file:
pickle.dump(count_vectoriser, file)

return X, y
dataset_path = os.path.join(root_path, '..', 'data', 'external', 'a1_RestaurantReviews_HistoricDump.tsv')

# Load data from file
load_dataset = pd.read_csv(dataset_path, delimiter = '\t', quoting = 3, dtype={'Review': object, 'Liked': int})[:]

# Preprocess and store processed corpus in joblib
print("Preprocessing the dataset...")
preprocess_class = Preprocessing()
save_corpus = preprocess_class.preprocess_dataset(load_dataset)
corpus_path = os.path.join(root_path, '..', 'data/processed/corpus.joblib')
joblib.dump(save_corpus, corpus_path)
print(f"Processed dataset (corpus) is saved to: {corpus_path}")
Loading

0 comments on commit 2d767b7

Please sign in to comment.