Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added DVC with remote storage #5

Merged
merged 3 commits into from
May 31, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .dvc/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
/config.local
/tmp
/cache
2 changes: 2 additions & 0 deletions .dvc/config
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
['remote "gdrive_remote"']
url = gdrive://1NY8yEl6N1ZhE-q9jnEt6G6cqIHyEiCKc
3 changes: 3 additions & 0 deletions .dvcignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Add patterns of files dvc should ignore, which could improve
# the performance. Learn more at
# https://dvc.org/doc/user-guide/dvcignore
2 changes: 1 addition & 1 deletion .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ evaluation=10.0 - ((float(5 * error + warning + refactor + convention) / stateme
# Set the output format. Available formats are text, parseable, colorized, json
# and msvs (visual studio). You can also give a reporter class, e.g.
# mypackage.mymodule.MyReporterClass.
output-format=text:data/reports/report.txt,colorized
output-format=text:reports/pylint_report.txt,colorized

# Tells whether to display a full report or only the messages.
reports=y
Expand Down
1 change: 1 addition & 0 deletions data/external/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.tsv
2 changes: 2 additions & 0 deletions data/models/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
c1_BoW_Sentiment_Model.pkl
c2_Classifier_Sentiment_Model
Binary file removed data/models/c1_BoW_Sentiment_Model.pkl
Binary file not shown.
Binary file removed data/models/c2_Classifier_Sentiment_Model
Binary file not shown.
1 change: 1 addition & 0 deletions data/processed/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.joblib
50 changes: 50 additions & 0 deletions dvc.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
schema: '2.0'
stages:
preprocessing:
cmd: python src/preprocessing.py
deps:
- path: data/external/a1_RestaurantReviews_HistoricDump.tsv
md5: 102f1f4193e0bdebdd6cce7f13e0a839
size: 54686
- path: src/preprocessing.py
md5: b45d76ab50b20ccabfb50d591ee7ef02
size: 2034
outs:
- path: data/processed/corpus.joblib
md5: 243212bb05cce5e3fdc72bfd2826d329
size: 31612
load_data:
cmd: python src/load_data.py
deps:
- path: src/load_data.py
md5: e579b1f5296f89c5f22d8ac4af92e1c0
size: 913
outs:
- path: data/external/a1_RestaurantReviews_HistoricDump.tsv
md5: 102f1f4193e0bdebdd6cce7f13e0a839
size: 54686
training:
cmd: python src/training.py
deps:
- path: data/external/a1_RestaurantReviews_HistoricDump.tsv
md5: 102f1f4193e0bdebdd6cce7f13e0a839
size: 54686
- path: data/processed/corpus.joblib
md5: 243212bb05cce5e3fdc72bfd2826d329
size: 31612
- path: src/evaluation.py
md5: 96c08113733680243cbc537a93cc128d
size: 396
- path: src/training.py
md5: 81ddde09ae93959e83afb4bae0ddd90a
size: 2073
outs:
- path: data/models/c1_BoW_Sentiment_Model.pkl
md5: 47e4584e52d616cbb5af92f988648e27
size: 39823
- path: data/models/c2_Classifier_Sentiment_Model
md5: e6e6744062a1d370a585d15df7f45934
size: 46127
- path: reports/model_evaluation.txt
md5: 35b131f5c189995225c586a8ae7025d9
size: 67
25 changes: 25 additions & 0 deletions dvc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
stages:
load_data:
cmd: python src/load_data.py
deps:
- src/load_data.py
outs:
- data/external/a1_RestaurantReviews_HistoricDump.tsv
preprocessing:
cmd: python src/preprocessing.py
deps:
- src/preprocessing.py
- data/external/a1_RestaurantReviews_HistoricDump.tsv
outs:
- data/processed/corpus.joblib
training:
cmd: python src/training.py
deps:
- src/training.py
- src/evaluation.py
- data/external/a1_RestaurantReviews_HistoricDump.tsv
- data/processed/corpus.joblib
outs:
- data/models/c1_BoW_Sentiment_Model.pkl
- data/models/c2_Classifier_Sentiment_Model
- reports/model_evaluation.txt
1 change: 1 addition & 0 deletions reports/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/model_evaluation.txt
102 changes: 102 additions & 0 deletions reports/pylint_report.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@


Report
======
109 statements analysed.

Statistics by type
------------------

+---------+-------+-----------+-----------+------------+---------+
|type |number |old number |difference |%documented |%badname |
+=========+=======+===========+===========+============+=========+
|module |7 |7 |= |100.00 |0.00 |
+---------+-------+-----------+-----------+------------+---------+
|class |1 |1 |= |100.00 |0.00 |
+---------+-------+-----------+-----------+------------+---------+
|method |3 |3 |= |100.00 |0.00 |
+---------+-------+-----------+-----------+------------+---------+
|function |3 |3 |= |100.00 |0.00 |
+---------+-------+-----------+-----------+------------+---------+



External dependencies
---------------------
::

joblib (src.main,src.preprocessing,src.training)
nltk (src.preprocessing)
\-corpus (src.preprocessing)
\-stem
\-porter (src.preprocessing)
pandas (src.preprocessing,src.training)
sklearn
\-feature_extraction
| \-text (src.training)
\-metrics (src.evaluation)
\-model_selection (src.classification,src.training)
\-naive_bayes (src.classification,src.training)



Raw metrics
-----------

+----------+-------+------+---------+-----------+
|type |number |% |previous |difference |
+==========+=======+======+=========+===========+
|code |122 |49.19 |122 |= |
+----------+-------+------+---------+-----------+
|docstring |32 |12.90 |32 |= |
+----------+-------+------+---------+-----------+
|comment |37 |14.92 |37 |= |
+----------+-------+------+---------+-----------+
|empty |57 |22.98 |57 |= |
+----------+-------+------+---------+-----------+



Duplication
-----------

+-------------------------+------+---------+-----------+
| |now |previous |difference |
+=========================+======+=========+===========+
|nb duplicated lines |0 |0 |0 |
+-------------------------+------+---------+-----------+
|percent duplicated lines |0.000 |0.000 |= |
+-------------------------+------+---------+-----------+



Messages by category
--------------------

+-----------+-------+---------+-----------+
|type |number |previous |difference |
+===========+=======+=========+===========+
|convention |0 |1 |1 |
+-----------+-------+---------+-----------+
|refactor |0 |0 |0 |
+-----------+-------+---------+-----------+
|warning |0 |0 |0 |
+-----------+-------+---------+-----------+
|error |0 |0 |0 |
+-----------+-------+---------+-----------+



Messages
--------

+-----------+------------+
|message id |occurrences |
+===========+============+




-------------------------------------------------------------------
Your code has been rated at 10.00/10 (previous run: 9.91/10, +0.09)

6 changes: 5 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,8 @@ joblib==1.1.1
nltk==3.7
scikit_learn==1.2.2
setuptools==45.2.0

dvc==2.58.1
dvc_gdrive==2.19.2
pylint==2.12.2
mllint==0.12.2
dslinter==2.0.9
8 changes: 1 addition & 7 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,6 @@

from setuptools import setup, find_packages

requirements = [ ]

test_requirements = [ ]

setup(
author="Team 08",
python_requires='>=3.6',
Expand All @@ -21,15 +17,13 @@
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
],
description="The model-training repository of Team 08 for the Release Engineering for Machine Learning (CS4295) course at the TU Delft.",
install_requires=requirements,
description="The model-training repository of Team 08 for the CS4295 course at the TU Delft.",
license="MIT license",
include_package_data=True,
keywords='model_training',
name='model_training',
packages=find_packages(include=['model_training', 'model_training.*']),
test_suite='tests',
tests_require=test_requirements,
url='https://github.com/remla23-team08/model-training',
version='0.2.0',
zip_safe=False,
Expand Down
7 changes: 4 additions & 3 deletions src/evaluation.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
#! /usr/bin/env

"""
Functions related to model evaluation
Evaluate the model and return results
"""

from sklearn.metrics import confusion_matrix, accuracy_score


def model_eval(classifier, X_test, y_test):
"""
Prints model evaluation metrics
Returns model evaluation metrics
"""
y_pred = classifier.predict(X_test)
conf_matrix = confusion_matrix(y_test, y_pred)
print(conf_matrix, accuracy_score(y_test, y_pred))
acc_score = accuracy_score(y_test, y_pred)
return conf_matrix, acc_score
29 changes: 23 additions & 6 deletions src/load_data.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,30 @@
#! /usr/bin/env
# #! /usr/bin/env

"""
This script loads data from the dataset_path into a pandas dataset.
"""

import pandas as pd
import os
import urllib.request
import zipfile

def load_data(dataset_path):
"""Function loading data from dataset_path into pandas dataset"""
dataset = pd.read_csv(dataset_path, delimiter = '\t', quoting = 3, dtype={'Review': object, 'Liked': int})[:]

return dataset
if __name__ == "__main__":
# Specify the relative path to data tsv
root_path = os.path.dirname(os.path.abspath(__file__))
dataset_path = os.path.join(root_path, '..', 'data', 'external', 'a1_RestaurantReviews_HistoricDump.tsv')

# Import the data from external source
print("Importing external dataset..")
URL = r'https://drive.google.com/uc?export=download&id=1G7rLkSloPUzkK4zCzb9lLR0zSYygu8mK'
zip_path, _ = urllib.request.urlretrieve(URL)

# Define export path for dataset
export_path = os.path.dirname(os.path.abspath(dataset_path))

# Unzip at export path
with zipfile.ZipFile(zip_path, "r") as f:
f.extractall(export_path)

# Print success to console
print("External dataset sucessfully imported!")
67 changes: 40 additions & 27 deletions src/preprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,43 +5,56 @@
"""

import os
import pickle
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import joblib

def data_preprocessing(dataset):
"""
Main preprocessing steps for ML data
"""
nltk.download('stopwords')
porter_stem = PorterStemmer()

all_stopwords = stopwords.words('english')
all_stopwords.remove('not')
class Preprocessing:
"""Class to easily preprocess datasets"""

corpus=[]
for i in range(0, len(dataset)):
review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
def __init__(self):
"""Initialize preprocess class"""
nltk.download('stopwords')
self.porter_stem = PorterStemmer()
self.all_stopwords = stopwords.words('english')
self.all_stopwords.remove('not')

self.dataset = None
self.count_vectorizer = None

def preprocess_dataset(self, dataset):
"""Loop over entire dataset to preprocess"""
corpus = []
for i in range(0, len(dataset)):
corpus.append(self.preprocess_review(dataset['Review'][i]))
return corpus

def preprocess_review(self, review):
"""Processing a single review"""
review = re.sub('[^a-zA-Z]', ' ', review)
review = review.lower()
review = review.split()
review = [porter_stem.stem(word) for word in review if not word in set(all_stopwords)]
review = [self.porter_stem.stem(word) for word in review if not word in set(self.all_stopwords)]
review = ' '.join(review)
corpus.append(review)
return review

# Use count vectoriser to transform dataset
count_vectoriser = CountVectorizer(max_features = 1420)
X = count_vectoriser.fit_transform(corpus).toarray()
y = dataset.iloc[:, -1].values

# Get the root path of the current script, and bow path to save dictionary later
if __name__ == "__main__":
# Specify the relative path to data tsv
root_path = os.path.dirname(os.path.abspath(__file__))
bow_path = os.path.join(root_path, '..', 'data', 'models', 'c1_BoW_Sentiment_Model.pkl')

# Saving BoW dictionary to later use in prediction
with open(bow_path, "wb") as file:
pickle.dump(count_vectoriser, file)

return X, y
dataset_path = os.path.join(root_path, '..', 'data', 'external', 'a1_RestaurantReviews_HistoricDump.tsv')

# Load data from file
load_dataset = pd.read_csv(dataset_path, delimiter = '\t', quoting = 3, dtype={'Review': object, 'Liked': int})[:]

# Preprocess and store processed corpus in joblib
print("Preprocessing the dataset...")
preprocess_class = Preprocessing()
save_corpus = preprocess_class.preprocess_dataset(load_dataset)
corpus_path = os.path.join(root_path, '..', 'data/processed/corpus.joblib')
joblib.dump(save_corpus, corpus_path)
print(f"Processed dataset (corpus) is saved to: {corpus_path}")
Loading