CAP 5610 - Brian Strickland (1368280)
- Collect the data
- Clean the data
- Feature extract to find known false incidnets
- Use a Fequency Matrix to create similarties between incidents base don short / long descrition (do experiements with toehr columsn)
Data was created by generating a custom service portal page within ServiceNow and placing a Data Table by Instance widget on to the page with defined columns. Over 200K incidents were then able to be exported.
import pandas as pd
import numpy as np
#pd.options.display.float_format = '{:.20f}'.format
#incidents = pd.read_csv("incident_full.csv")
incidents = pd.read_csv("incident_and_comments.csv")
#incidents
- Remove unused colums from the data
- Only use rows that pertain to the Custom Application Development group
- Remove bogus data and hand NaN values
import re
nanColumns = ['description', 'short_description']
assignment_group = 'Custom Application Development'
# Filter down to development team items
incidents = incidents[incidents.assignment_group == assignment_group]
# Remove unecssrary feature
#incidents = incidents.drop(['assignment_group', 'assigned_to', 'number', 'opened_by'], axis=1)
incidents = incidents.drop(['number', 'opened_by'], axis=1)
# Remove all NaN values from dataset
incidents = incidents.dropna(subset=nanColumns)
# Remove bogus data
incidents = incidents[incidents.description != 'asdf']
# Remove incidents with non string date
incidents = incidents[incidents['sys_created_on'].apply(lambda x: isinstance(x, str))]
incidents = incidents[incidents['closed_at'].apply(lambda x: isinstance(x, str))]
# Function to clean special characters out of data
def clean_description(description):
try:
re.sub("[^a-zA-Z0-9 ]", "", description)
return description
except:
#print(description)
return description
# Clean the short description and descriptions
incidents.short_description = incidents.short_description.fillna(0)
incidents.description = incidents.description.fillna(0)
incidents["short_description"] = incidents["short_description"].apply(clean_description)
incidents["description"] = incidents["description"].apply(clean_description)
incidents["FalseIncident"] = "False"
# set random seed for reproducibility
#np.random.seed(143)
# generate a random number between 0 and the length of the dataframe
#num_true = np.random.randint(0, len(incidents))
# set that many incidents to True for the "FalseIncident" column
#incidents.loc[np.random.choice(incidents.index, size=num_true), "FalseIncident"] = "true"
incidents
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
category | short_description | description | assignment_group | u_inc_dept | sys_created_on | closed_at | comments_and_work_notes | FalseIncident | |||
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Software | The SAFE Form website is not properly generati... | When individuals complete a SAFE Form, there i... | Custom Application Development | NaN | 09/25/2017 04:07:49 PM | 10/13/2017 08:48:07 AM | 10/13/2017 08:48:07 AM - System (Additional co... | False | ||
1 | Software | Undergraduate Admissions Web App (OLA) file wa... | The web application load in PS failed because ... | Custom Application Development | NaN | 10/25/2017 08:26:34 AM | 10/30/2017 02:48:11 PM | 10/30/2017 02:48:11 PM - System (Additional co... | False | ||
2 | Hardware | Unable to access http://directory.sdes.ucf.edu... | Unable to access site | Custom Application Development | NaN | 10/27/2017 08:55:53 AM | 11/01/2017 11:48:06 AM | 11/01/2017 11:48:06 AM - System (Additional co... | False | ||
3 | Software | We are unable to enter redeemed vouchers into ... | This is the first time this year that we are p... | Custom Application Development | NaN | 10/27/2017 02:45:07 PM | 11/01/2017 03:48:07 PM | 11/01/2017 03:48:07 PM - System (Additional co... | False | ||
4 | Software | UA forms such as Residency, Reacts, Counselor ... | UA forms such as Residency, Reacts, Counselor ... | Custom Application Development | NaN | 10/31/2017 08:39:06 AM | 11/03/2017 01:48:08 PM | 11/03/2017 01:48:08 PM - System (Additional co... | False | ||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
351 | Software | Trying to submit changes to UCF Phonebook and ... | Details: Althea Robinson called to report she... | Custom Application Development | CCIE ADMINISTRATION | 05/18/2022 11:55:32 AM | 05/23/2022 04:48:07 PM | 05/23/2022 04:48:07 PM - System (Additional co... | False | ||
352 | Software | I need access to following link below. When I ... | User's relationship to UCF: Employee\n\nUser's... | Custom Application Development | COLLEGE OF BUSINESS DEAN | 05/25/2022 08:57:12 AM | 05/31/2022 02:48:00 PM | 05/31/2022 02:48:00 PM - System (Additional co... | False | ||
353 | Software | Cannot log into COBA Test Management | I have tried several times to log in but keep ... | Custom Application Development | NaN | 06/03/2022 04:46:22 AM | 06/23/2022 10:48:00 AM | 06/23/2022 10:48:00 AM - System (Additional co... | False | ||
354 | Software | Knights Email Acount Login and Password Reset/... | Incoming student Sydney Schumacher called in f... | Custom Application Development | NaN | 06/14/2022 01:31:45 PM | 06/20/2022 08:48:05 AM | 06/20/2022 08:48:05 AM - System (Additional co... | False | ||
355 | Software | custom app e911.it.ucf.edu not pulling data | custom app is displaying datatables error when... | Custom Application Development | UCF IT | 06/30/2022 03:48:37 PM | 07/18/2022 01:48:00 PM | 07/18/2022 01:48:00 PM - System (Additional co... | False |
355 rows Ă— 11 columns
Here we will look at various methods to identify known false incidents and add a feature, FalseIncident, and add this to each of those rows. The below method will enable the ability to quickly mark an identify row(s) as a false incident.
# Helper function to set the FalseIncident column to true for all rows in the dataframe based on a feature and value.
def add_false_incidents(df, feature, value):
df.loc[(df[feature] == value), 'FalseIncident'] = "true"
Here we will explore incidents that were closed quickly (within an hour) and if any are found, deteremine if they are truly false incidents. From this we can gather some information about what makes quickly closed incidents false incidents that we'll add into our scoring for identifying false incidents.
# Get incidents that were closed in less than 60 minutes
from datetime import datetime
def is_quickly_closed(row):
date1 = datetime.strptime(row["sys_created_on"], '%m/%d/%Y %I:%M:%S %p')
date2 = datetime.strptime(row["closed_at"], '%m/%d/%Y %I:%M:%S %p')
diff_minutes = int((date2 - date1).total_seconds() / 60)
if diff_minutes <= 60:
return "true"
return "false"
incidents["quickly_closed"] = incidents.apply(is_quickly_closed, axis=1)
incidents[incidents['quickly_closed'] == "true"]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
category | short_description | description | assignment_group | u_inc_dept | sys_created_on | closed_at | comments_and_work_notes | FalseIncident | quickly_closed | |
---|---|---|---|---|---|---|---|---|---|---|
326 | Software | Unable to reset NID password | User called in to report that he has been unab... | Custom Application Development | CCIE DEAN | 12/09/2021 09:33:23 AM | 12/09/2021 09:49:04 AM | 12/09/2021 09:49:04 AM - Yacine Tazi (Addition... | False | true |
From here we can tell that there is only a very small number of incidents that get closed in less than an hour. With this specific incident, it's a password reset, so lets look at the comments to see if it was user error or if the support center help them to reset the password.
from IPython.display import display, HTML
def print_comments(df):
for i,row in incidents[incidents['quickly_closed'] == "true"].iterrows():
display( HTML( pd.DataFrame({'comments_and_work_notes': [row['comments_and_work_notes']]}).to_html().replace("\\n","<br>") ) )
Based on the following, we can deteremine that this wasn't an incident that needed to be resolved by the development team, but rather an expected outage which eventually enable the customer to login:
- There is an intermittent outage with Self Service Reset Tool when users are trying to reset their password using email. This is not consistent behavior and will resolve itself shortly.
- Was able to log in again
With this information, we can go ahead and flag this particular incident as a FalseIncident
with pd.option_context('display.max_colwidth', None):
print(incidents[incidents['quickly_closed'] == "true"].comments_and_work_notes)
326 12/09/2021 09:49:04 AM - Yacine Tazi (Additional comments)\nWas able to log in again\n\n12/09/2021 09:47:40 AM - System (Additional comments)\nWe are continuing to investigate the underlying issue:\n\n\nWHAT IS HAPPENING?\nThere is an intermittent outage with Self Service Reset Tool when users are trying to reset their password using email. This is not consistent behavior and will resolve itself shortly. \n\nWHO IS IMPACTED?\nAnyone that need to reset their NID password using the email functionality. \n\nWHAT ARE WE DOING ABOUT IT?\nWe are currently investigating this issue.\n\nWHAT HAPPENS NEXT?\nWe are currently investigating and will keep everyone posted once the issue is resolved. \n\nWHAT DO I NEED TO DO?\nShould users encounter this issue during password reset, please wait 15-20 minutes and try again. \n\n\n\n12/09/2021 09:34:35 AM - Diego Cruces (Work notes)\nRouting to the Custom Application Development team for further investigation.\n\n
Name: comments_and_work_notes, dtype: object
add_false_incidents(incidents, 'quickly_closed', 'true')
incidents[incidents['FalseIncident'] == "true"]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
category | short_description | description | assignment_group | u_inc_dept | sys_created_on | closed_at | comments_and_work_notes | FalseIncident | quickly_closed | |
---|---|---|---|---|---|---|---|---|---|---|
326 | Software | Unable to reset NID password | User called in to report that he has been unab... | Custom Application Development | CCIE DEAN | 12/09/2021 09:33:23 AM | 12/09/2021 09:49:04 AM | 12/09/2021 09:49:04 AM - Yacine Tazi (Addition... | true | true |
Incidents that are set as hardware are assumed to be a false incident since the software development team doesn't deal with hardware issues.
add_false_incidents(incidents, 'category', 'Hardware')
#incidents[incidents['FalseIncident'] == "true"]
# Set the selected indices to True
#incidents.loc[incidents['FalseIncident'] == False, 'FalseIncident'] = np.random.choice([True, False], size=incidents['FalseIncident'].shape[0], p=[0.5, 0.5])
# store to training data
train_data = incidents[incidents['FalseIncident'] == "true"].copy()
train_data
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
category | short_description | description | assignment_group | u_inc_dept | sys_created_on | closed_at | comments_and_work_notes | FalseIncident | quickly_closed | |
---|---|---|---|---|---|---|---|---|---|---|
2 | Hardware | Unable to access http://directory.sdes.ucf.edu... | Unable to access site | Custom Application Development | NaN | 10/27/2017 08:55:53 AM | 11/01/2017 11:48:06 AM | 11/01/2017 11:48:06 AM - System (Additional co... | true | false |
11 | Hardware | Students are unable to upload forms to online ... | Students are required to upload their involvem... | Custom Application Development | NaN | 12/04/2017 01:13:26 PM | 12/12/2017 11:48:05 AM | 12/12/2017 11:48:06 AM - System (Additional co... | true | false |
121 | Hardware | The Exchange Unified Messaging voicemail assig... | The Exchange Unified Messaging voicemail assig... | Custom Application Development | UCF IT | 03/13/2019 06:38:29 AM | 03/21/2019 03:48:13 PM | 03/21/2019 03:48:13 PM - System (Additional co... | true | false |
210 | Hardware | Can't access DHCP reservations or do anythong ... | I got the new URL my.it.ucf.edeu and I can get... | Custom Application Development | FINANCIAL AFFAIRS | 02/19/2020 09:41:48 AM | 02/24/2020 12:48:09 PM | 02/24/2020 12:48:09 PM - System (Additional co... | true | false |
292 | Hardware | We are not able to access lead.sdes.ucf.edu/ad... | All computers in the office are getting the sa... | Custom Application Development | SDES STU LEADERSHIP DEVELOP | 08/19/2021 08:47:17 AM | 08/26/2021 11:48:14 AM | 08/26/2021 11:48:14 AM - System (Additional co... | true | false |
326 | Software | Unable to reset NID password | User called in to report that he has been unab... | Custom Application Development | CCIE DEAN | 12/09/2021 09:33:23 AM | 12/09/2021 09:49:04 AM | 12/09/2021 09:49:04 AM - Yacine Tazi (Addition... | true | true |
Here we'll create a feature matrix based on the short description values of known incidents (i.e. FalseIncident == "true"). From there we can create a similarite score on all other incidents that have been submitted to see if we can identify some other false incidents.
from sklearn.feature_extraction.text import TfidfVectorizer
def createTfid(train_data):
# Choose min and max word sequesnces
vectorizer = TfidfVectorizer(ngram_range=(1,5))
#change this to only vectorize on known incidents
#false_incidents = incidents[incidents['FalseIncident'] == "true"]
false_incidents = train_data.copy()
tfid_known_false_incidents = vectorizer.fit_transform(false_incidents['short_description'] )
return tfid_known_false_incidents, false_incidents, vectorizer
# Choose min and max word sequesnces
#vectorizer = TfidfVectorizer(ngram_range=(1,5))
#change this to only vectorize on known incidents
#false_incidents = train_data.copy()
#tfid_known_false_incidents = vectorizer.fit_transform(false_incidents['short_description'] )
tfid_known_false_incidents, false_incidents, vectorizer = createTfid(train_data)
tfid_known_false_incidents
<6x316 sparse matrix of type '<class 'numpy.float64'>'
with 339 stored elements in Compressed Sparse Row format>
from sklearn.metrics.pairwise import cosine_similarity
# then compute similaties using cosine_sim with all other types to get a similartiy
def search(incident, tfid_known_false_incidents, false_incidents, vectorizer):
if incident.name in false_incidents.index:
return
desc = clean_description(incident["short_description"])
query_vec = vectorizer.transform([desc])
# compare the description to the knownIncidents list
similarity = cosine_similarity(query_vec, tfid_known_false_incidents).flatten()
# If there are anay items with a > .7 similarity, add this to the list
indices = np.where(similarity > 0.5)[0]
# Remove the current item from the list so that you odn't get a 1.0 similairty (i.e. itself)
current_index = incident.name - 1
indices = indices[indices != current_index]
same_incident_indices = np.where(indices >= len(false_incidents))[0]
indices = indices[indices < len(false_incidents)]
results = false_incidents.iloc[indices].iloc[::-1]
if not results.empty:
# Add similarity score feature
results["similarity_score"] = similarity[indices]
return results
# add a new column to the incidents dataframe that contains non-empty dataframes with similar incidents
#incidents["similar_incidents"] = incidents.apply(search, axis=1)
#incidents["similar_incidents"] = incidents.apply(search, args=(tfid_known_false_incidents), axis=1)
incidents["similar_incidents"] = incidents.apply(search, args=(tfid_known_false_incidents,false_incidents, vectorizer ,), axis=1)
non_empty_similar_incidents = incidents.dropna(subset=["similar_incidents"])
similarity_scores = non_empty_similar_incidents["similar_incidents"].apply(lambda x: x["similarity_score"])
non_empty_similar_incidents
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
category | short_description | description | assignment_group | u_inc_dept | sys_created_on | closed_at | comments_and_work_notes | FalseIncident | quickly_closed | similar_incidents | |
---|---|---|---|---|---|---|---|---|---|---|---|
212 | Software | User called in to report that he is unable to ... | User called in to report that he is unable to ... | Custom Application Development | NaN | 02/24/2020 12:45:58 PM | 04/10/2020 08:48:08 AM | 04/10/2020 08:48:08 AM - System (Additional co... | False | false | category short_description ... |
268 | Software | Unable to reset NID Password due to webpage ou... | User's password has expired and they attempted... | Custom Application Development | NaN | 03/08/2021 04:48:03 PM | 03/15/2021 01:48:02 PM | 03/15/2021 01:48:02 PM - System (Additional co... | False | false | category short_description ... |
272 | Software | NID Password reset for account (NID): da909465 | User is having issues being able to reset his ... | Custom Application Development | HOSPITALITY MANAGEMENT DEAN | 03/09/2021 01:30:01 PM | 03/15/2021 01:48:08 PM | 03/15/2021 01:48:08 PM - System (Additional co... | False | false | category short_description ... |
318 | Software | User is unable to reset knights mail due to it... | User states when trying to reset his password ... | Custom Application Development | NaN | 12/07/2021 07:31:51 PM | 12/14/2021 02:48:04 PM | 12/14/2021 02:48:04 PM - System (Additional co... | False | false | category short_description ... |
325 | Software | Unable to reset NID password | User called in to report that she has been una... | Custom Application Development | NaN | 12/09/2021 09:27:13 AM | 01/21/2022 08:48:06 AM | 01/21/2022 08:48:06 AM - System (Additional co... | False | false | category short_description ... |
335 | Software | Students are not able to upload documents to t... | Students are trying to go to their profile and... | Custom Application Development | SDES STU LEADERSHIP DEVELOP | 01/18/2022 03:53:46 PM | 02/09/2022 08:48:15 AM | 02/09/2022 08:48:15 AM - System (Additional co... | False | false | category ... |
The table below displays each incident that has a similarity score of >=0.7 with any known incident. Each additional NaN column is just a known_incident that the current row has no simiality with
similarity_scores
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
326 | 292 | |
---|---|---|
212 | 0.536588 | NaN |
268 | 0.988003 | NaN |
272 | 0.517992 | NaN |
318 | 0.506493 | NaN |
325 | 1.000000 | NaN |
335 | NaN | 0.513134 |
#non_empty_similar_incidents['similar_incidents']
def findFalseOnes(df):
new = incidents.copy()
tfid_known_false_incidents, false_incidents, vectorizer = createTfid(df)
new["similar_incidents"] = new.apply(search, args=(tfid_known_false_incidents,false_incidents, vectorizer ,), axis=1)
not_empty = new.dropna(subset=["similar_incidents"])
sim_scores = not_empty["similar_incidents"].apply(lambda x: x["similarity_score"])
return not_empty, sim_scores
p = non_empty_similar_incidents.drop(['similar_incidents'], axis=1)
#e,b = findFalseOnes(p)
count = 0
df_found = pd.DataFrame()
df_sim = pd.DataFrame()
while(not p.empty):
p, b = findFalseOnes(p)
if(p.empty):
break
count = count + p.shape[0]
df_found = df_found.append(p)
df_sim = df_sim.append(b)
p = df_found
print(count)
df_found
/tmp/ipykernel_35266/1471197761.py:24: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
df_found = df_found.append(p)
/tmp/ipykernel_35266/1471197761.py:25: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
df_sim = df_sim.append(b)
/tmp/ipykernel_35266/1471197761.py:24: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
df_found = df_found.append(p)
/tmp/ipykernel_35266/1471197761.py:25: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
df_sim = df_sim.append(b)
7
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
category | short_description | description | assignment_group | u_inc_dept | sys_created_on | closed_at | comments_and_work_notes | FalseIncident | quickly_closed | similar_incidents | |
---|---|---|---|---|---|---|---|---|---|---|---|
292 | Hardware | We are not able to access lead.sdes.ucf.edu/ad... | All computers in the office are getting the sa... | Custom Application Development | SDES STU LEADERSHIP DEVELOP | 08/19/2021 08:47:17 AM | 08/26/2021 11:48:14 AM | 08/26/2021 11:48:14 AM - System (Additional co... | true | false | category ... |
326 | Software | Unable to reset NID password | User called in to report that he has been unab... | Custom Application Development | CCIE DEAN | 12/09/2021 09:33:23 AM | 12/09/2021 09:49:04 AM | 12/09/2021 09:49:04 AM - Yacine Tazi (Addition... | true | true | category ... |
212 | Software | User called in to report that he is unable to ... | User called in to report that he is unable to ... | Custom Application Development | NaN | 02/24/2020 12:45:58 PM | 04/10/2020 08:48:08 AM | 04/10/2020 08:48:08 AM - System (Additional co... | False | false | category short_description ... |
268 | Software | Unable to reset NID Password due to webpage ou... | User's password has expired and they attempted... | Custom Application Development | NaN | 03/08/2021 04:48:03 PM | 03/15/2021 01:48:02 PM | 03/15/2021 01:48:02 PM - System (Additional co... | False | false | category short_description ... |
318 | Software | User is unable to reset knights mail due to it... | User states when trying to reset his password ... | Custom Application Development | NaN | 12/07/2021 07:31:51 PM | 12/14/2021 02:48:04 PM | 12/14/2021 02:48:04 PM - System (Additional co... | False | false | category short_description ... |
325 | Software | Unable to reset NID password | User called in to report that she has been una... | Custom Application Development | NaN | 12/09/2021 09:27:13 AM | 01/21/2022 08:48:06 AM | 01/21/2022 08:48:06 AM - System (Additional co... | False | false | category short_description ... |
335 | Software | Students are not able to upload documents to t... | Students are trying to go to their profile and... | Custom Application Development | SDES STU LEADERSHIP DEVELOP | 01/18/2022 03:53:46 PM | 02/09/2022 08:48:15 AM | 02/09/2022 08:48:15 AM - System (Additional co... | False | false | category ... |
df_sim
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
335 | 325 | 268 | 326 | 292 | |
---|---|---|---|---|---|
292 | 0.553778 | NaN | NaN | NaN | NaN |
326 | NaN | 0.543881 | 1.0 | NaN | NaN |
212 | NaN | NaN | NaN | 0.649939 | NaN |
268 | NaN | NaN | NaN | 0.984638 | NaN |
318 | NaN | NaN | NaN | 0.595599 | NaN |
325 | NaN | NaN | NaN | 1.000000 | NaN |
335 | NaN | NaN | NaN | NaN | 0.575907 |
Here we use Navie Bayes Classifier to attempt to make perdictions based on input, however there is not enough data so it is always giving a false prediction (false postive, you're false incident, but you're not)
import nltk
from nltk.stem import WordNetLemmatizer
stopwords = nltk.corpus.stopwords.words('english')
# had to install this followed this guide: https://stackoverflow.com/questions/13965823/resource-corpora-wordnet-not-found-on-heroku
lemmatizer = WordNetLemmatizer()
nltk.download('stopwords')
#train_data
## CREDIT: https://www.analyticsvidhya.com/blog/2021/09/creating-a-movie-reviews-classifier-using-tf-idf-in-python/
train_X_non = train_data['short_description']# + " " + train_data['description'] # '0' refers to the review text
train_y = train_data['FalseIncident'] # '1' corresponds to Label (1 - positive and 0 - negative)
test_X_non = incidents['short_description']# + " " + incidents['description']
test_y = incidents['FalseIncident']
train_X=[]
test_X=[]
def processText(text, i=0):
processed_text = re.sub('[^a-zA-Z]', ' ', text[i])
processed_text = processed_text.lower()
processed_text = processed_text.split()
processed_text = [lemmatizer.lemmatize(word) for word in processed_text if not word in set(stopwords)]
processed_text = [' '.join(processed_text)]
return processed_text
#text pre processing
for i in range(0, len(train_X_non)):
review = re.sub('[^a-zA-Z]', ' ', train_X_non.iloc[i])
review = review.lower()
review = review.split()
review = [lemmatizer.lemmatize(word) for word in review if not word in set(stopwords)]
review = ' '.join(review)
train_X.append(review)
#text pre processing
for i in range(0, len(test_X_non)):
review = re.sub('[^a-zA-Z]', ' ', test_X_non.iloc[i])
review = review.lower()
review = review.split()
review = [lemmatizer.lemmatize(word) for word in review if not word in set(stopwords)]
review = ' '.join(review)
test_X.append(review)
print(train_X[3])
[nltk_data] Downloading package stopwords to /home/strick/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
access dhcp reservation anythong dhcp
#tf idf
tf_idf = TfidfVectorizer()
#applying tf idf to training data
X_train_tf = tf_idf.fit_transform(train_X)
#applying tf idf to training data
X_train_tf = tf_idf.transform(train_X)
print("n_samples: %d, n_features: %d" % X_train_tf.shape)
#transforming test data into tf-idf matrix
X_test_tf = tf_idf.transform(test_X)
print("n_samples: %d, n_features: %d" % X_test_tf.shape)
n_samples: 6, n_features: 40
n_samples: 355, n_features: 40
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
# REF: https://builtin.com/data-science/precision-and-recall
#naive bayes classifier
naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(X_train_tf, train_y)
#predicted y
y_pred = naive_bayes_classifier.predict(X_test_tf)
#Prediction is complete. Now, we print the classification report.
print(metrics.classification_report(test_y, y_pred, target_names=['FalseIncident', 'NotFalse']))
precision recall f1-score support
FalseIncident 0.00 0.00 0.00 349
NotFalse 0.02 1.00 0.03 6
accuracy 0.02 355
macro avg 0.01 0.50 0.02 355
weighted avg 0.00 0.02 0.00 355
/home/strick/.local/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/home/strick/.local/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
/home/strick/.local/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
print("Confusion matrix:")
print(metrics.confusion_matrix(test_y, y_pred))
Confusion matrix:
[[ 0 349]
[ 0 6]]
Based on this confusion matrix, my model is not predicting very well. There are 0 instances where i predicted a postive outcome to be true, 6 instances whre i predicted an incident not to be false, but it was and 349 (all others) where i said it was a false incident but it's not!
# Lets do a prediction
test = ["avengers are here to stay in the world of us"]
review = re.sub('[^a-zA-Z]', ' ', test[0])
review = review.lower()
review = review.split()
review = [lemmatizer.lemmatize(word) for word in review if not word in set(stopwords)]
test_processed =[ ' '.join(review)]
#test_processed = processText("This is unlike any kind of adventure movie my eyes")
test_processed
['avenger stay world u']
test_input = tf_idf.transform(test_processed)
test_input.shape
(1, 40)
res=naive_bayes_classifier.predict(test_input)[0]
res
'true'
from sklearn.cluster import KMeans
# Convert the similarity scores to a numpy array
similarity_scores_array = similarity_scores.values.reshape(-1, 1)
#similarity_scores_array = similarity_scores.dropna().values.reshape(-1, 1)
similarity_scores_array
def simplify_category(df):
df['category']=pd.get_dummies(df.category).drop('Software',axis=1)
return df
def simplify(df, col_name, col, value):
df[col_name]=pd.get_dummies(col).drop(value,axis=1)
return df
tmp = simplify_category(incidents)
def drop_features(df):
return df.drop(['short_description', 'description', 'sys_created_by', 'u_inc_dept', 'sys_created_on', 'closed_at', 'comments_and_work_notes', 'similar_incidents'], axis=1)
#tmp = drop_features(incidents)
tmp = simplify(tmp, 'quickly_closed', tmp.quickly_closed, 'false')
tmp = simplify(tmp, 'FalseIncident', tmp.FalseIncident, 'False')
# Initialize a k-means object with the desired number of clusters
k = 2
kmeans = KMeans(n_clusters=k, init='k-means++')
# Fit the k-means model to the similarity scores
kmeans.fit(X_test_tf)
# Get the cluster assignments for each similarity score
#cluster_labels = kmeans.labels_
cluster_labels = kmeans.fit_predict(X_test_tf)
print(cluster_labels)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0
0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 0 0 1 1
0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0
0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
/home/strick/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
import matplotlib.pyplot as plt
#filter rows of original data
filtered_label0 = X_test_tf[cluster_labels == 0]
filtered_label0
filtered_label0 = filtered_label0.toarray()
plt.scatter(filtered_label0[:, 0], filtered_label0[:, 1])
#plotting the results
plt.scatter(filtered_label0[:,1] , filtered_label0[:,1])
#plt.show()
<matplotlib.collections.PathCollection at 0x7fc1dd3b9090>