Predicting False Incident Requests

CAP 5610 - Brian Strickland (1368280)

Collect the data
Clean the data
Feature extract to find known false incidnets
Use a Fequency Matrix to create similarties between incidents base don short / long descrition (do experiements with toehr columsn)

Collect Data to build training data

Data was created by generating a custom service portal page within ServiceNow and placing a Data Table by Instance widget on to the page with defined columns. Over 200K incidents were then able to be exported.

import pandas as pd
import numpy as np

#pd.options.display.float_format = '{:.20f}'.format


#incidents = pd.read_csv("incident_full.csv")
incidents = pd.read_csv("incident_and_comments.csv")

#incidents

Preprocess data

Drop Data

Remove unused colums from the data
Only use rows that pertain to the Custom Application Development group
Remove bogus data and hand NaN values

import re

nanColumns = ['description', 'short_description']
assignment_group = 'Custom Application Development'

# Filter down to development team items
incidents = incidents[incidents.assignment_group == assignment_group]

# Remove unecssrary feature
#incidents = incidents.drop(['assignment_group', 'assigned_to', 'number', 'opened_by'], axis=1)
incidents = incidents.drop(['number', 'opened_by'], axis=1)

# Remove all NaN values from dataset
incidents = incidents.dropna(subset=nanColumns)

# Remove bogus data
incidents = incidents[incidents.description != 'asdf']

# Remove incidents with non string date
incidents = incidents[incidents['sys_created_on'].apply(lambda x: isinstance(x, str))]
incidents = incidents[incidents['closed_at'].apply(lambda x: isinstance(x, str))]

Clean Up Data

# Function to clean special characters out of data
def clean_description(description):
    try:
        re.sub("[^a-zA-Z0-9 ]", "", description)
        return description
    except:
        #print(description)
        return description
        
# Clean the short description and descriptions
incidents.short_description = incidents.short_description.fillna(0)
incidents.description = incidents.description.fillna(0)
incidents["short_description"] = incidents["short_description"].apply(clean_description)
incidents["description"] = incidents["description"].apply(clean_description)
incidents["FalseIncident"] = "False"

# set random seed for reproducibility
#np.random.seed(143)

# generate a random number between 0 and the length of the dataframe
#num_true = np.random.randint(0, len(incidents))

# set that many incidents to True for the "FalseIncident" column
#incidents.loc[np.random.choice(incidents.index, size=num_true), "FalseIncident"] = "true"

incidents

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	category	short_description	description	assignment_group	u_inc_dept	sys_created_on	closed_at	comments_and_work_notes	FalseIncident
0	Software	The SAFE Form website is not properly generati...	When individuals complete a SAFE Form, there i...	Custom Application Development	NaN	09/25/2017 04:07:49 PM	10/13/2017 08:48:07 AM	10/13/2017 08:48:07 AM - System (Additional co...	False
1	Software	Undergraduate Admissions Web App (OLA) file wa...	The web application load in PS failed because ...	Custom Application Development	NaN	10/25/2017 08:26:34 AM	10/30/2017 02:48:11 PM	10/30/2017 02:48:11 PM - System (Additional co...	False
2	Hardware	Unable to access http://directory.sdes.ucf.edu...	Unable to access site	Custom Application Development	NaN	10/27/2017 08:55:53 AM	11/01/2017 11:48:06 AM	11/01/2017 11:48:06 AM - System (Additional co...	False
3	Software	We are unable to enter redeemed vouchers into ...	This is the first time this year that we are p...	Custom Application Development	NaN	10/27/2017 02:45:07 PM	11/01/2017 03:48:07 PM	11/01/2017 03:48:07 PM - System (Additional co...	False
4	Software	UA forms such as Residency, Reacts, Counselor ...	UA forms such as Residency, Reacts, Counselor ...	Custom Application Development	NaN	10/31/2017 08:39:06 AM	11/03/2017 01:48:08 PM	11/03/2017 01:48:08 PM - System (Additional co...	False
...	...	...	...	...	...	...	...	...	...	...	...
351	Software	Trying to submit changes to UCF Phonebook and ...	Details: Althea Robinson called to report she...	Custom Application Development	CCIE ADMINISTRATION	05/18/2022 11:55:32 AM	05/23/2022 04:48:07 PM	05/23/2022 04:48:07 PM - System (Additional co...	False
352	Software	I need access to following link below. When I ...	User's relationship to UCF: Employee\n\nUser's...	Custom Application Development	COLLEGE OF BUSINESS DEAN	05/25/2022 08:57:12 AM	05/31/2022 02:48:00 PM	05/31/2022 02:48:00 PM - System (Additional co...	False
353	Software	Cannot log into COBA Test Management	I have tried several times to log in but keep ...	Custom Application Development	NaN	06/03/2022 04:46:22 AM	06/23/2022 10:48:00 AM	06/23/2022 10:48:00 AM - System (Additional co...	False
354	Software	Knights Email Acount Login and Password Reset/...	Incoming student Sydney Schumacher called in f...	Custom Application Development	NaN	06/14/2022 01:31:45 PM	06/20/2022 08:48:05 AM	06/20/2022 08:48:05 AM - System (Additional co...	False
355	Software	custom app e911.it.ucf.edu not pulling data	custom app is displaying datatables error when...	Custom Application Development	UCF IT	06/30/2022 03:48:37 PM	07/18/2022 01:48:00 PM	07/18/2022 01:48:00 PM - System (Additional co...	False

355 rows × 11 columns

Feature Extraction

Here we will look at various methods to identify known false incidents and add a feature, FalseIncident, and add this to each of those rows. The below method will enable the ability to quickly mark an identify row(s) as a false incident.

# Helper function to set the FalseIncident column to true for all rows in the dataframe based on a feature and value.
def add_false_incidents(df, feature, value):
    
    df.loc[(df[feature] == value), 'FalseIncident'] = "true"

Quickly Closed Incidents

Here we will explore incidents that were closed quickly (within an hour) and if any are found, deteremine if they are truly false incidents. From this we can gather some information about what makes quickly closed incidents false incidents that we'll add into our scoring for identifying false incidents.

# Get incidents that were closed in less than 60 minutes
from datetime import datetime
def is_quickly_closed(row):
    
    date1 = datetime.strptime(row["sys_created_on"], '%m/%d/%Y %I:%M:%S %p')
    date2 = datetime.strptime(row["closed_at"], '%m/%d/%Y %I:%M:%S %p')

    diff_minutes = int((date2 - date1).total_seconds() / 60)


    if diff_minutes <= 60:
        return "true"
            
    return "false"
        

incidents["quickly_closed"] = incidents.apply(is_quickly_closed, axis=1)

incidents[incidents['quickly_closed'] == "true"]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	category	short_description	description	assignment_group	u_inc_dept	sys_created_on	closed_at	comments_and_work_notes	FalseIncident	quickly_closed
326	Software	Unable to reset NID password	User called in to report that he has been unab...	Custom Application Development	CCIE DEAN	12/09/2021 09:33:23 AM	12/09/2021 09:49:04 AM	12/09/2021 09:49:04 AM - Yacine Tazi (Addition...	False	true

From here we can tell that there is only a very small number of incidents that get closed in less than an hour. With this specific incident, it's a password reset, so lets look at the comments to see if it was user error or if the support center help them to reset the password.

from IPython.display import display, HTML

def print_comments(df):
    for i,row in incidents[incidents['quickly_closed'] == "true"].iterrows():
        display( HTML( pd.DataFrame({'comments_and_work_notes': [row['comments_and_work_notes']]}).to_html().replace("\\n","<br>") ) )

Based on the following, we can deteremine that this wasn't an incident that needed to be resolved by the development team, but rather an expected outage which eventually enable the customer to login:

There is an intermittent outage with Self Service Reset Tool when users are trying to reset their password using email. This is not consistent behavior and will resolve itself shortly.
Was able to log in again

With this information, we can go ahead and flag this particular incident as a FalseIncident

with pd.option_context('display.max_colwidth', None):
    print(incidents[incidents['quickly_closed'] == "true"].comments_and_work_notes)

326    12/09/2021 09:49:04 AM - Yacine Tazi (Additional comments)\nWas able to log in again\n\n12/09/2021 09:47:40 AM - System (Additional comments)\nWe are continuing to investigate the underlying issue:\n\n\nWHAT IS HAPPENING?\nThere is an intermittent outage with Self Service Reset Tool when users are trying to reset their password using email. This is not consistent behavior and will resolve itself shortly. \n\nWHO IS IMPACTED?\nAnyone that need to reset their NID password using the email functionality. \n\nWHAT ARE WE DOING ABOUT IT?\nWe are currently investigating this issue.\n\nWHAT HAPPENS NEXT?\nWe are currently investigating and will keep everyone posted once the issue is resolved. \n\nWHAT DO I NEED TO DO?\nShould users encounter this issue during password reset, please wait 15-20 minutes and try again. \n\n\n\n12/09/2021 09:34:35 AM - Diego Cruces (Work notes)\nRouting to the Custom Application Development team for further investigation.\n\n
Name: comments_and_work_notes, dtype: object

Add False Incident Feature

add_false_incidents(incidents, 'quickly_closed', 'true')
incidents[incidents['FalseIncident'] == "true"]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	category	short_description	description	assignment_group	u_inc_dept	sys_created_on	closed_at	comments_and_work_notes	FalseIncident	quickly_closed
326	Software	Unable to reset NID password	User called in to report that he has been unab...	Custom Application Development	CCIE DEAN	12/09/2021 09:33:23 AM	12/09/2021 09:49:04 AM	12/09/2021 09:49:04 AM - Yacine Tazi (Addition...	true	true

Category type is hardware

Incidents that are set as hardware are assumed to be a false incident since the software development team doesn't deal with hardware issues.

add_false_incidents(incidents, 'category', 'Hardware')
#incidents[incidents['FalseIncident'] == "true"]

# Set the selected indices to True
#incidents.loc[incidents['FalseIncident'] == False, 'FalseIncident'] = np.random.choice([True, False], size=incidents['FalseIncident'].shape[0], p=[0.5, 0.5])

# store to training data
train_data = incidents[incidents['FalseIncident'] == "true"].copy()
train_data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	category	short_description	description	assignment_group	u_inc_dept	sys_created_on	closed_at	comments_and_work_notes	FalseIncident	quickly_closed
2	Hardware	Unable to access http://directory.sdes.ucf.edu...	Unable to access site	Custom Application Development	NaN	10/27/2017 08:55:53 AM	11/01/2017 11:48:06 AM	11/01/2017 11:48:06 AM - System (Additional co...	true	false
11	Hardware	Students are unable to upload forms to online ...	Students are required to upload their involvem...	Custom Application Development	NaN	12/04/2017 01:13:26 PM	12/12/2017 11:48:05 AM	12/12/2017 11:48:06 AM - System (Additional co...	true	false
121	Hardware	The Exchange Unified Messaging voicemail assig...	The Exchange Unified Messaging voicemail assig...	Custom Application Development	UCF IT	03/13/2019 06:38:29 AM	03/21/2019 03:48:13 PM	03/21/2019 03:48:13 PM - System (Additional co...	true	false
210	Hardware	Can't access DHCP reservations or do anythong ...	I got the new URL my.it.ucf.edeu and I can get...	Custom Application Development	FINANCIAL AFFAIRS	02/19/2020 09:41:48 AM	02/24/2020 12:48:09 PM	02/24/2020 12:48:09 PM - System (Additional co...	true	false
292	Hardware	We are not able to access lead.sdes.ucf.edu/ad...	All computers in the office are getting the sa...	Custom Application Development	SDES STU LEADERSHIP DEVELOP	08/19/2021 08:47:17 AM	08/26/2021 11:48:14 AM	08/26/2021 11:48:14 AM - System (Additional co...	true	false
326	Software	Unable to reset NID password	User called in to report that he has been unab...	Custom Application Development	CCIE DEAN	12/09/2021 09:33:23 AM	12/09/2021 09:49:04 AM	12/09/2021 09:49:04 AM - Yacine Tazi (Addition...	true	true

Using Cosine Similarity

Create a Feature Matrix

Here we'll create a feature matrix based on the short description values of known incidents (i.e. FalseIncident == "true"). From there we can create a similarite score on all other incidents that have been submitted to see if we can identify some other false incidents.

from sklearn.feature_extraction.text import TfidfVectorizer

def createTfid(train_data):
    # Choose min and max word sequesnces
    vectorizer = TfidfVectorizer(ngram_range=(1,5))

    #change this to only vectorize on known incidents
    #false_incidents = incidents[incidents['FalseIncident'] == "true"]
    false_incidents = train_data.copy()

    tfid_known_false_incidents = vectorizer.fit_transform(false_incidents['short_description'] )
    return tfid_known_false_incidents, false_incidents, vectorizer


# Choose min and max word sequesnces
#vectorizer = TfidfVectorizer(ngram_range=(1,5))

#change this to only vectorize on known incidents
#false_incidents = train_data.copy()

#tfid_known_false_incidents = vectorizer.fit_transform(false_incidents['short_description'] )
tfid_known_false_incidents, false_incidents, vectorizer = createTfid(train_data)

tfid_known_false_incidents

<6x316 sparse matrix of type '<class 'numpy.float64'>'
	with 339 stored elements in Compressed Sparse Row format>

Generate the Similarites

from sklearn.metrics.pairwise import cosine_similarity
# then compute similaties using cosine_sim with all other types to get a similartiy
def search(incident, tfid_known_false_incidents, false_incidents, vectorizer):

    if incident.name in false_incidents.index:
       return
    
    desc = clean_description(incident["short_description"])
    query_vec = vectorizer.transform([desc]) 
    
    # compare the description to the knownIncidents list
    similarity = cosine_similarity(query_vec, tfid_known_false_incidents).flatten()

    # If there are anay items with a > .7 similarity, add this to the list
    indices = np.where(similarity > 0.5)[0]

    # Remove the current item from the list so that you odn't get a 1.0 similairty (i.e. itself)
    current_index = incident.name - 1
    indices = indices[indices != current_index]
    
    same_incident_indices = np.where(indices >= len(false_incidents))[0]
    indices = indices[indices < len(false_incidents)]

    results = false_incidents.iloc[indices].iloc[::-1]


    if not results.empty:
        # Add similarity score feature
        results["similarity_score"] = similarity[indices]

        return results

# add a new column to the incidents dataframe that contains non-empty dataframes with similar incidents
#incidents["similar_incidents"] = incidents.apply(search, axis=1)
#incidents["similar_incidents"] = incidents.apply(search, args=(tfid_known_false_incidents), axis=1)
incidents["similar_incidents"] = incidents.apply(search, args=(tfid_known_false_incidents,false_incidents, vectorizer ,), axis=1)



non_empty_similar_incidents = incidents.dropna(subset=["similar_incidents"])
similarity_scores = non_empty_similar_incidents["similar_incidents"].apply(lambda x: x["similarity_score"])

non_empty_similar_incidents

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	category	short_description	description	assignment_group	u_inc_dept	sys_created_on	closed_at	comments_and_work_notes	FalseIncident	quickly_closed	similar_incidents
212	Software	User called in to report that he is unable to ...	User called in to report that he is unable to ...	Custom Application Development	NaN	02/24/2020 12:45:58 PM	04/10/2020 08:48:08 AM	04/10/2020 08:48:08 AM - System (Additional co...	False	false	category short_description ...
268	Software	Unable to reset NID Password due to webpage ou...	User's password has expired and they attempted...	Custom Application Development	NaN	03/08/2021 04:48:03 PM	03/15/2021 01:48:02 PM	03/15/2021 01:48:02 PM - System (Additional co...	False	false	category short_description ...
272	Software	NID Password reset for account (NID): da909465	User is having issues being able to reset his ...	Custom Application Development	HOSPITALITY MANAGEMENT DEAN	03/09/2021 01:30:01 PM	03/15/2021 01:48:08 PM	03/15/2021 01:48:08 PM - System (Additional co...	False	false	category short_description ...
318	Software	User is unable to reset knights mail due to it...	User states when trying to reset his password ...	Custom Application Development	NaN	12/07/2021 07:31:51 PM	12/14/2021 02:48:04 PM	12/14/2021 02:48:04 PM - System (Additional co...	False	false	category short_description ...
325	Software	Unable to reset NID password	User called in to report that she has been una...	Custom Application Development	NaN	12/09/2021 09:27:13 AM	01/21/2022 08:48:06 AM	01/21/2022 08:48:06 AM - System (Additional co...	False	false	category short_description ...
335	Software	Students are not able to upload documents to t...	Students are trying to go to their profile and...	Custom Application Development	SDES STU LEADERSHIP DEVELOP	01/18/2022 03:53:46 PM	02/09/2022 08:48:15 AM	02/09/2022 08:48:15 AM - System (Additional co...	False	false	category ...

Similarity Scores

The table below displays each incident that has a similarity score of >=0.7 with any known incident. Each additional NaN column is just a known_incident that the current row has no simiality with

similarity_scores

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	326	292
212	0.536588	NaN
268	0.988003	NaN
272	0.517992	NaN
318	0.506493	NaN
325	1.000000	NaN
335	NaN	0.513134

#non_empty_similar_incidents['similar_incidents']

Finding False Incidents Through Recursive Similarity

def findFalseOnes(df):
    
    new = incidents.copy()
    tfid_known_false_incidents, false_incidents, vectorizer = createTfid(df)
    
    new["similar_incidents"] = new.apply(search, args=(tfid_known_false_incidents,false_incidents, vectorizer ,), axis=1)
    not_empty = new.dropna(subset=["similar_incidents"])
    sim_scores = not_empty["similar_incidents"].apply(lambda x: x["similarity_score"])
    
    return not_empty, sim_scores

p = non_empty_similar_incidents.drop(['similar_incidents'], axis=1)
#e,b = findFalseOnes(p)

count = 0
df_found = pd.DataFrame()
df_sim = pd.DataFrame()
while(not p.empty):
    p, b = findFalseOnes(p)
    
    if(p.empty):
        break
    count = count + p.shape[0]
    df_found = df_found.append(p)
    df_sim = df_sim.append(b)

    p = df_found


print(count)
df_found

/tmp/ipykernel_35266/1471197761.py:24: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df_found = df_found.append(p)
/tmp/ipykernel_35266/1471197761.py:25: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df_sim = df_sim.append(b)
/tmp/ipykernel_35266/1471197761.py:24: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df_found = df_found.append(p)
/tmp/ipykernel_35266/1471197761.py:25: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df_sim = df_sim.append(b)


7

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	category	short_description	description	assignment_group	u_inc_dept	sys_created_on	closed_at	comments_and_work_notes	FalseIncident	quickly_closed	similar_incidents
292	Hardware	We are not able to access lead.sdes.ucf.edu/ad...	All computers in the office are getting the sa...	Custom Application Development	SDES STU LEADERSHIP DEVELOP	08/19/2021 08:47:17 AM	08/26/2021 11:48:14 AM	08/26/2021 11:48:14 AM - System (Additional co...	true	false	category ...
326	Software	Unable to reset NID password	User called in to report that he has been unab...	Custom Application Development	CCIE DEAN	12/09/2021 09:33:23 AM	12/09/2021 09:49:04 AM	12/09/2021 09:49:04 AM - Yacine Tazi (Addition...	true	true	category ...
212	Software	User called in to report that he is unable to ...	User called in to report that he is unable to ...	Custom Application Development	NaN	02/24/2020 12:45:58 PM	04/10/2020 08:48:08 AM	04/10/2020 08:48:08 AM - System (Additional co...	False	false	category short_description ...
268	Software	Unable to reset NID Password due to webpage ou...	User's password has expired and they attempted...	Custom Application Development	NaN	03/08/2021 04:48:03 PM	03/15/2021 01:48:02 PM	03/15/2021 01:48:02 PM - System (Additional co...	False	false	category short_description ...
318	Software	User is unable to reset knights mail due to it...	User states when trying to reset his password ...	Custom Application Development	NaN	12/07/2021 07:31:51 PM	12/14/2021 02:48:04 PM	12/14/2021 02:48:04 PM - System (Additional co...	False	false	category short_description ...
325	Software	Unable to reset NID password	User called in to report that she has been una...	Custom Application Development	NaN	12/09/2021 09:27:13 AM	01/21/2022 08:48:06 AM	01/21/2022 08:48:06 AM - System (Additional co...	False	false	category short_description ...
335	Software	Students are not able to upload documents to t...	Students are trying to go to their profile and...	Custom Application Development	SDES STU LEADERSHIP DEVELOP	01/18/2022 03:53:46 PM	02/09/2022 08:48:15 AM	02/09/2022 08:48:15 AM - System (Additional co...	False	false	category ...

df_sim

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	335	325	268	326	292
292	0.553778	NaN	NaN	NaN	NaN
326	NaN	0.543881	1.0	NaN	NaN
212	NaN	NaN	NaN	0.649939	NaN
268	NaN	NaN	NaN	0.984638	NaN
318	NaN	NaN	NaN	0.595599	NaN
325	NaN	NaN	NaN	1.000000	NaN
335	NaN	NaN	NaN	NaN	0.575907

Using Navie Bayes Classifier

Here we use Navie Bayes Classifier to attempt to make perdictions based on input, however there is not enough data so it is always giving a false prediction (false postive, you're false incident, but you're not)

Preprocess the text

import nltk
from nltk.stem import WordNetLemmatizer
stopwords = nltk.corpus.stopwords.words('english')

# had to install this followed this guide:  https://stackoverflow.com/questions/13965823/resource-corpora-wordnet-not-found-on-heroku
lemmatizer = WordNetLemmatizer()
nltk.download('stopwords')
#train_data

## CREDIT:  https://www.analyticsvidhya.com/blog/2021/09/creating-a-movie-reviews-classifier-using-tf-idf-in-python/

train_X_non = train_data['short_description']# + " " + train_data['description']   # '0' refers to the review text
train_y = train_data['FalseIncident']   # '1' corresponds to Label (1 - positive and 0 - negative)
test_X_non = incidents['short_description']# + " " + incidents['description']
test_y = incidents['FalseIncident']
train_X=[]
test_X=[]

def processText(text, i=0):
    processed_text = re.sub('[^a-zA-Z]', ' ', text[i])
    processed_text = processed_text.lower()
    processed_text = processed_text.split()
    processed_text = [lemmatizer.lemmatize(word) for word in processed_text if not word in set(stopwords)]
    processed_text = [' '.join(processed_text)]
    return processed_text

#text pre processing
for i in range(0, len(train_X_non)):
    review = re.sub('[^a-zA-Z]', ' ', train_X_non.iloc[i])
    review = review.lower()
    review = review.split()
    review = [lemmatizer.lemmatize(word) for word in review if not word in set(stopwords)]
    review = ' '.join(review)
    train_X.append(review)
    
#text pre processing
for i in range(0, len(test_X_non)):
    review = re.sub('[^a-zA-Z]', ' ', test_X_non.iloc[i])
    review = review.lower()
    review = review.split()
    review = [lemmatizer.lemmatize(word) for word in review if not word in set(stopwords)]
    review = ' '.join(review)
    test_X.append(review)
    
print(train_X[3])

[nltk_data] Downloading package stopwords to /home/strick/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


access dhcp reservation anythong dhcp

#tf idf
tf_idf = TfidfVectorizer()
#applying tf idf to training data
X_train_tf = tf_idf.fit_transform(train_X)
#applying tf idf to training data
X_train_tf = tf_idf.transform(train_X)

print("n_samples: %d, n_features: %d" % X_train_tf.shape)

#transforming test data into tf-idf matrix
X_test_tf = tf_idf.transform(test_X)
print("n_samples: %d, n_features: %d" % X_test_tf.shape)

n_samples: 6, n_features: 40
n_samples: 355, n_features: 40

Run algo

from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# REF:  https://builtin.com/data-science/precision-and-recall
#naive bayes classifier
naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(X_train_tf, train_y)
#predicted y
y_pred = naive_bayes_classifier.predict(X_test_tf)

#Prediction is complete. Now, we print the classification report.

print(metrics.classification_report(test_y, y_pred, target_names=['FalseIncident', 'NotFalse']))

               precision    recall  f1-score   support

FalseIncident       0.00      0.00      0.00       349
     NotFalse       0.02      1.00      0.03         6

     accuracy                           0.02       355
    macro avg       0.01      0.50      0.02       355
 weighted avg       0.00      0.02      0.00       355



/home/strick/.local/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/home/strick/.local/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/home/strick/.local/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

print("Confusion matrix:")
print(metrics.confusion_matrix(test_y, y_pred))

Confusion matrix:
[[  0 349]
 [  0   6]]

Based on this confusion matrix, my model is not predicting very well. There are 0 instances where i predicted a postive outcome to be true, 6 instances whre i predicted an incident not to be false, but it was and 349 (all others) where i said it was a false incident but it's not!

# Lets do a prediction
test = ["avengers are here to stay in the world of us"]


review = re.sub('[^a-zA-Z]', ' ', test[0])
review = review.lower()
review = review.split()
review = [lemmatizer.lemmatize(word) for word in review if not word in set(stopwords)]
test_processed =[ ' '.join(review)]

#test_processed = processText("This is unlike any kind of adventure movie my eyes")
test_processed

['avenger stay world u']

test_input = tf_idf.transform(test_processed)
test_input.shape

(1, 40)

res=naive_bayes_classifier.predict(test_input)[0]
res

'true'

Clustering Similarity Scores (NO), cluster based on full insciet for unsupervised learning

from sklearn.cluster import KMeans

# Convert the similarity scores to a numpy array
similarity_scores_array = similarity_scores.values.reshape(-1, 1)
#similarity_scores_array = similarity_scores.dropna().values.reshape(-1, 1)
similarity_scores_array


def simplify_category(df):
    df['category']=pd.get_dummies(df.category).drop('Software',axis=1)
    return df

def simplify(df, col_name, col, value):
    df[col_name]=pd.get_dummies(col).drop(value,axis=1)
    return df

tmp = simplify_category(incidents)
def drop_features(df):
    return df.drop(['short_description', 'description', 'sys_created_by', 'u_inc_dept', 'sys_created_on', 'closed_at', 'comments_and_work_notes', 'similar_incidents'], axis=1)

#tmp = drop_features(incidents)

tmp = simplify(tmp, 'quickly_closed', tmp.quickly_closed, 'false')
tmp = simplify(tmp, 'FalseIncident', tmp.FalseIncident, 'False')

# Initialize a k-means object with the desired number of clusters
k = 2
kmeans = KMeans(n_clusters=k, init='k-means++')

# Fit the k-means model to the similarity scores
kmeans.fit(X_test_tf)

# Get the cluster assignments for each similarity score
#cluster_labels = kmeans.labels_
cluster_labels = kmeans.fit_predict(X_test_tf)

print(cluster_labels)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0
 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 0 0 1 1
 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0
 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]


/home/strick/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(

import matplotlib.pyplot as plt
 
#filter rows of original data
filtered_label0 = X_test_tf[cluster_labels == 0]
filtered_label0

filtered_label0 = filtered_label0.toarray()
plt.scatter(filtered_label0[:, 0], filtered_label0[:, 1])

#plotting the results
plt.scatter(filtered_label0[:,1] , filtered_label0[:,1])
#plt.show()

<matplotlib.collections.PathCollection at 0x7fc1dd3b9090>

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.vscode		.vscode
HttpTrigger1		HttpTrigger1
nltk_data/corpora		nltk_data/corpora
.funcignore		.funcignore
.gitignore		.gitignore
Machine_Learning.pdf		Machine_Learning.pdf
README.md		README.md
host.json		host.json
output_47_1.png		output_47_1.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting False Incident Requests

Collect Data to build training data

Preprocess data

Drop Data

Clean Up Data

Feature Extraction

Quickly Closed Incidents

Add False Incident Feature

Category type is hardware

Using Cosine Similarity

Create a Feature Matrix

Generate the Similarites

Similarity Scores

Finding False Incidents Through Recursive Similarity

Using Navie Bayes Classifier

Preprocess the text

Run algo

Clustering Similarity Scores (NO), cluster based on full insciet for unsupervised learning

About

Releases

Packages

Languages

strick/false-incident-finder

Folders and files

Latest commit

History

Repository files navigation

Predicting False Incident Requests

Collect Data to build training data

Preprocess data

Drop Data

Clean Up Data

Feature Extraction

Quickly Closed Incidents

Add False Incident Feature

Category type is hardware

Using Cosine Similarity

Create a Feature Matrix

Generate the Similarites

Similarity Scores

Finding False Incidents Through Recursive Similarity

Using Navie Bayes Classifier

Preprocess the text

Run algo

Clustering Similarity Scores (NO), cluster based on full insciet for unsupervised learning

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages