Skip to content

ML Research Project and Hack Day project API endpoint to provide whether or not a given incident is false.

Notifications You must be signed in to change notification settings

strick/false-incident-finder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Predicting False Incident Requests

CAP 5610 - Brian Strickland (1368280)

  • Collect the data
  • Clean the data
  • Feature extract to find known false incidnets
  • Use a Fequency Matrix to create similarties between incidents base don short / long descrition (do experiements with toehr columsn)

Collect Data to build training data

Data was created by generating a custom service portal page within ServiceNow and placing a Data Table by Instance widget on to the page with defined columns. Over 200K incidents were then able to be exported.

import pandas as pd
import numpy as np

#pd.options.display.float_format = '{:.20f}'.format


#incidents = pd.read_csv("incident_full.csv")
incidents = pd.read_csv("incident_and_comments.csv")

#incidents

Preprocess data

Drop Data

  • Remove unused colums from the data
  • Only use rows that pertain to the Custom Application Development group
  • Remove bogus data and hand NaN values
import re

nanColumns = ['description', 'short_description']
assignment_group = 'Custom Application Development'

# Filter down to development team items
incidents = incidents[incidents.assignment_group == assignment_group]

# Remove unecssrary feature
#incidents = incidents.drop(['assignment_group', 'assigned_to', 'number', 'opened_by'], axis=1)
incidents = incidents.drop(['number', 'opened_by'], axis=1)

# Remove all NaN values from dataset
incidents = incidents.dropna(subset=nanColumns)

# Remove bogus data
incidents = incidents[incidents.description != 'asdf']

# Remove incidents with non string date
incidents = incidents[incidents['sys_created_on'].apply(lambda x: isinstance(x, str))]
incidents = incidents[incidents['closed_at'].apply(lambda x: isinstance(x, str))]

Clean Up Data

# Function to clean special characters out of data
def clean_description(description):
    try:
        re.sub("[^a-zA-Z0-9 ]", "", description)
        return description
    except:
        #print(description)
        return description
        
# Clean the short description and descriptions
incidents.short_description = incidents.short_description.fillna(0)
incidents.description = incidents.description.fillna(0)
incidents["short_description"] = incidents["short_description"].apply(clean_description)
incidents["description"] = incidents["description"].apply(clean_description)
incidents["FalseIncident"] = "False"

# set random seed for reproducibility
#np.random.seed(143)

# generate a random number between 0 and the length of the dataframe
#num_true = np.random.randint(0, len(incidents))

# set that many incidents to True for the "FalseIncident" column
#incidents.loc[np.random.choice(incidents.index, size=num_true), "FalseIncident"] = "true"
incidents
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
category short_description description assignment_group u_inc_dept sys_created_on closed_at comments_and_work_notes FalseIncident
0 Software The SAFE Form website is not properly generati... When individuals complete a SAFE Form, there i... Custom Application Development NaN 09/25/2017 04:07:49 PM 10/13/2017 08:48:07 AM 10/13/2017 08:48:07 AM - System (Additional co... False
1 Software Undergraduate Admissions Web App (OLA) file wa... The web application load in PS failed because ... Custom Application Development NaN 10/25/2017 08:26:34 AM 10/30/2017 02:48:11 PM 10/30/2017 02:48:11 PM - System (Additional co... False
2 Hardware Unable to access http://directory.sdes.ucf.edu... Unable to access site Custom Application Development NaN 10/27/2017 08:55:53 AM 11/01/2017 11:48:06 AM 11/01/2017 11:48:06 AM - System (Additional co... False
3 Software We are unable to enter redeemed vouchers into ... This is the first time this year that we are p... Custom Application Development NaN 10/27/2017 02:45:07 PM 11/01/2017 03:48:07 PM 11/01/2017 03:48:07 PM - System (Additional co... False
4 Software UA forms such as Residency, Reacts, Counselor ... UA forms such as Residency, Reacts, Counselor ... Custom Application Development NaN 10/31/2017 08:39:06 AM 11/03/2017 01:48:08 PM 11/03/2017 01:48:08 PM - System (Additional co... False
... ... ... ... ... ... ... ... ... ... ... ...
351 Software Trying to submit changes to UCF Phonebook and ... Details: Althea Robinson called to report she... Custom Application Development CCIE ADMINISTRATION 05/18/2022 11:55:32 AM 05/23/2022 04:48:07 PM 05/23/2022 04:48:07 PM - System (Additional co... False
352 Software I need access to following link below. When I ... User's relationship to UCF: Employee\n\nUser's... Custom Application Development COLLEGE OF BUSINESS DEAN 05/25/2022 08:57:12 AM 05/31/2022 02:48:00 PM 05/31/2022 02:48:00 PM - System (Additional co... False
353 Software Cannot log into COBA Test Management I have tried several times to log in but keep ... Custom Application Development NaN 06/03/2022 04:46:22 AM 06/23/2022 10:48:00 AM 06/23/2022 10:48:00 AM - System (Additional co... False
354 Software Knights Email Acount Login and Password Reset/... Incoming student Sydney Schumacher called in f... Custom Application Development NaN 06/14/2022 01:31:45 PM 06/20/2022 08:48:05 AM 06/20/2022 08:48:05 AM - System (Additional co... False
355 Software custom app e911.it.ucf.edu not pulling data custom app is displaying datatables error when... Custom Application Development UCF IT 06/30/2022 03:48:37 PM 07/18/2022 01:48:00 PM 07/18/2022 01:48:00 PM - System (Additional co... False

355 rows Ă— 11 columns

Feature Extraction

Here we will look at various methods to identify known false incidents and add a feature, FalseIncident, and add this to each of those rows. The below method will enable the ability to quickly mark an identify row(s) as a false incident.

# Helper function to set the FalseIncident column to true for all rows in the dataframe based on a feature and value.
def add_false_incidents(df, feature, value):
    
    df.loc[(df[feature] == value), 'FalseIncident'] = "true"

Quickly Closed Incidents

Here we will explore incidents that were closed quickly (within an hour) and if any are found, deteremine if they are truly false incidents. From this we can gather some information about what makes quickly closed incidents false incidents that we'll add into our scoring for identifying false incidents.

# Get incidents that were closed in less than 60 minutes
from datetime import datetime
def is_quickly_closed(row):
    
    date1 = datetime.strptime(row["sys_created_on"], '%m/%d/%Y %I:%M:%S %p')
    date2 = datetime.strptime(row["closed_at"], '%m/%d/%Y %I:%M:%S %p')

    diff_minutes = int((date2 - date1).total_seconds() / 60)


    if diff_minutes <= 60:
        return "true"
            
    return "false"
        

incidents["quickly_closed"] = incidents.apply(is_quickly_closed, axis=1)
incidents[incidents['quickly_closed'] == "true"]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
category short_description description assignment_group u_inc_dept sys_created_on closed_at comments_and_work_notes FalseIncident quickly_closed
326 Software Unable to reset NID password User called in to report that he has been unab... Custom Application Development CCIE DEAN 12/09/2021 09:33:23 AM 12/09/2021 09:49:04 AM 12/09/2021 09:49:04 AM - Yacine Tazi (Addition... False true

From here we can tell that there is only a very small number of incidents that get closed in less than an hour. With this specific incident, it's a password reset, so lets look at the comments to see if it was user error or if the support center help them to reset the password.

from IPython.display import display, HTML

def print_comments(df):
    for i,row in incidents[incidents['quickly_closed'] == "true"].iterrows():
        display( HTML( pd.DataFrame({'comments_and_work_notes': [row['comments_and_work_notes']]}).to_html().replace("\\n","<br>") ) )

Based on the following, we can deteremine that this wasn't an incident that needed to be resolved by the development team, but rather an expected outage which eventually enable the customer to login:

  1. There is an intermittent outage with Self Service Reset Tool when users are trying to reset their password using email. This is not consistent behavior and will resolve itself shortly.
  2. Was able to log in again

With this information, we can go ahead and flag this particular incident as a FalseIncident

with pd.option_context('display.max_colwidth', None):
    print(incidents[incidents['quickly_closed'] == "true"].comments_and_work_notes)
326    12/09/2021 09:49:04 AM - Yacine Tazi (Additional comments)\nWas able to log in again\n\n12/09/2021 09:47:40 AM - System (Additional comments)\nWe are continuing to investigate the underlying issue:\n\n\nWHAT IS HAPPENING?\nThere is an intermittent outage with Self Service Reset Tool when users are trying to reset their password using email. This is not consistent behavior and will resolve itself shortly. \n\nWHO IS IMPACTED?\nAnyone that need to reset their NID password using the email functionality. \n\nWHAT ARE WE DOING ABOUT IT?\nWe are currently investigating this issue.\n\nWHAT HAPPENS NEXT?\nWe are currently investigating and will keep everyone posted once the issue is resolved. \n\nWHAT DO I NEED TO DO?\nShould users encounter this issue during password reset, please wait 15-20 minutes and try again. \n\n\n\n12/09/2021 09:34:35 AM - Diego Cruces (Work notes)\nRouting to the Custom Application Development team for further investigation.\n\n
Name: comments_and_work_notes, dtype: object

Add False Incident Feature

add_false_incidents(incidents, 'quickly_closed', 'true')
incidents[incidents['FalseIncident'] == "true"]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
category short_description description assignment_group u_inc_dept sys_created_on closed_at comments_and_work_notes FalseIncident quickly_closed
326 Software Unable to reset NID password User called in to report that he has been unab... Custom Application Development CCIE DEAN 12/09/2021 09:33:23 AM 12/09/2021 09:49:04 AM 12/09/2021 09:49:04 AM - Yacine Tazi (Addition... true true

Category type is hardware

Incidents that are set as hardware are assumed to be a false incident since the software development team doesn't deal with hardware issues.

add_false_incidents(incidents, 'category', 'Hardware')
#incidents[incidents['FalseIncident'] == "true"]

# Set the selected indices to True
#incidents.loc[incidents['FalseIncident'] == False, 'FalseIncident'] = np.random.choice([True, False], size=incidents['FalseIncident'].shape[0], p=[0.5, 0.5])

# store to training data
train_data = incidents[incidents['FalseIncident'] == "true"].copy()
train_data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
category short_description description assignment_group u_inc_dept sys_created_on closed_at comments_and_work_notes FalseIncident quickly_closed
2 Hardware Unable to access http://directory.sdes.ucf.edu... Unable to access site Custom Application Development NaN 10/27/2017 08:55:53 AM 11/01/2017 11:48:06 AM 11/01/2017 11:48:06 AM - System (Additional co... true false
11 Hardware Students are unable to upload forms to online ... Students are required to upload their involvem... Custom Application Development NaN 12/04/2017 01:13:26 PM 12/12/2017 11:48:05 AM 12/12/2017 11:48:06 AM - System (Additional co... true false
121 Hardware The Exchange Unified Messaging voicemail assig... The Exchange Unified Messaging voicemail assig... Custom Application Development UCF IT 03/13/2019 06:38:29 AM 03/21/2019 03:48:13 PM 03/21/2019 03:48:13 PM - System (Additional co... true false
210 Hardware Can't access DHCP reservations or do anythong ... I got the new URL my.it.ucf.edeu and I can get... Custom Application Development FINANCIAL AFFAIRS 02/19/2020 09:41:48 AM 02/24/2020 12:48:09 PM 02/24/2020 12:48:09 PM - System (Additional co... true false
292 Hardware We are not able to access lead.sdes.ucf.edu/ad... All computers in the office are getting the sa... Custom Application Development SDES STU LEADERSHIP DEVELOP 08/19/2021 08:47:17 AM 08/26/2021 11:48:14 AM 08/26/2021 11:48:14 AM - System (Additional co... true false
326 Software Unable to reset NID password User called in to report that he has been unab... Custom Application Development CCIE DEAN 12/09/2021 09:33:23 AM 12/09/2021 09:49:04 AM 12/09/2021 09:49:04 AM - Yacine Tazi (Addition... true true

Using Cosine Similarity

Create a Feature Matrix

Here we'll create a feature matrix based on the short description values of known incidents (i.e. FalseIncident == "true"). From there we can create a similarite score on all other incidents that have been submitted to see if we can identify some other false incidents.

from sklearn.feature_extraction.text import TfidfVectorizer

def createTfid(train_data):
    # Choose min and max word sequesnces
    vectorizer = TfidfVectorizer(ngram_range=(1,5))

    #change this to only vectorize on known incidents
    #false_incidents = incidents[incidents['FalseIncident'] == "true"]
    false_incidents = train_data.copy()

    tfid_known_false_incidents = vectorizer.fit_transform(false_incidents['short_description'] )
    return tfid_known_false_incidents, false_incidents, vectorizer


# Choose min and max word sequesnces
#vectorizer = TfidfVectorizer(ngram_range=(1,5))

#change this to only vectorize on known incidents
#false_incidents = train_data.copy()

#tfid_known_false_incidents = vectorizer.fit_transform(false_incidents['short_description'] )
tfid_known_false_incidents, false_incidents, vectorizer = createTfid(train_data)
tfid_known_false_incidents
<6x316 sparse matrix of type '<class 'numpy.float64'>'
	with 339 stored elements in Compressed Sparse Row format>

Generate the Similarites

from sklearn.metrics.pairwise import cosine_similarity
# then compute similaties using cosine_sim with all other types to get a similartiy
def search(incident, tfid_known_false_incidents, false_incidents, vectorizer):

    if incident.name in false_incidents.index:
       return
    
    desc = clean_description(incident["short_description"])
    query_vec = vectorizer.transform([desc]) 
    
    # compare the description to the knownIncidents list
    similarity = cosine_similarity(query_vec, tfid_known_false_incidents).flatten()

    # If there are anay items with a > .7 similarity, add this to the list
    indices = np.where(similarity > 0.5)[0]

    # Remove the current item from the list so that you odn't get a 1.0 similairty (i.e. itself)
    current_index = incident.name - 1
    indices = indices[indices != current_index]
    
    same_incident_indices = np.where(indices >= len(false_incidents))[0]
    indices = indices[indices < len(false_incidents)]

    results = false_incidents.iloc[indices].iloc[::-1]


    if not results.empty:
        # Add similarity score feature
        results["similarity_score"] = similarity[indices]

        return results

# add a new column to the incidents dataframe that contains non-empty dataframes with similar incidents
#incidents["similar_incidents"] = incidents.apply(search, axis=1)
#incidents["similar_incidents"] = incidents.apply(search, args=(tfid_known_false_incidents), axis=1)
incidents["similar_incidents"] = incidents.apply(search, args=(tfid_known_false_incidents,false_incidents, vectorizer ,), axis=1)



non_empty_similar_incidents = incidents.dropna(subset=["similar_incidents"])
similarity_scores = non_empty_similar_incidents["similar_incidents"].apply(lambda x: x["similarity_score"])
non_empty_similar_incidents
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
category short_description description assignment_group u_inc_dept sys_created_on closed_at comments_and_work_notes FalseIncident quickly_closed similar_incidents
212 Software User called in to report that he is unable to ... User called in to report that he is unable to ... Custom Application Development NaN 02/24/2020 12:45:58 PM 04/10/2020 08:48:08 AM 04/10/2020 08:48:08 AM - System (Additional co... False false category short_description ...
268 Software Unable to reset NID Password due to webpage ou... User's password has expired and they attempted... Custom Application Development NaN 03/08/2021 04:48:03 PM 03/15/2021 01:48:02 PM 03/15/2021 01:48:02 PM - System (Additional co... False false category short_description ...
272 Software NID Password reset for account (NID): da909465 User is having issues being able to reset his ... Custom Application Development HOSPITALITY MANAGEMENT DEAN 03/09/2021 01:30:01 PM 03/15/2021 01:48:08 PM 03/15/2021 01:48:08 PM - System (Additional co... False false category short_description ...
318 Software User is unable to reset knights mail due to it... User states when trying to reset his password ... Custom Application Development NaN 12/07/2021 07:31:51 PM 12/14/2021 02:48:04 PM 12/14/2021 02:48:04 PM - System (Additional co... False false category short_description ...
325 Software Unable to reset NID password User called in to report that she has been una... Custom Application Development NaN 12/09/2021 09:27:13 AM 01/21/2022 08:48:06 AM 01/21/2022 08:48:06 AM - System (Additional co... False false category short_description ...
335 Software Students are not able to upload documents to t... Students are trying to go to their profile and... Custom Application Development SDES STU LEADERSHIP DEVELOP 01/18/2022 03:53:46 PM 02/09/2022 08:48:15 AM 02/09/2022 08:48:15 AM - System (Additional co... False false category ...

Similarity Scores

The table below displays each incident that has a similarity score of >=0.7 with any known incident. Each additional NaN column is just a known_incident that the current row has no simiality with

similarity_scores
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
326 292
212 0.536588 NaN
268 0.988003 NaN
272 0.517992 NaN
318 0.506493 NaN
325 1.000000 NaN
335 NaN 0.513134
#non_empty_similar_incidents['similar_incidents']

Finding False Incidents Through Recursive Similarity

def findFalseOnes(df):
    
    new = incidents.copy()
    tfid_known_false_incidents, false_incidents, vectorizer = createTfid(df)
    
    new["similar_incidents"] = new.apply(search, args=(tfid_known_false_incidents,false_incidents, vectorizer ,), axis=1)
    not_empty = new.dropna(subset=["similar_incidents"])
    sim_scores = not_empty["similar_incidents"].apply(lambda x: x["similarity_score"])
    
    return not_empty, sim_scores

p = non_empty_similar_incidents.drop(['similar_incidents'], axis=1)
#e,b = findFalseOnes(p)

count = 0
df_found = pd.DataFrame()
df_sim = pd.DataFrame()
while(not p.empty):
    p, b = findFalseOnes(p)
    
    if(p.empty):
        break
    count = count + p.shape[0]
    df_found = df_found.append(p)
    df_sim = df_sim.append(b)

    p = df_found


print(count)
df_found
    
/tmp/ipykernel_35266/1471197761.py:24: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df_found = df_found.append(p)
/tmp/ipykernel_35266/1471197761.py:25: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df_sim = df_sim.append(b)
/tmp/ipykernel_35266/1471197761.py:24: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df_found = df_found.append(p)
/tmp/ipykernel_35266/1471197761.py:25: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df_sim = df_sim.append(b)


7
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
category short_description description assignment_group u_inc_dept sys_created_on closed_at comments_and_work_notes FalseIncident quickly_closed similar_incidents
292 Hardware We are not able to access lead.sdes.ucf.edu/ad... All computers in the office are getting the sa... Custom Application Development SDES STU LEADERSHIP DEVELOP 08/19/2021 08:47:17 AM 08/26/2021 11:48:14 AM 08/26/2021 11:48:14 AM - System (Additional co... true false category ...
326 Software Unable to reset NID password User called in to report that he has been unab... Custom Application Development CCIE DEAN 12/09/2021 09:33:23 AM 12/09/2021 09:49:04 AM 12/09/2021 09:49:04 AM - Yacine Tazi (Addition... true true category ...
212 Software User called in to report that he is unable to ... User called in to report that he is unable to ... Custom Application Development NaN 02/24/2020 12:45:58 PM 04/10/2020 08:48:08 AM 04/10/2020 08:48:08 AM - System (Additional co... False false category short_description ...
268 Software Unable to reset NID Password due to webpage ou... User's password has expired and they attempted... Custom Application Development NaN 03/08/2021 04:48:03 PM 03/15/2021 01:48:02 PM 03/15/2021 01:48:02 PM - System (Additional co... False false category short_description ...
318 Software User is unable to reset knights mail due to it... User states when trying to reset his password ... Custom Application Development NaN 12/07/2021 07:31:51 PM 12/14/2021 02:48:04 PM 12/14/2021 02:48:04 PM - System (Additional co... False false category short_description ...
325 Software Unable to reset NID password User called in to report that she has been una... Custom Application Development NaN 12/09/2021 09:27:13 AM 01/21/2022 08:48:06 AM 01/21/2022 08:48:06 AM - System (Additional co... False false category short_description ...
335 Software Students are not able to upload documents to t... Students are trying to go to their profile and... Custom Application Development SDES STU LEADERSHIP DEVELOP 01/18/2022 03:53:46 PM 02/09/2022 08:48:15 AM 02/09/2022 08:48:15 AM - System (Additional co... False false category ...
df_sim
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
335 325 268 326 292
292 0.553778 NaN NaN NaN NaN
326 NaN 0.543881 1.0 NaN NaN
212 NaN NaN NaN 0.649939 NaN
268 NaN NaN NaN 0.984638 NaN
318 NaN NaN NaN 0.595599 NaN
325 NaN NaN NaN 1.000000 NaN
335 NaN NaN NaN NaN 0.575907

Using Navie Bayes Classifier

Here we use Navie Bayes Classifier to attempt to make perdictions based on input, however there is not enough data so it is always giving a false prediction (false postive, you're false incident, but you're not)

Preprocess the text

import nltk
from nltk.stem import WordNetLemmatizer
stopwords = nltk.corpus.stopwords.words('english')

# had to install this followed this guide:  https://stackoverflow.com/questions/13965823/resource-corpora-wordnet-not-found-on-heroku
lemmatizer = WordNetLemmatizer()
nltk.download('stopwords')
#train_data

## CREDIT:  https://www.analyticsvidhya.com/blog/2021/09/creating-a-movie-reviews-classifier-using-tf-idf-in-python/

train_X_non = train_data['short_description']# + " " + train_data['description']   # '0' refers to the review text
train_y = train_data['FalseIncident']   # '1' corresponds to Label (1 - positive and 0 - negative)
test_X_non = incidents['short_description']# + " " + incidents['description']
test_y = incidents['FalseIncident']
train_X=[]
test_X=[]

def processText(text, i=0):
    processed_text = re.sub('[^a-zA-Z]', ' ', text[i])
    processed_text = processed_text.lower()
    processed_text = processed_text.split()
    processed_text = [lemmatizer.lemmatize(word) for word in processed_text if not word in set(stopwords)]
    processed_text = [' '.join(processed_text)]
    return processed_text

#text pre processing
for i in range(0, len(train_X_non)):
    review = re.sub('[^a-zA-Z]', ' ', train_X_non.iloc[i])
    review = review.lower()
    review = review.split()
    review = [lemmatizer.lemmatize(word) for word in review if not word in set(stopwords)]
    review = ' '.join(review)
    train_X.append(review)
    
#text pre processing
for i in range(0, len(test_X_non)):
    review = re.sub('[^a-zA-Z]', ' ', test_X_non.iloc[i])
    review = review.lower()
    review = review.split()
    review = [lemmatizer.lemmatize(word) for word in review if not word in set(stopwords)]
    review = ' '.join(review)
    test_X.append(review)
    
print(train_X[3])
[nltk_data] Downloading package stopwords to /home/strick/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


access dhcp reservation anythong dhcp
#tf idf
tf_idf = TfidfVectorizer()
#applying tf idf to training data
X_train_tf = tf_idf.fit_transform(train_X)
#applying tf idf to training data
X_train_tf = tf_idf.transform(train_X)

print("n_samples: %d, n_features: %d" % X_train_tf.shape)

#transforming test data into tf-idf matrix
X_test_tf = tf_idf.transform(test_X)
print("n_samples: %d, n_features: %d" % X_test_tf.shape)
n_samples: 6, n_features: 40
n_samples: 355, n_features: 40

Run algo

from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# REF:  https://builtin.com/data-science/precision-and-recall
#naive bayes classifier
naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(X_train_tf, train_y)
#predicted y
y_pred = naive_bayes_classifier.predict(X_test_tf)

#Prediction is complete. Now, we print the classification report.

print(metrics.classification_report(test_y, y_pred, target_names=['FalseIncident', 'NotFalse']))
               precision    recall  f1-score   support

FalseIncident       0.00      0.00      0.00       349
     NotFalse       0.02      1.00      0.03         6

     accuracy                           0.02       355
    macro avg       0.01      0.50      0.02       355
 weighted avg       0.00      0.02      0.00       355



/home/strick/.local/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/home/strick/.local/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/home/strick/.local/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
print("Confusion matrix:")
print(metrics.confusion_matrix(test_y, y_pred))
Confusion matrix:
[[  0 349]
 [  0   6]]

Based on this confusion matrix, my model is not predicting very well. There are 0 instances where i predicted a postive outcome to be true, 6 instances whre i predicted an incident not to be false, but it was and 349 (all others) where i said it was a false incident but it's not!

# Lets do a prediction
test = ["avengers are here to stay in the world of us"]


review = re.sub('[^a-zA-Z]', ' ', test[0])
review = review.lower()
review = review.split()
review = [lemmatizer.lemmatize(word) for word in review if not word in set(stopwords)]
test_processed =[ ' '.join(review)]

#test_processed = processText("This is unlike any kind of adventure movie my eyes")
test_processed
['avenger stay world u']
test_input = tf_idf.transform(test_processed)
test_input.shape
(1, 40)
res=naive_bayes_classifier.predict(test_input)[0]
res
'true'

Clustering Similarity Scores (NO), cluster based on full insciet for unsupervised learning

from sklearn.cluster import KMeans

# Convert the similarity scores to a numpy array
similarity_scores_array = similarity_scores.values.reshape(-1, 1)
#similarity_scores_array = similarity_scores.dropna().values.reshape(-1, 1)
similarity_scores_array


def simplify_category(df):
    df['category']=pd.get_dummies(df.category).drop('Software',axis=1)
    return df

def simplify(df, col_name, col, value):
    df[col_name]=pd.get_dummies(col).drop(value,axis=1)
    return df
tmp = simplify_category(incidents)
def drop_features(df):
    return df.drop(['short_description', 'description', 'sys_created_by', 'u_inc_dept', 'sys_created_on', 'closed_at', 'comments_and_work_notes', 'similar_incidents'], axis=1)

#tmp = drop_features(incidents)

tmp = simplify(tmp, 'quickly_closed', tmp.quickly_closed, 'false')
tmp = simplify(tmp, 'FalseIncident', tmp.FalseIncident, 'False')
# Initialize a k-means object with the desired number of clusters
k = 2
kmeans = KMeans(n_clusters=k, init='k-means++')

# Fit the k-means model to the similarity scores
kmeans.fit(X_test_tf)

# Get the cluster assignments for each similarity score
#cluster_labels = kmeans.labels_
cluster_labels = kmeans.fit_predict(X_test_tf)

print(cluster_labels)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0
 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 0 0 1 1
 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0
 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]


/home/strick/.local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(
import matplotlib.pyplot as plt
 
#filter rows of original data
filtered_label0 = X_test_tf[cluster_labels == 0]
filtered_label0

filtered_label0 = filtered_label0.toarray()
plt.scatter(filtered_label0[:, 0], filtered_label0[:, 1])

#plotting the results
plt.scatter(filtered_label0[:,1] , filtered_label0[:,1])
#plt.show()
<matplotlib.collections.PathCollection at 0x7fc1dd3b9090>

png

About

ML Research Project and Hack Day project API endpoint to provide whether or not a given incident is false.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages