-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs(thesis): completed the Project Specifications Design and Prototy…
…pe (#5) * fix(sherlock): Fixed typo in metric name * docs: Added first 2 chapters of PSPD * Added Images for implementation * Completed up to design goals * Completed upto Summary of technology * Before spelling fixing * Wrote about Lazy-Koala Operator * Wrote about n-teir * Fixed Some typos * Before showing to sir * Completed all the sections * Before formating * Fixed captions * Before setting operator \ac * Added Operator \ac * Refactoring * Fixed all the grammer mistakes * Final version * Add a missing colon * Fixed the toc depth * Removed .tex from github indexes
- Loading branch information
Showing
145 changed files
with
3,549 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
*.tex linguist-detectable=false | ||
*.ipynb linguist-detectable=false |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,10 +2,6 @@ | |
**/.idea/** | ||
|
||
|
||
# Project Proposal | ||
!proposal/project-proposal.pdf | ||
|
||
|
||
## Python | ||
**/__pycache__/** | ||
**/venv/** |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
\chapter*{List of Acronyms} | ||
|
||
\begin{acronym} | ||
% \acro{iaas}[IaaS]{Infrastructure as a Service} | ||
\acro{sres}[SRE]{Site Reliability Engineer} | ||
\acro{sli}[SLI]{Service Level Indicator} | ||
\acro{sre}[SRE]{Site Reliability Engineering} | ||
\acro{apm}[APM]{Application Performance Monitoring} | ||
\acro{mttr}[MTTR]{Mean Time To Recovery} | ||
% \acro{gan}[GAN]{Generative adversarial networks} | ||
% \acro{hhmm}[HHMM]{Hierarchical hidden Markov model} | ||
% \acro{fsl}[FSL]{Few-shot Learning} | ||
% \acro{sdlc}[SDLC]{Software Development Life Cycle} | ||
% \acro{ooad}[OOAD]{Object-oriented analysis and design} | ||
\acro{ebpf}[eBPF]{Extended Berkeley Packet Filter} | ||
% \acro{sla}[SLA]{Service-Level Agreement} | ||
% \acro{saas}[SaaS]{Software as a service} | ||
% \acro{vm}[VM]{Virtual Machine} | ||
% \acro{cncf}[CNCF]{Cloud Native Computing Foundation} | ||
|
||
\acro{aiops}[AIOps]{Artificial Intelligence for IT operations} | ||
% \acro{sre}[SRE]{Site Reliability Engineering} | ||
\acro{gazer}[Gazer]{Telemetry extraction agent} | ||
\acro{sherlock}[Sherlock]{AI-engine} | ||
\acro{lazy-koala-operator}[Operator]{Lazy Koala Resource Manager} | ||
\end{acronym} |
File renamed without changes
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+2.85 KB
documentation/PSPD/assets/implementation/visualize-representation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
Binary file added
BIN
+50.3 KB
documentation/PSPD/assets/literature-review/Container-orchestration-engines.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+305 KB
documentation/PSPD/assets/literature-review/containers-vs-virtual-machines.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+30.6 KB
documentation/PSPD/assets/literature-review/num-of-anomaly-detection-papers.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
Binary file added
BIN
+34.2 KB
documentation/PSPD/assets/requirement-specification/contex-digram.png
Oops, something went wrong.
Oops, something went wrong.
Binary file added
BIN
+47.8 KB
documentation/PSPD/assets/requirement-specification/poc-autoencoder.png
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
\chapter*{Abstract} | ||
|
||
Cloud computing has shown considerable growth in the past few years, due to its scalability and ease of use. With this change, a new programming paradigm called cloud-native was born. Cloud-native applications are often developed as a set of stand-alone microservices yet could depend on each other to provide a unified experience. Even though microservices introduce a lot of benefits when it comes to flexibility and scalability it could be a nightmare to operate in production. Specifically, when operating a large system with hundreds of microservices talking to each other, the smallest problem could result in failures all around the system. | ||
|
||
% Cloud computing is a steady rise for the past few years due to its scalability and ease of use. With this change, a new programming paradigm called cloud-native was born. Cloud-native applications are often developed as a set of stand-alone microservices yet, it could depend on each other to provide a unified experience. | ||
|
||
% This helps different teams to work on different services which increases the development velocity. This works well for medium to large companies but over time this mesh of services could become very complicated to a point where it's very difficult for a single person to understand the entire system. When the system consists of thousands of individual services talking and depending on each other, the network layer of that system becomes chaotic. A failure in a single point could create a ripple effect across the entire system. When something like that happens it could take a considerable amount of time to zero in on the exact point of the failure. | ||
|
||
The focus of this project are in two-folds. First, the authors introduce a robust Kubernetes native toolkit that helps both researchers and developers collect and process service telemetry data with zero instrumentation. Secondly, the authors proposed a novel way of detecting anomalies by encoding raw metric data into an image-like structure and using a convolutional autoencoder to learn the general data distribution for each service and detecting outliers. Finally, a weighted graph was used along with anomaly scores calculated prior to finding out possible root cause for any system anomaly. | ||
|
||
Initial test results show that the telemetry extraction components are both resilient and lightweight even under the sustained load, while the anomaly prediction algorithm seems to converge on target learning goals. | ||
\newline | ||
\newline | ||
\textbf{Keywords}: | ||
AIOps, Monitoring, Disaster Recovery, eBPF, Kubernetes | ||
\newline | ||
\textbf{Subject Descriptors}: | ||
• Computing methodologies $\rightarrow$ Machine learning $\rightarrow$ Learning paradigms $\rightarrow$ Unsupervised learning $\rightarrow$ Anomaly detection • Computer systems organization $\rightarrow$ Architectures $\rightarrow$ Distributed architectures $\rightarrow$ Cloud computing |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
\chapter*{Appendix} | ||
\begin{appendices} | ||
\input{chapters/appendix/use-case-description} | ||
\input{chapters/appendix/poc} | ||
\input{chapters/appendix/prometheus-dashboard} | ||
\end{appendices} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
\chapter{Proof of Concept Results}\label{appendix:poc-results} | ||
|
||
\begin{figure}[H] | ||
\includegraphics[width=14cm]{assets/appendix/poc-results.png} | ||
\caption{Proof of concept results (self-composed)} | ||
\end{figure} |
6 changes: 6 additions & 0 deletions
6
documentation/PSPD/chapters/appendix/prometheus-dashboard.tex
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
\chapter{Prometheus Dashboard}\label{appendix:prometheus-dashboard} | ||
|
||
\begin{figure}[H] | ||
\includegraphics[width=16.5cm]{assets/appendix/prometheus-dashboard.png} | ||
\caption{Prometheus dashboard with collected data (self-composed)} | ||
\end{figure} |
126 changes: 126 additions & 0 deletions
126
documentation/PSPD/chapters/appendix/use-case-description.tex
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,126 @@ | ||
{\let\clearpage\relax\chapter{Use Case Descriptions}\label{appendix:use-case-description}} | ||
|
||
\UseCaseDescription | ||
{UC-01} | ||
{Deploy Lazy Koala} | ||
{Install \ac{lazy-koala-operator} to a Kubernetes cluster} | ||
{Reliability Engineer} | ||
{\begin{CompactItemizes} | ||
\item A Kubernetes cluster running. | ||
\item kubectl installed and configured to talk to the cluster. | ||
\item Helm CLI installed. | ||
\end{CompactItemizes}} | ||
{N/A} | ||
{N/A} | ||
{\begin{CompactEnumerate} | ||
\item Add Helm remote. | ||
\item Run helm install command. | ||
\item Kube API acknowledges the changes. | ||
\item Display content of Notes.txt | ||
\end{CompactEnumerate}} | ||
{{\begin{CompactEnumerate} | ||
\item Apply Kubernetes Manifest found in the code repository. | ||
\item Kube API acknowledges the changes. | ||
\end{CompactEnumerate}} | ||
{\textbf{E1}: A \ac{lazy-koala-operator} couldn’t achieve desired state | ||
\vspace{-4mm}\begin{enumerate} | ||
\item The \ac{lazy-koala-operator} retries to achieve the desired state with an exponential backoff | ||
\vspace{-7mm}\end{enumerate}} | ||
{\begin{CompactItemizes} | ||
\item \ac{lazy-koala-operator} deployed on the cluster. | ||
\item Instance of \ac{gazer} deployed on every node. | ||
\item New permission rules are registered with Kube API. | ||
\end{CompactItemizes}}} | ||
|
||
\vspace{-2em} | ||
\UseCaseDescription | ||
{UC-02} | ||
{Update Configuration} | ||
{Add or Remove a service from a monitored list.} | ||
{Reliability Engineer} | ||
{\begin{CompactItemizes} | ||
\item kubectl installed and configured to talk to a Kubernetes cluster. | ||
\item The Kubernetes cluster has a \ac{lazy-koala-operator} deployed. | ||
\item Established port forwarding connection with \ac{lazy-koala-operator}. | ||
\end{CompactItemizes}} | ||
{N/A} | ||
{N/A} | ||
{\begin{CompactEnumerate} | ||
\item Visit the forwarded port on the local machine. | ||
\item Open the “Services” tab. | ||
\item Click Attach Inspector. | ||
\item Select the namespace and the service. | ||
\item Click Attach. | ||
\item Status update sent to kube API. | ||
\end{CompactEnumerate}} | ||
{{\begin{CompactEnumerate} | ||
\item Visit the forwarded port on the local machine. | ||
\item Open the “Services” tab. | ||
\item Scroll to the relevant record. | ||
\item Press the delete button next to the name. | ||
\item Confirm the action. | ||
\item Status update sent to kube API. | ||
\end{CompactEnumerate}} | ||
{\textbf{E1}: Kube API not available | ||
\vspace{-4mm}\begin{enumerate} | ||
\item Show an error to the user asking to retry in a bit. | ||
\vspace{-7mm}\end{enumerate}} | ||
{\begin{CompactItemizes} | ||
\item A new Inspector resource is attached to the service. | ||
\end{CompactItemizes}}} | ||
|
||
\vspace{-2em} | ||
\UseCaseDescription | ||
{UC-03} | ||
{Purge Lazy Koala} | ||
{Remove Lazy Koala from a Kubernetes cluster.} | ||
{Reliability Engineer} | ||
{\begin{CompactItemizes} | ||
\item kubectl installed and configured to talk to a Kubernetes cluster. | ||
\item The Kubernetes cluster has a \ac{lazy-koala-operator} deployed. | ||
\end{CompactItemizes}} | ||
{N/A} | ||
{N/A} | ||
{\begin{CompactEnumerate} | ||
\item Find the helm release name. | ||
\item Run helm uninstall <release name>. | ||
\end{CompactEnumerate}} | ||
{{\begin{CompactEnumerate} | ||
\item Locate Kubernetes Manifest found in the code repository. | ||
\item Run kubectl delete -f <manifest-file> | ||
\end{CompactEnumerate}} | ||
{N/A} | ||
{\begin{CompactItemizes} | ||
\item All the resources provisioned by Lazy Koala including the \ac{lazy-koala-operator} itself get removed from the cluster. | ||
\end{CompactItemizes}}} | ||
|
||
\vspace{-2em} | ||
\UseCaseDescription | ||
{UC-11} | ||
{Reconcile on modified resources} | ||
{Whenever a resource owned by the \ac{lazy-koala-operator} gets modified, kubelet invokes the reconciliation loop on the \ac{lazy-koala-operator}.} | ||
{Kubelet} | ||
{\begin{CompactItemizes} | ||
\item \ac{lazy-koala-operator} is deployed | ||
\end{CompactItemizes}} | ||
{Read the cluster state} | ||
{N/A} | ||
{\begin{CompactEnumerate} | ||
\item Resources get modified. | ||
\item Kubelet invokes a reconciliation loop on the \ac{lazy-koala-operator}. | ||
\item Check if the change is interesting. | ||
\item Update children resources accordingly. | ||
\end{CompactEnumerate}} | ||
{{\begin{CompactEnumerate} | ||
\item Resources get modified. | ||
\item Kubelet invokes a reconciliation loop on the \ac{lazy-koala-operator}. | ||
\item Check if the change is interesting. | ||
\item Stop execution. | ||
\end{CompactEnumerate}} | ||
{\textbf{E1}: Error while reconciling | ||
\vspace{-4mm}\begin{enumerate} | ||
\item Retry with exponential backoff. | ||
\vspace{-7mm}\end{enumerate}} | ||
{\begin{CompactItemizes} | ||
\item Cluster in the new desired state. | ||
\end{CompactItemizes}}} |
3 changes: 3 additions & 0 deletions
3
documentation/PSPD/chapters/implementation/chapter-overview.tex
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
\section{Chapter Overview} | ||
|
||
This chapter focuses on making the proposed system a reality. During this chapter, the author will talk about the tools and technologies he relyed on to complete the working prototype along with the reasoning behind all those choices. Then the author will share their experience implementing the core functionality of the system in line with his design goals. Finally, the chapter will be concluded with a self-reflection on achievements. |
3 changes: 3 additions & 0 deletions
3
documentation/PSPD/chapters/implementation/chapter-summary.tex
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
\section{Chapter Summary} | ||
|
||
In this chapter, the author shared their experiences and findings while implementing the proposed system. At the start, the author broke down the entire tech stack and explained all the tools and technologies used to build this project. Then the inner workings of the three core components were explained. Finally, the chapter concluded with a self-reflection where the author talked about the current results of the project. |
92 changes: 92 additions & 0 deletions
92
documentation/PSPD/chapters/implementation/core-functionalities.tex
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
\section{Implementation of Core Functionalities} | ||
|
||
This project contains three components that work together in order to makeup the entire system. In this section the inner working of each of those components will be explained. | ||
|
||
\subsection{Lazy Koala Resource Manager (Operator)} | ||
|
||
\ac{lazy-koala-operator} is the heart of the entire project. It is responsible for binding all other components together. In the context of Kubernetes, an operator is an agent that's running on the cluster, which is responsible for keeping one or more dedicated resources in sync with the desired state. | ||
|
||
For example, Kubernetes has a built-in resource named "Pod" which is the smallest deployable object in Kubernetes. So when a system administrator asked the Kube-API to create a pod out of a certain Docker container, Kube-API will create a resource object and attach it to the pod operator. Once that's done the pod operator will parse the pod resource specification and create a pod out of it. If for some reason the pod crashes or the administrator change the specification of the pod, the operator will be notified and it will re-run its reconciliation function to match the observed state with the desired state. | ||
|
||
\begin{figure}[H] | ||
\includegraphics[width=11.5cm]{assets/implementation/kubernetes-control-loop.png} | ||
\caption{Kubernetes control loop \citep{hausenblas2019programming}} | ||
% \label{fig:reconcile-loop} | ||
\end{figure} | ||
|
||
|
||
Coming back to the \ac{lazy-koala-operator}, it has a Custom Resource Definition (CRD) called Inspector. In its specification, there are three required values. Deployment reference, DNS reference, and an URL to download the model that was fine-tuned for this specific deployment. So once a resource was deployed, the \ac{lazy-koala-operator} will first get the pods related to the deployment and populate the "scrapePoints" data structure with the IP address of each pod. Then it's going to find out the IP address mapped to the DNS reference and append that to the "scrapePoints". Then the \ac{lazy-koala-operator} will compare the new scrapePoints hashmap with the scrapePoints hashmap which was created in the previous iteration. Then it will identify which points needs to be added and which needs to be removed from the "gazer-config". After that the \ac{lazy-koala-operator} will pull down the "gazer-config" ConfigMap and runs the through the calculated changelog. Then it will send a patch request to the kube-api with the new state. Since this ConfigMap gets mounted to every \ac{gazer} instance via Kubernetes volumes system the changes made here will instantly be reflected with all of the \ac{gazer} instances. As a final step, an instance of \ac{sherlock} will be provisioned with the model given in the specification. | ||
|
||
Figure \ref{fig:reconcile-loop} shows a part of this reconciliation loop, which get repeated every time there is a change to an existing resource or whenever the user creates a new variant of this resource. | ||
|
||
\begin{figure}[H] | ||
\includegraphics[height=15cm]{assets/implementation/reconcile-loop.png} | ||
\caption{\ac{lazy-koala-operator} reconciliation loop (self-composed)} | ||
\label{fig:reconcile-loop} | ||
\end{figure} | ||
|
||
|
||
|
||
\subsection{Telemetry extraction agent (Gazer)} | ||
|
||
\ac{gazer} is the telemetry extraction agent that's get scheduled to run on every node on the cluster using a Kubernetes DaemonSet. \ac{gazer} is implemented in Python with the help of a library called BCC which acts as a frontend for \ac{ebpf} API. \ac{gazer} contains two kernel probes that get submitted to the kernel space at the startup. | ||
|
||
The first probe is a TCP SYN backlog monitor that keeps track of the backlog size of the TCP SYN queue. Since every TCP connection starts with a 3-way handshake and the SYN packet is the first packet that will be sent from the client in this sequence. The entire request is kept on hold till this packet is acknowledged by the system. Hence unusually higher SYN backlog is a strong signal of something going wrong. Figure \ref{fig:backlog-probe} showcase the core part of this probe and how the backlog size is calculated. | ||
|
||
\begin{figure}[H] | ||
\includegraphics[width=14cm]{assets/implementation/backlog-probe.png} | ||
\caption{eBPF probe to collecting tcp backlog (self-composed)} | ||
\label{fig:backlog-probe} | ||
\end{figure} | ||
|
||
Next is a tracepoint probe that gets invoked whenever inet\_sock\_set\_state kernel function is called. This probe extract five key data points from every TCP request. Transmitting and receiving IP address, the number of bytes sent and received and finally the time taken to complete the entire request. All these data are shipped to the userspace via a perf buffer. In user space, these raw data get enriched with the data collected from kube-api. | ||
|
||
As shown in the figure \ref{fig:gazer-enrich}, since \ac{gazer} already has a list of interested IP addresses which is given by the \ac{lazy-koala-operator}, it first check whether the request was made from an one of those IPs (here only the transmitting IP is checked since every request get duplicate pair of entries, one for the request, one for the response and all the other attributes are shared among that pair). If it's found, the parser tries to identify the receiving IP address too. If the receiving IP also has a match the request received and counter for that particular service will be increased. Then the parser moves on to record the number of bytes sent and received and the time taken to complete the request under identified service. Finally, these data points are exposed via an HTTP server so that the Prometheus scraper can read it and store it in the database so it can be consumed by \ac{sherlock}. | ||
|
||
\begin{figure}[H] | ||
\includegraphics[width=14cm]{assets/implementation/gazer-enrich.png} | ||
\caption{Code used to enrich TCP event data (self-composed)} | ||
\label{fig:gazer-enrich} | ||
\end{figure} | ||
|
||
|
||
\subsection{AI-engine (Sherlock)} | ||
|
||
\ac{sherlock} is the AI engine which predict anomaly scores for each service which in turned used by the \ac{lazy-koala-operator} to figure out possible root causes for a particular issue. From a high level, this works by polling service telemetry for a predetermined number of time steps and running it through an convolutional autoencoder which tries to reconstruct the input data sequence. The difference between the input sequence and output sequence is called the reconstruction error and this will be use as the anomaly score for this specific service. Even though this process seems straight forward, a number of prepocessing steps has to be taken in order to make it easier on the model to converge on the learning goal. | ||
|
||
Since the collected metric data has different units, each feature of the dataset has a different range. This makes the training process very inefficient hence the model has to learn the concept of scales and unit first, and the backpropagation algorithm works best when the output values of the network are between 0-1 \citep{sola1997importance}. So to normalize this dataset a slightly modified version of the min-max normalization equation was used. This was done due to the fact that in most typical conditions metric values fluctuate between a fixed and limited range. If the min-max normalization function was applied as it is, the model may be hypersensitive to the slightest fluctuation. So adding this padding on both high and low ends, acts as an attention mechanism that helps the model to look for large variations rather than focusing on smaller ones. | ||
|
||
\begin{figure}[H] | ||
\includegraphics[width=12cm]{assets/implementation/normalize-data.png} | ||
\caption{Data normalization function (self-composed)} | ||
\label{fig:normalize-data} | ||
\end{figure} | ||
|
||
|
||
\begin{figure}[H] | ||
\centering | ||
\begin{subfigure}[b]{0.48\textwidth} | ||
\centering | ||
\includegraphics[width=\textwidth]{assets/implementation/before-normalization.png} | ||
\caption{Before Normalization} | ||
\label{fig:before-normalization} | ||
\end{subfigure} | ||
\hfill | ||
\begin{subfigure}[b]{0.49\textwidth} | ||
\centering | ||
\includegraphics[width=\textwidth]{assets/implementation/after-normalization.png} | ||
\caption{After Normalization} | ||
\label{fig:after-normalization} | ||
\end{subfigure} | ||
\hfill | ||
\caption{Comparsion of a data point before and after the data normalization (self-composed)} | ||
\end{figure} | ||
|
||
% After the normalization dataset is formated to | ||
During the requirement engineering process it was found out even though RNN tends to perform better with time-series data, the convolutional autoencoders are very efficient at detecting anomalies from time-series data. So after the normalization step metric data is encoded into an image-like structure that can be inputted into a convolutional autoencoder. | ||
|
||
\begin{figure}[H] | ||
\includegraphics[height=7cm]{assets/implementation/visualize-representation.png} | ||
\caption{Visualization of encoded time series (self-composed)} | ||
\label{fig:visualize-representation} | ||
\end{figure} |
Oops, something went wrong.