Network Severity detection at a specific location based on log data available

Abstract

The goal of the problem is to predict the severity of a network’s faults at a given time and location based on the log data available. Each row in the main dataset (train.csv, test.csv) represents a location and a time point. The “id” column, which is the key “id” used in other data files, identifies them. Using “ID” as a primary key we merge other files to make up a full dataset which then was used for data cleaning, Analysis and modelling. A lot of emphasis has been put on data analysis rather than data modelling since, identifying the issue and consequences of fault severity was very important.

Algorithm Description

Random Forest Classifier:

Random Forest Classifier is an ensemble algorithm which works with multiple algorithms parallelly. This is a supervised algorithm and it can be used with both classification and regression problems. The output of the new data is estimated either by using majority voting or average voting technique. Since the algorithm works with bagging technique, multiple decision trees are used to provide the output for the specific input. This is a key difference between decision trees and random forests. While decision trees consider all the possible feature splits, random forests only select a subset of those features. Random forest works best with large datasets and high dimensional.

References:

https://www.geeksforgeeks.org/random-forest-regression-in-python/

Decision Tree Classifier:

A decision tree is a tool for making decisions and the process for making decisions is in a tree like structure, decision tree is a supervised machine learning algorithm mainly used for predicting the outcome after computing all the attributes. The process flow of Decision tree goes from Root node to leave node i.e., the decision node.

References:

Support Vector Machine:

Support vector machines are basically a supervised learning algorithm which classifies the data points by drawing a linear curve and a non-linear curve depending on the data it is dealing with. The boundary that separated the 2 or more classes is called as a hyperplane, though there is a possibility of having some million hyperplanes for our data, but we need to find the hyperplane with maximum margin from all the training points, which makes the algorithm more efficient while predicting on new dataset, it can easily classify on which side the new data belongs to.

References:

1. https://www.geeksforgeeks.org/support-vector-machine-algorithm/
2. https://scikit-learn.org/stable/modules/svm.html

Nearest Neighbour:

KNN or K Nearest neighbours is a basic yet an efficient algorithm which is being used in most of the Machine learning application. Since it is a non-parametric i.e. This algorithm doesn’t make any underlying assumption like other algorithms do, such as having specify distribution of data to work with. So, this makes it very easy and understandable to all the users who are using it. The Technique KNN applies in predicting on new data is where it finds the nearest neighbours for the given point and takes a majority voting, whichever class is resided near to the new point, it will be considered as the new class for the new data point.

References:

DOWNLOAD BASE PAPER

How to Execute?

Make sure you have checked the add to path tick boxes while installing python, anaconda.

Refer to this link, if you are just starting and want to know how to install anaconda.

If you already have anaconda and want to check on how to create anaconda environment, refer to this article set up jupyter notebook. You can skip the article if you have knowledge of installing anaconda, setting up environment and installing requirements.txt

Install the prerequisites/software’s required to execute the code from reading the above blog which is provided in the link above.
Press windows key and type in anaconda prompt a terminal opens up.
Before executing the code, we need to create a specific environment which allows us to install the required libraries necessary for our project.

Type conda create -name “env_name”, e.g.: conda create -name project_1
Type conda activate “env_name, e.g.: conda activate project_1

Go to the directory where your requirement.txt file is present.
cd <>. E.g., If my file is in d drive, then
d:

7.cd d:\License-Plate-Recognition–main #CHANGE PATH AS PER YOUR PROJECT, THIS IS JUST AN EXAMPLE

8. If your project is in c drive, you can ignore step 5 and go with step 6

9. g., cd C:\Users\Hi\License-Plate-Recognition-main

10. CHANGE PATH AS PER YOUR PROJECT, THIS IS JUST AN EXAMPLE

11. Run pip install -r requirements.txt or conda install requirements.txt (Requirements.txt is a text file consisting of all the necessary libraries required for executing this python file. If it gives any error while installing libraries, you might need to install them individually.)

12. To run .py file make sure you are in the anaconda terminal with the anaconda path being set as your executable file/folder is being saved. Then type python main.pyin the terminal, before running open the main.py and make sure to change the path of the dataset.

13. If you would like to run .ipynb file, Please follow the link to setup and open jupyter notebook, You will be redirected to the local server there you can select which ever .ipynb file you’d like to run and click on it and execute each cell one by one by pressing shift+enter.

Please follow the above links on how to install and set up anaconda environment to execute files.

Data Description

The Dataset is collected form Kaggle Repository which contains 61840 Instances with 8 features. Multiple csv files have been merged together to work on one single mission. They have been merged by ID as a primary Key. Features with high correlation coefficient are Severity type, log_feature, location, event-type and volume. The target column is represented as fault_severity which ranges from 1,2,3 with 1 being minimum and 3 being maximum. Data balancing techniques have been incorporated since the data was quite imbalanced from 1 class been dominated by other.

Final Results

Decision tree Confusion matrix

2. Random forest Confusion matrix

3. Support Vector Machine Confusion matrix

4. Gaussian Confusion matrix

4. KNN Confusion matrix

4. Improved Model(RF) Confusion matrix

Exploratory Data Analysis

Exploratory Data Analysis

Correlation Heatmap

2. Pie Chart

3. Pair Plot

4. Bar chart of Resources available

5. Decision Tree

6. Random forest

7. Support vector machine

8. KNearest Neighbours

9. Gaussian NB

Evaluation Metrics

Evaluation metrics are considered as one of the most important steps in any machine learning and deep learning projects, where it will allow us to evaluate how good our Network Severity detection model is performing on the new data or on unseen data. There are a lot of evaluation metrics which can be used in order to assess how good our model is performing such as roc_auc_curve, f1_score, recall, precision and each of which work for specific problem we deal. So, for our project we have gone with confusion matrix and classification report which helps us to evaluate not just the accuracy of the model but also the other metrics such as precision, recall and f1_score.

Confusion matrix:

Classification Report: Classification report helps us to understand and evaluate how good the model is performing which consists of different evaluation metrics, such as

Precision: This will count number of correct predictions of a single class divided by the total number of observations of the same class.

Precision = TP/(TP + FP)

Recall: Recall is calculated by counting the actual number of classes in a single class divided by total number of observations of that particular class.

Recall = TP/(TP+FN)

F1_score: This is a Harmonic average of Precision and recall.

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

Reference:

Issues you may face while executing the code

We might face an issue while installing specific libraries, in this case, you might need to install the libraires manually. Example: pip install “module name/library” i.e., pip install pandas
Make sure you have the latest or specific version of python, since sometimes it might cause version mismatch.
Adding path to environment variables in order to run python files and anaconda environment in code editor, specifically in any code editor.
Make sure to change the paths in the code accordingly where your dataset/model is saved.

Refer to the Below links to get more details on installing python and anaconda and how to configure it.

https://techieyantechnologies.com/2022/07/how-to-install-anaconda/

https://techieyantechnologies.com/2022/06/get-started-with-creating-new-environment-in-anaconda-configuring-jupyter-notebook-and-installing-libraries-using-requirements-txt-2/

Note:

All the required data has been provided over here. Please feel free to contact me for model weights and if you face any issues.

Click Here For The Source Code And Associated Files.

https://www.linkedin.com/in/abhinay-lingala-5a3ab7205/

Yes, you now have more knowledge than yesterday, Keep Going.