TechieYan Technologies

Road Transportation Analysis and Statistical modelling


This project is an end-to-end Data science project where you can find all the core steps required to satisfy a data science objective. Steps involving Data Extraction, Data Cleaning, Exploratory Data Analysis, Modelling and the Variable selection methods to evaluate how good our road transportation analysis model is before and after selecting best features from the raw data using Exclusive Variable Selection algorithm. Statistical analysis has also been performed in order to clean the data and extract some insights which helped us to make a better model.

Algorithm Description

Random Forest Classifier:

Random Forest Classifier is an ensemble algorithm which works with multiple algorithms parallelly. This is a supervised algorithm and it can be used with both classification and regression problems. The output of the new data is estimated either by using majority voting or average voting technique. Since the algorithm works with bagging technique, multiple decision trees are used to provide the output for the specific input. This is a key difference between decision trees and random forests. While decision trees consider all the possible feature splits, random forests only select a subset of those features. Random forest works best with large datasets and high dimensional.

Random Forest Classifier


Nearest Neighbour:

KNN or K Nearest neighbours is a basic yet an efficient algorithm which is being used in most of the Machine learning application. Since it is a non-parametric i.e. This algorithm doesn’t make any underlying assumption like other algorithms do, such as having specify distribution of data to work with. So, this makes it very easy and understandable to all the users who are using it. The Technique KNN applies in predicting on new data is where it finds the nearest neighbours for the given point and takes a majority voting, whichever class is resided near to the new point, it will be considered as the new class for the new data point.

Nearest Neighbour



Sequential Feature Selector:

Sequential feature selector is a feature selection method which allows us to add or remove features from the dataset. The features are selected based on the cross validation score achieved by training on the estimator. Sequential feature works best with supervised learning algorithms, but in the case of unsupervised, the algorithm just looks at the independent variables rather not he desired output.



How to Execute?

Make sure you have checked the add to path tick boxes while installing python, anaconda.

Refer to this link, if you are just starting and want to know how to install anaconda.

If you already have anaconda and want to check on how to create anaconda environment, refer to this article set up jupyter notebook. You can skip the article if you have knowledge of installing anaconda, setting up environment and installing requirements.txt

  1. Install the prerequisites/software’s required to execute the code from reading the above blog which is provided in the link above.
  2. Press windows key and type in anaconda prompt a terminal opens up.
  3. Before executing the road transportation analysis code, we need to create a specific environment which allows us to install the required libraries necessary for our project.
  • Type conda create -name “env_name”, e.g.: conda create -name project_1
  • Type conda activate “env_name, e.g.: conda activate project_1
  1. Go to the directory where your requirement.txt file is present.
  2. cd <>. E.g., If my file is in d drive, then
  3. d:

command d d:\License-Plate-Recognitionmain    #CHANGE PATH AS PER YOUR PROJECT, THIS IS JUST AN EXAMPLE

command d license plate    

     8. If your project is in c drive, you can ignore step 5 and go with step 6

     9. g., cd C:\Users\Hi\License-Plate-Recognition-main


    11. Run pip install -r requirements.txt or conda install requirements.txt (Requirements.txt is a text file consisting of all the necessary libraries  required for executing this python file. If it gives any error while installing libraries, you might need to install them individually.)

pipe installation

     12. To run .py file make sure you are in the anaconda terminal with the anaconda path being set as your executable file/folder is being saved. Then type python main.pyin the terminal, before running open the and make sure to change the path of the dataset.

     13. If you would like to run .ipynb file, Please follow the link to setup and open jupyter notebook, You will be redirected to the local server there you can select which ever .ipynb file you’d like to run and click on it and execute each cell one by one by pressing shift+enter.

Please follow the above links on how to install and set up anaconda environment to execute files.

Note: There are 4 different files each seeves different purpose such as, 

  • Preprocess.ipynb consists of all the data cleaning steps, which are necessary to build a clean and efficient model.
  • main.ipynb consist of major steps and exploratory data analysis which allow us to understand more about the data and behavior of it.
  • Variable_Selction.ipynb consists of data reduction/dimensionality reduction techniques such as Sequential feature selector method to reduce the dimensions in the data and compare the model scores before and after dimensionality reduction.
  • Combined_main_var.ipynb consists of combination of main.ipynb and variable_selection.ipynb to make it more clear and understable for the audience.

Please follow the above sequence if you would like to execute and the files require good system requirements to run.

Make sure to change the path of the dataset in the code

Data Description

The dataset was downloaded from a kaggle data repository. The dataset has been pre-processed and cleaned to remove any bias while training. Dataset consists of more than 2 lakh data entries and around 47 columns. Some of the important features present in the dataset, Severity, street, city, weather_timestamp, Country, Start_lat, Start_Log, End_lat, End_Lon and etc. These features help us to know which state or country had most number of accidents over the span of years. This road transportation analysis gives and overview of how the traffic in the state or country is like and necessary actions to be taken in future.

data set
data set 2

Final Results

  1. Model Training and Loading the model
Model Training and Loading the model

       2. Sequential Feature Selector

Sequential Feature Selector

Exploratory Data Analysis

  1. Missing values heatmap
Missing values heatmap

     2. Coordinate clusters

Coordinate clusters

      3. Heatmap of accidents

Heatmap of accidents

Issues you may face while executing the code

  1. We might face an issue while installing specific libraries, in this case, you might need to install the libraires manually. Example: pip install “module_name/library” i.e., pip install pandas
  2. Make sure you have the latest or specific version of python, since sometimes it might cause version mismatch.
  3. Adding path to environment variables in order to run python files and anaconda environment in code editor, specifically in any code editor.
  4. Make sure to change the paths in the code accordingly where your dataset/model is saved.

Refer to the Below links to get more details on installing python and anaconda and how to configure it.


All the required data has been provided over here. Please feel free to contact me for model weights and if you face any issues.

Click Here For The Source Code And Associated Files.

Yes, you now have more knowledge than yesterday, Keep Going.

+91 7075575787