Converting words to numbers using gensim word2vec (sports and outdoors dataset)

Abstract

Computers do not understand words; we need to convert them to numbers to make sure computer understands them. We use gensim library to convert word to vector form. These word embeddings on the sports and outdoors dataset t will help us get similar words. After generating the word embeddings we can check similarity between words, most similar words to a given word. These embeddings are the basic building blocks for building higher level NLP projects in the future.

Methodology:

We used gensim library to generate the word embeddings. Gensim is a great library to do NLP tasks. To read more about gensim and word embeddings you can refer to this article – https://radimrehurek.com/gensim/models/word2vec.html.

Word embeddings – Representation of word into a vector format where words with similar meanings have similar vector representation.

To give you a simple explanation of word embeddings, let us take 2 words. 1- India , 2 – China. Both are countries, have strong military, high population and hence similar in some way. So the word embeddings generated would look like India – [0,0.9,0.2,0.85,0.7] and China [0,0.8,0.3,0.80,065].

We get these word embeddings after training the model on the data and before training we preprocess the data, clean it, remove punctuations and stop words. Convert all words to lower case and in a standardized way. After training, we get the individual word embeddings.

DOWNLOAD BASE PAPER

https://www.researchgate.net/publication/291153115_Using_Word2Vec_to_process_big_text_data

Data Description

The dataset contains reviews on amazon products. It has 9 columns like reviewerID, reviewerName, reviewText. “reviewText” column is what we will use to generate the word embeddings and contains the reviews given by users on a specific product be it good or bad.

How to Execute?

So, before execution we have some pre-requisites that we need to download or install i.e., anaconda environment, python and a code editor.

Anaconda: Anaconda is like a package of libraries and offers a great deal of information which allows a data engineer to create multiple environments and install required libraries easy and neat.

Refer to this, if you are just starting and want to know how to install anaconda.

If you already have anaconda and want to check on how to create anaconda environment, refer to this article set up jupyter notebook. You can skip the article if you have knowledge of installing anaconda, setting up environment and installing requirements.txt

Install necessary libraries from requirements.txt file provided.

2.Go to the directory where your requirement.txt file is present.

e.g. cd C:\Users\Hi\word2vecgensim, this is just an example. Set appropriate path as in your computer.

Run command pip install -r requirements.txt or conda install requirements.txt (Requirements.txt is a text file consisting of all the necessary libraries required for executing this python file. If it gives any error while installing libraries, you might need to install them individually.)

All the necessary files will get downloaded.

To run the code, start jupyter notebook, open folder where your code is present.

When you run the sports_ word2vec.ipynb file, you get the appropriate results.

Results:

Issues you may face while executing the code

Make sure to change the path in the code. Give full path of dataset/csv file you want to use.
Make sure you have the appropriate versions of the given libraries.

Click here to download the code and associated files.

TechieYan Technologies

Converting words to numbers using gensim word2vec (sports and outdoors dataset)

Abstract

Methodology:

Data Description

How to Execute?

Issues you may face while executing the code

we will assist you 24/7

Quick Contact

Useful Links

Free Resources