Computers do not understand words; we need to convert them to numbers to make sure computers understand them. We use the gensim library to convert words to vector form. These word embeddings on the amazon cell phone and accessories dataset will help us get similar words. After generating the word embeddings we can check similarity between words, most similar words to a given word. These embeddings are the basic building blocks for building higher level NLP projects in the future.
We used the gensim library to generate the word embeddings. Gensim is a great library to do NLP tasks. To read more about gensim and word embeddings you can refer to this article – https://radimrehurek.com/gensim/models/word2vec.html.
Word embeddings – Representation of word into a vector format where words with similar meanings have similar vector representation.
To give you a simple explanation of word embeddings, let us take 2 words. 1- India , 2 – China. Both are countries, have strong military, high population and hence similar in some way. So the word embeddings generated would look like India – [0,0.9,0.2,0.85,0.7] and China [0,0.8,0.3,0.80,065].
We get these word embeddings after training the model on the data and before training we preprocess the data, clean it, remove punctuations and stop words. Convert all words to lowercase and in a standardized way. After training, we get the individual word embeddings.