We have looked at how to prepare the data for the sentiment analysis in the previous blog post. As a continuation, we will look into how to create a Sentiment analysis with a basic dictionary.
Yes, You read it right. In fact 5 years back this method has been used in Gmail to filter the emails as spam or not Spam.
Let start with a first principle(fancy word). We have a sentence like
We can split the sentences by words and try to see how many positive words and negative words and if the Positive words are more than negative then we can call the sentence Positive.
Well, this is not perfect but good enough for a basic application.
Before we start finding it out if the word is positive or negative, we need to get all the positive/negative words from the English dictionary. Luckily for us, great people from Stanford has done this exercise and provided it in a library form
With that dictionary of positive and negative words collection, we will be standing on the shoulders of giants and build our own sentiment analyzer.
Let's start by loading the data. You can learn about the data and what it contains from the Previous blog post. But here is the sample of the same.
So we will be using loading the data using pandas. This contains two main fields ratings which are review text which we pulled from amazon about a product and ratings are human-annotated.
Our main aim is to create automation where we can achieve rating to Human accuracy or more accurate than that in some cases.
Dictionary Based Sentiment Analysis
Now that we have loaded data its time to evaluate the accuracy using the NLTK library
In this above code snippet, we have downloaded the Opinion dictionary from NLTK, which contains all the positive and negative words.
Using the dictionary we can get the number of positive words in the sentence and provide a score between -1 to 1. It can be considered as the most negative sentence to positive sentence.
But a single review may consider multiple sentences then we need to handle the situation.
using the above methods, let's run it across all the reviews and get a score for each review and store the scores in the same Dataframe.
To calculate our method’s efficiency, we need to compare the human reviews(1 to 5) to model reviews(-1 to 1).
To solve this impedance mismatch, let us find common ground by normalizing the data(predicated vs. human) to Negative(0), Neutral(1), and Positive(2)
To normalize the predicated value, we will be using the following function. This function simply turns any number above 0.2, between neutral and below -0.2, to negative.
To Normalize target(human) data, we will do similar to the previous method.
Visualize the Results
Once we normalized the data, we can find out the accuracy of the model. Before that, we need to understand what accuracy is in machine learning.
Accuracy is defined as the percentage of correct predictions for the test data. It can be calculated easily by dividing the number of correct predictions by the number of total predictions.
Even though we can write our own accuracy functions accurately, let us use a standard library like Scikit-learn to do so.
The output of this function looks like this.
In short, it shows that we got an accuracy of 42% percentage. As I told it is not the ideal method or Best available, but something simple which anyone starting NLP should know of
Let me show another way to visualize the same using confusion matrix,
It clearly shows that model has biased towards neutral rather than the other two.
We will be using SOTA(State of the Art) models from Hugging Face transformers to achieve human-level accuracy in the next blog post.