Sentiment analysis and NLP — Dataset preparation

3 min readMay 16, 2021

Sentiment analysis plays a significant role in marketing. In this project, I try to solve the automation of analyzing millions of reviews in the market. I am picking fictional video game review data from the manning live project to show how it’s done.

This project’s primary goal is to understand NLP using deep learning and a complete lifecycle of developing applications.

Dataset

Let us start with the first step in the project by downloading data from the web.

Downloaded data is moved into the Data folder. Video_Games_5.json.gz is gzipped file, and we can open the file using gunzip command in Linux. I found this helpful link https://tecadmin.net/extract-gz-file-in-linux-command/

gunzip data/Video_Games_5.json.gz

Data analysis

Now that we have downloaded the dataset, we can move to analyze the data with the help of Pandas. The downloaded dataset is a json file, and read_json can be used to read the json into the pandas data frame.

Since the data is not a standard JSON, it’s using Ndjson [http://ndjson.org/], which is a Newline delimited JSON data. Pandas come with an API to parse such a JSON format too.

df = pd.read_json(‘data/Video_Games_5.json’,lines=True)

lines=True allow this format to be possible

Rating and Review Columns

We will be working on a supervised learning technique. So we need the text review written, and the rating given by the individual is the most essential feature. I have predicted the most essential field quickly, but that may not be the case generally. So with a pinch of salt, we can move on to the next step.

The next step in the process is to identify the

Length of Dataset
Find the balance of the samples.
Create a small dataset for training
Hold of Large corpus for deep learning.

Based on ratings, we need the provide an equal amount of the data to model for training. For example, If we provide a more positive or negative model will be biased to the data provided. To achieve the generalization, it’s crucial to balance the data and to start with, let’s check how the data is distributed.

Looking at the distribution above clearly shows the data is poised towards 5(Positive) reviews. So using the complete data will skew the model towards a positive mindset. So let us create a couple of training subsets.

Small corpus — This is useful for a general training
Large corpus — Again, we have a 497577 list of data. So It’s better to reduce the number to 100K

Small Corpus

On the small corpus, we can start with 1% of the data. To make a balanced dataset out of the large dataset, I pick a percentage from each category.

This allows me to represent the dataset in an ideal way.

One more prominent use of small corpus is to train faster, allowing you to experiment with multiple models before choosing one.

Large Corpus

Take a random sample of the reviews by selecting 100,000 reviews. This way, you get a more considerable representative corpus for deep learning models.