Home | About | Contact | Projects

Amazon Review Classification:

Test data: 18000 Amazon Reviews
Train data: 18000 reviews with class labels.
Goal: To predict class labels for the reviews in test data.
To summarize, we have a train data and the test data. For train data, we have the class labels which is either a +1 or a -1. +1 denotes that the review is good and -1 is a bad review. For the test data, however, we have reviews whose labels are unknown. The goal is to build a knowledge base that can predict the label for the unknown test data using a knowledge base that we are building using the training data. For this to happen the test review goes to the train review and finds ‘k’ closest reviews to it. It does the average of positive and negative labels among those neighbors and predicts a label for itself.
Implementation: Used TfidfVectorizer to tokenize and perform feature extraction and pre-processing. Once that was done, implemented KNN algorithm that looks at the K closest neighbors to check the neighbors classification label and based on the majority vote, predicted a label for the test review.

Platform: Program coded with Python.