Convolutional Neural Network (CNN) for Sentiment Analysis

Movie Review Dataset

The data used in this blog is retrieved from the imdb.com website by Bo Pang and Lillian Lee in their paper. The dataset is comprised of 1,000 positive and 1,000 negative movie reviews. The data has been used for a few related NLP tasks. For classification, the performance of Support Vector Machines on the data is between 78% and 82%. One can further improve the peroformance by using 10-fold cross validation. In this blog, we are going to build a CNN for classifying the reviews.

Data Preparation

Before we build up the CNN, let's look at the data first. This is an example of a negative review. As we can see from the data, there are lots of negative words like "asshole", "stupid" and etc which indicates this is a negative review. So the first thing we need to do is to find these negative or positive words in the review which is called "vocabulary" of known words. In order to define a vocabulary of known words, we need to do some cleaning for the reviews. Specially, we gonna do the following

  • Split reviews on white space:
  • Remove all punctuation from words
  • Remove all words that are known stop words
  • Remove all words that are not purely comprised of alphabetical characters
  • Remove all words that have a length <= 1 character

more like a hugh asshole , but that's beside the point , which is : nine months includes too many needlessly stupid jokes that get laughs from the ten year olds in the audience while everyone else shakes his or her head in disbelief . so , anyway , grant finds out his girlfriend is pregnant and does his usual reaction ( fluttered eyelashes , nervous smiles ) . this paves the way for every possible pregnancy/child birth gag in the book , especially since grant's equally annoying friend's wife is also pregnant . the annoying friend is played by tom arnold , who provides most of the cacophonous slapstick , none of which is funny , such as a scene where arnold beats up a costumed " arnie the dinosaur " ( you draw your own parallels on that one ) in a toy store . the only interesting character in the movie is played by jeff goldblum , who should have hid himself away somewhere after the dreadful hideaway , as an artist with a fear of ( and simultaneous longing for ) commitment . not even robin williams , who plays a russian doctor who has recently decided to switch from veterinary medicine to obstetrics , has much humor .

from string import punctuation
from os import listdir
from collections import Counter
from nltk.corpus import stopwords

# load doc into memory
def load_doc(filename):
    file = open(filename, 'r')
    text = file.read()
    file.close()
    return text

# turn a doc into clean tokens
def clean_doc(doc):
    tokens = doc.split()
    table = str.maketrans('', '', punctuation)
    tokens = [w.translate(table) for w in tokens]
    tokens = [word for word in tokens if word.isalpha()]
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    tokens = [word for word in tokens if len(word) > 1]
    return tokens

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
    doc = load_doc(filename)
    tokens = clean_doc(doc)
    vocab.update(tokens)

# load all docs in a directory
def process_docs_vocab(directory, vocab, is_trian):
    for filename in listdir(directory):
        if is_trian and filename.startswith('cv9'):
            continue
        if not is_trian and not filename.startswith('cv9'):
            continue
        path = directory + '/' + filename
        add_doc_to_vocab(path, vocab)

vocab = Counter()
process_docs_vocab('C:/Users/ZhenbangWang/Desktop/Website/public/post/cnn_mr/txt_sentoken/neg', vocab, True)
process_docs_vocab('C:/Users/ZhenbangWang/Desktop/Website/public/post/cnn_mr/txt_sentoken/pos', vocab, True)
# print the top words in the vocab
print(vocab.most_common(10))
vocab = vocab.keys()
vocab = set(vocab)
[('film', 7983), ('one', 4946), ('movie', 4826), ('like', 3201), ('even', 2262), ('good', 2080), ('time', 2041), ('story', 1907), ('films', 1873), ('would', 1844)]

Training data and Testing data

Now is time to split the data into training and testing. Here we use 90% as training and 10% as testing.

# turn a doc into clean tokens
def clean_doc_vocab(doc, vocab):
    tokens = doc.split()
    table = str.maketrans('', '', punctuation)
    tokens = [w.translate(table) for w in tokens]
    tokens = [w for w in tokens if w in vocab]
    tokens = ' '.join(tokens)
    return tokens

def process_docs(directory, vocab, is_trian):
    documents = list()
    for filename in listdir(directory):
        if is_trian and filename.startswith('cv9'):
            continue
        if not is_trian and not filename.startswith('cv9'):
            continue
        path = directory + '/' + filename
        doc = load_doc(path)
        tokens = clean_doc_vocab(doc, vocab)
        documents.append(tokens)
    return documents
# load all training reviews
positive_docs = process_docs('C:/Users/ZhenbangWang/Desktop/Website/public/post/cnn_mr/txt_sentoken/pos', vocab, True)
negative_docs = process_docs('C:/Users/ZhenbangWang/Desktop/Website/public/post/cnn_mr/txt_sentoken/neg', vocab, True)
train_docs = negative_docs + positive_docs
# load all test reviews
positive_docs = process_docs('C:/Users/ZhenbangWang/Desktop/Website/public/post/cnn_mr/txt_sentoken/pos', vocab, False)
negative_docs = process_docs('C:/Users/ZhenbangWang/Desktop/Website/public/post/cnn_mr/txt_sentoken/neg', vocab, False)
test_docs = negative_docs + positive_docs

Encoding text to numerical value and Zero - Padding

While there is another thing we need to do before, we create the CNN i.e. we need to transfer the text data into numerical inputs since the input of neural networks have to be numerical. Here we can encode the training and testing documents as sequences of integers using the Tokenizer class in the Keras API. Since each review may have different size (i.e. number of words in each review varies). So we need to standardlize the input for neural networks. In this blog, we use zero-padding that is making all of reviews to be the same size by adding 0 to reviews with smaller words count.

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from numpy import array
from numpy import asarray
from numpy import zeros
# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(train_docs)
# sequence encode
encoded_docs_train = tokenizer.texts_to_sequences(train_docs)
# pad sequences
max_length = max([len(s.split()) for s in train_docs])
Xtrain = pad_sequences(encoded_docs_train, maxlen=max_length, padding='post')
# define training labels
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(test_docs)
# pad sequences
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define test labels
ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])

Building CNN

Now we can start buidling out neural networks. We use a Convolutional Neural Network (CNN) as they have proven to be successful at document classification problems. A conservative CNN configuration is used with 32 filters (parallel fields for processing words) and a kernel size of 8 with a rectified linear (‘relu’) activation function. This is followed by a pooling layer that reduces the output of the convolutional layer by half. Next, the 2D output from the CNN part of the model is flattened to one long 2D vector to represent the ‘features’ extracted by the CNN. The back-end of the model is a standard Multilayer Perceptron layers to interpret the CNN features. The output layer uses a sigmoid activation function to output a value between 0 and 1 for the negative and positive sentiment in the review.

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1
# define model
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=max_length))
model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 1380, 100)         4427700   
_________________________________________________________________
conv1d (Conv1D)              (None, 1373, 32)          25632     
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 686, 32)           0         
_________________________________________________________________
flatten (Flatten)            (None, 21952)             0         
_________________________________________________________________
dense (Dense)                (None, 10)                219530    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 11        
=================================================================
Total params: 4,672,873
Trainable params: 4,672,873
Non-trainable params: 0
_________________________________________________________________
None

Training the CNN

We use a binary cross entropy loss function because the problem we are learning is a binary classification problem. The efficient Adam implementation of stochastic gradient descent is used and we keep track of accuracy in addition to loss during training. The model is trained for 10 epochs, or 10 passes through the training data.

# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(Xtrain, ytrain, epochs=10, verbose=2)
# evaluate
loss, acc = model.evaluate(Xtest, ytest, verbose=0)
print('Test Accuracy: %f' % (acc*100))
Epoch 1/10
57/57 - 6s - loss: 0.6917 - accuracy: 0.5206
Epoch 2/10
57/57 - 6s - loss: 0.5656 - accuracy: 0.7106
Epoch 3/10
57/57 - 6s - loss: 0.1097 - accuracy: 0.9778
Epoch 4/10
57/57 - 6s - loss: 0.0098 - accuracy: 1.0000
Epoch 5/10
57/57 - 6s - loss: 0.0026 - accuracy: 1.0000
Epoch 6/10
57/57 - 6s - loss: 0.0015 - accuracy: 1.0000
Epoch 7/10
57/57 - 6s - loss: 0.0011 - accuracy: 1.0000
Epoch 8/10
57/57 - 6s - loss: 8.2282e-04 - accuracy: 1.0000
Epoch 9/10
57/57 - 6s - loss: 6.7293e-04 - accuracy: 1.0000
Epoch 10/10
57/57 - 6s - loss: 5.6252e-04 - accuracy: 1.0000
Test Accuracy: 88.000000

As summary, we achieve 88% Test Accuracy for classifying the reviews. Compared with traditional ML methods like SVM, we improve the Test Accuracy by 6%.

Avatar
Zhen-bang Wang
PhD Student