Sentiment Analysis using Doc2Vec
Word2Vec is dope. In short, it takes in a corpus, and churns out vectors for each of those words. What’s so special about these vectors you ask? Well, similar words are near each other. Furthermore, these vectors represent how we use the words. For example,
v_man - v_woman is approximately equal to
v_king - v_queen, illustrating the relationship that “man is to woman as king is to queen”. This process, in NLP voodoo, is called word embedding. These representations have been applied widely. This is made even more awesome with the introduction of Doc2Vec that represents not only words, but entire sentences and documents. Imagine being able to represent an entire sentence using a fixed-length vector and proceeding to run all your standard classification algorithms. Isn’t that amazing?
However, Word2Vec documentation is shit. The C-code is nigh unreadable (700 lines of highly optimized, and sometimes weirdly optimized code). I personally spent a lot of time untangling Doc2Vec and crashing into ~50% accuracies due to implementation mistakes. This tutorial aims to help other users get off the ground using Word2Vec for their own research. We use Word2Vec for sentiment analysis by attempting to classify the Cornell IMDB movie review corpus (http://www.cs.cornell.edu/people/pabo/movie-review-data/).
The source code used in this demo can be found at https://github.com/linanqiu/word2vec-sentiments
gensim has a much more readable implementation of Word2Vec (and Doc2Vec). Bless those guys. We also use
numpy for general array manipulation, and
sklearn for Logistic Regression classifier.
We can’t input the raw reviews from the Cornell movie review data repository. Instead, we clean them up by converting everything to lower case and removing punctuation. I did this via bash, and you can do this easily via Python, JS, or your favorite poison. This step is trivial.
The result is to have five documents:
test-neg.txt: 12500 negative movie reviews from the test data
test-pos.txt: 12500 positive movie reviews from the test data
train-neg.txt: 12500 negative movie reviews from the training data
train-pos.txt: 12500 positive movie reviews from the training data
train-unsup.txt: 50000 Unlabelled movie reviews
Each of the reviews should be formatted as such:
The sample up there contains two movie reviews, each one taking up one entire line. Yes, each document should be on one line, separated by new lines. This is extremely important, because our parser depends on this to identify sentences.
Feeding Data to Doc2Vec
Doc2Vec (the portion of
gensim that implements the Doc2Vec algorithm) does a great job at word embedding, but a terrible job at reading in files. It only takes in
LabeledLineSentence classes which basically yields
LabeledSentence, a class from
gensim.models.doc2vec representing a single sentence. Why the “Labeled” word? Well, here’s how Doc2Vec differs from Word2Vec.
Word2Vec simply converts a word into a vector.
Doc2Vec not only does that, but also aggregates all the words in a sentence into a vector. To do that, it simply treats a sentence label as a special word, and does some voodoo on that special word. Hence, that special word is a label for a sentence.
So we have to format sentences into
LabeledSentence is simply a tidier way to do that. It contains a list of words, and a label for the sentence. We don’t really need to care about how
LabeledSentence works exactly, we just have to know that it stores those two things – a list of words and a label.
However, we need a way to convert our new line separated corpus into a collection of
LabeledSentences. The default constructor for the default
LabeledLineSentence class in Doc2Vec can do that for a single text file, but can’t do that for multiple files. In classification tasks however, we usually deal with multiple documents (test, training, positive, negative etc). Ain’t that annoying?
So we write our own
LabeledLineSentence class. The constructor takes in a dictionary that defines the files to read and the label prefixes sentences from that document should take on. Then, Doc2Vec can either read the collection directly via the iterator, or we can access the array directly. We also need a function to return a permutated version of the array of
LabeledSentences. We’ll see why later on.
Now we can feed the data files to
LabeledLineSentence. As we mentioned earlier,
LabeledLineSentence simply takes a dictionary with keys as the file names and values the special prefixes for sentences from that document. The prefixes need to be unique, so that there is no ambiguitiy for sentences from different documents.
The prefixes will have a counter appended to them to label individual sentences in the documetns.
Building the Vocabulary Table
Doc2Vec requires us to build the vocabulary table (simply digesting all the words and filtering out the unique words, and doing some basic counts on them). So we feed it the array of sentences.
model.build_vocab takes an array of
LabeledLineSentence, hence our
to_array function in the
If you’re curious about the parameters, do read the Word2Vec documentation. Otherwise, here’s a quick rundown:
min_count: ignore all words with total frequency lower than this. You have to set this to 1, since the sentence labels only appear once. Setting it any higher than 1 will miss out on the sentences.
window: the maximum distance between the current and predicted word within a sentence. Word2Vec uses a skip-gram model, and this is simply the window size of the skip-gram model.
size: dimensionality of the feature vectors in output. 100 is a good number. If you’re extreme, you can go up to around 400.
sample: threshold for configuring which higher-frequency words are randomly downsampled
workers: use this many worker threads to train the model
Now we train the model. The model is better trained if in each training epoch, the sequence of sentences fed to the model is randomized. This is important: missing out on this steps gives you really shitty results. This is the reason for the
sentences_perm method in our
We train it for 10 epochs. If I had more time, I’d have done 20.
This process takes around 10 mins, so go grab some coffee.
Saving and Loading Models
To avoid training the model again, we can save it.
And load it.
Inspecting the Model
Let’s see what our model gives. It seems that it has kind of understood the word
good, since the most similar words to good are
astounding etc. This is really awesome (and important), since we are doing sentiment analysis.
[(u'excellent', 0.49283793568611145), (u'rudolfs', 0.43693652749061584), (u'chattanooga', 0.4162406325340271), (u'dodgy', 0.4156044125556946), (u'hurts', 0.4120287299156189), (u'maytime', 0.4106871485710144), (u'problematic', 0.4097048342227936), (u'lousy', 0.40627604722976685), (u'humanization', 0.40399202704429626), (u'detaches', 0.40335381031036377)]
We can also prop the hood open and see what the model actually contains. This is each of the vectors of the words and sentences in the model. We can access all of them using
model.syn0 (for the geekier ones among you,
syn0 is simply the output layer of the shallow neural network). However, we don’t want to use the entire
syn0 since that contains the vectors for the words as well, but we are only interested in the ones for sentences.
Here’s a sample vector for the first sentence in the training set for negative reviews:
array([-0.29009071, -0.12225933, 0.31921104, 0.12690307, -0.78038144, 0.22722125, 0.20833154, 0.43868524, 0.23561548, -0.36413202, -0.15643257, 0.00461203, 0.08018852, -0.08258788, -0.2760058 , 0.10478264, -0.08799575, -0.37180164, -0.33013585, 0.18241297, 0.88061708, 0.31900761, 0.23834513, -0.10049706, 0.28829831, -0.10783304, 0.0262799 , 0.14395693, -0.10469725, -0.52104455, -0.65519089, 0.55344343, 0.01129826, 0.04632513, -0.15539262, -0.28914869, 0.33816752, -0.62154663, -0.02470196, 0.62869382, 0.11478873, -0.25910828, -0.08567604, -0.02737088, -0.07904764, -1.01183033, -0.17131189, 0.04178049, -0.25471294, -0.42550623, -0.58592063, -0.31924966, 0.17547569, 0.20776786, -0.34506091, 0.00154094, 0.30706513, 0.09242618, 0.02120452, -0.09602273, -0.40714192, 0.47554722, 0.36416441, 0.01952979, 0.6814065 , -0.15965664, -0.22644502, -0.24989423, 0.40612498, 0.27820811, -0.35662967, 0.27260619, 0.07444458, -0.40855888, 0.09954116, 0.11635212, -0.13925125, 0.18950833, 0.10479139, -0.17033359, 0.06540342, -0.01675605, 0.00690889, -0.06219884, 0.13443422, 0.05676769, -0.19048406, -0.08656956, 0.16460682, -0.18292029, 0.42254189, 0.23022693, -0.10003133, -0.52825296, 0.18979053, -0.10800292, 0.01646511, -0.00504603, 0.173182 , 0.19618541], dtype=float32)
Now let’s use these vectors to train a classifier. First, we must extract the training vectors. Remember that we have a total of 25000 training reviews, with equal numbers of positive and negative ones (12500 positive, 12500 negative).
Hence, we create a
numpy array (since the classifier we use only takes numpy arrays. There are two parallel arrays, one containing the vectors (
train_arrays) and the other containing the labels (
We simply put the positive ones at the first half of the array, and the negative ones at the second half.
The training array looks like this: rows and rows of vectors representing each sentence.
[[ 0.18376358 0.14441976 0.09568384 ..., -0.28158343 0.29901281 0.30613518] [-0.24681839 0.71076155 -0.40998992 ..., 0.39490703 -0.0812839 0.21507834] [-0.05608005 0.48018554 0.35442445 ..., 0.28135905 -0.3482835 0.23772541] ..., [-0.23058473 0.03048714 0.8894285 ..., -0.5808149 0.18430299 -0.08481987] [ 0.33824882 0.32103017 -0.24168059 ..., -0.55988938 0.61781228 -0.4085339 ] [ 0.3712599 -0.18826039 0.30162022 ..., -0.21118143 0.00217488 -0.4914557 ]]
The labels are simply category labels for the sentence vectors – 1 representing positive and 0 for negative.
[ 1. 1. 1. ..., 0. 0. 0.]
We do the same for testing data – data that we are going to feed to the classifier after we’ve trained it using the training data. This allows us to evaluate our results. The process is pretty much the same as extracting the results for the training data.
Now we train a logistic regression classifier using the training data.
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)
And find that we have achieved near 87% accuracy for sentiment analysis. This is rather incredible, given that we are only using a linear SVM and a very shallow neural network.
Isn’t this fantastic? Hope I saved you some time!
- Doc2vec: https://radimrehurek.com/gensim/models/doc2vec.html
- Paper that inspired this: http://arxiv.org/abs/1405.4053