Sentiment Analysis using Twitter data

- 17 mins

INTRODUCTION

Sentiment analysis also called opinion mining can be defined as the use of natural language processing, text analysis, computational linguistics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. more info about Sentiment analysis can be found in Here

the revolution of social media has made sentiment analysis an interesting research area as users write about their views and emotions about just anything as a result of this social media data present a very useful plartform for sentiment analysis

we will use tweet downloaded from https://www.crowdflower.com as our dataset to predict the emotions of the users from their tweets

Purpose

To train a classifier that can predict emotion from the labeled datset

we import pandas

import pandas as pd
### load the dataset
tweets = pd.read_csv('text_emotion.csv',)
tweets = tweets[((tweets['sentiment']== 'sadness') | (tweets['sentiment'] == 'happiness') |(tweets['sentiment']== 'love')) ]
##### check the number of dimension of the dataset
tweets.shape
(14216, 4)

lets check the first five tweets

tweets.head()
tweet_id sentiment author content
1 1956967666 sadness wannamama Layin n bed with a headache ughhhh...waitin o...
2 1956967696 sadness coolfunky Funeral ceremony...gloomy friday...
6 1956968487 sadness ShansBee I should be sleep, but im not! thinking about ...
8 1956969035 sadness nic0lepaula @charviray Charlene my love. I miss you
9 1956969172 sadness Ingenue_Em @kelcouch I'm sorry at least it's Friday?

as shown the dataset has already been lebeled hence this is a supervised learning

lets check the category of sentiment we have

mapper = {
    'love':1,
    'happiness':2,
    'sadness':3,
}

we will now transom the label to numeric values

tweets.sentiment = tweets.sentiment.map(mapper)

lets check the first five tweets again

tweets.head()
tweet_id sentiment author content
1 1956967666 3 wannamama Layin n bed with a headache ughhhh...waitin o...
2 1956967696 3 coolfunky Funeral ceremony...gloomy friday...
6 1956968487 3 ShansBee I should be sleep, but im not! thinking about ...
8 1956969035 3 nic0lepaula @charviray Charlene my love. I miss you
9 1956969172 3 Ingenue_Em @kelcouch I'm sorry at least it's Friday?

done! we have succefully transform our label to numeric values

Now features of interest to us are sentiment and contenten, so lets split them

X = tweets.content
y = tweets.sentiment

The next thing to do is to split our datasets into training and test data

# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
/home/anaconda/.conda/envs/py27/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20, random_state=42)

Machine learning algorithms work on numeric values so we will need to transform the content as well

from sklearn.feature_extraction.text import CountVectorizer
# instantiate the vectorizer
vect = CountVectorizer()

Now we fit and transorm the content using the instance of the CountVectorizer

# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

equivalantly we fit and transform the test data as well

# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_train_dtm
<11372x19354 sparse matrix of type '<type 'numpy.int64'>'
	with 138459 stored elements in Compressed Sparse Row format>
X_test_dtm
<2844x19354 sparse matrix of type '<type 'numpy.int64'>'
	with 31072 stored elements in Compressed Sparse Row format>

The Models and Evaluation

Naives Bayes Classifier
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

We train the model and record the time it takes for the training

%time nb.fit(X_train_dtm, y_train)
CPU times: user 21.8 ms, sys: 365 µs, total: 22.1 ms
Wall time: 39.5 ms





MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

make prediction

# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)
we check the accuracy of Naives Bayes Classifier
# calculate accuracy of class predictions
from sklearn import metrics

metrics.accuracy_score(y_test, y_pred_class)
0.6413502109704642

lets insepct confusion matrix

# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)
array([[311, 263, 169],
       [176, 619, 219],
       [ 41, 152, 894]], dtype=int64)

we inspect the false positive

X_test[y_test < y_pred_class]
30126                               is ready for summer!!!
23139    @dajuin Appending the #verticalchinese hash ta...
23008    Up, I slept in till 11.02 !!! Shooting a new v...
28758             @NJE112 hey mate fancy finden you on hea
30314    Headed to eat with my hubby n my mommy!!  So r...
3328     I think I love a part of me saying some cynica...
37732    OMG!!  Booth's hallucination in the latest epi...
23772    @PaulHarriott Hahaha!! Tv isn't really my thin...
4237     @almcheese alms ur tweets r all delicious. Hah...
36134    @ lovelytrinkets I like the way you worded tha...
27815    @SuzeBoozey LOL good to know the time in LA, i...
1016     Trying to learn how to do this &quot;twitter&q...
11616    @descendtorise i cant go on fb. i found out im...
34922    @officialTila YES!! I wanna come hang out with...
32556    @southernbell361 Yeah, Jimmy Fallon is back to...
25173            @afwife08 Good Morning, good wakeup music
36299    Last week I had 516 hits on my guinea pig pict...
30116    I have lived through our band's first performa...
1587     kewl the JB chat was awesome. I had to miss lo...
3606     @HospitalityMan  got the laptop but no golden ...
20977    Following @perth_aisa These guys could be shif...
6720     goodnight loveee, i have to attend extra class...
12909    @gfalcone601 tom is not with u!  hahahahahahha...
23664                               short work week for me
35125    @melissamark_ hehe go dye it beb! put more blo...
20589    @nuttychris Aww thank you for asking people to...
38111          @yaeljk NKOTB world is the best place to be
29400    I'm coming up with a new plan with my bestie. ...
22109                                      climbed snowdon
36825                Ill catch you at the very last second
                               ...                        
29010    @MikeHuntington oops, me and my drunken stupor...
26679    @jerzicua I didn't say I met *all* the awesome...
36168    going to bed. good night everyone! i love you ...
31374    Just Saw Confessions Of A Shopoholic...Totally...
29841    @Praxilla My kids count down the days till Sat...
11944    Re-pinging @NUNU_B: Is it pathetic that I .......
34216    thought Yes Man was good  had a blast with old...
1424     I really wanted to watch Mary Poppins and sing...
17988    @Keiyaunna grrrrri want you to come  kiss just...
21313             I am thoroughly enjoying my new ringtone
36791    After the sketchy moments at A mtn, the friend...
27023                                  Now english academy
1357     oooooh!  this guy is so amazing, he's so cute,...
35802    haven't twittered in forever. yeah, just had m...
30879                               Its 11:11  make a wish
37237                                       Living my life
26531    @sweetemmaxxx yeeh. i also have a thing for dr...
26397           @cuddle_bug68 Bill is super!! Thanks you!!
10726    O'Charleys? Pretty good. Especially when its f...
20863    NiGHT NiGHT MY TWiTTER LOVES ON THE PHONE THEN...
24818        @labella27 That is so sweet!! Have a good day
9841     i love my new phone but hate that i didnt get ...
22031    Am soo happy about today .. the going home bit...
29116    @esmeg Had one of the female servers tell her ...
246      I Can`t do 30 minutes of Treadmill  but done 3...
21775                 Monday blues? Not today, not for me.
37039                               happy birthday to me!!
19776    http://twitpic.com/6814w - So glad Sam is in a...
24515    @mikefoong lol i like challenges  the more imp...
5982     Don't you hate it when you're left with one sl...
Name: content, Length: 651, dtype: object
# print message text for the false negatives (spam incorrectly classified as ham)
X_test[y_test > y_pred_class]
37675          Wishing @Blacksocialite a Happy Mothers Day
5783        @MikeandBobShow   Awww, that wasn't very nice.
17116    @dougiemcfly Have a great show Doug, have fun....
24494    congratulations penjiiii !!! Are we calling hi...
5330     phew! i am MELTING!  stupid fan - you picked a...
14297    Just saw up with my favorites and surprisingly...
8732     @theNetImp yeah... it's gonna be a long while ...
24700    @dotcompals yes... nice. I missed a lot of fun...
10123    A little 3 mile run done!  22.5 minutes...  I'...
34138                                 i'm addicted to home
35534    Happy anniversary to @tayloredot and @hellohou...
27096                                    @Monodi what lol?
26443                     @macpowell I'M AWAKE, AND HAPPY!
14252    @EshSoMajor how dare u? after all the love mak...
13416    @purelynarcotic Ohh.. #twpp is falling silent....
1517                          where's enthusiasm in meeeee
33232    @laurenconrad http://twitpic.com/4wh4d - LAURE...
17083    @SCHATJE TY. Just a long tiring day filled wit...
10275    @deevazquez Lol. She's offering me a case of g...
10294    @nippysweety Thanks very much for the #ff @nip...
39646    Finally going to bed after staying up reading ...
29538    Sitting outside in the cold by the ocean with ...
25053                           What can twitter do for me
35566                          Watchin TV. HAPPY MOMS DAY.
15232                                              @bfly13
34337    off to bed I go..have a great Mom's Day all yo...
19791    down side = i have a seat by myself  loner. pl...
4666     just woke up, its laura's last full day here  ...
5492     Awh, its the last day of the tour  I'll miss h...
21062     http://tinyurl.com/d7tb38 Nice Photoshop Effects
                               ...                        
28742                    @SweetWifey Her new cd is lovely.
19427    @RealRobBrydon Hi Rob, will you be coming back...
13668    saddest celeb story of the week  http://bit.ly...
32255               @kelsie_marie1 hey beautiful whats up?
32401                                @bunnyrenee thank you
1612     i mean monday take the test monday!!!! please ...
30659    @ShebaBaby should be a rebroadcast of last wee...
13178    Lula is not feeling well.  http://tinyurl.com/...
4228     @SuFiGirl35 well that's no way to start the da...
23909    @Mollieandme Hello, you are sooo awesome!! Don...
14829                       got beat by super LARRYonation
20101          @Petty01 Happy Birthday and cheers to PAO!!
34183     just received a personal email from Perez Hilton
13794    my baby man is 4 years old a week today  he's ...
32866    @JonathanRKnight Going so soon? I was hoping t...
9724     Poor baby's got his first booboo  that he caus...
26686    @mrsgizara  we are at FOrt Belvoir, on base.  ...
1935     Good Morning, it's beautiful here today. Shame...
2172             My truss failing  http://yfrog.com/154upj
39383    Basil will be the highlight of my day. I've ju...
23072    @sparklethots love that birdy nest! though i a...
32134    just met Sarah Kelly... wow... she is an amazi...
598      http://twitpic.com/663vr - Wanted to visit the...
12264             i want to see my bud mel  miss hur loads
39967    @McMedia Very well thank you! How are you, mor...
37402    @lilbumbles Happy Mother's Day to the only mot...
27773                    Morning workout sesh.  Love Life.
39305            happy mothers day to all moms out there!!
25014    will be practicing my smile today-it's gunna b...
32247         Happy Mother's Day to all you moms out there
Name: content, Length: 369, dtype: object

logistic regression

# import and instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
# train the model using X_train_dtm
%time logreg.fit(X_train_dtm, y_train)
CPU times: user 1.97 s, sys: 1.23 s, total: 3.2 s
Wall time: 1.74 s





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

make prediction

# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)
#calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)
0.6568213783403657

Decision Tree

from sklearn import tree
dtree = tree.DecisionTreeClassifier()
%timeit dtree.fit(X_train_dtm, y_train)
1 loop, best of 3: 8.7 s per loop
# make class predictions for X_test_dtm
y_pred_class = dtree.predict(X_test_dtm)
#calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)
0.5734880450070323

Extremely Randomized Trees


from sklearn.ensemble import RandomForestClassifier
rndForest = RandomForestClassifier()
#

%timeit rndForest.fit(X_train_dtm, y_train)
1 loop, best of 3: 16.4 s per loop

make prediction

# make class predictions for X_test_dtm
y_pred_class = rndForest.predict(X_test_dtm)
# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)
0.5970464135021097

ExtraTreesClassifier

from sklearn.ensemble import ExtraTreesClassifier
ext = ExtraTreesClassifier()
%timeit ext.fit(X_train_dtm, y_train)
1 loop, best of 3: 8.78 s per loop
# make class predictions for X_test_dtm
y_pred_class = ext.predict(X_test_dtm)
# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)
0.6047819971870605
## Support Vector Machine
from sklearn import svm

Linear Support Vector Machine

linear_svc = svm.SVC(kernel='linear',random_state=1,C=.5)
linear_svc.fit(X_train_dtm, y_train)
SVC(C=0.5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=1, shrinking=True,
  tol=0.001, verbose=False)
# make class predictions for X_test_dtm
y_pred_class = linear_svc.predict(X_test_dtm)
# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)
0.6417018284106891

NoneLinear Support Vector Machine

rbf_svc = svm.SVC(kernel='sigmoid')
%timeit rbf_svc.fit(X_train_dtm, y_train)
1 loop, best of 3: 57.8 s per loop
# make class predictions for X_test_dtm
y_pred_class = rbf_svc.predict(X_test_dtm)
# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)
0.35654008438818563

Neural Network

from sklearn.neural_network import MLPClassifier
neu = MLPClassifier(solver='lbfgs', alpha=1e-5,hidden_layer_sizes=(5, 2), random_state=1)
neu.fit(X_train_dtm, y_train)
MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(5, 2), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
       solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False,
       warm_start=False)
# make class predictions for X_test_dtm
y_pred_class = neu.predict(X_test_dtm)
# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)
0.5678621659634318

Imbalanced Dataset

the dataset is a little bit inbalanced which mean the model will need walkaround for optimal result

tweets.sentiment.value_counts()
2    5209
3    5165
1    3842
Name: sentiment, dtype: int64
from nltk.classify import NaiveBayesClassifier

Summary

we evaluate different machine learning algorithms,Logistic Regression have the highest accuracy (65%) follow by Naive Bayes and Linear Support Vector machine. It is intersting to investigate why the accuracy is low…many factor can contribute to these


Mustapha Omotosho

Mustapha Omotosho

constant learner,machine learning enthusiast,huge Barcelona fan

comments powered by Disqus
rss facebook twitter github youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora