Create Twitter Wordcloud with Sentiments

You can use this tutorial in the ThinkToStartR package with:

ThinkToStart(“SentimentCloud”,”KEYWORD”,# of tweets,”DATUMBOX API KEY”)

Hey everybody,

some days ago I created a wordcloud filled with tweets of a recent german news topic. And a lot of people asked me if I have some code how I created this cloud. And so here it is.

In the end the plot will basically look like this:

Twitter Wordcloud R

It uses tweets and the datumbox twitter-sentiment API.

Preparation:

First let´s get a datumbox API key like I described in this tutorial:

In the first step you need an API key. So go to the Datumbox website http://www.datumbox.com/ and register yourself. After you have logged in you can see your free API key here: http://www.datumbox.com/apikeys/view/

Like always when we want to work with Twitter we have to go through the authentication process like I described here.

And then we need some packages for this tutorial but they are all available at CRAN:


library(twitteR)
library(RCurl)
library(RJSONIO)
library(stringr)
library(tm)
library(wordcloud)

The last preparation step is defining two functions which will help us a lot. The first cleans the text we send it and removes unwanted chars and the second sends the text to the datumbox API.


clean.text <- function(some_txt)
{
some_txt = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", some_txt)
some_txt = gsub("@\\w+", "", some_txt)
some_txt = gsub("[[:punct:]]", "", some_txt)
some_txt = gsub("[[:digit:]]", "", some_txt)
some_txt = gsub("http\\w+", "", some_txt)
some_txt = gsub("[ \t]{2,}", "", some_txt)
some_txt = gsub("^\\s+|\\s+$", "", some_txt)
some_txt = gsub("amp", "", some_txt)
# define "tolower error handling" function
try.tolower = function(x)
{
y = NA
try_error = tryCatch(tolower(x), error=function(e) e)
if (!inherits(try_error, "error"))
y = tolower(x)
return(y)
}

some_txt = sapply(some_txt, try.tolower)
some_txt = some_txt[some_txt != ""]
names(some_txt) = NULL
return(some_txt)
}

getSentiment <- function (text, key){

text <- URLencode(text);

#save all the spaces, then get rid of the weird characters that break the API, then convert back the URL-encoded spaces.
text <- str_replace_all(text, "%20", " ");
text <- str_replace_all(text, "%\\d\\d", "");
text <- str_replace_all(text, " ", "%20");

if (str_length(text) > 360){
text <- substr(text, 0, 359);
}
##########################################

data <- getURL(paste("http://api.datumbox.com/1.0/TwitterSentimentAnalysis.json?api_key=", key, "&text=",text, sep=""))

js <- fromJSON(data, asText=TRUE);

# get mood probability
sentiment = js$output$result

###################################
return(list(sentiment=sentiment))
}

Let´s start!

First we have to get some tweets:


tweets = searchTwitter("iPhone", 20, lang="en")

Then we get the text from these tweets and remove all the unwanted chars:


# get text
tweet_txt = sapply(tweets, function(x) x$getText())

# clean text
tweet_clean = clean.text(tweet_txt)
tweet_num = length(tweet_clean)

Now we create a dataframe where we can save all our data in like the tweet text and the results of the sentiment analysis.


tweet_df = data.frame(text=tweet_clean, sentiment=rep("", tweet_num),stringsAsFactors=FALSE)

In the next step we apply the sentiment analysis function getSentiment() to every tweet text and save the result in our dataframe. Then we delete all the rows which don´t have a sentiment score. This sometimes happens when unwanted characters survive our cleaning procedure.


# apply function getSentiment
 sentiment = rep(0, tweet_num)
 for (i in 1:tweet_num)
 {
 tmp = getSentiment(tweet_clean[i], db_key)

 tweet_df$sentiment[i] = tmp$sentiment

 print(paste(i," of ", tweet_num))

}

# delete rows with no sentiment
tweet_df <- tweet_df[tweet_df$sentiment!="",]

Now that we have our data we can start building the wordcloud.

The Wordcloud

First we get the different forms of sentiment scores the API returned. If you used the Datumbox API you will have positive, neutral and negative. With the help of them we divide the tweet texts into categories.


#separate text by sentiment
sents = levels(factor(tweet_df$sentiment))

The next line of code seems to be a little bit complicated. But it is enough if you know that it generates labels for each sentiment category which include the percents.


# get the labels and percents

labels <- lapply(sents, function(x) paste(x,format(round((length((tweet_df[tweet_df$sentiment ==x,])$text)/length(tweet_df$sentiment)*100),2),nsmall=2),"%"))

Then we create the so called docs for each category and add the tweet texts to these categories:


nemo = length(sents)
emo.docs = rep("", nemo)
for (i in 1:nemo)
{
 tmp = tweet_df[tweet_df$sentiment == sents[i],]$text

 emo.docs[i] = paste(tmp,collapse=" ")
}

The next steps are the same steps you would use for a “normal” worcloud. We just create a TermDocument Matrix and call the function comparison.cloud() from the “wordcloud” package


# remove stopwords
emo.docs = removeWords(emo.docs, stopwords("german"))
emo.docs = removeWords(emo.docs, stopwords("english"))
corpus = Corpus(VectorSource(emo.docs))
tdm = TermDocumentMatrix(corpus)
tdm = as.matrix(tdm)
colnames(tdm) = labels

# comparison word cloud
comparison.cloud(tdm, colors = brewer.pal(nemo, "Dark2"),
 scale = c(3,.5), random.order = FALSE, title.size = 1.5)

Of course you can find the whole code on github.

And if you always want stay up to date about my work and the topics R, analytics and Machine Learning feel free to follow me on Twitter

About these ads

16 thoughts on “Create Twitter Wordcloud with Sentiments

  1. Sorry, my first comment didn’t post. Thanks for posting this tutorial.

    If I was interested in modifying the code to only build the cloud out of tweets from a specific area, like a state or country, any ideas how I would do that?

    Thanks!

    • Hey datahappy,
      yes you can do so. You have to modify the searchTwitter function call like this:
      searchTwitter(“iphone”, since=’2011-03-01’, until=’2011-03-02’)

      Regards

    • Oh Sorry that was the wrong code.
      The code hast to look like this:
      searchTwitter(“iphone”, geocode=’42.375,-71.1061111,10mi’)

      For the geocode argument, the values are given in the format latitude,longitude,radius, where
      the radius can have either mi (miles) or km (kilometers) as a unit. For example geocode=’37.781157,-122.39720,1mi’

      Regards

  2. Thank you so much for this great resource! I’m very new to R and TwitteR and I’m having some trouble with entering my API key – how and where exactly to I enter it, and what is the ‘text’ that needs to be entered afterwards? It would be great if someone could provide an example with a fake key. Thank you in advance!

    • Hey Sarah,
      Sorry for the late answer.
      Actually you have to set your datumbox API key as db_key.
      So type in db_key <- "your key"
      The part at the top is just a function. It needs the API key and a text which should be analyzed. But this function is called in the for loop so you don't have to worry about that.
      Did this help you?
      Please feel free to ask further questions.
      Regards

      • Hi Julian

        appreciate your help – bear with me while i fix that

        so i have done
        db_key <- "xxxxxxxxAPIxxxxxxxxxx"
        then
        data <- getURL(paste("http://api.datumbox.com/1.0/TwitterSentimentAnalysis.json?api_key=&quot;, db_key, "&text=",text, sep=""))

        but still getting the following:

        tweets = searchTwitter("iPhone", 20, lang="en")
        Error in twInterfaceObj$doAPICall(cmd, params, "GET", …) :
        OAuth authentication is required with Twitter's API v1.1

        any idea? sorry for being such a pain

      • Don’t worry;)
        Before you can use the code you have to do the Twitter authentication. You can find that tutorial on my blog.
        After the authentication you can execute the code here in this tutorial.
        So the steps are:
        – Twitter authentication
        – define db_key
        – execute this code here in the tutorial

        Regards

  3. Pingback: ThinkToStartR package | julianhi's Blog

  4. oh dear sorry for not reading carefully your post

    on the Twitter authentification i am running the code but cant find the PIN number i am supposed to paste back into R, any idea why?

    cheers and thanks again for your help!

  5. Almost there!!!!

    almost been through the whole script, now stuck here:

    # apply function getSentiment
    sentiment = rep(0, tweet_num)
    for (i in 1:tweet_num)
    {
    tmp = getSentiment(tweet_clean[i], db_key)
    tweet_df$sentiment[i] = tmp$sentiment
    print(paste(i,” of “, tweet_num))
    }

    when running i got the following error: Error in fromJSON(data, asText = TRUE) : unused argument (asText = TRUE)

    thanks

  6. Thank you very much. Could you help me in Emotion and Polarity function of Sentiment Analysis. I have R 3.1.1 and I guess sentiment analysis does not work with this version. Please help which version I need to install for Emotion and Polarity function to work on.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s