Create a wordcloud with your Twitter Data

First follow the steps described in my tutorial about Sentiment Analysis with Twitter but stop before the point “The Analyzing”.

By this step we got our tweets

tweets = searchTwitter("sustainable development", n=200, cainfo="cacert.pem")

We now have to get the Text from our tweets to analyze them. We do this with:

tweets.text = laply(tweets,function(t)t$getText())

Sometimes this text has invalid characters in it which will make our API crash; so we have to remove them.
We can use a function of the site Viralheat to do so:

clean.text <- function(some_txt)

{
some_txt = gsub("&amp", "", some_txt)

some_txt = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", some_txt)

some_txt = gsub("@\\w+", "", some_txt)

some_txt = gsub("[[:punct:]]", "", some_txt)

some_txt = gsub("[[:digit:]]", "", some_txt)

some_txt = gsub("http\\w+", "", some_txt)

some_txt = gsub("[ \t]{2,}", "", some_txt)

some_txt = gsub("^\\s+|\\s+$", "", some_txt)

# define "tolower error handling" function

try.tolower = function(x)

{

y = NA

try_error = tryCatch(tolower(x), error=function(e) e)

if (!inherits(try_error, "error"))

y = tolower(x)

return(y)

}

some_txt = sapply(some_txt, try.tolower)

some_txt = some_txt[some_txt != ""]

names(some_txt) = NULL

return(some_txt)

}

You just have to copy-past this code and hit enter in R and you can use this function by letting it analyze our text extracted out of the tweets.

clean_text = clean.text(tweets.text)

We add this clean text to a so called Corpus, this is the main structure in the tool tm to save collections of text documents. To fill this Vector we have to use the VectorSource attribute. This looks like this:

tweet_corpus = Corpus(VectorSource(clean_text))

To go on we have to transform this Corpus in a so-called Term-document Matrix. This matrix describes the frequency of terms that occur in a collection of documents.

tdm = TermDocumentMatrix(tweet_corpus, control = list(removePunctuation = TRUE,stopwords = c("machine", "learning", stopwords("english")), removeNumbers = TRUE, tolower = TRUE))

Ok now we have our tdm. We have to do now is arrange our words by frequencies and put them in the wordcloud.
But before we have to install the wordcloud tool:

install.packages(c("wordcloud","tm"),repos="http://cran.r-project.org")

library(wordcloud)
require(plyr)

m = as.matrix(tdm) #we define tdm as matrix

word_freqs = sort(rowSums(m), decreasing=TRUE)   #now we get the word orders in decreasing order

dm = data.frame(word=names(word_freqs), freq=word_freqs)    #we create our data set

wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2))  #and we visualize our data

Ok here we have our wordcloud. If you want to save it to your computer you can do it with:

png("Cloud.png", width=12, height=8, units="in", res=300)

wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

dev.off()
Cloud
Now you can find the file Cloud.png on your Computer. Enjoy your own clouds!
Info: In the cloud picture you can see that the word “amp” was often used. This is a small mistake and you have to add this keyword to the clean.text() function which can remove it.

6 thoughts on “Create a wordcloud with your Twitter Data

  1. Pingback: julianhi's Blog | Wordclouds Dortmund vs. Bayern

  2. Pingback: Wordcloud #syria on Twitter | julianhi's Blog

  3. great post for #R beginners like me, thanks! Do you have an example how to create the wordcloud based only on the hashtags inside tweets (instead of the complet tweet text/words)?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s