Cluster your Twitter Data with R and k-means

plot example rcharts twitter

Hello everbody!

Today  I want to show you how you can get deeper insights into your Twitter followers with the help of R. Because I just completed the course “Machine Learning” by Prof. Andrew Ng on Coursera I will use the k-means algorithm and cluster my Twitter followers by the ratio of their followers and how many people they follow.

Setup

Before we can get the Twitter data we have to authorize like I described here. But before we go one we make sure that we have installed rcharts:


require(rCharts)

If it is not installed you have to install it with:

require(devtools)
install_github('rCharts','ramnathv')

Get the Twitter Data

We get our data in three steps. First we get the user object object, then we get the followers and friends of this user and then we merge it to a dataframe containing all these users and their information.

user <- getUser("JulianHi") #Set the username

userFriends <- user$getFriends()
userFollowers <- user$getFollowers()
userNeighbors <- union(userFollowers, userFriends) #merge followers and friends

userNeighbors.df = twListToDF(userNeighbors) #create the dataframe
 

If you would plot this data now I couldn´t give you a lot of insight as the most data points would be in the left corner. It would look like this:

 

Plot non log

So what we do is a log transformation and use the log of all the values for our analysis.

Do do so we need to replace all 0 values in our dataframe with 1, because log(0) would result in -Inf values inside our dateframe.


userNeighbors.df[userNeighbors.df=="0"]<-1

Now we can apply the log transformation. Therefore we add the columns logFollowersCount and logFriendsCount to our dataframe which contain the log() values of the original columns followersCount and friendsCount we received from Twitter.


userNeighbors.df$logFollowersCount <-log(userNeighbors.df$followersCount)

userNeighbors.df$logFriendsCount <-log(userNeighbors.df$friendsCount)

 k-means

Now that we have our data we can start clustering it. Before we can use the k-means algorithm we have to decide how many clusters we want to have in the end. Therefore we can use the so called elbow method. It runs the k-means algorithm with different numbers of clusters and shows the results. Based on this we can decide how many clusters we should choose.

First we extract the relevant columns out of our dataframe and create a new one for our algorithm.


kObject.log <- data.frame(userNeighbors.df$logFriendsCount,userNeighbors.df$logFollowersCount)

Then we can create the elbow chart and you will see why it is actually called elbow chart:

mydata <- kObject.log

wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))

for (i in 2:15) wss[i] <- sum(kmeans(mydata,centers=i)$withinss)

plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")

elbow

The best number of clusters it now at the “elbow” of the graph. In this case the best number of clusters would we something around 4. So we will try to find 4 clusters in our data.


#Run the K Means algorithm, specifying 4 centers
user2Means.log <- kmeans(kObject.log, centers=4, iter.max=10, nstart=100)

#Add the vector of specified clusters back to the original vector as a factor

userNeighbors.df$cluster <- factor(user2Means.log$cluster)

Plot the data

Ok now that we have our data we can plot it with the rCharts library.

This has the advantage that we can create easily an interactive graph which provides us additional information like the actual number of followers and friends as we can´t see it directly on the axes after the log transformation.


p2 <- nPlot(logFollowersCount ~ logFriendsCount, group = 'cluster', data = userNeighbors.df, type = 'scatterChart')

p2$xAxis(axisLabel = 'Followers Count')

p2$yAxis(axisLabel = 'Friends Count')

p2$chart(tooltipContent = "#! function(key, x, y, e){
 return e.point.screenName + ' Followers: ' + e.point.followersCount +' Friends: ' + e.point.friendsCount
} !#")

p2

You can find an interactive example of such a plot here

rcharts twitter

You can find the code on my github and if you have any questions feel free to follow me on Twitter or write a comment 🙂

 

 

This post was inspired by http://rstudio-pubs-static.s3.amazonaws.com/5983_af66eca6775f4528a72b8e243a6ecf2d.html

Create Twitter Wordcloud with Sentiments

You can use this tutorial in the ThinkToStartR package with:

ThinkToStart(“SentimentCloud”,”KEYWORD”,# of tweets,”DATUMBOX API KEY”)

Hey everybody,

some days ago I created a wordcloud filled with tweets of a recent german news topic. And a lot of people asked me if I have some code how I created this cloud. And so here it is.

In the end the plot will basically look like this:

Twitter Wordcloud R

It uses tweets and the datumbox twitter-sentiment API.

Preparation:

First let´s get a datumbox API key like I described in this tutorial:

In the first step you need an API key. So go to the Datumbox website http://www.datumbox.com/ and register yourself. After you have logged in you can see your free API key here: http://www.datumbox.com/apikeys/view/

Like always when we want to work with Twitter we have to go through the authentication process like I described here.

And then we need some packages for this tutorial but they are all available at CRAN:


library(twitteR)
library(RCurl)
library(RJSONIO)
library(stringr)
library(tm)
library(wordcloud)

The last preparation step is defining two functions which will help us a lot. The first cleans the text we send it and removes unwanted chars and the second sends the text to the datumbox API.


clean.text <- function(some_txt)
{
some_txt = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", some_txt)
some_txt = gsub("@\\w+", "", some_txt)
some_txt = gsub("[[:punct:]]", "", some_txt)
some_txt = gsub("[[:digit:]]", "", some_txt)
some_txt = gsub("http\\w+", "", some_txt)
some_txt = gsub("[ \t]{2,}", "", some_txt)
some_txt = gsub("^\\s+|\\s+$", "", some_txt)
some_txt = gsub("amp", "", some_txt)
# define "tolower error handling" function
try.tolower = function(x)
{
y = NA
try_error = tryCatch(tolower(x), error=function(e) e)
if (!inherits(try_error, "error"))
y = tolower(x)
return(y)
}

some_txt = sapply(some_txt, try.tolower)
some_txt = some_txt[some_txt != ""]
names(some_txt) = NULL
return(some_txt)
}

getSentiment <- function (text, key){

text <- URLencode(text);

#save all the spaces, then get rid of the weird characters that break the API, then convert back the URL-encoded spaces.
text <- str_replace_all(text, "%20", " ");
text <- str_replace_all(text, "%\\d\\d", "");
text <- str_replace_all(text, " ", "%20");

if (str_length(text) > 360){
text <- substr(text, 0, 359);
}
##########################################

data <- getURL(paste("http://api.datumbox.com/1.0/TwitterSentimentAnalysis.json?api_key=", key, "&text=",text, sep=""))

js <- fromJSON(data, asText=TRUE);

# get mood probability
sentiment = js$output$result

###################################
return(list(sentiment=sentiment))
}

Let´s start!

First we have to get some tweets:


tweets = searchTwitter("iPhone", 20, lang="en")

Then we get the text from these tweets and remove all the unwanted chars:


# get text
tweet_txt = sapply(tweets, function(x) x$getText())

# clean text
tweet_clean = clean.text(tweet_txt)
tweet_num = length(tweet_clean)

Now we create a dataframe where we can save all our data in like the tweet text and the results of the sentiment analysis.


tweet_df = data.frame(text=tweet_clean, sentiment=rep("", tweet_num),stringsAsFactors=FALSE)

In the next step we apply the sentiment analysis function getSentiment() to every tweet text and save the result in our dataframe. Then we delete all the rows which don´t have a sentiment score. This sometimes happens when unwanted characters survive our cleaning procedure.


# apply function getSentiment
 sentiment = rep(0, tweet_num)
 for (i in 1:tweet_num)
 {
 tmp = getSentiment(tweet_clean[i], db_key)

 tweet_df$sentiment[i] = tmp$sentiment

 print(paste(i," of ", tweet_num))

}

# delete rows with no sentiment
tweet_df <- tweet_df[tweet_df$sentiment!="",]

Now that we have our data we can start building the wordcloud.

The Wordcloud

First we get the different forms of sentiment scores the API returned. If you used the Datumbox API you will have positive, neutral and negative. With the help of them we divide the tweet texts into categories.


#separate text by sentiment
sents = levels(factor(tweet_df$sentiment))

The next line of code seems to be a little bit complicated. But it is enough if you know that it generates labels for each sentiment category which include the percents.


# get the labels and percents

labels <- lapply(sents, function(x) paste(x,format(round((length((tweet_df[tweet_df$sentiment ==x,])$text)/length(tweet_df$sentiment)*100),2),nsmall=2),"%"))

Then we create the so called docs for each category and add the tweet texts to these categories:


nemo = length(sents)
emo.docs = rep("", nemo)
for (i in 1:nemo)
{
 tmp = tweet_df[tweet_df$sentiment == sents[i],]$text

 emo.docs[i] = paste(tmp,collapse=" ")
}

The next steps are the same steps you would use for a “normal” worcloud. We just create a TermDocument Matrix and call the function comparison.cloud() from the “wordcloud” package


# remove stopwords
emo.docs = removeWords(emo.docs, stopwords("german"))
emo.docs = removeWords(emo.docs, stopwords("english"))
corpus = Corpus(VectorSource(emo.docs))
tdm = TermDocumentMatrix(corpus)
tdm = as.matrix(tdm)
colnames(tdm) = labels

# comparison word cloud
comparison.cloud(tdm, colors = brewer.pal(nemo, "Dark2"),
 scale = c(3,.5), random.order = FALSE, title.size = 1.5)

Of course you can find the whole code on github.

And if you always want stay up to date about my work and the topics R, analytics and Machine Learning feel free to follow me on Twitter

Was denkt Twitter über Hoeneß?

 

 

 

wordcloud_hoeneß

 

 

 

Diese Wordcloud entstand auf Grundlage von 1000 Tweets und einer Sentiment Analyse mit der Datumbox API.

Auch wenn nicht jeder Tweet korrekt bewertet sein sollte, gibt es doch ein gutes Gesamtbild ab, welches zum einen die Sentiment-Verteilung anbelangt und zum anderen aber auch die Themen. Auch interessant ist, dass viele internationale Tweets geschrieben wurden und dieses Thema auch große Resonanz im Ausland hervorruft.

Build your own Twitter Archive and Analyzing Infrastructure with MongoDB, Java and R [Part 2] [Update]

Hello everybody,

in my first tutorial I described how you can build your own MongoDB and use a JAVA program to mine Twitter either via the search function and a loop or via the Streaming API. But till now you just have your tweets stores in a Database and we couldn´t get any insight in our tweets now.

So we will take a look in this tutorial on how to connect to the MongoDB with R and analyze our tweets.

twitter infrastructure databse

Start the MongoDB

To access the MongoDB I use the REST interface. This is the easiest way for accessing the database with R when just have started with it. If you are a more advanced user, you can also use the rmongodb package and the code provided by the user abhishek. You can find the code below.

mongodb daemon

So we have to start the MongoDB daemon. It is located in the folder “bin” and has the name “mongod”. So navigate to this folder and type in:

./mongod --rest

This way we start the server and enable the access via the REST interface.

R

Let´s take a look at our R code and connect to the Database.

First we need the two packages RCurl and rjson. So type in:

library(RCurl)
library(rjson)

Normally the MongoDB server is running on the port 28017. So make sure that there is no firewall or other program blocking it.

So we have to define the path to the data base with:

database = "tweetDB"
collection = "Apple"
limit = "100"
db <- paste("http://localhost:28017/",database,"/",collection,"/?limit=",limit,sep = "")

tweetDB – name of your database

Apple – name of your collection

limit=100 – number of tweets you want to get

Ok now we can get our tweets with

tweets <- fromJSON(getURL(db))

And so you saved the Tweets the received. You can now analyze them like I explained in other tutorials about working with R and Twitter.

You can for example extract the text of your tweets and store it in a dataframe with:

tweet_df = data.frame(text=1:limit)
for (i in 1:limit){
tweet_df$text[i] = tweets$rows[[i]]$tweet_text}
tweet_df

If you have any questions feel free to ask or follow me on Twitter to get the newest updates about analytics with R and analytics of Social Data.

                                                                                                             

# install package to connect through monodb
install.packages(“rmongodb”)
library(rmongodb)
# connect to MongoDB
mongo = mongo.create(host = “localhost”)
mongo.is.connected(mongo)

mongo.get.databases(mongo)

mongo.get.database.collections(mongo, db = “tweetDB2″) #”tweetDB” is where twitter data is stored

library(plyr)
## create the empty data frame
df1 = data.frame(stringsAsFactors = FALSE)

## create the namespace
DBNS = “tweetDB2.#analytic”

## create the cursor we will iterate over, basically a select * in SQL
cursor = mongo.find(mongo, DBNS)

## create the counter
i = 1

## iterate over the cursor
while (mongo.cursor.next(cursor)) {
# iterate and grab the next record
tmp = mongo.bson.to.list(mongo.cursor.value(cursor))
# make it a dataframe
tmp.df = as.data.frame(t(unlist(tmp)), stringsAsFactors = F)
# bind to the master dataframe
df1 = rbind.fill(df1, tmp.df)
}

dim(df1)

Build your own Twitter Archive and Analyzing Infrastructure with MongoDB, Java and R [Part 1] [Update]

UPDATE: The JAVA script is now also available with the streaming API. You can find the script on my github account

Hey everybody,
you sure know the problems which appear when you want to work with the Twitter API. Twitter created a lot of different restrictions minimizing the fun of the Data Mining process.
Another problem is that you Twitter provides no way to analyze your data at a later time. You can´t just start a Twitter search, which gives you all the tweets ever written about your topic. And you can´t get all tweets related to a special event for example if there are a lot. So i always dreamed of my own archive filled with Twitter Data. And then i saw MongoDB. The Mongo database, which comes from “humongous”, is an open-source document database, and the leading NoSQL database.
And this document oriented structure makes it very easy to use especially for our purpose because it all concentrates on the JSON format. And that´s of course the format we get directly from Twitter. So we don´t need to process our tweets, we can just save them into our database.

Structure

So let´s take a closer look at our structure

structure

As you can see we need different steps. First we need to get the Twitter Data and store it in the Database and then we need to find a way to get this data into R and start analyzing.

In this first tutorial I will show you how to set up this first part. We set up a mongoDB locally on your computer and write a Java crawler, getting the Data directly from the Twitter API and storing them.

MongoDB

Installing the MongoDB is as easy as using it. You just have to go to the mongoDB website and select the right precompiled files for your operating system.
http://www.mongodb.org/downloads
Ok after downloading unpack the folder. And that´s it.
Then you just have to go to the folder and take a look into “bin” subfolder. Here you can see the different scripts. For our purposes we need the mongod and the mongo file.
The mongod is the mongo daemon. So it is basically the server and we need to start this script every time we want to work with database.
But we can do it even easier:
Just download the IntelliJ IDEA Java IDE http://www.jetbrains.com/idea/download/index.html
This cool and lightweight IDE has a nice third party mongoDB plugin available which will help you a lot working with the database. Of course there are plugins available for Eclipse or NetBeans but i haven´t tried them yet. Maybe you did?

Ok after you installed the IDE download the mongoDB plugin and install it as well http://plugins.jetbrains.com/plugin/7141
Then you can find the mongo  explorer on the right side of your working space.IDE

Go to the settings of this plugin.
There we have to add the path to the Mongo executable. Then you have to add a server connection by clicking on the + at the end of your server list. Just leave all the settings as they are and click ok.

path

connection

Now we established the connection to our server. If you can´t connect, start restarting the IDE or start the mongod script manually.
That was basically the mongoDB part. Let´s take a look at our Java crawler.

Java

I will go through the code step by step. You can find the complete code on my github.
But before we start we have to download some additional jar libraries helping us to work with the mongoDB on the one hand and the Twitter API on the other hand.
MongoDB Java driver: http://central.maven.org/maven2/org/mongodb/mongo-java-driver/2.11.3/mongo-java-driver-2.11.3.jar

Twitter4j package: http://twitter4j.org/archive/twitter4j-3.0.4.zip

Now add the Twitter4j-core. Twitter4j-stream and the mongoDB driver to your project.

libs

Our Java program starts with a small menu giving us the chance to insert the keyword we want to look for. So this Java program works like a loop searching Twitter every few seconds for new Tweets and saves new Tweets to our database. So it just saves tweets while it´s running but this is perfect for example to monitor a certain event.


public void loadMenu() throws InterruptedException {

         System.out.print("Please choose your Keyword:\t");

         Scanner input = new Scanner(System.in);
         String keyword = input.nextLine();

         connectdb(keyword);

         int i = 0;

         while(i < 1)
         {
             cb = new ConfigurationBuilder();
             cb.setDebugEnabled(true);
             cb.setOAuthConsumerKey("XXX");
             cb.setOAuthConsumerSecret("XXX");
             cb.setOAuthAccessToken("XXX");
             cb.setOAuthAccessTokenSecret("XXX");

             getTweetByQuery(true,keyword);
             cb = null;

             Thread.sleep(60 * 1000);              // wait

         }

     }

So after the program received a keyword it connects to the database with connectdb(keyword);


public void connectdb(String keyword)
     {
         try {

             initMongoDB();
             items = db.getCollection(keyword);

             //make the tweet_ID unique in the database
             BasicDBObject index = new BasicDBObject("tweet_ID", 1);
             items.ensureIndex(index, new BasicDBObject("unique", true));

         } catch (MongoException ex) {
             System.out.println("MongoException :" + ex.getMessage());
         }

     }
   public void initMongoDB() throws MongoException {
         try {
             System.out.println("Connecting to Mongo DB..");
             Mongo mongo;
             mongo = new Mongo("127.0.0.1");
             db = mongo.getDB("tweetDB");
         } catch (UnknownHostException ex) {
             System.out.println("MongoDB Connection Error :" + ex.getMessage());
         }
     }

The initMongoDB function connects to our local mongoDB server an creates a database instance called “db”. And here something cool happens: You can type in whatever name of the database you want it to be called. If this database doesn´t exit mongoDB automatically creates it just for you and you can work with it like nothing happened.
This effect also appears when we call the db.getCollection(keyword)
It automatically creates a Collection if it doesn´t exists. So no error messages anymore 😉
MongoDB is structured like:

DATABASE –> Collections —> Documents
You could compare a Collection to a table in a SQL Database and the Documents as the elements in this table; in this case our tweets.

But there also two very important lines of code:


//make the tweet_ID unique in the database
             BasicDBObject index = new BasicDBObject("tweet_ID", 1);
             items.ensureIndex(index, new BasicDBObject("unique", true));

Here we create an index in our database. So it just saves a tweet if the tweet_ID isn´t already in our database. Otherwise we would have doubled entries.

But let´s get some tweets!

If you take a look at our main function again it know creates a ConfigurationBuilder and sets the login details of our twitter API access. This configurationBuilder is needed for the TwitterFactory provided by the Twitter4j package we use in the next function: getTweetByQuery


  public void getTweetByQuery(boolean loadRecords, String keyword) throws InterruptedException {

         TwitterFactory tf = new TwitterFactory(cb.build());
         Twitter twitter = tf.getInstance();

         if (cb != null) {

             try {
                 Query query = new Query(keyword);
                 query.setCount(100);
                 QueryResult result;
                 result = twitter.search(query);
                 System.out.println("Getting Tweets...");
                 List<Status> tweets = result.getTweets();

                 for (Status tweet : tweets) {
                     BasicDBObject basicObj = new BasicDBObject();
                     basicObj.put("user_name", tweet.getUser().getScreenName());
                     basicObj.put("retweet_count", tweet.getRetweetCount());
                     basicObj.put("tweet_followers_count", tweet.getUser().getFollowersCount());
                     basicObj.put("source",tweet.getSource());
                     basicObj.put("coordinates",tweet.getGeoLocation());

                     UserMentionEntity[] mentioned = tweet.getUserMentionEntities();
                     basicObj.put("tweet_mentioned_count", mentioned.length);
                     basicObj.put("tweet_ID", tweet.getId());
                     basicObj.put("tweet_text", tweet.getText());

                     try {
                         items.insert(basicObj);
                     } catch (Exception e) {
                         System.out.println("MongoDB Connection Error : " + e.getMessage());
                         //loadMenu();
                     }
                 }

             } catch (TwitterException te) {
                 System.out.println("te.getErrorCode() " + te.getErrorCode());
                 System.out.println("te.getExceptionCode() " + te.getExceptionCode());
                 System.out.println("te.getStatusCode() " + te.getStatusCode());
                 if (te.getStatusCode() == 401) {
                     System.out.println("Twitter Error : \nAuthentication credentials (https://dev.twitter.com/pages/auth) were missing or incorrect.\nEnsure that you have set valid consumer key/secret, access token/secret, and the system clock is in sync.");
                 } else {
                     System.out.println("Twitter Error : " + te.getMessage());
                 }

             }
         } else {
             System.out.println("MongoDB is not Connected! Please check mongoDB intance running..");
         }
     }

After creating our connection to Twitter we get some tweets. Then we save the content of these tweets to our Database. But we select what we want to save, as we don´t need all the information delivered by Twitter.
We loop through our tweets with the help of our tweets List. We put it all in a BasicDBObject with the help of Twitter4j and finally insert this object in our database if it doesn´t exist.

Then the program sleeps a few seconds and starts the whole loop again.

Settings and Usage

If you want to monitor an event you have two options you can adjust to your needs. You can change the time the loop waits and the number of tweets the searchTwitter functions return with every time.
So if you monitor an event which will create a huge amount of tweets, you can increase the number of tweets returned and decrease the time the loop waits. But be careful, because if the program connects to often, Twitter will deny the access.

So start monitoring some events or just random keywords by running the program and typing in our keyword. The program will automatically create a Collection for you where the tweets are stored in.

This was just the first step for building our own Twitter archive and analyzing structure. In the next part I will talk about how to connect with R to your Twitter database and start analyzing your saved tweets.
I hope you enjoyed this first part and please feel free ask questions.
If you want to stay up to date about my blog please give me a like on Facebook, a +1 on Google+ or follow me on Twitter.

Part 2

Sentiment Analysis on Twitter with Datumbox API

Hey there!

Datumbox

After my post about sentiment analysis using the Viralheat API I found another service. Datumbox ist offering special sentiment analysis for Twitter. But this API doesn´t just offer sentiment analysis, it offers a much more detailed analysis. „The currently supported API functions are: Sentiment Analysis, Twitter Sentiment Analysis, Subjectivity Analysis, Topic Classification, Spam Detection, Adult Content Detection, Readability Assessment, Language Detection, Commercial Detection, Educational Detection, Gender Detection, Keyword Extraction, Text Extraction and Document Similarity.“

But note:
Datumbox just offers Sentiment analysis for tweets. All the other classifiers like gender or topic are build for longer texts and not for short tweets as they have too less chars. So the results for tweets can be inaccurately.

But these are very interesting features and so I wanted to test them with R.

But before we start you should take a look at the authentication tutorial and go through the steps.

The API Key

In the first step you need an API key. So go to the Datumbox website http://www.datumbox.com/ and register yourself. After you have logged in you can see your free API key here: http://www.datumbox.com/apikeys/view/

Datumbox

Ok, let´s go on with R.

Functions

The getSentiment() function

First import the needed packages for our analysis:


# load packages
 library(twitteR)
 library(RCurl)
 library(RJSONIO)
 library(stringr)

 



getSentiment <- function (text, key){



text <- URLencode(text);

#save all the spaces, then get rid of the weird characters that break the API, then convert back the URL-encoded spaces.
text <- str_replace_all(text, "%20", " ");
text <- str_replace_all(text, "%\\d\\d", "");
text <- str_replace_all(text, " ", "%20");


if (str_length(text) > 360){
text <- substr(text, 0, 359);
}
##########################################

data <- getURL(paste("http://api.datumbox.com/1.0/TwitterSentimentAnalysis.json?api_key=", key, "&text=",text, sep=""))

js <- fromJSON(data, asText=TRUE);

# get mood probability
sentiment = js$output$result

###################################

data <- getURL(paste("http://api.datumbox.com/1.0/SubjectivityAnalysis.json?api_key=", key, "&text=",text, sep=""))

js <- fromJSON(data, asText=TRUE);

# get mood probability
subject = js$output$result

##################################

data <- getURL(paste("http://api.datumbox.com/1.0/TopicClassification.json?api_key=", key, "&text=",text, sep=""))

js <- fromJSON(data, asText=TRUE);

# get mood probability
topic = js$output$result

##################################
data <- getURL(paste("http://api.datumbox.com/1.0/GenderDetection.json?api_key=", key, "&text=",text, sep=""))

js <- fromJSON(data, asText=TRUE);

# get mood probability
gender = js$output$result

return(list(sentiment=sentiment,subject=subject,topic=topic,gender=gender))
}




The getSentiment() function handles the queries we send to the API. It saves all the results we want to have like sentiment, subject, topic and gender and returns them as a list. For every request we have the same structure and the API is always requesting the API-Key and the text to be analyzed. It then returns a JSON object of the structure


{

output: {

status: 1,

result: positive

}

}

So what we want to have is the “result”. We extract it with js$output$result where js is the saved JSON response.

The clean.text() function

We need this function because of the problems occurring when the tweets contain some certain characters and to remove characters like “@” and “RT”.



clean.text <- function(some_txt)
{
some_txt = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", some_txt)
some_txt = gsub("@\\w+", "", some_txt)
some_txt = gsub("[[:punct:]]", "", some_txt)
some_txt = gsub("[[:digit:]]", "", some_txt)
some_txt = gsub("http\\w+", "", some_txt)
some_txt = gsub("[ \t]{2,}", "", some_txt)
some_txt = gsub("^\\s+|\\s+$", "", some_txt)

# define "tolower error handling" function
try.tolower = function(x)
{
y = NA
try_error = tryCatch(tolower(x), error=function(e) e)
if (!inherits(try_error, "error"))
y = tolower(x)
return(y)
}

some_txt = sapply(some_txt, try.tolower)
some_txt = some_txt[some_txt != ""]
names(some_txt) = NULL
return(some_txt)
}

Let´s start

Ok now we have our functions, all packages and the API key.

In the first step we need the tweets. We do this with searchTwitter() function as usual.



# harvest tweets
tweets = searchTwitter("iPhone", n=200, lang="en")


In my example I used the keyword “iphone5″. Of course you can use whatever you want.

In the next steps we have to extract the text from the text and remove the characters with the clean_tweet() function. We just call these functions with:


# get text
 tweet_txt = sapply(tweets, function(x) x$getText())

# clean text
 tweet_clean = clean.text(tweet_txt)

Then we need to count our tweets and based on this information we build a data frame we will fill with the information from our analysis




# how many tweets
tweet_num = length(tweet_clean)

# data frame (text, sentiment, score)
tweet_df = data.frame(text=tweet_clean, sentiment=rep("", tweet_num),
subject=1:tweet_num, topic=1:tweet_num, gender=1:tweet_num, stringsAsFactors=FALSE)


Do the analysis

We come to our final step: the analysis. We call the getSentiment() with the text of every tweet and wait for the answer to save it to a list. So this can cost some time. Just replace API-KEY with your Datumbox API key.



# apply function getSentiment
sentiment = rep(0, tweet_num)
for (i in 1:tweet_num)
{
tmp = getSentiment(tweet_clean[i], "API_KEY")

 tweet_df$sentiment[i] = tmp$sentiment

 tweet_df$subject[i] = tmp$subject
 tweet_df$topic[i] = tmp$topic
 tweet_df$gender[i] = tmp$gender
}



That´s it! We saved all our parameters in a list and can take a look at our Analysis.

text sentiment subject topic gender
shit your phone man wtf all ur memories and its a freaking iphone is it in the schl or with ur teacher negative subjective Arts male
fuck iphone i want the s then o negative subjective Home & Domestic Life female
stay home saturday night vscocam iphone picarts bored saturday stay postive reoverlay negative objective Sports female
why i love the mornings sunrise pic iphone now lets get crossfit wod goingcompass fitness positive subjective Home & Domestic Life female
iphone or stick with my bbhelp positive subjective Home & Domestic Life female

You can just display your data frame in R with:


tweet_df

Or you can save it to a CSV File with:


write.table(tweet_df,file=Analysis.csv,sep=",",row.names=F)

Sentiment Analysis on Twitter with Viralheat API

Hi there!

Some time ago I published a post about doing a sentiment analysis on Twitter. I used two wordlists to do so; one with positive and one with negative words. For the first try of a sentiment analysis it is surely a good way to start but if you want to receive more accurate sentiments you should use an external API. And that´s what we do in this tutorial. But before we start you should take a look at the authentication tutorial and go through the steps.

The Viralheat API

The Viralheat sentiment API receives more than 300M calls per week. And this huge amount of calls makes this API become better and better. Everytime a company for example using this API notices that a tweet was analyzed wrong, lets say it was a positive tweet but the API said it is neutral, the user can correct it and the API can use this knowledge for the next time.

Viralheat registration

You can reach the Viralheat API with a free account. This account includes 1000calls/day what should be enough for starting. Just go to the Viralheat developer Center and register yourself: https://app.viralheat.com/developer

Viralheat Developer Center

Then you can generate your free API key we´ll need later.

Functions

The getSentiment() function

First import the needed packages for our analysis:

library(twitteR)
library(RCurl)
library(RJSONIO)
library(stringr)

The getSentiment() function handles the queries we send to the API and splits the positive and negative statements out of the JSON reply and returns them in a list.

getSentiment <- function (text, key){
library(RCurl);
library(RJSONIO);

text <- URLencode(text);

#save all the spaces, then get rid of the weird characters that break the API, then convert back the URL-encoded spaces.
text <- str_replace_all(text, "%20", " ");
text <- str_replace_all(text, "%\\d\\d", "");
text <- str_replace_all(text, " ", "%20");

if (str_length(text) > 360){
text <- substr(text, 0, 359);
}

data <- getURL(paste("https://www.viralheat.com/api/sentiment/review.json?api_key=", key, "&text=",text, sep=""))

js <- fromJSON(data, asText=TRUE);

# get mood probability
score = js$prob

# positive, negative or neutral?
if (js$mood != "positive")
{
if (js$mood == "negative") {
score = -1 * score
} else {
# neutral
score = 0
}
}

return(list(mood=js$mood, score=score))
}

The clean.text() function

We need this function because of the problems occurring when the tweets contain some certain characters and to remove characters like “@” and “RT”.

clean.text <- function(some_txt)
{
some_txt = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", some_txt)
some_txt = gsub("@\\w+", "", some_txt)
some_txt = gsub("[[:punct:]]", "", some_txt)
some_txt = gsub("[[:digit:]]", "", some_txt)
some_txt = gsub("http\\w+", "", some_txt)
some_txt = gsub("[ \t]{2,}", "", some_txt)
some_txt = gsub("^\\s+|\\s+$", "", some_txt)

# define "tolower error handling" function
try.tolower = function(x)
{
y = NA
try_error = tryCatch(tolower(x), error=function(e) e)
if (!inherits(try_error, "error"))
y = tolower(x)
return(y)
}

some_txt = sapply(some_txt, try.tolower)
some_txt = some_txt[some_txt != ""]
names(some_txt) = NULL
return(some_txt)
}

Let´s start

Ok now we have our functions, all packages and the API key.

In the first step we need the tweets. We do this with searchTwitter() function as usual.

# harvest tweets
tweets = searchTwitter("iphone5", n=200, lang="en")

In my example I used the keyword “iphone5”. Of course you can use whatever you want.

In the next steps we have to extract the text from the text and remove the characters with the clean_tweet() function. We just call these functions with:

tweet_txt = sapply(tweets, function(x) x$getText())
tweet_clean = clean.text(tweet_txt)
mcnum = length(tweet_clean)
tweet_df = data.frame(text=tweet_clean, sentiment=rep("", mcnum), score=1:mcnum, stringsAsFactors=FALSE)

Do the analysis

We come to our final step: the analysis. We call the getSentiment() with the text of every tweet and wait for the answer to save it to a list. So this can cost some time. Just replace API-KEY with your Viralheat API key.

sentiment = rep(0, mcnum)
for (i in 1:mcnum)
{
tmp = getSentiment(tweet_clean[i], "API-KEY")
tweet_df$sentiment[i] = tmp$mood
tweet_df$score[i] = tmp$score
}

That´s it! Now we have our analyzed tweets in the tweet_df list and you can show your results with

tweet_df

Sentiment Results

Note:

Sometimes the API breaks when receiving certain character. I couldn´t figure out why , but as soon as I know it  I will update this tutorial.

Please also note that sentiment analysis can just give you a roughly overview of the mood.