Visualize Retweets with R

Hi there!

Today I want to show you how to plot a graph filled with Data of Retweets of a Twitter Account.

To start you have to go through the Twitter authentification process as i described in an earlier blog post you can find here.

Let´s get started with the things we are really interested in.

Get the data

To start the data mining we need to load 3 packages.

library(twitteR)
library(igraph)
library(stringr)

To get the posts of a certain user, the twitteR package provides a cool function

tweets = userTimeline("mashable", n=1000)

This gets us around 1000 tweets from the user account @mashable. These are saved in the Variable tweets and we can extract the text in the next step with

tweet_txt = sapply(tweets, function(x) x$getText())

But now we have to recognize the retweets in this huge amount of text.

# regular expressions to find retweets
grep("(RT|via)((?:\\b\\W*@\\w+)+)", tweets,
ignore.case=TRUE, value=TRUE)

# which tweets are retweets
rt_patterns = grep("(RT|via)((?:\\b\\W*@\\w+)+)",
tweet_txt, ignore.case=TRUE)

# show retweets (these are the ones we want to focus on)
tweet_txt[rt_patterns] 

Now that we have our raw data we can go on analyzing it.

Analyzing the data

# we create a list to store user names
who_retweet = as.list(1:length(rt_patterns))
who_post = as.list(1:length(rt_patterns))

# for loop
for (i in 1:length(rt_patterns))
{
 # get tweet with retweet entity
 twit = tweets[[rt_patterns[i]]]
 # get retweet source
 poster = str_extract_all(twit$getText(),
 "(RT|via)((?:\\b\\W*@\\w+)+)")
 #remove ':'
 poster = gsub(":", "", unlist(poster))
 # name of retweeted user
 who_post[[i]] = gsub("(RT @|via @)", "", poster, ignore.case=TRUE)
 # name of retweeting user
 who_retweet[[i]] = rep(twit$getScreenName(), length(poster))
}

# and we put it off the list
who_post = unlist(who_post)
who_retweet = unlist(who_retweet)

Now we have created our so called edge list. A list which shows the connections of our data. A very common construct in R.
But our goal wasn´t an edge list, but a graph. So we have to form our edge list in a nice graph.

# two column matrix of edges
retweeter_poster = cbind(who_retweet, who_post)

# generate graph
rt_graph = graph.edgelist(retweeter_poster)

# get vertex names
ver_labs = get.vertex.attribute(rt_graph, "name", index=V(rt_graph))

Now there are just a few steps left to see our retweet graph.

# choose some layout
glay = layout.fruchterman.reingold(rt_graph)

# plot
par(bg="gray15", mar=c(1,1,1,1))
plot(rt_graph, layout=glay,
 vertex.color="gray25",
 vertex.size=10,
 vertex.label=ver_labs,
 vertex.label.family="sans",
 vertex.shape="none",
 vertex.label.color=hsv(h=0, s=0, v=.95, alpha=0.5),
 vertex.label.cex=0.85,
 edge.arrow.size=0.8,
 edge.arrow.width=0.5,
 edge.width=3,
 edge.color=hsv(h=.95, s=1, v=.7, alpha=0.5))
# add title
title("\nTweets from the User account @mashable: Who retweets whom",
 cex.main=1, col.main="gray95")

And here it is:
Our nice retweets graph!

retweets mashable

11 thoughts on “Visualize Retweets with R

  1. Hi. Great article. I tried to repeat all your instructions but in the loop of analysing data, the R shows:

    Error in check_string(string) : attempt to apply non-function

    You can please give me some help?

  2. Hi julianhi…

    I encounter the problem while using above function “userTimeline”

    tweets = userTimeline(“tatadocomo”, n=100)

    “SSL certificate problem, verify that the CA cert is OK. Details:\nerror:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed”
    Error in twInterfaceObj$doAPICall(cmd, params, method, …) :
    Error: SSL certificate problem, verify that the CA cert is OK. Details:
    error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed

    Kindly help

    Regards
    Abhishek

  3. Hello. I am also hitting the SSL error, with the message as described above. I am using RStudio but have just replicated the issue in base R 3.0 (64bit).

    I suspect it is related to authentication, because including _cainfo=”cacert.pem”_ in the searchTwitter() command seems to make all the difference:
    > searchTwitter(‘cnn’, cainfo=”cacert.pem”, n=100) # works fine
    > searchTwitter(‘cnn’, n=100) # throws SSL error like userTimeline()

    Trouble is, I cannot see an equivalent parameter through which to pass this certificate to userTimeline() !

    Any advice much appreciated!

  4. ut <- userTimeline('test', 1500)
    tw.df <- twListToDF(ut)

    write.csv(tw.df, file="tweets.csv")

    for (i in 1:length(tw.df))
    {
    rt.df <- twListToDF(searchTwitter(tw.df[i,1], n=tw.df[i,12]))
    rt.df["originalTweet"] <- tw.df[i,1]
    user <- getUser(rt.df[i,11])
    user.df <- user$toDataFrame()

    if (i==1)
    {
    write.table(rt.df, file="retweetlog.csv", sep = ",", append=TRUE, col.names=TRUE, row.names=FALSE)
    write.table(user.df, file="userlog.csv", sep = ",", append=TRUE, col.names=TRUE, row.names=FALSE)
    }
    else
    {
    write.table(rt.df, file="retweetlog.csv", sep = ",", append=TRUE, col.names=FALSE, row.names=FALSE)
    write.table(user.df, file="userlog.csv", sep = ",", append=TRUE, col.names=FALSE, row.names=FALSE)
    }
    }

    While this code completes, I get warnings:

    Error in if (n <= 0) stop("n must be positive") :
    missing value where TRUE/FALSE needed
    In addition: Warning messages:
    1: In doRppAPICall("search/tweets", n, params = params, retryOnRateLimit = retryOnRateLimit, :
    19 tweets were requested but the API can only return 16
    2: In write.table(rt.df, file = "1_retweetlog-test2.csv", sep = ",", :
    appending column names to file
    3: In write.table(user.df, file = "1_userlog-test2.csv", sep = ",", :
    appending column names to file
    4: In doRppAPICall("search/tweets", n, params = params, retryOnRateLimit = retryOnRateLimit, :
    25 tweets were requested but the API can only return 24
    5: In doRppAPICall("search/tweets", n, params = params, retryOnRateLimit = retryOnRateLimit, :
    151 tweets were requested but the API can only return 135
    6: In doRppAPICall("search/tweets", n, params = params, retryOnRateLimit = retryOnRateLimit, :
    66 tweets were requested but the API can only return 62
    7: In doRppAPICall("search/tweets", n, params = params, retryOnRateLimit = retryOnRateLimit, :
    27 tweets were requested but the API can only return 23

    Are these associated with API limits or code errors? When running 1 tweet, seems fine, but trying max 1500 rarely returns any more than 150. When iterating through users, it is hard to understand how only 20 users are returned when I'm expecting 100s. Any help would be appreciated.

    • Hey Michael,
      yes your problems seem to be caused by the rate limits of the Twitter API. But Twitter has a very complicated system which defines these limits and they try to keep them secret. So the only thing you can do is to test all your function calls and hope that they work.

      I hope i could help you.

      Regards

  5. Hi Julian,

    I tried to use the above code with minor changes. By using below search string, I am getting only 2 rows of data
    ———————————————————
    # regular expressions to find retweets
    grep(“(RT|via)((?:\\b\\W*@\\w+)+)”, dm_tweets,
    ignore.case=TRUE, value=TRUE)

    # which tweets are retweets
    rt_patterns = grep(“(RT|via)((?:\\b\\W*@\\w+)+)”,
    dm_txt, ignore.case=TRUE)

    # show retweets (these are the ones we want to focus on)
    dm_txt[rt_patterns]
    ————————————————————
    However, when I downloaded the contents into .csv file, there are many rows with retweetCount>0 (assuming then these are the retweets for a particular user account).

    I am not able to discern how is it possible.Could you please help.

    Also, I am not able to understand how we are searching the retweets through the grep function. Can you please clarify how we are making use of the regular expression.

    Thanks

    • Hey Nikhil,
      in my example I extracted the raw text of the tweets and searched for patterns indicating a retweet. You could do so as well and save if the text indicates a retweet or not in a boolean vector. With this you could filter your twitter list.
      I hope I could help you.

      Regards

Leave a reply to julianhi Cancel reply