Text Data Mining with Twitter and R

Twitter is a favorite source of text data for analysis: it’s popular (there is a huge volume of variety on all topics) and easily accessible using Twitter’s free, open APIs which are easily consumable in JSON and ATOM formats.

Some people have used Twitter for sophisticated analysis such as predicting flu outbreaks and the stock market, but let’s start with something simpler and less ambitious: an introduction to text data mining using Twitter and R. We’ll download live data using the Twitter APIs, parse it, build a corpus, demonstrate some basic text processing, and plot a hierarchical agglomerative cluster—because everyone likes pictures. I query for a controversial topic, abortion, in hopes of visualizing the two sides of the debate.

There is a specialized package for R called twitteR, but it isn’t available for Windows, but it’s easy to substitute the generic XML package and the Twitter search API documentation for our needs.

### Read tweets from Twitter using ATOM (XML) format

# installation is required only required once and is rememberd across sessions

# loading the package is required once each session

# initialize a storage variable for Twitter tweets
mydata.vectors <- character(0)

# paginate to get more tweets
for (page in c(1:15))
	# search parameter
	twitter_q <- URLencode('#prolife OR #prochoice')
	# construct a URL
	twitter_url = paste('http://search.twitter.com/search.atom?q=',twitter_q,'&rpp=100&page=', page, sep='')
	# fetch remote URL and parse
	mydata.xml <- xmlParseDoc(twitter_url, asText=F)
	# extract the titles
	mydata.vector <- xpathSApply(mydata.xml, '//s:entry/s:title', xmlValue, namespaces =c('s'='http://www.w3.org/2005/Atom'))
	# aggregate new tweets with previous tweets
	mydata.vectors <- c(mydata.vector, mydata.vectors)

# how many tweets did we get?

Based on the limits of the Twitter API, you should now have 1500 vectors representing 1500 tweets.

### Use tm (text mining) package


# build a corpus
mydata.corpus <- Corpus(VectorSource(mydata.vectors))

# make each letter lowercase
mydata.corpus <- tm_map(mydata.corpus, tolower) 

# remove punctuation 
mydata.corpus <- tm_map(mydata.corpus, removePunctuation)

# remove generic and custom stopwords
my_stopwords <- c(stopwords('english'), 'prolife', 'prochoice')
mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords)

# build a term-document matrix
mydata.dtm <- TermDocumentMatrix(mydata.corpus)

# inspect the document-term matrix

# inspect most popular words
findFreqTerms(mydata.dtm, lowfreq=30)

The most popular terms include abortion, dont, funding, gop, parenthood, planned, prochoice prolife, tcot, women. Though not explicitly stated here, the words “planned” and “parenthood” are a collocation (words which occur together more often than by chance). Let’s see which words are associated with a term.

> findAssocs(mydata.dtm, 'fetus', 0.20) 
         fetus          child            247     believeing    brainwashed          cared 
          1.00           0.31           0.30           0.30           0.30           0.30 

The number under each word is an association score, so the search term always occurs with the search term. The next most-associated term is “child,” etc. In some applications, a stemmer or spell checker could help with the misspelled word “believeing.”

To make a Hierarchical Agglomerative cluster plot, we need to reduce the number of terms (which otherwise wouldn’t fit on a page or the screen) and build a data frame.

# remove sparse terms to simplify the cluster plot
# Note: tweak the sparse parameter to determine the number of words.
# About 10-30 words is good.
mydata.dtm2 <- removeSparseTerms(mydata.dtm, sparse=0.95)

# convert the sparse term-document matrix to a standard data frame
mydata.df <- as.data.frame(inspect(mydata.dtm2))

# inspect dimensions of the data frame

Now the data frame (a standard data structure in R) contains a bag of words (specifically, 1-grams) which are simple frequency counts. Though the structure is lost, it retains much information and is simple to use. The data frame is ready for cluster analysis using a cluster analysis function available in R core. The following code is basically copied from Robert I. Kabacof’s “Cluster Analysis” page.

mydata.df.scale <- scale(mydata.df)
d <- dist(mydata.df.scale, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram?

groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
rect.hclust(fit, k=5, border="red") 

Adjust the quantity five to best fit the data, and now we have the plot:

The terms higher in the plot are more popular, and terms close to each other are more associated. For example, today there is a fear the US government will shut down, so the terms “budget,” “funding,” and “shutdown” appear together, but these are not associated with the term “woman. The term “periodpiece” is a Twitter account (remember, punctuation was removed including @ which designates accounts), and the cluster with “periodpiece” and “life” is a semantic argument (example).

Some possible next steps include:

  • K-means cluster
  • Remove hyperlinks from tweets
  • Basic word association plots (built in to the tm package but requires Rgraphviz which can be tricky to install)
  • Word association fans
  • Sentiment analysis: which hashtag has more positive mood?
  • Classification: to which side of the debate does a new tweet (without a hashtag) belong?
  • Find a happier topic
  • About these ads

61 thoughts on “Text Data Mining with Twitter and R

  1. Hi Andrew,
    Maybe you would like to try my recently released tm.plugin.webcorpus package from R-Forge.
    The following code should work for your example:
    install.packages(“tm.plugin.webcorpus”, repos=”http://R-Forge.R-project.org”)
    #you may also need to install tm, slam, RCurl, XML and Defaults
    c <- getCorpus('#prolife OR #prochoice', src = "twitter", n = 1500)

    • Mario: Nice! I see support for some APIs from Yahoo, Bing, Google, NYTimes, and Twitter. Any thoughts about adding support for easily pulling Facebook status updates from public (non-profile) pages? I’ve done this in Python from Facebook’s JSON API without an API key. Also I plan to look at your package tm.plugin.sentiment

      • The packages are still in alpha status – hope you like them ;-)
        Your facebook idea sounds great – I’ll take a look at it soon.

      • I am trying to pull Facebook status updates from public (non-profile) pages but dont know exactly how to do it.Can you please send me your code or some help material regarding this?
        my id is (nadeem_rao21@yahoo.com).

    • Hi Mario, I´m trying to install the package that you mentioned above but I can´t do it. The package is not already available. Could you tell me or someone else where Can I find it please?? I´m talking about the package tm.plugin.webcorpus. My e-mail is jrgsua83@gmail.com

  2. Pingback: My ongoing struggle with the Twitter API, R, … copy paste | Christina's LIS Rant

  3. Pingback: Solution to my Twitter API – twitterR issues | Christina's LIS Rant

  4. Pingback: links for 2011-06-28 « Personal Link Sampler

  5. Pingback: Grep, sub, dictionary and new code in R | Accessibility of Ecological Language

  6. Pingback: Text Data Mining with Twitter and R (via Heuristic Andrew) « beatsnpeace

  7. Pingback: Wordclouds of tweets with R | Matteo Redaelli

  8. # Remove hyperlink
    for (j in 1:length(tweet.corpus)) tweet.corpus[[j]] <- gsub('http[a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z]', '', tweet.corpus[[j]])

  9. Hi, Andrew
    For me, this is very useful post. It could be even more useful if I knew how to pass UTF-8 encoded words, like “Čačak” or “Београд”, to URLencode function. Do you have any suggestions?

  10. Great Article, Thanks.
    I tried to mine “gaddafi”, but most of the results were gibberish, probably because great number of tweets on gaddafi is in Arabic. So, the question is, how is it possible to mine them? thanks.

  11. Hello,

    I am looking for some docs on the usage of tm.sentiment.plugin, but can’t seem to find anything. Any idea or examples on how to use it?


  12. Please which of the R version can I use for tm.plugin.sentiment.
    Can someone be kind enough to tell me. I need a reply urgently.


  13. Does anybody have suggestions on how to load text files to *create* the vectors? I’m not sure on what format I need to start this process. My initial data is in Excel (one comment per row).


  14. Pingback: Using R to search Twitter for analysis

  15. x <- readLines("file.txt")
    x1<- as.vector(unlist(strsplit(x,split="\n")) )


    # build a corpus
    mydata.corpus<- Corpus(VectorSource(x1))

  16. Hey there,
    Just small couple of comments, since you put K-Means as the first item in your possible next steps. It seems to me it is not difficult to get a small amount of labelled twitters (but surely not enough to use a supervised algorithm), so you may wish to take a look into semi-supervised clustering.

    Another point is that I’m not entirely sure how you got to the number 30 (I guess it was just a visual guesstimation), you may wish to take a look into feature weighting for K-Means – perhaps (a bit biased here as this is in fact my paper… published in Pattern Recognition):

    Best of luck

  17. I’m new to R; I tried your script but I got this error:
    > twitter_url = paste(‘http://search.twitter.com/search.atom?q=’,twitter_q,’&rpp=100&page=’, page, sep=”)
    Errore in paste(“http://search.twitter.com/search.atom?q=”, twitter_q, :
    cannot coerce type ‘closure’ to vector of type ‘character’

    Any help?
    Thank you,

    • Check that the quotation marks didn’t get translated wrong when copying and pasting (retyping the quotation marks is a way to check), and check that the variable twitter_q contains the right data.

  18. Pingback: Analysis of #FNCE tweets – - nutsci.orgnutsci.org

  19. Hi – This webpage has been very helpful. Can you tell me how I can get data for a particular range of dates using this code? That will help me with the project I’m trying to do right now. Thanks!

  20. Hello everyone,

    This article is sent from Programmer Heaven to us ignorants of the Ways of the Code. I am hoping that there could be more where that came from…

    I am a graduate in a Belgian university doing research in digital marketing and I have bitten a lot more than I can chew by getting involved in a very challenging project involving gathering location data from Twitter. Let me explain.

    Due to circumstances that are now irreversible, in the following weeks I absolutely have to learn how to gather continuous Twitter location checkin data corresponding to two full months (preferably December 2012 and January 2013) that respect the following parameters:

    -the checkins (could all come via Foursquare or Gowalla into Twitter for instance) are limited to an area of 50 km from the Center of Brussels
    - only the checkins from the supermarkets ‘delhaize’, ‘carrefour’ and ‘colruyt’ & fast-food restaurants ‘mcdonalds’, ‘quick’ and ‘pizza hut’ are needed (for this I need to learn to geotagging :) )
    -i need the timestamp for each tweet so that I can then use it to create some graphs showing daily, weekly and monthly peaks of activity in the various locations

    My research supervisor (a marketing professor with limited code knowledge) has given me the link to this article and asked me to learn by myself how to use it as a springboard for my data collection. At this point I am a bit confused about the following:

    1. How do I adapt/build a code that helps me retrieve the data with the parameters I put above? Would anyone be willing to guide me?
    2. How should I go about collecting the data timewise ? Can it be automated somehow or, if not, how often do I have to run the code? For instance, can I retrieve twitter data from last week? last month? How far back can I go? Is there a limit?
    3.Where can I store the data I am collecting. Is there one file that R creates where I can have all my data?

    I realize that I am asking for a lot from you guys but I am in a state of shock :) and any guidance would go a long way. I have had some C++ programming experience since it was my highschool major, but since then I followed a career in humanities so it’s hard a bit to get back into gear.

    Thanks in advance for any help get :)

    Ciprian B

    • To get a friends/followers list, you need to use a different Twitter API. For an account with under 5000 friends, it is relatively easy to get in one API call (but you may need OAuth authentication). While building a Twitter to SQL archiver I’ve found using a well established library makes working with Twitter APIs much easier, so consider using twitterR for R (which I haven’t tried) or Mike Verdone’s Twitter API for Python. Even though it is in Python and I have to export the data to R, overall it’s easier to work with.

  21. Based on teh clusters that are formed, if I were to get back the documentID from term document marix (for example, to use to get the user who belongs to a cluster, based on the tweets), how would you trace back from clusters?

  22. Pingback: Twitte-R: Attempts at Gleaning Insights from Twitter Searches | Deductions through Data

  23. Pingback: Wie ich mit R und Tweets rummachte: Ein Protokoll | Schafott

    • Paulie, In the Twitter API version 1 it was easy to code a light-weight query, but Twitter API v2 requires OAuth. Because of this and various quirks in the Twitter API, I recommend you consider my new tool called tweets2sql, which is a Twitter archiver. It is Python based, and I run it on a daily basis to add the latest tweets to a SQL database, which then I can query from R, SAS, or any other tool.

      An alternative is the twitterR package on CRAN, but I still prefer tweets2sql because it is designed to cultivate a large history of tweets, it isn’t limited by Twitter’s one-week window on the search API (if you run tweets2sql regularly), handles network errors, etc.

  24. Pingback: Pappu Vs. Feku – Twitter Wars | TweetSent

  25. Andrew – I received this error when I tried loading the data from the hashtags:
    Error in UseMethod(“xpathApply”) :
    no applicable method for ‘xpathApply’ applied to an object of class “NULL”

    • Is this the line

      mydata.vector <- xpathSApply


      If so, check that the variables passed to the function are not NULL. Maybe your query returned no results.

    • I also have the same problem. I think it has to do with the hashtag-symbol. Here’s what I get after trying to run the loop and then looking at ‘twitter_url’:

      1> twitter_url
      [1] “http://search.twitter.com/search.atom?q=%23prolife%20OR%20%23prochoice&rpp=100&page=1″

      Any suggestions would be greatly appreciated.

  26. Pingback: Feijoo e Beiras polarizan o Debate do Estado da Autonomía en Twitter | Calidonia Hibernia

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s