Text Data Mining with Twitter and R

heuristicandrew / April 8, 2011

Twitter is a favorite source of text data for analysis: it’s popular (there is a huge volume of variety on all topics) and easily accessible using Twitter’s free, open APIs which are easily consumable in JSON and ATOM formats.

Some people have used Twitter for sophisticated analysis such as predicting flu outbreaks and the stock market, but let’s start with something simpler and less ambitious: an introduction to text data mining using Twitter and R. We’ll download live data using the Twitter APIs, parse it, build a corpus, demonstrate some basic text processing, and plot a hierarchical agglomerative cluster—because everyone likes pictures. I query for a controversial topic, abortion, in hopes of visualizing the two sides of the debate.

There is a specialized package for R called twitteR, but it isn’t available for Windows, but it’s easy to substitute the generic XML package and the Twitter search API documentation for our needs.

###
### Read tweets from Twitter using ATOM (XML) format
###

# installation is required only required once and is rememberd across sessions
install.packages('XML') 

# loading the package is required once each session
require(XML)

# initialize a storage variable for Twitter tweets
mydata.vectors <- character(0)

# paginate to get more tweets
for (page in c(1:15))
{
	# search parameter
	twitter_q <- URLencode('#prolife OR #prochoice')
	# construct a URL
	twitter_url = paste('http://search.twitter.com/search.atom?q=',twitter_q,'&rpp=100&page=', page, sep='')
	# fetch remote URL and parse
	mydata.xml <- xmlParseDoc(twitter_url, asText=F)
	# extract the titles
	mydata.vector <- xpathSApply(mydata.xml, '//s:entry/s:title', xmlValue, namespaces =c('s'='http://www.w3.org/2005/Atom'))
	# aggregate new tweets with previous tweets
	mydata.vectors <- c(mydata.vector, mydata.vectors)
}

# how many tweets did we get?
length(mydata.vectors)

Based on the limits of the Twitter API, you should now have 1500 vectors representing 1500 tweets.

###
### Use tm (text mining) package
###

install.packages('tm')
require(tm)

# build a corpus
mydata.corpus <- Corpus(VectorSource(mydata.vectors))

# make each letter lowercase
mydata.corpus <- tm_map(mydata.corpus, tolower) 

# remove punctuation 
mydata.corpus <- tm_map(mydata.corpus, removePunctuation)

# remove generic and custom stopwords
my_stopwords <- c(stopwords('english'), 'prolife', 'prochoice')
mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords)

# build a term-document matrix
mydata.dtm <- TermDocumentMatrix(mydata.corpus)

# inspect the document-term matrix
mydata.dtm

# inspect most popular words
findFreqTerms(mydata.dtm, lowfreq=30)

The most popular terms include abortion, dont, funding, gop, parenthood, planned, prochoice prolife, tcot, women. Though not explicitly stated here, the words “planned” and “parenthood” are a collocation (words which occur together more often than by chance). Let’s see which words are associated with a term.

> findAssocs(mydata.dtm, 'fetus', 0.20) 
         fetus          child            247     believeing    brainwashed          cared 
          1.00           0.31           0.30           0.30           0.30           0.30

The number under each word is an association score, so the search term always occurs with the search term. The next most-associated term is “child,” etc. In some applications, a stemmer or spell checker could help with the misspelled word “believeing.”

To make a Hierarchical Agglomerative cluster plot, we need to reduce the number of terms (which otherwise wouldn’t fit on a page or the screen) and build a data frame.

# remove sparse terms to simplify the cluster plot
# Note: tweak the sparse parameter to determine the number of words.
# About 10-30 words is good.
mydata.dtm2 <- removeSparseTerms(mydata.dtm, sparse=0.95)

# convert the sparse term-document matrix to a standard data frame
mydata.df <- as.data.frame(inspect(mydata.dtm2))

# inspect dimensions of the data frame
nrow(mydata.df)
ncol(mydata.df)

Now the data frame (a standard data structure in R) contains a bag of words (specifically, 1-grams) which are simple frequency counts. Though the structure is lost, it retains much information and is simple to use. The data frame is ready for cluster analysis using a cluster analysis function available in R core. The following code is basically copied from Robert I. Kabacof’s “Cluster Analysis” page.

mydata.df.scale <- scale(mydata.df)
d <- dist(mydata.df.scale, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram?

groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
rect.hclust(fit, k=5, border="red")

Adjust the quantity five to best fit the data, and now we have the plot:

The terms higher in the plot are more popular, and terms close to each other are more associated. For example, today there is a fear the US government will shut down, so the terms “budget,” “funding,” and “shutdown” appear together, but these are not associated with the term “woman. The term “periodpiece” is a Twitter account (remember, punctuation was removed including @ which designates accounts), and the cluster with “periodpiece” and “life” is a semantic argument (example).

Some possible next steps include:

K-means cluster
Remove hyperlinks from tweets
Basic word association plots (built in to the tm package but requires Rgraphviz which can be tricky to install)
Word association fans
Sentiment analysis: which hashtag has more positive mood?
Classification: to which side of the debate does a new tweet (without a hashtag) belong?
Find a happier topic

69 thoughts on “Text Data Mining with Twitter and R”

mherradora says:

April 8, 2011 at 1:40 pm

twitteR doesn’t work in Ubuntu 10.04 with the last version of R.

Reply
mario says:

April 8, 2011 at 2:46 pm

Hi Andrew,
Maybe you would like to try my recently released tm.plugin.webcorpus package from R-Forge.
The following code should work for your example:
install.packages(“tm.plugin.webcorpus”, repos=”http://R-Forge.R-project.org”)
#you may also need to install tm, slam, RCurl, XML and Defaults
library(tm.plugin.webcorpus)
c <- getCorpus('#prolife OR #prochoice', src = "twitter", n = 1500)

Reply
- heuristicandrew says:
  
  April 8, 2011 at 2:53 pm
  
  Mario: Nice! I see support for some APIs from Yahoo, Bing, Google, NYTimes, and Twitter. Any thoughts about adding support for easily pulling Facebook status updates from public (non-profile) pages? I’ve done this in Python from Facebook’s JSON API without an API key. Also I plan to look at your package tm.plugin.sentiment
  
  Reply
  - mario says:
    
    April 8, 2011 at 3:09 pm
    
    The packages are still in alpha status – hope you like them 😉
    Your facebook idea sounds great – I’ll take a look at it soon.
  - rao nadeem says:
    
    July 25, 2013 at 11:14 am
    
    I am trying to pull Facebook status updates from public (non-profile) pages but dont know exactly how to do it.Can you please send me your code or some help material regarding this?
    my id is (nadeem_rao21@yahoo.com).
    Regards.
  - heuristicandrew says:
    
    July 25, 2013 at 11:18 am
    
    Rao, try a web search for Facebook Graph API Python
- George says:
  
  January 27, 2012 at 5:38 am
  
  Hi Mario, I´m trying to install the package that you mentioned above but I can´t do it. The package is not already available. Could you tell me or someone else where Can I find it please?? I´m talking about the package tm.plugin.webcorpus. My e-mail is jrgsua83@gmail.com
  
  Reply
  - Dr Stephen Tagg says:
    
    May 8, 2013 at 2:21 am
    
    It’s now called tm.plugin.webmining
Pingback: My ongoing struggle with the Twitter API, R, … copy paste | Christina's LIS Rant
Pingback: Solution to my Twitter API – twitterR issues | Christina's LIS Rant
Pingback: links for 2011-06-28 « Personal Link Sampler
Pingback: Grep, sub, dictionary and new code in R | Accessibility of Ecological Language
Pingback: Text Data Mining with Twitter and R (via Heuristic Andrew) « beatsnpeace
Pingback: Wordclouds of tweets with R | Matteo Redaelli
Alessandro Bessi says:

October 10, 2011 at 8:24 am

# Remove hyperlink
for (j in 1:length(tweet.corpus)) tweet.corpus[[j]] <- gsub('http[a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z]', '', tweet.corpus[[j]])

Reply
- Triss says:
  
  April 26, 2012 at 5:19 pm
  
  I could not make that piece of code work. The following however did work:
  
  for (j in 1:nrow(df))
  df[j,2] <- gsub("http://t.co/.*", "", df[j,2])
  
  Reply
  - MJ says:
    
    February 26, 2013 at 4:30 am
    
    Hi Triss,
    I am new to this space of writing programs. I tried both of the codes mentioned above, agree the first one does not work on my program either. however when using yours syntax it ends with “+” sign that is there is still more that needs to be fed to the program.
    Please guide me!
    
    Regards
    MJ
  - Triss says:
    
    February 26, 2013 at 1:29 pm
    
    Hi MJ. the post above has problems because the web page is trying to interpret and convert it to a hyper link. After the http, you want ‘//t.co/.*”, “”, df[j,2])’ with out the single quote ‘
  - heuristicandrew says:
    
    February 26, 2013 at 1:31 pm
    WordPress has a special way of posting sourcecode using a special tag. The code would be something like this
    
    for (j in 1:nrow(df)) df[j,2] <- gsub("http://t.co/.*", "", df[j,2])
  - Triss says:
    
    February 26, 2013 at 3:41 pm
    
    MJ, Andrew’s post is exactly what I use in my cleaning routine.
verbic says:

October 17, 2011 at 5:54 am

Hi, Andrew
For me, this is very useful post. It could be even more useful if I knew how to pass UTF-8 encoded words, like “Čačak” or “Београд”, to URLencode function. Do you have any suggestions?

Reply
Ahmed Ahmedov says:

October 22, 2011 at 7:41 am

Great Article, Thanks.
I tried to mine “gaddafi”, but most of the results were gibberish, probably because great number of tweets on gaddafi is in Arabic. So, the question is, how is it possible to mine them? thanks.

Reply
- heuristicandrew says:
  
  October 22, 2011 at 2:28 pm
  
  It depends what you are trying to do. First, could you exclude Arabic or limit to English?
  
  Reply
Yako says:

November 11, 2011 at 7:53 am

Hello,

I am looking for some docs on the usage of tm.sentiment.plugin, but can’t seem to find anything. Any idea or examples on how to use it?

Thanks!

Reply
- Ben says:
  
  November 21, 2011 at 12:29 am
  
  Not sure about that function, but there are some excellent instructions on sentiment analysis of tweets using R here: http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/
  
  Reply
Mariam says:

December 30, 2011 at 4:27 pm

Please which of the R version can I use for tm.plugin.sentiment.
Can someone be kind enough to tell me. I need a reply urgently.

Thanks.

Reply
- heuristicandrew says:
  
  January 3, 2012 at 9:50 am
  
  R 2.14, R 2.12, or any recent version should work. I don’t see tm.plugin.sentiment in CRAN, but it seems you can get it here https://r-forge.r-project.org/R/?group_id=1048
  
  Reply
Barry says:

January 5, 2012 at 10:40 am

Does anybody have suggestions on how to load text files to *create* the vectors? I’m not sure on what format I need to start this process. My initial data is in Excel (one comment per row).

Thanks!

Reply
- Vijayan Padmanabhan says:
  
  May 8, 2012 at 10:44 pm
  
  x <- readLines("file.txt")
  x1 <- as.vector(unlist(strsplit(x,split="\n")) )
  
  mydata.corpus<- Corpus(VectorSource(x1))
  should help!
  Regards
  Vijayan Padmanabhan
  
  Reply
- seesiva says:
  
  April 12, 2013 at 5:16 pm
  
  In addition to what vijay has said you can also used ReadCSV for the excel data. You can check out my blog http://sivaanalytics.wordpress.com/2013/02/22/importing-excel-data-using-r-step-by-step/
  
  Reply
Barry says:

January 5, 2012 at 12:01 pm

I found some help:

http://www.johndcook.com/r_excel_clipboard.html
(there’s a ‘read.Clipboard()’ function!!!)

http://www.rdatamining.com/examples/text-mining

Reply
Pingback: Using R to search Twitter for analysis
Vijayan Padmanabhan says:

May 8, 2012 at 10:46 pm

x <- readLines("file.txt")
x1<- as.vector(unlist(strsplit(x,split="\n")) )

require(tm)

# build a corpus
mydata.corpus<- Corpus(VectorSource(x1))

Reply
Renato Cordeiro de Amorim says:

July 29, 2012 at 1:41 pm

Hey there,
Just small couple of comments, since you put K-Means as the first item in your possible next steps. It seems to me it is not difficult to get a small amount of labelled twitters (but surely not enough to use a supervised algorithm), so you may wish to take a look into semi-supervised clustering.

Another point is that I’m not entirely sure how you got to the number 30 (I guess it was just a visual guesstimation), you may wish to take a look into feature weighting for K-Means – perhaps (a bit biased here as this is in fact my paper… published in Pattern Recognition):

Click to access Minkowski%20metric,%20feature%20weighting%20and%20anomalous%20cluster%20initializing%20in%20K-Means%20clustering.pdf

Best of luck

Reply
Stefania says:

September 27, 2012 at 12:13 am

I’m new to R; I tried your script but I got this error:
> twitter_url = paste(‘http://search.twitter.com/search.atom?q=’,twitter_q,’&rpp=100&page=’, page, sep=”)
Errore in paste(“http://search.twitter.com/search.atom?q=”, twitter_q, :
cannot coerce type ‘closure’ to vector of type ‘character’

Any help?
Thank you,
Stefania

Reply
- heuristicandrew says:
  
  September 27, 2012 at 8:27 am
  
  Check that the quotation marks didn’t get translated wrong when copying and pasting (retyping the quotation marks is a way to check), and check that the variable twitter_q contains the right data.
  
  Reply
Stefania says:

September 27, 2012 at 8:47 am

Thank you; it was my fault, I missed a quotation mark!
S.

Reply
Pingback: Analysis of #FNCE tweets – - nutsci.orgnutsci.org
Swathi says:

October 24, 2012 at 10:49 pm

Hi – This webpage has been very helpful. Can you tell me how I can get data for a particular range of dates using this code? That will help me with the project I’m trying to do right now. Thanks!

Reply
ciprianb says:

November 22, 2012 at 3:40 pm

Hello everyone,

This article is sent from Programmer Heaven to us ignorants of the Ways of the Code. I am hoping that there could be more where that came from…

I am a graduate in a Belgian university doing research in digital marketing and I have bitten a lot more than I can chew by getting involved in a very challenging project involving gathering location data from Twitter. Let me explain.

Due to circumstances that are now irreversible, in the following weeks I absolutely have to learn how to gather continuous Twitter location checkin data corresponding to two full months (preferably December 2012 and January 2013) that respect the following parameters:

-the checkins (could all come via Foursquare or Gowalla into Twitter for instance) are limited to an area of 50 km from the Center of Brussels
– only the checkins from the supermarkets ‘delhaize’, ‘carrefour’ and ‘colruyt’ & fast-food restaurants ‘mcdonalds’, ‘quick’ and ‘pizza hut’ are needed (for this I need to learn to geotagging 🙂 )
-i need the timestamp for each tweet so that I can then use it to create some graphs showing daily, weekly and monthly peaks of activity in the various locations

My research supervisor (a marketing professor with limited code knowledge) has given me the link to this article and asked me to learn by myself how to use it as a springboard for my data collection. At this point I am a bit confused about the following:

1. How do I adapt/build a code that helps me retrieve the data with the parameters I put above? Would anyone be willing to guide me?
2. How should I go about collecting the data timewise ? Can it be automated somehow or, if not, how often do I have to run the code? For instance, can I retrieve twitter data from last week? last month? How far back can I go? Is there a limit?
3.Where can I store the data I am collecting. Is there one file that R creates where I can have all my data?

I realize that I am asking for a lot from you guys but I am in a state of shock 🙂 and any guidance would go a long way. I have had some C++ programming experience since it was my highschool major, but since then I followed a career in humanities so it’s hard a bit to get back into gear.

Thanks in advance for any help get 🙂

Gratefully,
Ciprian B

Reply
aathithi says:

February 8, 2013 at 2:59 am

I need a particular user information and his her relationship details (link) using Twiiter R

Reply
- heuristicandrew says:
  
  February 8, 2013 at 9:20 am
  
  To get a friends/followers list, you need to use a different Twitter API. For an account with under 5000 friends, it is relatively easy to get in one API call (but you may need OAuth authentication). While building a Twitter to SQL archiver I’ve found using a well established library makes working with Twitter APIs much easier, so consider using twitterR for R (which I haven’t tried) or Mike Verdone’s Twitter API for Python. Even though it is in Python and I have to export the data to R, overall it’s easier to work with.
  
  Reply
ramshako says:

May 2, 2013 at 7:09 am

Based on teh clusters that are formed, if I were to get back the documentID from term document marix (for example, to use to get the user who belongs to a cluster, based on the tweets), how would you trace back from clusters?

Reply
Pingback: Twitte-R: Attempts at Gleaning Insights from Twitter Searches | Deductions through Data
Jeff says:

May 10, 2013 at 1:52 pm

This worked great for me – thanks.

Reply
Pingback: Wie ich mit R und Tweets rummachte: Ein Protokoll | Schafott
PaulieB says:

July 15, 2013 at 9:09 am

Any chance of this being updated to get this working now that API v1 has been retired?

Reply
- heuristicandrew says:
  
  July 15, 2013 at 11:15 am
  
  Paulie, In the Twitter API version 1 it was easy to code a light-weight query, but Twitter API v2 requires OAuth. Because of this and various quirks in the Twitter API, I recommend you consider my new tool called tweets2sql, which is a Twitter archiver. It is Python based, and I run it on a daily basis to add the latest tweets to a SQL database, which then I can query from R, SAS, or any other tool.
  
  An alternative is the twitterR package on CRAN, but I still prefer tweets2sql because it is designed to cultivate a large history of tweets, it isn’t limited by Twitter’s one-week window on the search API (if you run tweets2sql regularly), handles network errors, etc.
  
  Reply
  - PaulieB says:
    
    July 16, 2013 at 1:14 am
    
    I’ll give it a go thanks for the quick response!
Pingback: Pappu Vs. Feku – Twitter Wars | TweetSent
ratheen says:

August 10, 2013 at 10:57 pm

Andrew – I received this error when I tried loading the data from the hashtags:
Error in UseMethod(“xpathApply”) :
no applicable method for ‘xpathApply’ applied to an object of class “NULL”
>

Reply
- heuristicandrew says:
  
  August 12, 2013 at 8:59 am
  Is this the line
```
mydata.vector <- xpathSApply
```
  ?
  
  If so, check that the variables passed to the function are not NULL. Maybe your query returned no results.
  Reply
- Henk says:
  
  November 14, 2013 at 12:00 pm
  
  Hi, I have the same problem. Have you solved it?
  
  Reply
- Simon says:
  
  November 25, 2013 at 4:14 pm
  
  I also have the same problem. I think it has to do with the hashtag-symbol. Here’s what I get after trying to run the loop and then looking at ‘twitter_url’:
  
  1> twitter_url
  [1] “http://search.twitter.com/search.atom?q=%23prolife%20OR%20%23prochoice&rpp=100&page=1”
  1>
  
  Any suggestions would be greatly appreciated.
  
  Reply
Pingback: Feijoo e Beiras polarizan o Debate do Estado da Autonomía en Twitter | Calidonia Hibernia
satya says:

November 24, 2013 at 6:35 am

i want code for twitter follower and friend list with matrix format using r

Reply
- heuristicandrew says:
  
  November 25, 2013 at 10:16 am
  
  Try the twitteR package on CRAN.
  
  Reply
Latuji Jnr. says:

December 29, 2013 at 8:45 pm

Reblogged this on My Research Collections.

Reply
satya says:

January 3, 2014 at 1:45 am

i did not get in form of adjacent matrix format is it possible for r

Reply
satya says:

January 3, 2014 at 1:49 am

tell me clear guide

Reply
RProgramming.net says:

April 8, 2014 at 6:22 am

Thanks for the clear intro to text analysis!

Reply
Pingback: #Allezlesbleus : R and Twitter | Data Science for Enthusiastic People
aathithi says:

August 21, 2014 at 9:01 am

i want code twitter friends and follower list in the form of adjacent matrix

Reply
- heuristicandrew says:
  
  August 21, 2014 at 1:28 pm
  
  You’ll need to use the new Twitter API version 1.1 which requires OAuth authentication. If the list of friends/followers is large, be prepared for problems because of the resource limitations in the API.
  
  Reply
cs says:

November 1, 2014 at 6:19 am

I have some problem, in code:

[1]
I/O warning : failed to load external entity “http://search.twitter.com/search.atom?q=%23data&rpp=100&result_type=recent&page=1”

[2]
Error in UseMethod(“xpathApply”) :
no applicable method for ‘xpathApply’ applied to an object of class “NULL”

When I put this link “http://search.twitter.com/search.atom?” in a search box:
Shows me:
This XML file does not appear to have any style information associated with it. The document tree is shown below.

The Twitter REST API v1 is no longer active. Please migrate to API v1.1. https://dev.twitter.com/docs/api/1.1/overview.

please i want solve 😦

Reply
- heuristicandrew says:
  
  November 3, 2014 at 8:41 am
  
  @cs: Now I use tweets2sql in Python, which supports the new Twitter API and is more robust, and it is easy to get the Tweets from SQL (e.g., SQLite) into R.
  
  Reply
Saikat Saha says:

May 21, 2015 at 2:06 pm

Hi,

I am not been able to retrieve data from twitter. I am using windows and trying to use the command: library(twitteR). But I am getting the error: Error in get_oauth_sig() : OAuth has not been registered for this session
Please let me know, what should I do now.

Reply
muhammedeltabakh says:

December 30, 2015 at 7:29 am

Reblogged this on .

Reply
Data Mining says:

February 20, 2019 at 4:57 am

Great post. I’ve already shared your tips with a couple newbie bloggers. Thanks!

Reply