Twitter is a favorite source of text data for analysis: it’s popular (there is a huge volume of variety on all topics) and easily accessible using Twitter’s free, open APIs which are easily consumable in JSON and ATOM formats.
Some people have used Twitter for sophisticated analysis such as predicting flu outbreaks and the stock market, but let’s start with something simpler and less ambitious: an introduction to text data mining using Twitter and R. We’ll download live data using the Twitter APIs, parse it, build a corpus, demonstrate some basic text processing, and plot a hierarchical agglomerative cluster—because everyone likes pictures. I query for a controversial topic, abortion, in hopes of visualizing the two sides of the debate.
There is a specialized package for R called twitteR, but it isn’t available for Windows, but it’s easy to substitute the generic XML package and the Twitter search API documentation for our needs.
### ### Read tweets from Twitter using ATOM (XML) format ### # installation is required only required once and is rememberd across sessions install.packages('XML') # loading the package is required once each session require(XML) # initialize a storage variable for Twitter tweets mydata.vectors <- character(0) # paginate to get more tweets for (page in c(1:15)) { # search parameter twitter_q <- URLencode('#prolife OR #prochoice') # construct a URL twitter_url = paste('http://search.twitter.com/search.atom?q=',twitter_q,'&rpp=100&page=', page, sep='') # fetch remote URL and parse mydata.xml <- xmlParseDoc(twitter_url, asText=F) # extract the titles mydata.vector <- xpathSApply(mydata.xml, '//s:entry/s:title', xmlValue, namespaces =c('s'='http://www.w3.org/2005/Atom')) # aggregate new tweets with previous tweets mydata.vectors <- c(mydata.vector, mydata.vectors) } # how many tweets did we get? length(mydata.vectors)
Based on the limits of the Twitter API, you should now have 1500 vectors representing 1500 tweets.
### ### Use tm (text mining) package ### install.packages('tm') require(tm) # build a corpus mydata.corpus <- Corpus(VectorSource(mydata.vectors)) # make each letter lowercase mydata.corpus <- tm_map(mydata.corpus, tolower) # remove punctuation mydata.corpus <- tm_map(mydata.corpus, removePunctuation) # remove generic and custom stopwords my_stopwords <- c(stopwords('english'), 'prolife', 'prochoice') mydata.corpus <- tm_map(mydata.corpus, removeWords, my_stopwords) # build a term-document matrix mydata.dtm <- TermDocumentMatrix(mydata.corpus) # inspect the document-term matrix mydata.dtm # inspect most popular words findFreqTerms(mydata.dtm, lowfreq=30)
The most popular terms include abortion, dont, funding, gop, parenthood, planned, prochoice prolife, tcot, women. Though not explicitly stated here, the words “planned” and “parenthood” are a collocation (words which occur together more often than by chance). Let’s see which words are associated with a term.
> findAssocs(mydata.dtm, 'fetus', 0.20) fetus child 247 believeing brainwashed cared 1.00 0.31 0.30 0.30 0.30 0.30
The number under each word is an association score, so the search term always occurs with the search term. The next most-associated term is “child,” etc. In some applications, a stemmer or spell checker could help with the misspelled word “believeing.”
To make a Hierarchical Agglomerative cluster plot, we need to reduce the number of terms (which otherwise wouldn’t fit on a page or the screen) and build a data frame.
# remove sparse terms to simplify the cluster plot # Note: tweak the sparse parameter to determine the number of words. # About 10-30 words is good. mydata.dtm2 <- removeSparseTerms(mydata.dtm, sparse=0.95) # convert the sparse term-document matrix to a standard data frame mydata.df <- as.data.frame(inspect(mydata.dtm2)) # inspect dimensions of the data frame nrow(mydata.df) ncol(mydata.df)
Now the data frame (a standard data structure in R) contains a bag of words (specifically, 1-grams) which are simple frequency counts. Though the structure is lost, it retains much information and is simple to use. The data frame is ready for cluster analysis using a cluster analysis function available in R core. The following code is basically copied from Robert I. Kabacof’s “Cluster Analysis” page.
mydata.df.scale <- scale(mydata.df) d <- dist(mydata.df.scale, method = "euclidean") # distance matrix fit <- hclust(d, method="ward") plot(fit) # display dendogram? groups <- cutree(fit, k=5) # cut tree into 5 clusters # draw dendogram with red borders around the 5 clusters rect.hclust(fit, k=5, border="red")
Adjust the quantity five to best fit the data, and now we have the plot:
The terms higher in the plot are more popular, and terms close to each other are more associated. For example, today there is a fear the US government will shut down, so the terms “budget,” “funding,” and “shutdown” appear together, but these are not associated with the term “woman. The term “periodpiece” is a Twitter account (remember, punctuation was removed including @ which designates accounts), and the cluster with “periodpiece” and “life” is a semantic argument (example).
Some possible next steps include:
- K-means cluster
- Remove hyperlinks from tweets
- Basic word association plots (built in to the tm package but requires Rgraphviz which can be tricky to install)
- Word association fans
- Sentiment analysis: which hashtag has more positive mood?
- Classification: to which side of the debate does a new tweet (without a hashtag) belong?
- Find a happier topic
twitteR doesn’t work in Ubuntu 10.04 with the last version of R.
Hi Andrew,
Maybe you would like to try my recently released tm.plugin.webcorpus package from R-Forge.
The following code should work for your example:
install.packages(“tm.plugin.webcorpus”, repos=”http://R-Forge.R-project.org”)
#you may also need to install tm, slam, RCurl, XML and Defaults
library(tm.plugin.webcorpus)
c <- getCorpus('#prolife OR #prochoice', src = "twitter", n = 1500)
Mario: Nice! I see support for some APIs from Yahoo, Bing, Google, NYTimes, and Twitter. Any thoughts about adding support for easily pulling Facebook status updates from public (non-profile) pages? I’ve done this in Python from Facebook’s JSON API without an API key. Also I plan to look at your package tm.plugin.sentiment
The packages are still in alpha status – hope you like them 😉
Your facebook idea sounds great – I’ll take a look at it soon.
I am trying to pull Facebook status updates from public (non-profile) pages but dont know exactly how to do it.Can you please send me your code or some help material regarding this?
my id is (nadeem_rao21@yahoo.com).
Regards.
Rao, try a web search for Facebook Graph API Python
Hi Mario, I´m trying to install the package that you mentioned above but I can´t do it. The package is not already available. Could you tell me or someone else where Can I find it please?? I´m talking about the package tm.plugin.webcorpus. My e-mail is jrgsua83@gmail.com
It’s now called tm.plugin.webmining
Pingback: My ongoing struggle with the Twitter API, R, … copy paste | Christina's LIS Rant
Pingback: Solution to my Twitter API – twitterR issues | Christina's LIS Rant
Pingback: links for 2011-06-28 « Personal Link Sampler
Pingback: Grep, sub, dictionary and new code in R | Accessibility of Ecological Language
Pingback: Text Data Mining with Twitter and R (via Heuristic Andrew) « beatsnpeace
Pingback: Wordclouds of tweets with R | Matteo Redaelli
# Remove hyperlink
for (j in 1:length(tweet.corpus)) tweet.corpus[[j]] <- gsub('http[a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z]', '', tweet.corpus[[j]])
I could not make that piece of code work. The following however did work:
for (j in 1:nrow(df))
df[j,2] <- gsub("http://t.co/.*", "", df[j,2])
Hi Triss,
I am new to this space of writing programs. I tried both of the codes mentioned above, agree the first one does not work on my program either. however when using yours syntax it ends with “+” sign that is there is still more that needs to be fed to the program.
Please guide me!
Regards
MJ
Hi MJ. the post above has problems because the web page is trying to interpret and convert it to a hyper link. After the http, you want ‘//t.co/.*”, “”, df[j,2])’ with out the single quote ‘
WordPress has a special way of posting sourcecode using a special tag. The code would be something like this
MJ, Andrew’s post is exactly what I use in my cleaning routine.
Hi, Andrew
For me, this is very useful post. It could be even more useful if I knew how to pass UTF-8 encoded words, like “Čačak” or “Београд”, to URLencode function. Do you have any suggestions?
Great Article, Thanks.
I tried to mine “gaddafi”, but most of the results were gibberish, probably because great number of tweets on gaddafi is in Arabic. So, the question is, how is it possible to mine them? thanks.
It depends what you are trying to do. First, could you exclude Arabic or limit to English?
Hello,
I am looking for some docs on the usage of tm.sentiment.plugin, but can’t seem to find anything. Any idea or examples on how to use it?
Thanks!
Not sure about that function, but there are some excellent instructions on sentiment analysis of tweets using R here: http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/
Please which of the R version can I use for tm.plugin.sentiment.
Can someone be kind enough to tell me. I need a reply urgently.
Thanks.
R 2.14, R 2.12, or any recent version should work. I don’t see tm.plugin.sentiment in CRAN, but it seems you can get it here https://r-forge.r-project.org/R/?group_id=1048
Does anybody have suggestions on how to load text files to *create* the vectors? I’m not sure on what format I need to start this process. My initial data is in Excel (one comment per row).
Thanks!
x <- readLines("file.txt")
x1 <- as.vector(unlist(strsplit(x,split="\n")) )
mydata.corpus<- Corpus(VectorSource(x1))
should help!
Regards
Vijayan Padmanabhan
In addition to what vijay has said you can also used ReadCSV for the excel data. You can check out my blog http://sivaanalytics.wordpress.com/2013/02/22/importing-excel-data-using-r-step-by-step/
I found some help:
http://www.johndcook.com/r_excel_clipboard.html
(there’s a ‘read.Clipboard()’ function!!!)
http://www.rdatamining.com/examples/text-mining
Pingback: Using R to search Twitter for analysis
x <- readLines("file.txt")
x1<- as.vector(unlist(strsplit(x,split="\n")) )
require(tm)
# build a corpus
mydata.corpus<- Corpus(VectorSource(x1))
Hey there,
Just small couple of comments, since you put K-Means as the first item in your possible next steps. It seems to me it is not difficult to get a small amount of labelled twitters (but surely not enough to use a supervised algorithm), so you may wish to take a look into semi-supervised clustering.
Another point is that I’m not entirely sure how you got to the number 30 (I guess it was just a visual guesstimation), you may wish to take a look into feature weighting for K-Means – perhaps (a bit biased here as this is in fact my paper… published in Pattern Recognition):
Click to access Minkowski%20metric,%20feature%20weighting%20and%20anomalous%20cluster%20initializing%20in%20K-Means%20clustering.pdf
Best of luck
I’m new to R; I tried your script but I got this error:
> twitter_url = paste(‘http://search.twitter.com/search.atom?q=’,twitter_q,’&rpp=100&page=’, page, sep=”)
Errore in paste(“http://search.twitter.com/search.atom?q=”, twitter_q, :
cannot coerce type ‘closure’ to vector of type ‘character’
Any help?
Thank you,
Stefania
Check that the quotation marks didn’t get translated wrong when copying and pasting (retyping the quotation marks is a way to check), and check that the variable
twitter_q
contains the right data.Thank you; it was my fault, I missed a quotation mark!
S.
Pingback: Analysis of #FNCE tweets – - nutsci.orgnutsci.org
Hi – This webpage has been very helpful. Can you tell me how I can get data for a particular range of dates using this code? That will help me with the project I’m trying to do right now. Thanks!
Hello everyone,
This article is sent from Programmer Heaven to us ignorants of the Ways of the Code. I am hoping that there could be more where that came from…
I am a graduate in a Belgian university doing research in digital marketing and I have bitten a lot more than I can chew by getting involved in a very challenging project involving gathering location data from Twitter. Let me explain.
Due to circumstances that are now irreversible, in the following weeks I absolutely have to learn how to gather continuous Twitter location checkin data corresponding to two full months (preferably December 2012 and January 2013) that respect the following parameters:
-the checkins (could all come via Foursquare or Gowalla into Twitter for instance) are limited to an area of 50 km from the Center of Brussels
– only the checkins from the supermarkets ‘delhaize’, ‘carrefour’ and ‘colruyt’ & fast-food restaurants ‘mcdonalds’, ‘quick’ and ‘pizza hut’ are needed (for this I need to learn to geotagging 🙂 )
-i need the timestamp for each tweet so that I can then use it to create some graphs showing daily, weekly and monthly peaks of activity in the various locations
My research supervisor (a marketing professor with limited code knowledge) has given me the link to this article and asked me to learn by myself how to use it as a springboard for my data collection. At this point I am a bit confused about the following:
1. How do I adapt/build a code that helps me retrieve the data with the parameters I put above? Would anyone be willing to guide me?
2. How should I go about collecting the data timewise ? Can it be automated somehow or, if not, how often do I have to run the code? For instance, can I retrieve twitter data from last week? last month? How far back can I go? Is there a limit?
3.Where can I store the data I am collecting. Is there one file that R creates where I can have all my data?
I realize that I am asking for a lot from you guys but I am in a state of shock 🙂 and any guidance would go a long way. I have had some C++ programming experience since it was my highschool major, but since then I followed a career in humanities so it’s hard a bit to get back into gear.
Thanks in advance for any help get 🙂
Gratefully,
Ciprian B
I need a particular user information and his her relationship details (link) using Twiiter R
To get a friends/followers list, you need to use a different Twitter API. For an account with under 5000 friends, it is relatively easy to get in one API call (but you may need OAuth authentication). While building a Twitter to SQL archiver I’ve found using a well established library makes working with Twitter APIs much easier, so consider using twitterR for R (which I haven’t tried) or Mike Verdone’s Twitter API for Python. Even though it is in Python and I have to export the data to R, overall it’s easier to work with.
Based on teh clusters that are formed, if I were to get back the documentID from term document marix (for example, to use to get the user who belongs to a cluster, based on the tweets), how would you trace back from clusters?
Pingback: Twitte-R: Attempts at Gleaning Insights from Twitter Searches | Deductions through Data
This worked great for me – thanks.
Pingback: Wie ich mit R und Tweets rummachte: Ein Protokoll | Schafott
Any chance of this being updated to get this working now that API v1 has been retired?
Paulie, In the Twitter API version 1 it was easy to code a light-weight query, but Twitter API v2 requires OAuth. Because of this and various quirks in the Twitter API, I recommend you consider my new tool called tweets2sql, which is a Twitter archiver. It is Python based, and I run it on a daily basis to add the latest tweets to a SQL database, which then I can query from R, SAS, or any other tool.
An alternative is the twitterR package on CRAN, but I still prefer tweets2sql because it is designed to cultivate a large history of tweets, it isn’t limited by Twitter’s one-week window on the search API (if you run tweets2sql regularly), handles network errors, etc.
I’ll give it a go thanks for the quick response!
Pingback: Pappu Vs. Feku – Twitter Wars | TweetSent
Andrew – I received this error when I tried loading the data from the hashtags:
Error in UseMethod(“xpathApply”) :
no applicable method for ‘xpathApply’ applied to an object of class “NULL”
>
Is this the line
?
If so, check that the variables passed to the function are not NULL. Maybe your query returned no results.
Hi, I have the same problem. Have you solved it?
I also have the same problem. I think it has to do with the hashtag-symbol. Here’s what I get after trying to run the loop and then looking at ‘twitter_url’:
1> twitter_url
[1] “http://search.twitter.com/search.atom?q=%23prolife%20OR%20%23prochoice&rpp=100&page=1”
1>
Any suggestions would be greatly appreciated.
Pingback: Feijoo e Beiras polarizan o Debate do Estado da Autonomía en Twitter | Calidonia Hibernia
i want code for twitter follower and friend list with matrix format using r
Try the twitteR package on CRAN.
Reblogged this on My Research Collections.
i did not get in form of adjacent matrix format is it possible for r
tell me clear guide
Thanks for the clear intro to text analysis!
Pingback: #Allezlesbleus : R and Twitter | Data Science for Enthusiastic People
i want code twitter friends and follower list in the form of adjacent matrix
You’ll need to use the new Twitter API version 1.1 which requires OAuth authentication. If the list of friends/followers is large, be prepared for problems because of the resource limitations in the API.
I have some problem, in code:
[1]
I/O warning : failed to load external entity “http://search.twitter.com/search.atom?q=%23data&rpp=100&result_type=recent&page=1”
[2]
Error in UseMethod(“xpathApply”) :
no applicable method for ‘xpathApply’ applied to an object of class “NULL”
When I put this link “http://search.twitter.com/search.atom?” in a search box:
Shows me:
This XML file does not appear to have any style information associated with it. The document tree is shown below.
The Twitter REST API v1 is no longer active. Please migrate to API v1.1. https://dev.twitter.com/docs/api/1.1/overview.
please i want solve 😦
@cs: Now I use tweets2sql in Python, which supports the new Twitter API and is more robust, and it is easy to get the Tweets from SQL (e.g., SQLite) into R.
Hi,
I am not been able to retrieve data from twitter. I am using windows and trying to use the command: library(twitteR). But I am getting the error: Error in get_oauth_sig() : OAuth has not been registered for this session
Please let me know, what should I do now.
Reblogged this on .
Great post. I’ve already shared your tips with a couple newbie bloggers. Thanks!