Delete rows from R data frame

heuristicandrew / October 8, 2009

Deleting rows from a data frame in R is easy by combining simple operations. Let’s say you are working with the built-in data set airquality and need to remove rows where the ozone is NA (also called null, blank or missing). The method is a conceptually different than a SQL database that has a dedicated delete command: in R deleting rows can be done simply by replacing the data frame with another data frame without those rows.

Before we make any changes, let’s count the number of NA records:

summary(airquality$Ozone)

The next step is identifying the rows. This code prints the rows where the Ozone is NA using a list comprehension:

airquality[is.na(airquality$Ozone),]

If you are a beginner, it’s worth analyzing this step in detail. Try running the inner part by itself:

is.na(airquality$Ozone)

This yields a long vector of TRUE and FALSE. When put plugged in to the data frame (the first code fragment), it tells R which rows to return. Since we want to remove the NA, we just need to reverse it using a boolean-not operator:

airquality[!is.na(airquality$Ozone),]

You just printed the desired data frame (where Ozone is not NA) to the screen. The last step (the only step you really need) is to “delete” the rows by recreating the data frame: just reassign the data frame from the filtered rows.

airquality <- airquality[!is.na(airquality$Ozone),]

To verify it worked, run:

summary(airquality$Ozone)

Now there are no NA records for Ozone, but there are 5 for Solar. To filter two columns (variables) at a time, combine them with boolean logic:

airquality<-airquality[!is.na(airquality$Ozone) & !is.na(airquality$Solar.R),]

32 thoughts on “Delete rows from R data frame”

erikr says:

October 8, 2009 at 3:35 pm

Good tip. If you want to subset a data.frame to only rows that have no missing information, you can use the “complete.cases” function, like the following…

>tmp tmp
a b
1 1 NA
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
11 NA 11
> tmp[complete.cases(tmp),]
a b
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10

Reply
erikr says:

October 8, 2009 at 3:36 pm

Well, that was a little bit mangled, sorry. The important bit is that for a data.frame called “tmp”, try

tmp[complete.cases(tmp),]

Reply
- Varun says:
  
  August 1, 2012 at 3:13 pm
  
  Great! Thank you Erick,
  
  Reply
heuristicandrew says:

October 8, 2009 at 3:42 pm

Nice, erick. I didn’t notice complete.cases() before.

Reply
Bob Muenchen says:

January 4, 2011 at 8:34 am

An easy way to get rid of rows that contain any NAs is with na.omit:

> summary(airquality$Ozone)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
1.00 18.00 31.50 42.13 63.25 168.00 37.00

> temp summary(temp$Ozone)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 18.0 31.0 42.1 62.0 168.0

Cheers,
Bob

Reply
Bob Muenchen says:

January 4, 2011 at 8:35 am

Wow, what happened to that paste?? I’ll try again:

> summary(airquality$Ozone)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
1.00 18.00 31.50 42.13 63.25 168.00 37.00

> temp summary(temp$Ozone)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 18.0 31.0 42.1 62.0 168.0

Reply
Bob Muenchen says:

January 4, 2011 at 8:38 am

Grrr! What is not pasting I’ll just type in:

This makes a copy with all rows containing NAs removed:

temp <- na.omit(airquality)

And this shows they're gone:

summary(temp$Ozone)

Reply
- Andrew says:
  
  September 6, 2011 at 5:55 am
  
  clearly, the na.omit was working too well!
  
  Reply
callmeRK says:

September 26, 2011 at 9:27 pm

What if instead of NA’s we wanted to omit something else, say a string like “Wrong” or a 0?

Thanks

Reply
- heuristicandrew says:
  
  September 27, 2011 at 7:36 am
  
  callmeRK: One way is with the subset function
  
  airquality subset(airquality, ozone > 50) ?subset
  
  Reply
Calo says:

November 5, 2011 at 7:06 pm

Hi, pls forgive, also a “newbie” here.

What if, in a data.frame, there is a variable called “acsyr.” This field/variable has three years, “2007,” “2008,” and “2009,” and each of those years counts the number of observations occurring during that year. However, by some unknown reason, it has an additional year “2.” Yes, “2,” with 7 observations.

I want to delete just those 7 records. I’ve tried several examples from here and other places, but can’t get it to work.

Subsetting and creating a new data.frame is not efficient in this case because the data.frame is huge. It has about 9 million records and 24 fields/variables. Even though I have a pretty good laptop (8 megs of RAM, i7, 1 terrabyte hard-drive, Windows 7, etc), I’ve experienced problems with memory and R.

I take that back, subsetting could work with this data.frame but I have another data.frame which is almost twice the size and for which I will need to do the same procedure. The data I’m using is actually the 3-year ACS from the Census. If anyone is familiar with that data, I stripped the first four characters from the “SERIALNO” from each observation, to capture the year (and later do a Cox proportional hazards model).

Any help would be greatly appreciated. THANKS!

Here are the results of table(s3yr$acsyr):

> table(s3yr$acsyr)

2 2007 2008 2009
7 2994658 3000655 3030727

Reply
Calo says:

November 5, 2011 at 10:42 pm

The pasting of the table results didn’t show very well, let’s see if this shows it better:
2 2007 2008 2009
7 2994658 3000655 3030727

Reply
Calo says:

November 5, 2011 at 10:43 pm

Ahh, it comes out the same… each figure in the bottom row corresponds to each year from the first row. Anyhow, any help would be greatly, greatly appreciated.

Reply
Rodrigo says:

May 9, 2012 at 8:20 am

Hi! thanks, it was very helpfull. I have a simple question :
How can I “translate” NA to 0 or -1 ?
I use “for” but i believe that there is a (lot) better solution in R@
Thanks in advance.

Reply
- heuristicandrew says:
  
  May 9, 2012 at 8:55 am
  The is.na() function translates NA to TRUE or FALSE which have values 1 and 0. For example,
```
x <- c(1,2,3,NA,5,6,NA,7)
is.na(x)
```
  Reply
  - Rodrigo says:
    
    May 9, 2012 at 10:37 am
    
    I see… but…
    i want the result:
    1,2,3,0,5,6,0,7
    A simple is.na(x) gives me [1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
  - heuristicandrew says:
    
    May 9, 2012 at 10:42 am
    Rodrigo: Use the ifelse() function like this, vectorized without a for loop:
    
    x2 <- ifelse(is.na(x), 0, x)
Rodrigo says:

May 9, 2012 at 11:13 am

Thank you!

Reply
Linds says:

May 21, 2012 at 11:17 pm

Hi, I would like to delete or remove certain rows from a table or data frame. I can remove rows 1 thru 9 using tablename[-2,] (example would delete row 2). The problem is that this does not work for the 10th and higher rows for some reason! Thanks!

Reply
- heuristicandrew says:
  
  May 22, 2012 at 8:47 am
  I’m not sure what you mean. Try this:
```
(df<-data.frame(letters=letters[1:15], numbers=1:15))
df[-2,]
df[-10,]
```
  Reply
ups says:

May 30, 2012 at 11:15 am

Hello, I am having a similar problem, being the issue that I have to use blank rows to identify which rows to eliminate. Since the rows do not appear like NA and are empty, i.e. the content of the row [x,] is “” I am not managing to find a way to automatically detect such row. Any help?
Thanks!

Reply
- heuristicandrew says:
  
  May 30, 2012 at 11:56 am
  Try this:
```
df<-data.frame(id=1:15) # numbers 1 through 15
(df$x <- ifelse(df$id %% 2==1, 'Odd', '')) # add column x where value is 'Odd' or blank
(df <- df[df$x != '',]) # remove where x is blank
```
  Reply
  - ups says:
    
    May 30, 2012 at 12:29 pm
    
    Thanks indeed. I managed to solve the problem differently: I read the table forcing blanks to appear as , then I find the row number of the first NA cell and eliminate the rest of the table from there on.
    
    GSH= read.csv(‘GT_shopping_sp.csv’,sep=’,’, ,na.string=””, row.names=NULL, skip=4,blank.lines.skip = FALSE, fill = TRUE, comment.char = “”);
    
    GSH<-GSH[1:which(is.na(GSH[,1]))-1,]
Ric says:

June 5, 2012 at 8:14 am

Hi, I have a dataset with over 320,000 rows. I am going through creating factor variables and checking the data. I would like to simply delete entire rows that have incorrect coding for individual variables. Are you able to assist?

Reply
- heuristicandrew says:
  
  June 5, 2012 at 9:25 am
  
  Ric: Your question doesn’t give enough information for a specific answer, so here is a general answer: as with the other examples on this page, the basic strategy is to create a new data frame (sometimes to replace the original data frame) which selects the rows you want to keep from the original data frame. So depending on what “incorrect coding” means, you will need to select those rows and reverse it—or select the rows with correct coding to keep.
  
  Reply
Ric says:

June 5, 2012 at 7:25 pm

Thanks for the reply. What I mean by “incorrect coding” is that for example a particular variable in the survey provides for only 5 possible answers (represented as values 1,2,3,4,9), but the data actually contains additional values (i.e. 0,5,8) – given earlier posts I am clear on how to deal with NA values. These additional values do not contain any meaning for this survey. I am fine in creating new data frames, but I don’t know how to identify the rows associated with the incorrect codes, part from scrolling through 320,000+ lines of data. If I know this then I think I can piece it all together. Cheers.

Reply
- heuristicandrew says:
  
  June 12, 2012 at 11:03 am
  I assume the “answers” are coded as factors, so this is how to limit to the “non-impossible” answers:
```
# create example data set
(good_and_bad <- data.frame(response=as.factor(c(1,2,3,4,9,0,5,8)), id=1:8))
# limit to "valid" response which have values 1,2,3,4,9 using %in% operator
(good <- good_and_bad[good_and_bad$response %in% c(1,2,3,4,9),])
```
  Reply
Claudia Penaloza says:

June 8, 2012 at 7:26 am

Hello,
You’ve been a great help to people here… I was wondering if you could lend me a hand too.
I want to select/remove 46 different rows from a data.frame which match a list of ID numbers. I know (of) SAS doing this through a “lookup table” but I have no idea how to do it in R.
Thanks!

Reply
- heuristicandrew says:
  
  June 11, 2012 at 11:45 am
  In SAS I would do a data step with a MERGE statement. In R one way to perform these kind of set operations is to use the %in% operator like this
```
set_a  <- data.frame(id=1:10, foo=letters[1:10])
set_b  <- data.frame(id=5:15)

# Keep every record in set_a that is also in set_b (i.e., the intersection).
set_intersection <- set_a[set_a$id %in% set_b$id,]

# Keep every record in set_a that not in set_b (i.e., remove every record found in set_b).
set_diff <- set_a[!set_a$id %in% set_b$id,]
```
  In the first case, you could use the R merge() function. The R set operation functions (union, intersect, setdiff, setequal) may also be helpful.
  Reply
  - Claudia Penaloza says:
    
    June 20, 2012 at 2:51 pm
    
    Thank you! That is exactly what I wanted and it worked perfectly.
- Taweesak Channgam says:
  
  June 26, 2012 at 8:58 pm
  
  Thank you very much for your answer. It worked perfectly.
  
  Reply
Irucka says:

March 12, 2013 at 2:05 am

Hi, what if I am using lapply (as I will have a list of data frames) and I need to remove a data frame (list element) because it does not have all of the necessary information?

For example, I need to have 4 columns in each data set to continue processing data, but I want to remove the data frames where one of the columns is missing. I know the position of the possible missing column and its name.

What’s a good way to do that?

For example, with the example data below I would like to remove files2 because it is missing dload_60000. I have about 149 data frames that are listed in a larger data frame and I want to remove each data frame where dload_60000 is missing.

I am including some example data below:

dput(files1)
structure(list(station_id = c(“21NC02WQ.C9819500”, “21NC02WQ.C9819500”,
“21NC02WQ.C9819500”, “21NC02WQ.C9819500”, “21NC02WQ.C9819500”,
“21NC02WQ.C9819500”, “21NC02WQ.C9819500”, “21NC02WQ.C9819500”,
“21NC02WQ.C9819500”), date = c(“1994/10/01”, “1994/10/02”, “1994/10/03”,
“1994/10/04”, “1994/10/05”, “1994/10/06”, “1994/10/07”, “1994/10/08”,
“1994/10/09”), dflow = c(1.8718701299, 2.1674285714, 2.660025974,
3.2511428571, 3.2511428571, 3.2511428571, 3.3496623377, 3.5467012987,
4.0392987013), dload_60000 = c(2.3716883438, 2.7887027547, 3.4994094887,
4.3708679341, 4.3624166528, 4.3540447988, 4.4932052465, 4.7805706952,
5.5180594209)), .Names = c(“station_id”, “date”, “dflow”, “dload_60000”
), class = “data.frame”, row.names = c(NA, -9L))

> dput(files2)
structure(list(station_id = c(2131000L, 2131000L, 2131000L, 2131000L,
2131000L, 2131000L, 2131000L, 2131000L, 2131000L), date = c(“1994/10/01”,
“1994/10/02”, “1994/10/03”, “1994/10/04”, “1994/10/05”, “1994/10/06”,
“1994/10/07”, “1994/10/08”, “1994/10/09”), dflow = c(7000L, 6890L,
5830L, 5670L, 5850L, 4580L, 4870L, 6230L, 5710L)), .Names = c(“station_id”,
“date”, “dflow”), class = “data.frame”, row.names = c(NA, -9L
))

I want to thank you in advance.

Irucka

Reply