Delete rows from R data frame

Deleting rows from a data frame in R is easy by combining simple operations. Let’s say you are working with the built-in data set airquality and need to remove rows where the ozone is NA (also called null, blank or missing). The method is a conceptually different than a SQL database that has a dedicated delete command: in R deleting rows can be done simply by replacing the data frame with another data frame without those rows.

Before we make any changes, let’s count the number of NA records:

summary(airquality$Ozone)

The next step is identifying the rows. This code prints the rows where the Ozone is NA using a list comprehension:

airquality[is.na(airquality$Ozone),]

If you are a beginner, it’s worth analyzing this step in detail. Try running the inner part by itself:

is.na(airquality$Ozone)

This yields a long vector of TRUE and FALSE. When put plugged in to the data frame (the first code fragment), it tells R which rows to return. Since we want to remove the NA, we just need to reverse it using a boolean-not operator:

airquality[!is.na(airquality$Ozone),]

You just printed the desired data frame (where Ozone is not NA) to the screen. The last step (the only step you really need) is to “delete” the rows by recreating the data frame: just reassign the data frame from the filtered rows.

airquality <- airquality[!is.na(airquality$Ozone),]

To verify it worked, run:

summary(airquality$Ozone)

Now there are no NA records for Ozone, but there are 5 for Solar. To filter two columns (variables) at a time, combine them with boolean logic:

airquality<-airquality[!is.na(airquality$Ozone) & !is.na(airquality$Solar.R),]

About these ads

32 thoughts on “Delete rows from R data frame

  1. Good tip. If you want to subset a data.frame to only rows that have no missing information, you can use the “complete.cases” function, like the following…

    >tmp tmp
    a b
    1 1 NA
    2 2 2
    3 3 3
    4 4 4
    5 5 5
    6 6 6
    7 7 7
    8 8 8
    9 9 9
    10 10 10
    11 NA 11
    > tmp[complete.cases(tmp),]
    a b
    2 2 2
    3 3 3
    4 4 4
    5 5 5
    6 6 6
    7 7 7
    8 8 8
    9 9 9
    10 10 10

  2. Well, that was a little bit mangled, sorry. The important bit is that for a data.frame called “tmp”, try

    tmp[complete.cases(tmp),]

  3. An easy way to get rid of rows that contain any NAs is with na.omit:

    > summary(airquality$Ozone)
    Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
    1.00 18.00 31.50 42.13 63.25 168.00 37.00

    > temp summary(temp$Ozone)
    Min. 1st Qu. Median Mean 3rd Qu. Max.
    1.0 18.0 31.0 42.1 62.0 168.0

    Cheers,
    Bob

  4. Wow, what happened to that paste?? I’ll try again:

    > summary(airquality$Ozone)
    Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
    1.00 18.00 31.50 42.13 63.25 168.00 37.00

    > temp summary(temp$Ozone)
    Min. 1st Qu. Median Mean 3rd Qu. Max.
    1.0 18.0 31.0 42.1 62.0 168.0

  5. Grrr! What is not pasting I’ll just type in:

    This makes a copy with all rows containing NAs removed:

    temp <- na.omit(airquality)

    And this shows they're gone:

    summary(temp$Ozone)

  6. Hi, pls forgive, also a “newbie” here.

    What if, in a data.frame, there is a variable called “acsyr.” This field/variable has three years, “2007,” “2008,” and “2009,” and each of those years counts the number of observations occurring during that year. However, by some unknown reason, it has an additional year “2.” Yes, “2,” with 7 observations.

    I want to delete just those 7 records. I’ve tried several examples from here and other places, but can’t get it to work.

    Subsetting and creating a new data.frame is not efficient in this case because the data.frame is huge. It has about 9 million records and 24 fields/variables. Even though I have a pretty good laptop (8 megs of RAM, i7, 1 terrabyte hard-drive, Windows 7, etc), I’ve experienced problems with memory and R.

    I take that back, subsetting could work with this data.frame but I have another data.frame which is almost twice the size and for which I will need to do the same procedure. The data I’m using is actually the 3-year ACS from the Census. If anyone is familiar with that data, I stripped the first four characters from the “SERIALNO” from each observation, to capture the year (and later do a Cox proportional hazards model).

    Any help would be greatly appreciated. THANKS!

    Here are the results of table(s3yr$acsyr):

    > table(s3yr$acsyr)

    2 2007 2008 2009
    7 2994658 3000655 3030727

  7. Ahh, it comes out the same… each figure in the bottom row corresponds to each year from the first row. Anyhow, any help would be greatly, greatly appreciated.

  8. Hi! thanks, it was very helpfull. I have a simple question :
    How can I “translate” NA to 0 or -1 ?
    I use “for” but i believe that there is a (lot) better solution in R@
    Thanks in advance.

    • The is.na() function translates NA to TRUE or FALSE which have values 1 and 0. For example,

      x <- c(1,2,3,NA,5,6,NA,7)
      is.na(x)
      
      • I see… but…
        i want the result:
        1,2,3,0,5,6,0,7
        A simple is.na(x) gives me [1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE

      • Rodrigo: Use the ifelse() function like this, vectorized without a for loop:

        x2 <- ifelse(is.na(x), 0, x)
        
  9. Hi, I would like to delete or remove certain rows from a table or data frame. I can remove rows 1 thru 9 using tablename[-2,] (example would delete row 2). The problem is that this does not work for the 10th and higher rows for some reason! Thanks!

    • I’m not sure what you mean. Try this:

      (df<-data.frame(letters=letters[1:15], numbers=1:15))
      df[-2,]
      df[-10,]
      
  10. Hello, I am having a similar problem, being the issue that I have to use blank rows to identify which rows to eliminate. Since the rows do not appear like NA and are empty, i.e. the content of the row [x,] is “” I am not managing to find a way to automatically detect such row. Any help?
    Thanks!

    • Try this:

      df<-data.frame(id=1:15) # numbers 1 through 15
      (df$x <- ifelse(df$id %% 2==1, 'Odd', '')) # add column x where value is 'Odd' or blank
      (df <- df[df$x != '',]) # remove where x is blank
      
      • Thanks indeed. I managed to solve the problem differently: I read the table forcing blanks to appear as , then I find the row number of the first NA cell and eliminate the rest of the table from there on.

        GSH= read.csv(‘GT_shopping_sp.csv’,sep=’,’, ,na.string=””, row.names=NULL, skip=4,blank.lines.skip = FALSE, fill = TRUE, comment.char = “”);

        GSH<-GSH[1:which(is.na(GSH[,1]))-1,]

  11. Hi, I have a dataset with over 320,000 rows. I am going through creating factor variables and checking the data. I would like to simply delete entire rows that have incorrect coding for individual variables. Are you able to assist?

    • Ric: Your question doesn’t give enough information for a specific answer, so here is a general answer: as with the other examples on this page, the basic strategy is to create a new data frame (sometimes to replace the original data frame) which selects the rows you want to keep from the original data frame. So depending on what “incorrect coding” means, you will need to select those rows and reverse it—or select the rows with correct coding to keep.

  12. Thanks for the reply. What I mean by “incorrect coding” is that for example a particular variable in the survey provides for only 5 possible answers (represented as values 1,2,3,4,9), but the data actually contains additional values (i.e. 0,5,8) – given earlier posts I am clear on how to deal with NA values. These additional values do not contain any meaning for this survey. I am fine in creating new data frames, but I don’t know how to identify the rows associated with the incorrect codes, part from scrolling through 320,000+ lines of data. If I know this then I think I can piece it all together. Cheers.

    • I assume the “answers” are coded as factors, so this is how to limit to the “non-impossible” answers:

      # create example data set
      (good_and_bad <- data.frame(response=as.factor(c(1,2,3,4,9,0,5,8)), id=1:8))
      # limit to "valid" response which have values 1,2,3,4,9 using %in% operator
      (good <- good_and_bad[good_and_bad$response %in% c(1,2,3,4,9),])
      
  13. Hello,
    You’ve been a great help to people here… I was wondering if you could lend me a hand too.
    I want to select/remove 46 different rows from a data.frame which match a list of ID numbers. I know (of) SAS doing this through a “lookup table” but I have no idea how to do it in R.
    Thanks!

    • In SAS I would do a data step with a MERGE statement. In R one way to perform these kind of set operations is to use the %in% operator like this

      set_a  <- data.frame(id=1:10, foo=letters[1:10])
      set_b  <- data.frame(id=5:15)
      
      # Keep every record in set_a that is also in set_b (i.e., the intersection).
      set_intersection <- set_a[set_a$id %in% set_b$id,]
      
      # Keep every record in set_a that not in set_b (i.e., remove every record found in set_b).
      set_diff <- set_a[!set_a$id %in% set_b$id,]
      

      In the first case, you could use the R merge() function. The R set operation functions (union, intersect, setdiff, setequal) may also be helpful.

  14. Hi, what if I am using lapply (as I will have a list of data frames) and I need to remove a data frame (list element) because it does not have all of the necessary information?

    For example, I need to have 4 columns in each data set to continue processing data, but I want to remove the data frames where one of the columns is missing. I know the position of the possible missing column and its name.

    What’s a good way to do that?

    For example, with the example data below I would like to remove files2 because it is missing dload_60000. I have about 149 data frames that are listed in a larger data frame and I want to remove each data frame where dload_60000 is missing.

    I am including some example data below:

    dput(files1)
    structure(list(station_id = c(“21NC02WQ.C9819500″, “21NC02WQ.C9819500″,
    “21NC02WQ.C9819500″, “21NC02WQ.C9819500″, “21NC02WQ.C9819500″,
    “21NC02WQ.C9819500″, “21NC02WQ.C9819500″, “21NC02WQ.C9819500″,
    “21NC02WQ.C9819500″), date = c(“1994/10/01″, “1994/10/02″, “1994/10/03″,
    “1994/10/04″, “1994/10/05″, “1994/10/06″, “1994/10/07″, “1994/10/08″,
    “1994/10/09″), dflow = c(1.8718701299, 2.1674285714, 2.660025974,
    3.2511428571, 3.2511428571, 3.2511428571, 3.3496623377, 3.5467012987,
    4.0392987013), dload_60000 = c(2.3716883438, 2.7887027547, 3.4994094887,
    4.3708679341, 4.3624166528, 4.3540447988, 4.4932052465, 4.7805706952,
    5.5180594209)), .Names = c(“station_id”, “date”, “dflow”, “dload_60000″
    ), class = “data.frame”, row.names = c(NA, -9L))

    > dput(files2)
    structure(list(station_id = c(2131000L, 2131000L, 2131000L, 2131000L,
    2131000L, 2131000L, 2131000L, 2131000L, 2131000L), date = c(“1994/10/01″,
    “1994/10/02″, “1994/10/03″, “1994/10/04″, “1994/10/05″, “1994/10/06″,
    “1994/10/07″, “1994/10/08″, “1994/10/09″), dflow = c(7000L, 6890L,
    5830L, 5670L, 5850L, 4580L, 4870L, 6230L, 5710L)), .Names = c(“station_id”,
    “date”, “dflow”), class = “data.frame”, row.names = c(NA, -9L
    ))

    I want to thank you in advance.

    Irucka

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s