R dplyr – Ignoring NA When Summing Multiple Columns

dataframedplyrmultiple-columnsr

I am summing across multiple columns, some that have NA. I am using

 dplyr::mutate

and then writing out the arithmetic sum of the columns to get the sum. But the columns have NA and I would like to treat them as zero. I was able to get it to work with rowSums (see below), but now using mutate. Using mutate allows to make it more readable, but can also allow me to subtract columns. The example is below.

require(dplyr)
data(iris)
iris <- tbl_df(iris)
iris[2,3] <- NA
iris <- mutate(iris, sum = Sepal.Length + Petal.Length)

How do I ensure that NA in Petal.Length is handled as zero in the above expression? I know using rowSums I can do something like:

iris$sum <- rowSums(DF[,c("Sepal.Length","Petal.Length")], na.rm = T)

but with mutate it is easier to set even diff = Sepal.Length – Petal.Length.
What would be a suggested way to accomplish this using mutate?

Note the post is similar to below stackoverflow posts.

Sum across multiple columns with dplyr

Subtract multiple columns ignoring NA

Best Answer

The problem with your rowSums is the reference to DF (which is undefined). This works:

mutate(iris, sum2 = rowSums(cbind(Sepal.Length, Petal.Length), na.rm = T))

For difference, you could of course use a negative: rowSums(cbind(Sepal.Length, -Petal.Length), na.rm = T)

The general solution is to use ifelse or similar to set the missing values to 0 (or whatever else is appropriate):

mutate(iris, sum2 = Sepal.Length + ifelse(is.na(Petal.Length), 0, Petal.Length))

More efficient than ifelse would be an implementation of coalesce, see examples here. This uses @krlmlr's answer from the previous link (see bottom for the code or use the kimisc package).

mutate(iris, sum2 = Sepal.Length + coalesce.na(Petal.Length, 0))

To replace missing values data-set wide, there is replace_na in the tidyr package.

@krlmlr's coalesce.na, as found here

coalesce.na <- function(x, ...) {
  x.len <- length(x)
  ly <- list(...)
  for (y in ly) {
    y.len <- length(y)
    if (y.len == 1) {
      x[is.na(x)] <- y
    } else {
      if (x.len %% y.len != 0)
        warning('object length is not a multiple of first object length')
      pos <- which(is.na(x))
      x[pos] <- y[(pos - 1) %% y.len + 1]
    }
  }
  x
}

Related Solutions

SQL Group By – Using Group By on Multiple Columns

Group By X means put all those with the same value for X in the one group.

Group By X, Y means put all those with the same values for both X and Y in the one group.

To illustrate using an example, let's say we have the following table, to do with who is attending what subject at a university:

Table: Subject_Selection

+---------+----------+----------+
| Subject | Semester | Attendee |
+---------+----------+----------+
| ITB001  |        1 | John     |
| ITB001  |        1 | Bob      |
| ITB001  |        1 | Mickey   |
| ITB001  |        2 | Jenny    |
| ITB001  |        2 | James    |
| MKB114  |        1 | John     |
| MKB114  |        1 | Erica    |
+---------+----------+----------+

When you use a group by on the subject column only; say:

select Subject, Count(*)
from Subject_Selection
group by Subject

You will get something like:

+---------+-------+
| Subject | Count |
+---------+-------+
| ITB001  |     5 |
| MKB114  |     2 |
+---------+-------+

...because there are 5 entries for ITB001, and 2 for MKB114

If we were to group by two columns:

select Subject, Semester, Count(*)
from Subject_Selection
group by Subject, Semester

we would get this:

+---------+----------+-------+
| Subject | Semester | Count |
+---------+----------+-------+
| ITB001  |        1 |     3 |
| ITB001  |        2 |     2 |
| MKB114  |        1 |     2 |
+---------+----------+-------+

This is because, when we group by two columns, it is saying "Group them so that all of those with the same Subject and Semester are in the same group, and then calculate all the aggregate functions (Count, Sum, Average, etc.) for each of those groups". In this example, this is demonstrated by the fact that, when we count them, there are three people doing ITB001 in semester 1, and two doing it in semester 2. Both of the people doing MKB114 are in semester 1, so there is no row for semester 2 (no data fits into the group "MKB114, Semester 2")

Hopefully that makes sense.

R Sorting DataFrame – How to Sort Rows by Multiple Columns

You can use the order() function directly without resorting to add-on tools -- see this simpler answer which uses a trick right from the top of the example(order) code:

R> dd[with(dd, order(-z, b)), ]
    b x y z
4 Low C 9 2
2 Med D 3 1
1  Hi A 8 1
3  Hi A 9 1

Edit some 2+ years later: It was just asked how to do this by column index. The answer is to simply pass the desired sorting column(s) to the order() function:

R> dd[order(-dd[,4], dd[,1]), ]
    b x y z
4 Low C 9 2
2 Med D 3 1
1  Hi A 8 1
3  Hi A 9 1
R>

rather than using the name of the column (and with() for easier/more direct access).

Best Answer

Related Solutions

SQL Group By – Using Group By on Multiple Columns

R Sorting DataFrame – How to Sort Rows by Multiple Columns

Related Question