Calculate row sums and col means with tidyR

Recently Someone asked a question on reddit: Adding a row and a column that are means for a set matrix

This is a very simple problem that many of us learned how to do only a few hours in to basic R. But since I’ve been learning tidyverse package, I figured why not do it tidy-fashion?

Let’s give it a try.

Here is the question:

I have the following sequence:
S<-seq(1,90, by=3)
I make a matrix that is the following:
matrix(S, nrow = 6, ncol = 5)
Now I am trying to do the following:
I want to calculate the means of the columns of the matrix and add them as a new row under the columns.
Next I want to calculate the sum of the rows of the matrix and add them as a new column on the right of the matrix.

Here is what their data looks like:

data_matrix = matrix(seq(1, 90, by = 3), nrow = 6, ncol = 5)
data_matrix
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1   19   37   55   73
## [2,]    4   22   40   58   76
## [3,]    7   25   43   61   79
## [4,]   10   28   46   64   82
## [5,]   13   31   49   67   85
## [6,]   16   34   52   70   88

They want to calculate the mean of each column and sum of each row.
Now if you know some basic R this can very easily be achieved:

data_df = data.frame(data_matrix)
data_df[,ncol(data_df)+1] = rowSums(data_df)
data_df[nrow(data_df)+1,] = colMeans(data_df)
head(data_df)
##   X1 X2 X3 X4 X5  V6
## 1  1 19 37 55 73 185
## 2  4 22 40 58 76 200
## 3  7 25 43 61 79 215
## 4 10 28 46 64 82 230
## 5 13 31 49 67 85 245
## 6 16 34 52 70 88 260

But how can we do this “tidy” way? Turns out this is more complicated than I originally thought.

First let’s breath some context into our data so we don’t get too bored:
Let’s say we had an exam in a class of 50 students. The exam consists of 5 questions. TAs recorded the score of each question for each student on an excel sheet. The spreadsheet data has 5 columns, each for a question and 50 rows, each for a student. Now we want to calculate:
1. Total score for each student (row sums)
2. Class average for each question (col means)

Let’s first creat this “spreadsheet”

scores_df = data.frame(matrix(sample(1:20,250, replace = TRUE), nrow = 50, ncol = 5))
head(scores_df)
##   X1 X2 X3 X4 X5
## 1 19 13  7  9 16
## 2  5 11 12 14 19
## 3  5 11  6  8 20
## 4  8  9  8  9  6
## 5  3 14  4  7  6
## 6  8 12  6 15 15

Here we have a data frame with 5 colmuns (5 questions) and 50 rows (50 students).
Now the next step is to for tidy data: key-value pairs.
First we need to identify the keys: in our dataset, each student-question pair uniquely identifies a value (score). So we can gather our date like this:

library(tidyverse)
scores_tibble = gather(rownames_to_column(scores_df, var = "student"), question, score, -student)
head(scores_tibble)
##   student question score
## 1       1       X1    19
## 2       2       X1     5
## 3       3       X1     5
## 4       4       X1     8
## 5       5       X1     3
## 6       6       X1     8

Here I did not unite() student and question into a single column so that later I can group by either student or question to calculate mean and sum.
Now to calculate question means:

summarise(group_by(scores_tibble, question), mean = mean(score))
## # A tibble: 5 x 2
##   question  mean
##   <chr>    <dbl>
## 1 X1       11   
## 2 X2       11.1 
## 3 X3        9.7 
## 4 X4        9.78
## 5 X5       10.8

Student sum scores:

summarise(group_by(scores_tibble, student), total = sum(score))
## # A tibble: 50 x 2
##    student total
##    <chr>   <int>
##  1 1          64
##  2 10         46
##  3 11         78
##  4 12         63
##  5 13         50
##  6 14         56
##  7 15         44
##  8 16         40
##  9 17         64
## 10 18         43
## # … with 40 more rows

Let’s get a single table with all information

scores_complete = group_by(scores_tibble, question) %>%
  mutate(question_average = mean(score)) %>%
    group_by(student) %>%
      mutate(student_total = sum(score))
scores_complete
## # A tibble: 250 x 5
## # Groups:   student [50]
##    student question score question_average student_total
##    <chr>   <chr>    <int>            <dbl>         <int>
##  1 1       X1          19               11            64
##  2 2       X1           5               11            61
##  3 3       X1           5               11            50
##  4 4       X1           8               11            40
##  5 5       X1           3               11            34
##  6 6       X1           8               11            56
##  7 7       X1           2               11            53
##  8 8       X1           1               11            42
##  9 9       X1          17               11            63
## 10 10      X1           8               11            46
## # … with 240 more rows

With a tidy table we can easily calculate stats we want and add them to the original data frame:

scores_df_complete = spread(scores_tibble, question, score) %>%
  left_join(summarise(group_by(scores_tibble, student), student_total = sum(score)), by = "student") %>%
    bind_rows(spread(summarise(group_by(scores_tibble, question), question_mean = mean(score)), question, question_mean))
head(scores_df_complete)
##   student X1 X2 X3 X4 X5 student_total
## 1       1 19 13  7  9 16            64
## 2      10  8 20  4  1 13            46
## 3      11 15 19 10 18 16            78
## 4      12  6 14 13 20 10            63
## 5      13  3  6 13 19  9            50
## 6      14 10 19 10 15  2            56
tail(scores_df_complete)
##    student X1    X2   X3    X4    X5 student_total
## 46      50 12 20.00 17.0 19.00  1.00            69
## 47       6  8 12.00  6.0 15.00 15.00            56
## 48       7  2  9.00 18.0 15.00  9.00            53
## 49       8  1 16.00 11.0  3.00 11.00            42
## 50       9 17  8.00  9.0 11.00 18.00            63
## 51    <NA> 11 11.08  9.7  9.78 10.84            NA
Avatar
Sichong Peng
PhD student

I study equine genetics/genomics at UC Davis Veterinary school. My primary interest is functional annotation of non-model organisms and its applications.