How to Calculate Cumulative Sum (cumsum) by Group in R

Cumulative sum by group means for each group, we calculate the running sum of values in the specific column that increases with each row within that group.

It is the following two-step process:

  1. First, divide the data into subgroups based on single or multiple grouping variables (categorical variables).
  2. Within each subgroup, calculate the sum of the current value and all preceding values within that group.

In the final output, you will have a new column that contains the cumulative sum of each row based on the groupings.

Here are three ways to calculate cumsum by group in R:

  1. Using base R’s ave()
  2. Using dplyr’s group_by() and mutate()
  3. Using data.table

For this tutorial, I will use the following data frame:

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000)
)

print(employee_data)

data frame in R

Method 1: Using base R’s ave()

For quick calculation without external packages in R, you can use the ave() function. It accepts the first argument as the value column, the second argument as the grouping column, and the third argument as the function where we apply the cumsum() function to the distinct groups.

Let’s calculate the cumulative sum of Salaries for each unique value in the Department column (HR, IT, Finance).

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000)
)

print(employee_data)

employee_data$Cumulative <- ave(
   employee_data$Salary,
   employee_data$Department,
   FUN = cumsum
)

print(employee_data)

cumsum by group using base R's ave() function

The above output image shows that it added a new column called “Cumulative” that holds all the cumsum.

The first value for HR is 60000 because it appears for the first time. There are no preceding values for HR.

The second HR’s cumsum salary is 130000, which is a cumulative value of 60000 + 70000.

The final cumsum for the HR department’s Salary is 192000. But how did we get there? Well, it is a sum of 60000 + 70000 + 62000.

The same principle applies to the IT and Finance departments.

Method 2: Using dplyr’s group_by() and mutate()

Using the dplyr::group_by() function, we can divide the data frame into groups and then use the dplyr::mutate() function to create a new column with the cumsum() function.

Let’s calculate the cumsum of the Salary Column group by the Department column.

Ensure that you have installed and imported the “dplyr” package at the head of your program file.

library(dplyr)

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000)
)

print(employee_data)

employee_data <- employee_data %>%
  group_by(Department) %>%
  mutate(Cumulative = cumsum(Salary))

print(employee_data)

cumsum by group using dplyr packageThe dplyr functions return the output in a tibble, which is an extended version of the data frame.

Method 3: Using data.table

You can always rely on data tables for large datasets and efficient performance. They provide every method for statistical analysis and return perfect output.

Convert the input data frame into a data.table using the setDT() function, and then calculate the cumsum by department.

library(data.table)

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000)
)

print(employee_data)

setDT(employee_data)[, Cumulative := cumsum(Salary), by = Department]

print(employee_data)

Using data.table package

Conclusion

The data.table is the fastest approach because it updates columns by (:=) reference operator instead of copying data, which makes it efficient.

For a medium-sized dataset, tidyverse works fine, and for small datasets, the ave() function is a good choice.

Leave a Comment