Calculating Cumulative Sum (cumsum) by Group in R

Cumulative sum by group means for each group, we calculate the running sum of values in the specific column that increases with each row within that group.

It is the following two-step process:

First, divide the data into subgroups based on single or multiple grouping variables (categorical variables).
Within each subgroup, calculate the sum of the current value and all preceding values within that group.

In the final output, you will have a new column that contains the cumulative sum of each row based on the groupings.

Here are three ways:

Using base R’s ave()
Using dplyr’s group_by() and mutate()
Using data.table

For this tutorial, I will use the following data frame:

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000)
)

print(employee_data)

Method 1: Using base R’s ave()

For quick calculation without external packages, use the ave() function. It accepts the first argument as the value column, the second argument as the grouping column, and the third argument as the function where we apply the cumsum() function to the distinct groups.

Let’s calculate the cumulative sum of Salaries for each unique value in the Department column (HR, IT, Finance).

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000)
)

print(employee_data)

employee_data$Cumulative <- ave(
   employee_data$Salary,
   employee_data$Department,
   FUN = cumsum
)

print(employee_data)

The above output image shows that it added a new column called “Cumulative” that holds all the cumsum.

The first value for HR is 60000 because it appears for the first time. There are no preceding values for HR.

The second HR’s cumsum salary is 130000, which is a cumulative value of 60000 + 70000.

The final cumsum for the HR department’s Salary is 192000. But how did we get there? Well, it is a sum of 60000 + 70000 + 62000.

The same principle applies to the IT and Finance departments.

Method 2: Using dplyr’s group_by() and mutate()

Using the dplyr::group_by() function, we can divide the data frame into groups and then use the dplyr::mutate() function to create a new column with the cumsum() function.

Let’s calculate the cumsum of the Salary Column group by the Department column.

Ensure that you have installed and imported the “dplyr” package at the head of your program file.

library(dplyr)

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000)
)

print(employee_data)

employee_data <- employee_data %>%
  group_by(Department) %>%
  mutate(Cumulative = cumsum(Salary))

print(employee_data)

The dplyr functions return the output in a tibble, which is an extended version of the data frame.

Method 3: Using data.table

You can always rely on data tables for large datasets and efficient performance. They provide every method for statistical analysis and return perfect output.

Convert the input data frame into a data.table using the setDT() function, and then calculate the cumsum by department.

library(data.table)

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000)
)

print(employee_data)

setDT(employee_data)[, Cumulative := cumsum(Salary), by = Department]

print(employee_data)

That’s all!

Krunal Lathiya

Krunal Lathiya is a seasoned Computer Science expert with over eight years in the tech industry. He boasts deep knowledge in Data Science and Machine Learning. Versed in Python, JavaScript, PHP, R, and Golang. Skilled in frameworks like Angular and React and platforms such as Node.js. His expertise spans both front-end and back-end development. His proficiency in the Python language stands as a testament to his versatility and commitment to the craft.

Method 1: Using base R’s ave()

Method 2: Using dplyr’s group_by() and mutate()

Method 3: Using data.table

Leave a Comment Cancel reply