R Advanced

Calculating Cumulative Sum (cumsum) by Group in R

Cumulative sum by group means for each group, we calculate the running sum of values in the specific column that increases with each row within that group.

It is the following two-step process:

  1. First, divide the data into subgroups based on single or multiple grouping variables (categorical variables).
  2. Within each subgroup, calculate the sum of the current value and all preceding values within that group.

In the final output, you will have a new column that contains the cumulative sum of each row based on the groupings.

Here are three ways:

  1. Using base R’s ave()
  2. Using dplyr’s group_by() and mutate()
  3. Using data.table

For this tutorial, I will use the following data frame:

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000)
)

print(employee_data)

Method 1: Using base R’s ave()

For quick calculation without external packages, use the ave() function. It accepts the first argument as the value column, the second argument as the grouping column, and the third argument as the function where we apply the cumsum() function to the distinct groups.

Let’s calculate the cumulative sum of Salaries for each unique value in the Department column (HR, IT, Finance).

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000)
)

print(employee_data)

employee_data$Cumulative <- ave(
   employee_data$Salary,
   employee_data$Department,
   FUN = cumsum
)

print(employee_data)

The above output image shows that it added a new column called “Cumulative” that holds all the cumsum.

The first value for HR is 60000 because it appears for the first time. There are no preceding values for HR.

The second HR’s cumsum salary is 130000, which is a cumulative value of 60000 + 70000.

The final cumsum for the HR department’s Salary is 192000. But how did we get there? Well, it is a sum of 60000 + 70000 + 62000.

The same principle applies to the IT and Finance departments.

Method 2: Using dplyr’s group_by() and mutate()

Using the dplyr::group_by() function, we can divide the data frame into groups and then use the dplyr::mutate() function to create a new column with the cumsum() function.

Let’s calculate the cumsum of the Salary Column group by the Department column.

Ensure that you have installed and imported the “dplyr” package at the head of your program file.

library(dplyr)

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000)
)

print(employee_data)

employee_data <- employee_data %>%
  group_by(Department) %>%
  mutate(Cumulative = cumsum(Salary))

print(employee_data)

The dplyr functions return the output in a tibble, which is an extended version of the data frame.

Method 3: Using data.table

You can always rely on data tables for large datasets and efficient performance. They provide every method for statistical analysis and return perfect output.

Convert the input data frame into a data.table using the setDT() function, and then calculate the cumsum by department.

library(data.table)

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000)
)

print(employee_data)

setDT(employee_data)[, Cumulative := cumsum(Salary), by = Department]

print(employee_data)

That’s all!

Recent Posts

cbind() Function: Binding R Objects by Columns

R cbind (column bind) is a function that combines specified vectors, matrices, or data frames…

1 week ago

rbind() Function: Binding Rows in R

The rbind() function combines R objects, such as vectors, matrices, or data frames, by rows.…

1 week ago

as.numeric(): Converting to Numeric Values in R

The as.numeric() function in R converts valid non-numeric data into numeric data. What do I…

2 weeks ago

Calculating Natural Log using log() Function in R

The log() function calculates the natural logarithm (base e) of a numeric vector. By default,…

3 weeks ago

Dollar Sign ($ Operator) in R

In R, you can use the dollar sign ($ operator)  to access elements (columns) of…

1 month ago

Calculating Absolute Value using abs() Function in R

The abs() function calculates the absolute value of a numeric input, returning a non-negative (only…

1 month ago