Cumulative sum by group means for each group, we calculate the running sum of values in the specific column that increases with each row within that group.
It is the following two-step process:
In the final output, you will have a new column that contains the cumulative sum of each row based on the groupings.
Here are three ways:
For this tutorial, I will use the following data frame:
employee_data <- data.frame(
Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000)
)
print(employee_data)
For quick calculation without external packages, use the ave() function. It accepts the first argument as the value column, the second argument as the grouping column, and the third argument as the function where we apply the cumsum() function to the distinct groups.
Let’s calculate the cumulative sum of Salaries for each unique value in the Department column (HR, IT, Finance).
employee_data <- data.frame(
Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000)
)
print(employee_data)
employee_data$Cumulative <- ave(
employee_data$Salary,
employee_data$Department,
FUN = cumsum
)
print(employee_data)
The above output image shows that it added a new column called “Cumulative” that holds all the cumsum.
The first value for HR is 60000 because it appears for the first time. There are no preceding values for HR.
The second HR’s cumsum salary is 130000, which is a cumulative value of 60000 + 70000.
The final cumsum for the HR department’s Salary is 192000. But how did we get there? Well, it is a sum of 60000 + 70000 + 62000.
The same principle applies to the IT and Finance departments.
Using the dplyr::group_by() function, we can divide the data frame into groups and then use the dplyr::mutate() function to create a new column with the cumsum() function.
Let’s calculate the cumsum of the Salary Column group by the Department column.
Ensure that you have installed and imported the “dplyr” package at the head of your program file.
library(dplyr)
employee_data <- data.frame(
Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000)
)
print(employee_data)
employee_data <- employee_data %>%
group_by(Department) %>%
mutate(Cumulative = cumsum(Salary))
print(employee_data)
You can always rely on data tables for large datasets and efficient performance. They provide every method for statistical analysis and return perfect output.
Convert the input data frame into a data.table using the setDT() function, and then calculate the cumsum by department.
library(data.table)
employee_data <- data.frame(
Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000)
)
print(employee_data)
setDT(employee_data)[, Cumulative := cumsum(Salary), by = Department]
print(employee_data)
That’s all!
Krunal Lathiya is a seasoned Computer Science expert with over eight years in the tech industry. He boasts deep knowledge in Data Science and Machine Learning. Versed in Python, JavaScript, PHP, R, and Golang. Skilled in frameworks like Angular and React and platforms such as Node.js. His expertise spans both front-end and back-end development. His proficiency in the Python language stands as a testament to his versatility and commitment to the craft.
R cbind (column bind) is a function that combines specified vectors, matrices, or data frames…
The rbind() function combines R objects, such as vectors, matrices, or data frames, by rows.…
The as.numeric() function in R converts valid non-numeric data into numeric data. What do I…
The log() function calculates the natural logarithm (base e) of a numeric vector. By default,…
In R, you can use the dollar sign ($ operator) to access elements (columns) of…
The abs() function calculates the absolute value of a numeric input, returning a non-negative (only…