In R, you can calculate the sum by group using the base aggregate(), dplyr’s group_by() with summarise(), or the data table package.
What do we mean when we say sum by group? Well, it means dividing the dataset into meaningful groups based on categorical variables and then calculating the summation of the values within each group.
For example, you may need to calculate the total number of quantities sold per category or product. Another example is summing the total revenue, which you can calculate as (Price × Quantity), for each group.
For this practical, we will use the df_sales_data data frame:
df_sales_data <- data.frame(
Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
Quantity = c(5, 10, 5, 2, 3, 12, 2),
stringsAsFactors = FALSE
)
print(df_sales_data)
If you are looking for a built-in solution that does not require packages, you should use the aggregate() function. It accepts a formula for summarization, a dataset, and a function to apply and returns data summarised by groups.
Let’s calculate the Sum of Quantities by Category:
library(data.table)
df_sales_data <- data.frame(
Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
Quantity = c(5, 10, 5, 2, 3, 12, 2),
stringsAsFactors = FALSE
)
print(df_sales_data)
# Calculate sum of quantities by category
aggregate(Quantity ~ Category, data = df_sales_data, FUN = sum)
Output
The above output image shows that we are calculating the total number of quantities based on each category.
Let’s calculate the sum of the Quantity group by Category and Product.
df_sales_data <- data.frame(
Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
Quantity = c(5, 10, 5, 2, 3, 12, 2),
stringsAsFactors = FALSE
)
print(df_sales_data)
# Sum by Multiple Groups
aggregate(Quantity ~ Category + Product, data = df_sales_data, FUN = sum)
Output
The above figure shows that we group data frames based on Category and Product and calculate the sum of quantities for each unique product.
Let’s calculate the sum of multiple columns (Quantity, Price) grouped by Category.
df_sales_data <- data.frame(
Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
Quantity = c(5, 10, 5, 2, 3, 12, 2),
stringsAsFactors = FALSE
)
print(df_sales_data)
# Sum Multiple Columns
aggregate(cbind(Quantity, Price) ~ Category, data = df_sales_data, FUN = sum)
Output
The above figure shows the sum of multiple columns categorywise.
The dplyr package provides an easy way to group and summarize data frame using the group_by() and summarise() functions.
The group_by() function groups data frame by categorical column.
The summarise() function applies aggregation functions like sum(), mean(), count(), etc. For example, we have already calculated the mean by group and count rows by group.
The output from dplyr functions is a tibble, which is an extended version of the data frame.
library(dplyr)
df_sales_data <- data.frame(
Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
Quantity = c(5, 10, 5, 2, 3, 12, 2),
stringsAsFactors = FALSE
)
print(df_sales_data)
# Using dplyr Approach
# Basic Grouped Sum
df_sales_data %>%
group_by(Category) %>%
summarize(Total_Quantity = sum(Quantity))
Output
Let’s calculate the sum of the total quantity and the sum of the total price grouped by Category and Product.
library(dplyr)
df_sales_data <- data.frame(
Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
Quantity = c(5, 10, 5, 2, 3, 12, 2),
stringsAsFactors = FALSE
)
print(df_sales_data)
# Grouped Sum with Multiple Columns
df_sales_data %>%
group_by(Category, Product) %>%
summarize(Total_Quantity = sum(Quantity), Total_Price = sum(Price))
Output
If you want to sum all numeric columns while grouping by one column, you can use group_by(), summarise(), across(), where(), and is.numeric() in dplyr.
library(dplyr)
df_sales_data <- data.frame(
Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
Quantity = c(5, 10, 5, 2, 3, 12, 2),
stringsAsFactors = FALSE
)
print(df_sales_data)
# Sum All Numeric Columns
df_sales_data %>%
group_by(Category) %>%
summarise(across(where(is.numeric), sum))
Output
The data.table package provides a highly optimized way to perform data manipulation.
library(data.table)
df_sales_data <- data.frame(
Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
Quantity = c(5, 10, 5, 2, 3, 12, 2),
stringsAsFactors = FALSE
)
print(df_sales_data)
# Convert to data.table
dt_sales <- as.data.table(df_sales_data)
# Basic Grouped Sum
dt_sales[, .(Total_Quantity = sum(Quantity)), by = Category]
Output
library(data.table)
df_sales_data <- data.frame(
Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
Quantity = c(5, 10, 5, 2, 3, 12, 2),
stringsAsFactors = FALSE
)
print(df_sales_data)
# Convert to data.table
dt_sales <- as.data.table(df_sales_data)
# Grouped Sum with Multiple Columns
dt_sales[, .(Total_Quantity = sum(Quantity), Total_Price = sum(Price)), by = .(Category, Product)]
Output
Krunal Lathiya is a seasoned Computer Science expert with over eight years in the tech industry. He boasts deep knowledge in Data Science and Machine Learning. Versed in Python, JavaScript, PHP, R, and Golang. Skilled in frameworks like Angular and React and platforms such as Node.js. His expertise spans both front-end and back-end development. His proficiency in the Python language stands as a testament to his versatility and commitment to the craft.
To calculate the percentage by the group in R, you need to combine various dplyr…
What does it mean when we say "mean by group"? It means grouping the data…
Whether you want to summarize the categorical data, identify patterns and trends, or calculate percentages…
The group_by() function from the dplyr package allows us to group data frames by one…
The dplyr::slice() function subsets rows by their position or index within a data frame. If…
R vectors are atomic, which means they have homogeneous data types. They are contiguous in…