What does it mean when we say “mean by group”? It means grouping the data based on the values of single or multiple columns and then calculating the mean (average) of those values, but doing so separately for each group.
Here are three ways to calculate the mean by group for single or multiple columns in the R data frame:
Mean is a statistical operation used in data summarization, comparative analysis, and preprocessing. It helps us create meaningful summaries from condensed data.
We will use the below sample data frame to perform practicals:
df_sales_data <- data.frame(
Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
Quantity = c(5, 10, 5, 2, 3, 12, 2),
stringsAsFactors = FALSE
)
print(df_sales_data)
The aggregate() function splits the data into subgroups and calculates the summary for each group. In the context of our article, we can split the data based on the specific column(s) groups and calculate the mean for each group.
In our df_sales_data dataset, we can group the data frame based on the Category column and find the mean of only one column (Price) for each Category.
The aggregate() function accepts three arguments:
df_sales_data <- data.frame(
Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
Quantity = c(5, 10, 5, 2, 3, 12, 2),
stringsAsFactors = FALSE
)
print(df_sales_data)
# Mean Price by Category
aggregate(Price ~ Category, data = df_sales_data, FUN = mean)
Output
In our data frame,
You can calculate the mean of multiple columns (e.g., Price and Quantity) grouped by a single column (Category) using the aggregate() function.
df_sales_data <- data.frame(
Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
Quantity = c(5, 10, 5, 2, 3, 12, 2),
stringsAsFactors = FALSE
)
print(df_sales_data)
# Mean Price and Quantity by Category
aggregate(cbind(Price, Quantity) ~ Category, data = df_sales_data, FUN = mean)
Output
As the name suggests, we can calculate the mean of multiple columns (Price and Quantity) grouped by multiple categorical variables (Category and Product).
df_sales_data <- data.frame(
Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
Quantity = c(5, 10, 5, 2, 3, 12, 2),
stringsAsFactors = FALSE
)
print(df_sales_data)
# Mean Price and Quantity group by Category and Product
aggregate(cbind(Price, Quantity) ~ Category + Product, data = df_sales_data, FUN = mean)
Output
If you combine group_by() and summarise() functions, the output will be in tibble.
The group_by() method accepts a data frame and single or multiple columns as arguments and groups the data frame based on the provided columns’ unique values.
Then, we use the summarise() function, which accepts a function—mean in our case—to calculate the mean for each group.
Before using dplyr, you must install it in your environment and then load it using the code below:
library(dplyr)
Let’s calculate the mean of Price grouped by Category:
library(dplyr)
df_sales_data <- data.frame(
Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
Quantity = c(5, 10, 5, 2, 3, 12, 2),
stringsAsFactors = FALSE
)
print(df_sales_data)
# Calculate the mean of Price grouped by Category:
df_sales_data %>%
group_by(Category) %>%
summarise(Mean_Price = mean(Price, na.rm = TRUE))
Output
Let’s find the mean of Price and Quantity, grouped by Category.
library(dplyr)
df_sales_data <- data.frame(
Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
Quantity = c(5, 10, 5, 2, 3, 12, 2),
stringsAsFactors = FALSE
)
print(df_sales_data)
# Calculate the mean of Price and Quantity grouped by Category
df_sales_data %>%
group_by(Category) %>%
summarise(
Mean_Price = mean(Price, na.rm = TRUE),
Mean_Quantity = mean(Quantity, na.rm = TRUE)
)
Output
Let’s calculate the mean of Price and Quantity grouped by Category and Product:
library(dplyr)
df_sales_data <- data.frame(
Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
Quantity = c(5, 10, 5, 2, 3, 12, 2),
stringsAsFactors = FALSE
)
print(df_sales_data)
# Calculate the mean of Price and Quantity grouped by Category and Product
df_sales_data %>%
group_by(Category, Product) %>%
summarise(
Mean_Price = mean(Price, na.rm = TRUE),
Mean_Quantity = mean(Quantity, na.rm = TRUE)
)
Output
A weighted mean (also called a weighted average) is a mean where some values contribute more than others to the final result.
library(dplyr)
df_sales_data <- data.frame(
Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
Quantity = c(5, 10, 5, 2, 3, 12, 2),
stringsAsFactors = FALSE
)
print(df_sales_data)
# Calculate weighted mean
df_sales_data %>%
group_by(Category) %>%
summarize(weighted_price = weighted.mean(Price, Quantity))
Output
Install the data.table package if you have not installed it already:
Let’s find the basic grouped mean of Price and Quantity by Category:
library(dplyr)
df_sales_data <- data.frame(
Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
Quantity = c(5, 10, 5, 2, 3, 12, 2),
stringsAsFactors = FALSE
)
print(df_sales_data)
# Basic Grouped Mean using data.table
dt <- as.data.table(df_sales_data)
dt[, .(mean_price = mean(Price), mean_quantity = mean(Quantity)), by = Category]
Output
For small-to-medium-sized data frames, you can use aggregate() or dplyr’s group_by() and summarise() functions. For large data frames, you can use the data.table package.
Krunal Lathiya is a seasoned Computer Science expert with over eight years in the tech industry. He boasts deep knowledge in Data Science and Machine Learning. Versed in Python, JavaScript, PHP, R, and Golang. Skilled in frameworks like Angular and React and platforms such as Node.js. His expertise spans both front-end and back-end development. His proficiency in the Python language stands as a testament to his versatility and commitment to the craft.
To calculate the percentage by the group in R, you need to combine various dplyr…
In R, you can calculate the sum by group using the base aggregate(), dplyr's group_by()…
Whether you want to summarize the categorical data, identify patterns and trends, or calculate percentages…
The group_by() function from the dplyr package allows us to group data frames by one…
The dplyr::slice() function subsets rows by their position or index within a data frame. If…
R vectors are atomic, which means they have homogeneous data types. They are contiguous in…