R Advanced

How to Calculate the Mean by Group in R Data Frame

Here are three ways to calculate the mean by group for single or multiple columns in the R data frame:

  1. Using base R’s aggregate()
  2. Using dplyr’s group_by() and summarize()
  3. Using data.table package

What does it mean when we say “mean by group”? It means grouping the data based on the values of single or multiple columns and then calculating the mean (average) of those values, but doing so separately for each group.

We will use the sample data frame below to perform practicals:

df_sales_data <- data.frame(
  Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
  Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
  Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
  Quantity = c(5, 10, 5, 2, 3, 12, 2),
  stringsAsFactors = FALSE
)

print(df_sales_data)

Method 1: Using base R’s aggregate()

The aggregate() function splits the data into subgroups and calculates the summary for each group. In the context of our article, we can split the data based on the specific column(s) groups and calculate the mean for each group.

Mean of a single column grouped by a single column

In our df_sales_data dataset, we can group the data frame by the Category column and calculate the mean of only one column (Price) for each Category.

The aggregate() function accepts three arguments:

  1. Price ~ Category: The Price column values should be grouped by Category.
  2. df_sales_data: It is an input data frame
  3. FUN: We want to calculate the mean of Price grouped by Category.
df_sales_data <- data.frame(
  Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
  Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
  Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
  Quantity = c(5, 10, 5, 2, 3, 12, 2),
  stringsAsFactors = FALSE
)

print(df_sales_data)

# Mean Price by Category
aggregate(Price ~ Category, data = df_sales_data, FUN = mean)

Output

In our data frame,

  1. The bakery category has only one product, Bread, which has a price of 1.8, so the mean price is 1.8000000.
  2. The Dairy category has three products: Milk, Butter, and Milk, and their prices are 2.5, 2, and 2.5. The total price is 7, and if you divide 7 by 3, the mean is 2.333333.
  3. The Fruit category has three products: Apple, Banana, and Apple. Their respective prices are 1.2, 0.5, and 1.2. The mean of these prices is 0.9666667.

Mean of multiple columns grouped by a single column

You can calculate the mean of multiple columns (e.g., Price and Quantity) grouped by a single column (Category) using the aggregate() function.

df_sales_data <- data.frame(
  Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
  Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
  Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
  Quantity = c(5, 10, 5, 2, 3, 12, 2),
  stringsAsFactors = FALSE
)

print(df_sales_data)


# Mean Price and Quantity by Category
aggregate(cbind(Price, Quantity) ~ Category, data = df_sales_data, FUN = mean)

Output

Mean of multiple columns grouped by multiple variables

As the name suggests, we can calculate the mean of multiple columns (Price and Quantity) grouped by multiple categorical variables (Category and Product).

df_sales_data <- data.frame(
  Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
  Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
  Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
  Quantity = c(5, 10, 5, 2, 3, 12, 2),
  stringsAsFactors = FALSE
)

print(df_sales_data)

# Mean Price and Quantity group by Category and Product
aggregate(cbind(Price, Quantity) ~ Category + Product, data = df_sales_data, FUN = mean)

Output

Method 2: Using dplyr’s group_by() and summarise()

If you combine the group_by() and summarise() functions, the output will be in a tibble.

The group_by() method accepts a data frame and one or more columns as arguments, grouping the data frame based on the unique values of the provided columns.

Then, we use the summarise() function, which accepts a function—mean in our case—to calculate the mean for each group.

Before using dplyr, you must install it in your environment and then load it using the code below:

library(dplyr)

Mean of a Single Column Grouped by a Single Column

Let’s calculate the mean of Price grouped by Category:

library(dplyr)


df_sales_data <- data.frame(
  Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
  Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
  Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
  Quantity = c(5, 10, 5, 2, 3, 12, 2),
  stringsAsFactors = FALSE
)

print(df_sales_data)

# Calculate the mean of Price grouped by Category:
df_sales_data %>%
  group_by(Category) %>%
  summarise(Mean_Price = mean(Price, na.rm = TRUE))

Output

Mean of multiple columns grouped by a single column

Let’s find the mean of Price and Quantity, grouped by Category.

library(dplyr)


df_sales_data <- data.frame(
  Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
  Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
  Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
  Quantity = c(5, 10, 5, 2, 3, 12, 2),
  stringsAsFactors = FALSE
)

print(df_sales_data)

# Calculate the mean of Price and Quantity grouped by Category

df_sales_data %>%
  group_by(Category) %>%
  summarise(
    Mean_Price = mean(Price, na.rm = TRUE),
    Mean_Quantity = mean(Quantity, na.rm = TRUE)
  )

Output

Mean of Multiple Columns Grouped by Multiple Variables

Let’s calculate the mean of Price and Quantity grouped by Category and Product:

library(dplyr)


df_sales_data <- data.frame(
  Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
  Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
  Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
  Quantity = c(5, 10, 5, 2, 3, 12, 2),
  stringsAsFactors = FALSE
)

print(df_sales_data)

# Calculate the mean of Price and Quantity grouped by Category and Product
df_sales_data %>%
   group_by(Category, Product) %>%
   summarise(
   Mean_Price = mean(Price, na.rm = TRUE),
   Mean_Quantity = mean(Quantity, na.rm = TRUE)
 )

Output

 

Weighted mean

A weighted mean (also known as a weighted average) is a mean where some values contribute more significantly than others to the final result.

library(dplyr)


df_sales_data <- data.frame(
  Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
  Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
  Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
  Quantity = c(5, 10, 5, 2, 3, 12, 2),
  stringsAsFactors = FALSE
)

print(df_sales_data)

# Calculate weighted mean
df_sales_data %>%
  group_by(Category) %>%
  summarize(weighted_price = weighted.mean(Price, Quantity))

Output

Method 3: Using data.table

Install the data.table package if you have not installed it already:

Let’s find the basic grouped mean of Price and Quantity by Category:

library(dplyr)


df_sales_data <- data.frame(
  Product = c("Apple", "Banana", "Apple", "Milk", "Bread", "Butter", "Milk"),
  Category = c("Fruit", "Fruit", "Fruit", "Dairy", "Bakery", "Dairy", "Dairy"),
  Price = c(1.2, 0.5, 1.2, 2.5, 1.8, 2.0, 2.5),
  Quantity = c(5, 10, 5, 2, 3, 12, 2),
  stringsAsFactors = FALSE
)

print(df_sales_data)

# Basic Grouped Mean using data.table
dt <- as.data.table(df_sales_data)
dt[, .(mean_price = mean(Price), mean_quantity = mean(Quantity)), by = Category]

Output

That’s all!

Recent Posts

cbind() Function: Binding R Objects by Columns

R cbind (column bind) is a function that combines specified vectors, matrices, or data frames…

1 week ago

rbind() Function: Binding Rows in R

The rbind() function combines R objects, such as vectors, matrices, or data frames, by rows.…

1 week ago

as.numeric(): Converting to Numeric Values in R

The as.numeric() function in R converts valid non-numeric data into numeric data. What do I…

2 weeks ago

Calculating Natural Log using log() Function in R

The log() function calculates the natural logarithm (base e) of a numeric vector. By default,…

3 weeks ago

Dollar Sign ($ Operator) in R

In R, you can use the dollar sign ($ operator)  to access elements (columns) of…

1 month ago

Calculating Absolute Value using abs() Function in R

The abs() function calculates the absolute value of a numeric input, returning a non-negative (only…

1 month ago