R dplyr::summarise() or summarize() Function

The dplyr summarise()(or summarize()) function aggregates data into a single summary value for each group or entire dataset if ungrouped.  

It collapses multiple rows into a concise statistical summary, such as mean, sum, count, etc. 

Developers often use summarize() with group_by(), which splits the data into groups based on one or more categorical variables (Columns you can use for groupings). They then apply the summary function(s), like mean, sum, etc., to each group separately to create a perfect summary of it that makes sense.

The resulting data frame (tibble) has a single row or row for each group, with the summary statistic(s) for the entire data frame or for that group.

Syntax

summarise(.data, .by = NULL, .groups = NULL)

# OR

summarize(.data, .by = NULL, .groups = NULL, ...)

Parameters

Name Value
.data It is an input data frame, tibble, or other dataset type. 
.by It suggests grouping things before performing operations like mean, sum, count, etc.
.groups It decides how to keep groups after operating on the group.
It tells whether you want to count or summarise.

Example data frame

We will use the stock_market_data data frame.

library(dplyr)

stock_market_data <- data.frame(
  Stock_Symbol = c("AAPL", "TSLA", "MSFT", "NVDA", "JPM", "AMZN", "GOOGL"),
  Industry = c(
    "Technology", "Automotive", "Technology",
    "Technology", "Banking", "E-commerce", "Technology"
  ),
  Share_Price = c(150.75, 720.46, 210.22, 200.36, 135.78, 3456.22, 2800.12)
)

print(stock_market_data)

stock_market_data Data Frame

Basic summary of the data frame

We can calculate the basic summary of column Share_Price of the data frame by calculating the mean of that column without any type of grouping.

library(dplyr)

stock_market_data <- data.frame(
  Stock_Symbol = c("AAPL", "TSLA", "MSFT", "NVDA", "JPM", "AMZN", "GOOGL"),
  Industry = c(
    "Technology", "Automotive", "Technology",
    "Technology", "Banking", "E-commerce", "Technology"
  ),
  Share_Price = c(150.75, 720.46, 210.22, 200.36, 135.78, 3456.22, 2800.12)
)


print(stock_market_data)

Output

Basic summary of data frame in R using summarise() function

The above output figure shows that we used the aggregate function mean() for the Share_Price column to get the average price for the entire data frame.

You can analyze the data like this: If you want to buy at least one share of any company, you must have a minimum balance of 1096.273 in your bank account.

Grouped summary

Summary without a group is like counting your candy without sorting your flavors.

Let’s calculate the average share price by industry:

library(dplyr)

stock_market_data <- data.frame(
  Stock_Symbol = c("AAPL", "TSLA", "MSFT", "NVDA", "JPM", "AMZN", "GOOGL"),
  Industry = c(
    "Technology", "Automotive", "Technology",
    "Technology", "Banking", "E-commerce", "Technology"
  ),
  Share_Price = c(150.75, 720.46, 210.22, 200.36, 135.78, 3456.22, 2800.12)
)


print(stock_market_data)

# Grouped Summary
stock_market_data %>%
  group_by(Industry) %>%
  summarise(avg_price = mean(Share_Price))

Output

Grouped Summary using summarise() and group_by()

The above figure shows that we calculated the average Share_Price for each Industry. The output is 4X2 tibble, as you can see.

Multiple summaries

Multiple summaries mean you are calculating the average and maximum share price for the data frame grouped by industry.

library(dplyr)

stock_market_data <- data.frame(
  Stock_Symbol = c("AAPL", "TSLA", "MSFT", "NVDA", "JPM", "AMZN", "GOOGL"),
  Industry = c(
    "Technology", "Automotive", "Technology",
    "Technology", "Banking", "E-commerce", "Technology"
  ),
  Share_Price = c(150.75, 720.46, 210.22, 200.36, 135.78, 3456.22, 2800.12)
)


print(stock_market_data)


# Multiple Summaries
stock_market_data %>%
  group_by(Industry) %>%
    summarise(
      avg_price = mean(Share_Price),
      min_price = min(Share_Price),
      max_price = max(Share_Price)
  )

Output

Multiple summaries using summarise() method in RIn this code, we calculated the average, minimum, and maximum share price (multiple summaries) of Stocks based on Industry. The output is 4X4 tibble, as you can see.

summarise(across()) with Multiple Columns

The dplyr::across() function allows us to apply multiple summary functions to multiple columns simultaneously. It is helpful when we want to calculate min, max, count, and mean for multiple columns with the dplyr::summarise() function.

library(dplyr)

stock_market_data <- data.frame(
  Stock_Symbol = c("AAPL", "TSLA", "MSFT", "NVDA", "JPM", "AMZN", "GOOGL"),
  Industry = c(
    "Technology", "Automotive", "Technology",
    "Technology", "Banking", "E-commerce", "Technology"
  ),
  Share_Price = c(150.75, 720.46, 210.22, 200.36, 135.78, 3456.22, 2800.12)
)


print(stock_market_data)


# Summarize across multiple columns with multiple functions
stock_market_data %>%
  summarise(
    across(
      Share_Price, # Column to summarize
      list(mean = mean, max = max), # Functions to apply
      .names = "{col}_{fn}" # Custom naming for output columns
    ),
    unique_industries = n_distinct(Industry) # Additional summary
 )

Output

Output of summarise(across()) with Multiple Columns

In this code example, we counted the number of unique industries in our data frame.

We also calculated the mean share price and max share price from the data frame.

So, we applied multiple aggregate functions to multiple columns to get the summary of the data frame.

Leave a Comment