R dplyr

R dplyr::summarise() or summarize() Function

The dplyr summarise()(or summarize()) function aggregates data into a single summary value for each group or the entire dataset if ungrouped.  

It collapses multiple rows into a concise statistical summary, such as the mean, sum, and count

Developers often use summarize() with group_by(), which splits the data into groups based on one or more categorical variables (Columns you can use for groupings).

They then apply the summary function(s), such as mean and sum, to each group separately to create a comprehensive summary that makes sense.

The resulting data frame (tibble) has a single row or row for each group, with the summary statistic(s) for the entire data frame or for that group.

Syntax

summarise(.data, .by = NULL, .groups = NULL)

# OR

summarize(.data, .by = NULL, .groups = NULL, ...)

Parameters

Name Value
.data It is an input data frame, tibble, or other dataset type.
.by It suggests grouping things before performing operations like mean, sum, count, etc.
.groups It determines how to maintain groups after operating on them.
It tells whether you want to count or summarise.

Example data frame

We will use the stock_market_data data frame.

library(dplyr)

stock_market_data <- data.frame(
  Stock_Symbol = c("AAPL", "TSLA", "MSFT", "NVDA", "JPM", "AMZN", "GOOGL"),
  Industry = c(
    "Technology", "Automotive", "Technology",
    "Technology", "Banking", "E-commerce", "Technology"
  ),
  Share_Price = c(150.75, 720.46, 210.22, 200.36, 135.78, 3456.22, 2800.12)
)

print(stock_market_data)

Basic summary of the data frame

We can calculate the basic summary of the column Share_Price of the data frame by calculating the mean of that column without any type of grouping.

library(dplyr)

stock_market_data <- data.frame(
  Stock_Symbol = c("AAPL", "TSLA", "MSFT", "NVDA", "JPM", "AMZN", "GOOGL"),
  Industry = c(
    "Technology", "Automotive", "Technology",
    "Technology", "Banking", "E-commerce", "Technology"
  ),
  Share_Price = c(150.75, 720.46, 210.22, 200.36, 135.78, 3456.22, 2800.12)
)


print(stock_market_data)

Output

The above output figure illustrates that we utilized the aggregate function mean() for the Share_Price column to calculate the average price for the entire data frame.

You can analyze the data as follows: To buy at least one share of any company, you must have a minimum balance of $1,096.27 in your bank account.

Grouped summary

Summary without a group is like counting your candy without sorting your flavors.

Let’s calculate the average share price by industry:

library(dplyr)

stock_market_data <- data.frame(
  Stock_Symbol = c("AAPL", "TSLA", "MSFT", "NVDA", "JPM", "AMZN", "GOOGL"),
  Industry = c(
    "Technology", "Automotive", "Technology",
    "Technology", "Banking", "E-commerce", "Technology"
  ),
  Share_Price = c(150.75, 720.46, 210.22, 200.36, 135.78, 3456.22, 2800.12)
)


print(stock_market_data)

# Grouped Summary
stock_market_data %>%
  group_by(Industry) %>%
  summarise(avg_price = mean(Share_Price))

Output

The above figure shows the average Share_Price for each Industry. The output is a 4X2 tibble, as you can see.

Multiple summaries

Multiple summaries indicate that you are calculating the average and maximum share prices for the data frame, grouped by industry.

library(dplyr)

stock_market_data <- data.frame(
  Stock_Symbol = c("AAPL", "TSLA", "MSFT", "NVDA", "JPM", "AMZN", "GOOGL"),
  Industry = c(
    "Technology", "Automotive", "Technology",
    "Technology", "Banking", "E-commerce", "Technology"
  ),
  Share_Price = c(150.75, 720.46, 210.22, 200.36, 135.78, 3456.22, 2800.12)
)


print(stock_market_data)


# Multiple Summaries
stock_market_data %>%
  group_by(Industry) %>%
    summarise(
      avg_price = mean(Share_Price),
      min_price = min(Share_Price),
      max_price = max(Share_Price)
  )

Output

In this code, we calculated the average, minimum, and maximum share price (multiple summaries) of Stocks based on Industry. The output is a 4X4 tibble, as you can see.

summarise(across()) with Multiple Columns

The dplyr::across() function allows us to apply multiple summary functions to multiple columns simultaneously.

It is helpful when we want to calculate min, max, count, and mean for multiple columns using the dplyr::summarise() function.

library(dplyr)

stock_market_data <- data.frame(
  Stock_Symbol = c("AAPL", "TSLA", "MSFT", "NVDA", "JPM", "AMZN", "GOOGL"),
  Industry = c(
    "Technology", "Automotive", "Technology",
    "Technology", "Banking", "E-commerce", "Technology"
  ),
  Share_Price = c(150.75, 720.46, 210.22, 200.36, 135.78, 3456.22, 2800.12)
)


print(stock_market_data)


# Summarize across multiple columns with multiple functions
stock_market_data %>%
  summarise(
    across(
      Share_Price, # Column to summarize
      list(mean = mean, max = max), # Functions to apply
      .names = "{col}_{fn}" # Custom naming for output columns
    ),
    unique_industries = n_distinct(Industry) # Additional summary
 )

Output

In this code example, we counted the number of unique industries in our data frame.

We also calculated the mean share price and the max share price from the data frame.

We applied multiple aggregate functions to multiple columns to obtain a summary of the data frame.

Share
Published by
Krunal Lathiya

Recent Posts

How to Set and Get Working Directory [setwd() and getwd()] in R

Set the current working directory The setwd() function sets the working directory to the new…

2 days ago

Standard deviation in R [Using sd() Function]

The sd() function in R calculates the sample standard deviation of a numeric vector or…

3 days ago

R dnorm(): Probability Density Function

The dnorm() function in R calculates the value of the probability density function (pdf) of…

4 days ago

R rep() Function: Repeating Elements of a Vector

R rep() is a generic function that replicates elements of vectors and lists for a…

1 week ago

Splitting Strings: A Beginner’s Guide to strsplit() in R

The strsplit() function in R splits elements of a character vector into a list of…

1 week ago

Understanding of rnorm() Function in R

The rnorm() method in R generates random numbers from a normal (Gaussian) distribution, which is…

2 weeks ago