R Advanced

How to Calculate Standard Deviation by Group in R

Calculating Standard Deviation (SD) by group in a Data Frame essentially means finding the SD of a specific numeric column separately for each unique categorical variable (column) or group. It allows us to see how the data variability differs across the groups.

Here are three ways to calculate SD by group:

  1. Using dplyr
  2. Using data.table
  3. Using base R’s aggregate()

The demo data frame for this tutorial is as follows:

library(dplyr)

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
  Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)

print(employee_data)

Method 1: Using dplyr

The dplyr package provides data manipulation functions such as summarise() and group_by(). Through them, you can apply the sd() function to a group and get the standard deviation for each group. It is an excellent choice for balancing efficiency and code readability.

Install and import the “dplyr” package at the start of your R file.

Let’s calculate the variability of salaries in each department (HR, Finance, and IT):

library(dplyr)

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
  Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)

print(employee_data)

# Applying sd() function on Salary by Department
employee_data %>%
  group_by(Department) %>%
  summarize(
    Salary_sd = sd(Salary)
  )

The above figure displays SD for each department, but how did we get that value? Well, here SD is a variability in Salary for each department.

For example,

  1. For the Finance department, the variability in salary between employees is around 1414. (One employee has 75000 and one has 77000). The difference is 2000, but the variability is around 1414 because it measures the average amount that individual data points deviate from the mean (average).
  2. For the HR department, the variability in salary among its employees is around 5292.
  3. For the IT department, the variability in salary among its employees is around 3536.

Using across() to apply SD to multiple columns

Our data frame contains two numeric columns: Salary and Bonus. We will calculate the SD of both of these columns grouped by Department.

library(dplyr)

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
  Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)

print(employee_data)

# Applying sd() function on Salary and Bonus columns by Department using across()

employee_data %>%
  group_by(Department) %>%
  summarize(across(c(Salary, Bonus), sd))

Method 2: Using data.table

I urge you to use data.table approach if your dataset is very large because it is the most optimal way. However, it is not as readable as the dplyr approach.

First, you need to convert the data frame into a data table using as.data.table() function and then find the SD of Salary by Department.

library(data.table)

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
  Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)

print(employee_data)


employee_data_table <- as.data.table(employee_data)

# Executing sd() function on Salary by Department using data.table

employee_data_table[
  , .(Salary_sd = sd(Salary)),
  by = Department
]

Method 3: Using aggregate()

If you don’t want to use any libraries or packages and stick with base R, you can use a function called aggregate(). It groups data based on the specified column and applies a statistical function (like sd()) to the specified columns.

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
  Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)

print(employee_data)

# Executing sd() function on Salary by Department using aggregate() function

aggregate(
  cbind(Salary) ~ Department,
  data = employee_data,
  FUN = sd
)

The above output figure shows the expected output.

If your data set is small and you want quick statistics, you can always use the aggregate() function.

Visualization of variability

Let’s visualize the group variability using ggplot2:

library(dplyr)
library(ggplot2)

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
  Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)

result_dplyr <- employee_data %>%
  group_by(Department) %>%
  summarize(
    Salary_sd = sd(Salary)
  )

ggplot(result_dplyr, aes(x = Department, y = Salary_sd)) +
  geom_bar(stat = "identity") +
  labs(title = "Salary Standard Deviation by Department")

That’s all!

Recent Posts

How to Set and Get Working Directory [setwd() and getwd()] in R

Set the current working directory The setwd() function sets the working directory to the new…

2 days ago

Standard deviation in R [Using sd() Function]

The sd() function in R calculates the sample standard deviation of a numeric vector or…

3 days ago

R dnorm(): Probability Density Function

The dnorm() function in R calculates the value of the probability density function (pdf) of…

4 days ago

R rep() Function: Repeating Elements of a Vector

R rep() is a generic function that replicates elements of vectors and lists for a…

1 week ago

Splitting Strings: A Beginner’s Guide to strsplit() in R

The strsplit() function in R splits elements of a character vector into a list of…

1 week ago

Understanding of rnorm() Function in R

The rnorm() method in R generates random numbers from a normal (Gaussian) distribution, which is…

2 weeks ago