How to Calculate Standard Deviation by Group in R

Calculating Standard Deviation (SD) by group in a Data Frame essentially means finding the SD of a specific numeric column separately for each unique categorical variable (column) or group. It allows us to see how the data variability differs across the groups.

Here are three ways to calculate SD by group:

  1. Using dplyr
  2. Using data.table
  3. Using base R’s aggregate()

The demo data frame for this tutorial is as follows:

library(dplyr)

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
  Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)

print(employee_data)

data frame employee_data

Method 1: Using dplyr

The dplyr package provides data manipulation functions such as summarise() and group_by(). Through them, you can apply the sd() function to a group and get the standard deviation for each group. It is an excellent choice for balancing efficiency and code readability.

Install and import the “dplyr” package at the start of your R file.

Let’s calculate the variability of salaries in each department (HR, Finance, and IT):

library(dplyr)

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
  Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)

print(employee_data)

# Applying sd() function on Salary by Department
employee_data %>%
  group_by(Department) %>%
  summarize(
    Salary_sd = sd(Salary)
  )

Using dplyr package to calculate sd() by Group in R

The above figure displays SD for each department, but how did we get that value? Well, here SD is a variability in Salary for each department.

For example,

  1. For the Finance department, the variability in salary between employees is around 1414. (One employee has 75000 and one has 77000). The difference is 2000, but the variability is around 1414 because it measures the average amount that individual data points deviate from the mean (average).
  2. For the HR department, the variability in salary among its employees is around 5292.
  3. For the IT department, the variability in salary among its employees is around 3536.

Using across() to apply SD to multiple columns

Our data frame contains two numeric columns: Salary and Bonus. We will calculate the SD of both of these columns grouped by Department.

library(dplyr)

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
  Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)

print(employee_data)

# Applying sd() function on Salary and Bonus columns by Department using across()

employee_data %>%
  group_by(Department) %>%
  summarize(across(c(Salary, Bonus), sd))

With across() for multiple columns

Method 2: Using data.table

I urge you to use data.table approach if your dataset is very large because it is the most optimal way. However, it is not as readable as the dplyr approach.

First, you need to convert the data frame into a data table using as.data.table() function and then find the SD of Salary by Department.

library(data.table)

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
  Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)

print(employee_data)


employee_data_table <- as.data.table(employee_data)

# Executing sd() function on Salary by Department using data.table

employee_data_table[
  , .(Salary_sd = sd(Salary)),
  by = Department
]

Using data.table approach

Method 3: Using aggregate()

If you don’t want to use any libraries or packages and stick with base R, you can use a function called aggregate(). It groups data based on the specified column and applies a statistical function (like sd()) to the specified columns.

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
  Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)

print(employee_data)

# Executing sd() function on Salary by Department using aggregate() function

aggregate(
  cbind(Salary) ~ Department,
  data = employee_data,
  FUN = sd
)

Using aggregate() function

The above output figure shows the expected output.

If your data set is small and you want quick statistics, you can always use the aggregate() function.

Visualization of variability

Let’s visualize the group variability using ggplot2:

library(dplyr)
library(ggplot2)

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
  Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)

result_dplyr <- employee_data %>%
  group_by(Department) %>%
  summarize(
    Salary_sd = sd(Salary)
  )

ggplot(result_dplyr, aes(x = Department, y = Salary_sd)) +
  geom_bar(stat = "identity") +
  labs(title = "Salary Standard Deviation by Department")

Visualization of variability

That’s all!

Leave a Comment