R Advanced

How to Calculate Standard Deviation by Group in R

Calculating Standard Deviation (SD) by group in a Data Frame essentially means finding the SD of a specific numeric column separately for each unique categorical variable (column) or group. It allows us to see how the data variability differs across the groups.

Here are three ways to calculate SD by group:

Using dplyr
Using data.table
Using base R’s aggregate()

The demo data frame for this tutorial is as follows:

library(dplyr)

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
  Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)

print(employee_data)

Method 1: Using dplyr

The dplyr package provides data manipulation functions such as summarise() and group_by(). Through them, you can apply the sd() function to a group and get the standard deviation for each group. It is an excellent choice for balancing efficiency and code readability.

Install and import the “dplyr” package at the start of your R file.

Let’s calculate the variability of salaries in each department (HR, Finance, and IT):

library(dplyr)

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
  Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)

print(employee_data)

# Applying sd() function on Salary by Department
employee_data %>%
  group_by(Department) %>%
  summarize(
    Salary_sd = sd(Salary)
  )

The above figure displays SD for each department, but how did we get that value? Well, here SD is a variability in Salary for each department.

For example,

For the Finance department, the variability in salary between employees is around 1414. (One employee has 75000 and one has 77000). The difference is 2000, but the variability is around 1414 because it measures the average amount that individual data points deviate from the mean (average).
For the HR department, the variability in salary among its employees is around 5292.
For the IT department, the variability in salary among its employees is around 3536.

Using across() to apply SD to multiple columns

Our data frame contains two numeric columns: Salary and Bonus. We will calculate the SD of both of these columns grouped by Department.

library(dplyr)

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
  Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)

print(employee_data)

# Applying sd() function on Salary and Bonus columns by Department using across()

employee_data %>%
  group_by(Department) %>%
  summarize(across(c(Salary, Bonus), sd))

Method 2: Using data.table

I urge you to use data.table approach if your dataset is very large because it is the most optimal way. However, it is not as readable as the dplyr approach.

First, you need to convert the data frame into a data table using as.data.table() function and then find the SD of Salary by Department.

library(data.table)

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
  Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)

print(employee_data)


employee_data_table <- as.data.table(employee_data)

# Executing sd() function on Salary by Department using data.table

employee_data_table[
  , .(Salary_sd = sd(Salary)),
  by = Department
]

Method 3: Using aggregate()

If you don’t want to use any libraries or packages and stick with base R, you can use a function called aggregate(). It groups data based on the specified column and applies a statistical function (like sd()) to the specified columns.

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
  Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)

print(employee_data)

# Executing sd() function on Salary by Department using aggregate() function

aggregate(
  cbind(Salary) ~ Department,
  data = employee_data,
  FUN = sd
)

The above output figure shows the expected output.

If your data set is small and you want quick statistics, you can always use the aggregate() function.

Visualization of variability

Let’s visualize the group variability using ggplot2:

library(dplyr)
library(ggplot2)

employee_data <- data.frame(
  Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
  Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
  Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
  Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)

result_dplyr <- employee_data %>%
  group_by(Department) %>%
  summarize(
    Salary_sd = sd(Salary)
  )

ggplot(result_dplyr, aes(x = Department, y = Salary_sd)) +
  geom_bar(stat = "identity") +
  labs(title = "Salary Standard Deviation by Department")

That’s all!

Krunal Lathiya

Krunal Lathiya is a seasoned Computer Science expert with over eight years in the tech industry. He boasts deep knowledge in Data Science and Machine Learning. Versed in Python, JavaScript, PHP, R, and Golang. Skilled in frameworks like Angular and React and platforms such as Node.js. His expertise spans both front-end and back-end development. His proficiency in the Python language stands as a testament to his versatility and commitment to the craft.