Calculating Standard Deviation (SD) by group in a Data Frame essentially means finding the SD of a specific numeric column separately for each unique categorical variable (column) or group. It allows us to see how the data variability differs across the groups.
Here are three ways to calculate SD by group:
The demo data frame for this tutorial is as follows:
library(dplyr)
employee_data <- data.frame(
Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)
print(employee_data)
The dplyr package provides data manipulation functions such as summarise() and group_by(). Through them, you can apply the sd() function to a group and get the standard deviation for each group. It is an excellent choice for balancing efficiency and code readability.
Install and import the “dplyr” package at the start of your R file.
Let’s calculate the variability of salaries in each department (HR, Finance, and IT):
library(dplyr)
employee_data <- data.frame(
Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)
print(employee_data)
# Applying sd() function on Salary by Department
employee_data %>%
group_by(Department) %>%
summarize(
Salary_sd = sd(Salary)
)
The above figure displays SD for each department, but how did we get that value? Well, here SD is a variability in Salary for each department.
For example,
Our data frame contains two numeric columns: Salary and Bonus. We will calculate the SD of both of these columns grouped by Department.
library(dplyr)
employee_data <- data.frame(
Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)
print(employee_data)
# Applying sd() function on Salary and Bonus columns by Department using across()
employee_data %>%
group_by(Department) %>%
summarize(across(c(Salary, Bonus), sd))
I urge you to use data.table approach if your dataset is very large because it is the most optimal way. However, it is not as readable as the dplyr approach.
First, you need to convert the data frame into a data table using as.data.table() function and then find the SD of Salary by Department.
library(data.table)
employee_data <- data.frame(
Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)
print(employee_data)
employee_data_table <- as.data.table(employee_data)
# Executing sd() function on Salary by Department using data.table
employee_data_table[
, .(Salary_sd = sd(Salary)),
by = Department
]
If you don’t want to use any libraries or packages and stick with base R, you can use a function called aggregate(). It groups data based on the specified column and applies a statistical function (like sd()) to the specified columns.
employee_data <- data.frame(
Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)
print(employee_data)
# Executing sd() function on Salary by Department using aggregate() function
aggregate(
cbind(Salary) ~ Department,
data = employee_data,
FUN = sd
)
The above output figure shows the expected output.
If your data set is small and you want quick statistics, you can always use the aggregate() function.
Let’s visualize the group variability using ggplot2:
library(dplyr)
library(ggplot2)
employee_data <- data.frame(
Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
Department = c("HR", "IT", "Finance", "HR", "IT", "HR", "Finance"),
Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
Bonus = c(5000, 8000, 7500, 6000, 9000, 5500, 7000)
)
result_dplyr <- employee_data %>%
group_by(Department) %>%
summarize(
Salary_sd = sd(Salary)
)
ggplot(result_dplyr, aes(x = Department, y = Salary_sd)) +
geom_bar(stat = "identity") +
labs(title = "Salary Standard Deviation by Department")
That’s all!
Krunal Lathiya is a seasoned Computer Science expert with over eight years in the tech industry. He boasts deep knowledge in Data Science and Machine Learning. Versed in Python, JavaScript, PHP, R, and Golang. Skilled in frameworks like Angular and React and platforms such as Node.js. His expertise spans both front-end and back-end development. His proficiency in the Python language stands as a testament to his versatility and commitment to the craft.
In R, you can use the dollar sign ($ operator) to access elements (columns) of…
The abs() function calculates the absolute value of a numeric input, returning a non-negative (only…
When working with R in an interactive mode, you don't need to use any functions…
To calculate the sample variance (measurement of spreading) in R, you should use the built-in…
The tryCatch() function acts as a mechanism for handling errors and other conditions (like warnings…
The grep() function in R searches for matches to a pattern within a character vector.…