If you want to find the maximum value within a specific subset of your data, you must find the maximum value within each group.
First, you need to group your data based on the values of one or more categorical variables (columns). The second step is identifying the maximum value of a specific numeric variable (numeric column) for each group.
Here are three approaches you can use to find the maximum value by group in R:
To illustrate the concepts in this tutorial, we use the following demo DataFrame, which is concise and robust for coding examples.
employee_data <- data.frame(
Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
Department = c("HR", "IT", "Finance", "Marketing", "IT", "HR", "Finance"),
Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
Location = c("TX", "CA", "NY", "IL", "CA", "CA", "NY")
)
print(employee_data)
Using dplyr is efficient because its main function is to modify the data frame meaningfully to get the desired output.
It provides a group_by() function to group the data into multiple subsets based on the value of a single or multiple columns. Then, we can use the summarise() function to find the maximum value in each subgroup.
Let’s get the maximum Salary by Department:
library(dplyr)
employee_data <- data.frame(
Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
Department = c("HR", "IT", "Finance", "Marketing", "IT", "HR", "Finance"),
Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
Location = c("TX", "CA", "NY", "IL", "CA", "CA", "NY")
)
print(employee_data)
# Using dplyr
# Maximum salary per single group
employee_data %>%
group_by(Department) %>%
summarize(max_salary = max(Salary))
Output
The above output figure shows a tibble of four rows and two columns. The Department is a categorical column containing unique departments; the next column is max_salary, which is the maximum Salary for each Department.
Let’s find the maximum Salary by Department (group 1) and location (group 2):
library(dplyr)
employee_data <- data.frame(
Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
Department = c("HR", "IT", "Finance", "Marketing", "IT", "HR", "Finance"),
Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
Location = c("TX", "CA", "NY", "IL", "CA", "CA", "NY")
)
print(employee_data)
# Maximum value by multiple groups
employee_data %>%
group_by(Department, Location) %>%
summarize(max_salary = max(Salary))
Output
When you combine multiple groups, R creates unique combinations of these groups (Department and Location) based on the max_salary for each subset.
A group with no data (e.g., Sales) will be omitted by default. To retain empty groups, use .drop = FALSE in the group_by() method.
Using dplyr’s slice_max() function, you can find the maximum value of Salary, and it will return the complete row containing that Salary.
library(dplyr)
employee_data <- data.frame(
Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
Department = c("HR", "IT", "Finance", "Marketing", "IT", "HR", "Finance"),
Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
Location = c("TX", "CA", "NY", "IL", "CA", "CA", "NY")
)
print(employee_data)
# Maximum value across all columns
employee_data %>%
slice_max(Salary, n = 1)
Output
If two employees in the same Department have the same maximum Salary, use the slice_max(Salary, n = Inf) function with n = Inf to return all ties.
When it comes to base R, aggregate() is the most popular function that allows grouping data and applying a function like max. You don’t need any third-party package to use it.
employee_data <- data.frame(
Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
Department = c("HR", "IT", "Finance", "Marketing", "IT", "HR", "Finance"),
Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
Location = c("TX", "CA", "NY", "IL", "CA", "CA", "NY")
)
# Get maximum salary per group
aggregate(Salary ~ Department, data = employee_data, max)
Output
employee_data <- data.frame(
Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
Department = c("HR", "IT", "Finance", "Marketing", "IT", "HR", "Finance"),
Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
Location = c("TX", "CA", "NY", "IL", "CA", "CA", "NY")
)
print(employee_data)
# Get maximum value by multiple groups
aggregate(Salary ~ Department + Location, data = employee_data, max)
Output
employee_data <- data.frame(
Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
Department = c("HR", "IT", "Finance", "Marketing", "IT", "HR", "Finance"),
Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
Location = c("TX", "CA", "NY", "IL", "CA", "CA", "NY")
)
print(employee_data)
# Get absolute maximum value across all columns
employee_data[which.max(employee_data$Salary), ]
Output
The above output shows that row 5 has the highest Salary of 85000, Department IT, Location CA, and Employee_ID E005.
The data.table is highly efficient when you have a large dataset. First, you need to convert the input data frame into a data.table and then you can use its functions to get the maximum value by group(s).
library(data.table)
employee_data <- data.frame(
Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
Department = c("HR", "IT", "Finance", "Marketing", "IT", "HR", "Finance"),
Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
Location = c("TX", "CA", "NY", "IL", "CA", "CA", "NY")
)
print(employee_data)
employee_data_dt <- data.table(employee_data)
# Maximum salary per group
employee_data_dt[, .(max_salary = max(Salary)), by = Department]
Output
In this code, we used data.table() function to convert the input data frame employee_data to data.table (employee_data_dt).
In the next step, we applied the max() function to the Salary column, grouped by Department, using the “by” argument. The above output is a data table, not a data frame. We get each Department’s highest Salary.
library(data.table)
employee_data <- data.frame(
Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
Department = c("HR", "IT", "Finance", "Marketing", "IT", "HR", "Finance"),
Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
Location = c("TX", "CA", "NY", "IL", "CA", "CA", "NY")
)
print(employee_data)
employee_data_dt <- data.table(employee_data)
# Maximum value by multiple groups
employee_data_dt[, .(max_salary = max(Salary)), by = .(Department, Location)]
Output
library(data.table)
employee_data <- data.frame(
Employee_ID = c("E001", "E002", "E003", "E004", "E005", "E006", "E007"),
Department = c("HR", "IT", "Finance", "Marketing", "IT", "HR", "Finance"),
Salary = c(60000, 80000, 75000, 70000, 85000, 62000, 77000),
Location = c("TX", "CA", "NY", "IL", "CA", "CA", "NY")
)
print(employee_data)
employee_data_dt <- data.table(employee_data)
# Maximum value across all columns
employee_data_dt[which.max(Salary), ]
Output
That’s all!
Krunal Lathiya is a seasoned Computer Science expert with over eight years in the tech industry. He boasts deep knowledge in Data Science and Machine Learning. Versed in Python, JavaScript, PHP, R, and Golang. Skilled in frameworks like Angular and React and platforms such as Node.js. His expertise spans both front-end and back-end development. His proficiency in the Python language stands as a testament to his versatility and commitment to the craft.
Whether you are reading or writing files via programs in the file system, it is…
When it comes to checking the data type of a variable, it depends on what…
The grepl() function (stands for "grep logical") in R searches for patterns within each element…
The zip() function creates a new zip archive file. You must ensure that the zip tool…
When working with file systems, checking the directory or file existence is always better before…
To create a grouped boxplot in R, we can use the ggplot2 library's aes() and…