Duplicate rows refer to all the values across all columns that are the same in two or more rows. To avoid redundant data, we must remove duplicates from a data frame. For example, if the same row appears three times in a data frame, we must remove two rows because they are duplicates of one original row.
Based on the requirements of our project, we have to decide which rows should be kept in the data frame and which ones should be eliminated. It can be the first or last row. If the original is the last row, we have to remove the first two rows as duplicates, and if the original is the first row, we have to remove the last two rows.
Here are three ways to remove duplicate rows in R data frame:
- Using !duplicated()
- Using unique()
- Using dplyr::distinct()
Method 1: Using !duplicated()
By default, the !duplicated() function keeps the first occurrence of the row and removes all the duplicates. The logical negation (!) helps us subsetting the data frame and keep the unique rows.
df <- data.frame(
name = c("Krunal", "Ankit", "Rushabh", "Dhaval", "Tejas", "Rushabh", "Dhaval", "Tejas"),
score = c(85, 90, 78, 92, 88, 78, 92, 88),
subject = c("Math", "Math", "History", "History", "Math", "History", "History", "Math"),
grade = c("10th", "11th", "11th", "10th", "10th", "11th", "10th", "10th")
)
df_unique <- df[!duplicated(df), ]
print(df_unique)
Output
Keeping the last occurrence
You come across a functionality where you need to remove all the duplicates except the last one; you can do that by passing the “fromLast = TRUE” argument to the duplicated() function.
df <- data.frame(
name = c("Krunal", "Ankit", "Rushabh", "Dhaval", "Tejas", "Rushabh", "Dhaval", "Tejas"),
score = c(85, 90, 78, 92, 88, 78, 92, 88),
subject = c("Math", "Math", "History", "History", "Math", "History", "History", "Math"),
grade = c("10th", "11th", "11th", "10th", "10th", "11th", "10th", "10th")
)
df_unique_last <- df[!duplicated(df, fromLast = TRUE), ]
print(df_unique_last)
Output
Removing all occurrences
If you want to remove all occurrences of duplicate rows, you can use the below code:
df_unique_all <- df[!(duplicated(df) | duplicated(df, fromLast = TRUE)), ]
Pros
- It is a straightforward base R method that does not require any packages.
- It provides control over which rows are duplicates, keeping (first or last).
- It directly operates on logical indexing without creating a new object upfront.
- It is faster than dplyr::distinct() for small-to-medium datasets.
Cons
- It is less intuitive for newbie developers because of the negation (!) operator.
- It may not handle certain use cases gracefully.
- It may treat NA values as duplicates (depending on your use case)
Method 2: Using unique()
As the name suggests, the unique() function keeps only unique rows and removes all the duplicate rows of the Data Frame.
df <- data.frame(
name = c("Krunal", "Ankit", "Rushabh", "Dhaval", "Tejas", "Rushabh", "Dhaval", "Tejas"),
score = c(85, 90, 78, 92, 88, 78, 92, 88),
subject = c("Math", "Math", "History", "History", "Math", "History", "History", "Math"),
grade = c("10th", "11th", "11th", "10th", "10th", "11th", "10th", "10th")
)
df_unique <- unique(df)
print(df_unique)
Output
The above image shows that row indexes 6, 7, 8 are duplicated rows, so they have been removed in the output data frame.
Pros
- It is a base R function for getting unique rows. No need to install other packages.
- It is a simple function call. No subsetting is required.
- It preserves the original order of the first occurrence.
Cons
- It does not provide flexibility to remove duplicates from specific columns.
- The performance is decreased when it is performed on larger data frames.
Method 3: Using the dplyr package’s distinct() function
The dplyr::distinct() function keeps unique/distinct rows from the data frame. If there are duplicate rows, only the first row is preserved, and others are removed from a data frame.
library(dplyr)
df <- data.frame(
name = c("Krunal", "Ankit", "Rushabh", "Dhaval", "Tejas", "Rushabh", "Dhaval", "Tejas"),
score = c(85, 90, 78, 92, 88, 78, 92, 88),
subject = c("Math", "Math", "History", "History", "Math", "History", "History", "Math"),
grade = c("10th", "11th", "11th", "10th", "10th", "11th", "10th", "10th")
)
df_unique <- df %>% distinct()
print(df_unique)
Output
Use the following code to remove duplicate rows based on a single column(variable).
df %>% distinct(subject, .keep_all = TRUE)
If you want to consider certain columns to determine the duplicate values, you can use `df %>% distinct(col1, col2, .keep_all = TRUE)` to keep all columns but consider only col1 and col2 for duplicates.
The `.keep_all=TRUE` argument is only needed when we have to specify certain columns and want to keep the other columns in the output.
df %>% distinct(col1, col2, .keep_all = TRUE)
It will return the unique rows based on the col1 and col2 columns.
Pros
- You can specify columns for duplication-checking which is not possible in other methods.
- If you are working on big data frames, I highly recommend you use the “dplyr” package because it is optimized for performance.
- It provides a clean syntax with pipes.
Cons
- It requires an external package (tidyverse or dplyr).
- It has a slight learning curve to get into.
- Slightly slower than !duplicated() approach for small datasets.
That’s all!
Krunal Lathiya is a seasoned Computer Science expert with over eight years in the tech industry. He boasts deep knowledge in Data Science and Machine Learning. Versed in Python, JavaScript, PHP, R, and Golang. Skilled in frameworks like Angular and React and platforms such as Node.js. His expertise spans both front-end and back-end development. His proficiency in the Python language stands as a testament to his versatility and commitment to the craft.