R Advanced

How to Remove Duplicate Rows from DataFrame in R

Duplicate rows refer to all the values across all columns that are the same in two or more rows. To avoid redundant data, we must remove duplicates from a data frame. For example, if the same row appears three times in a data frame, we must remove two rows because they are duplicates of one original row.

Based on the requirements of our project, we have to decide which rows should be kept in the data frame and which ones should be eliminated. It can be the first or last row. If the original is the last row, we have to remove the first two rows as duplicates, and if the original is the first row, we have to remove the last two rows.

Here are three ways to remove duplicate rows in R data frame:

  1. Using !duplicated()
  2. Using unique()
  3. Using dplyr::distinct()

Method 1: Using !duplicated()

By default, the !duplicated() function keeps the first occurrence of the row and removes all the duplicates. The logical negation (!) helps us subsetting the data frame and keep the unique rows.

df <- data.frame(
  name = c("Krunal", "Ankit", "Rushabh", "Dhaval", "Tejas", "Rushabh", "Dhaval", "Tejas"),
  score = c(85, 90, 78, 92, 88, 78, 92, 88),
  subject = c("Math", "Math", "History", "History", "Math", "History", "History", "Math"),
  grade = c("10th", "11th", "11th", "10th", "10th", "11th", "10th", "10th")
)

df_unique <- df[!duplicated(df), ]

print(df_unique)

Output

Keeping the last occurrence

You come across a functionality where you need to remove all the duplicates except the last one; you can do that by passing the “fromLast = TRUE” argument to the duplicated() function.

df <- data.frame(
  name = c("Krunal", "Ankit", "Rushabh", "Dhaval", "Tejas", "Rushabh", "Dhaval", "Tejas"),
  score = c(85, 90, 78, 92, 88, 78, 92, 88),
  subject = c("Math", "Math", "History", "History", "Math", "History", "History", "Math"),
  grade = c("10th", "11th", "11th", "10th", "10th", "11th", "10th", "10th")
)

df_unique_last <- df[!duplicated(df, fromLast = TRUE), ]

print(df_unique_last)

Output

Removing all occurrences

If you want to remove all occurrences of duplicate rows, you can use the below code:

df_unique_all <- df[!(duplicated(df) | duplicated(df, fromLast = TRUE)), ]

Pros

  1. It is a straightforward base R method that does not require any packages.
  2. It provides control over which rows are duplicates, keeping (first or last).
  3. It directly operates on logical indexing without creating a new object upfront.
  4. It is faster than dplyr::distinct() for small-to-medium datasets.

Cons

  1. It is less intuitive for newbie developers because of the negation (!) operator.
  2. It may not handle certain use cases gracefully.
  3. It may treat NA values as duplicates (depending on your use case)

Method 2: Using unique()

As the name suggests, the unique() function keeps only unique rows and removes all the duplicate rows of the Data Frame. 

df <- data.frame(
  name = c("Krunal", "Ankit", "Rushabh", "Dhaval", "Tejas", "Rushabh", "Dhaval", "Tejas"),
  score = c(85, 90, 78, 92, 88, 78, 92, 88),
  subject = c("Math", "Math", "History", "History", "Math", "History", "History", "Math"),
  grade = c("10th", "11th", "11th", "10th", "10th", "11th", "10th", "10th")
)

df_unique <- unique(df)

print(df_unique)

Output

The above image shows that row indexes 6, 7, 8 are duplicated rows, so they have been removed in the output data frame.

Pros

  1. It is a base R function for getting unique rows. No need to install other packages.
  2. It is a simple function call. No subsetting is required.
  3. It preserves the original order of the first occurrence.

Cons

  1. It does not provide flexibility to remove duplicates from specific columns.
  2. The performance is decreased when it is performed on larger data frames.

Method 3: Using the dplyr package’s distinct() function

The dplyr::distinct() function keeps unique/distinct rows from the data frame. If there are duplicate rows, only the first row is preserved, and others are removed from a data frame.

library(dplyr)

df <- data.frame(
  name = c("Krunal", "Ankit", "Rushabh", "Dhaval", "Tejas", "Rushabh", "Dhaval", "Tejas"),
  score = c(85, 90, 78, 92, 88, 78, 92, 88),
  subject = c("Math", "Math", "History", "History", "Math", "History", "History", "Math"),
  grade = c("10th", "11th", "11th", "10th", "10th", "11th", "10th", "10th")
)

df_unique <- df %>% distinct()

print(df_unique)

Output

Use the following code to remove duplicate rows based on a single column(variable).

df %>% distinct(subject, .keep_all = TRUE)

If you want to consider certain columns to determine the duplicate values, you can use `df %>% distinct(col1, col2, .keep_all = TRUE)` to keep all columns but consider only col1 and col2 for duplicates.

The `.keep_all=TRUE` argument is only needed when we have to specify certain columns and want to keep the other columns in the output.

df %>% distinct(col1, col2, .keep_all = TRUE)

It will return the unique rows based on the col1 and col2 columns.

Pros

  1. You can specify columns for duplication-checking which is not possible in other methods.
  2. If you are working on big data frames, I highly recommend you use the “dplyr” package because it is optimized for performance.
  3. It provides a clean syntax with pipes.

Cons

  1. It requires an external package (tidyverse or dplyr).
  2. It has a slight learning curve to get into.
  3. Slightly slower than !duplicated() approach for small datasets.

That’s all!

Recent Posts

R length(): Vector, List, Matrix, Array, Data Frame, String

Before executing an operation on an object, it is advisable to check its length, as…

15 hours ago

How to Round Numbers in R

Rounding is a process of approximating a number to a shorter, simpler, and more interpretable…

2 days ago

Adding Single or Multiple Columns to Data Frame in R

Whether you want to add new data to your existing datasets or create new variables…

4 days ago

sqrt() Function: Calculate Square Root in R

The square root of a number is a value that is multiplied by itself, giving…

5 days ago

How to Remove NA From Vector in R

A vector is a data structure that holds the same type of data. When working…

1 week ago

Converting String to Uppercase in R

For string operations like comparing strings, data standardization, formatting output, or input validation, we may…

1 week ago