How to Remove Duplicates in R with Example

Data cleaning is one of the toughest tasks of a data scientist. To accurately analyze the data set, we often remove duplicates based on certain conditions like column values. In this tutorial, we will see how to remove duplicate data based on column values and the different ways to do it efficiently.

How to Remove Duplicates in R

There are three approaches you can use to remove duplicates in R.

  1. Using duplicated(): It identifies the duplicate elements.
  2. Using unique(): It extracts unique elements
  3. dplyr package’s distinct(): Removing duplicate rows from a data frame.

duplicated() in R

The duplicated() is a built-in R method that defines which items of a vector or data frame are duplicates of items with smaller subscripts and returns a logical vector indicating which items (rows) are duplicates.

rv <- c(11, 21, 46, 21, 19, 18, 19)

duplicated(rv)

Output

[1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE

If the element appears a second time in the vector, it returns  TRUE. 

It gives the position of duplicate elements in the vector.

To extract a unique element from a vector in R, use the !duplicated(), where ! is logical negation.

rv <- c(11, 21, 46, 21, 19, 18, 19)

rv[!duplicated(rv)]

Output

[1] 11 21 46 19 18

Remove duplicate rows from data frame in R

To remove duplicate rows from a data frame, use the !duplicated() method, where ! is logical negation.

To create a data frame in R, use the data.frame() method.

provider <- data.frame(service_id = c(21, 19, 18, 46, 29, 18, 46),
 service_name = c("Netflix", "Disney+", "HBOMAX", "Hulu", "Peacock", "HBOMAX", "HULU"),
 service_price = c(18, 10, 15, 7, 12, 15, 7),
 stringsAsFactors = FALSE)

print(provider)

Output

   service_id   service_name   service_price
1     21           Netflix          18
2     19           Disney+          10
3     18           HBOMAX           15
4     46           Hulu              7
5     29           Peacock          12
6     18           HBOMAX           15
7     46           Hulu              7

You can see that our data frame contains duplicate rows.

T0 remove duplicate rows from a data frame based on column values; use the ! duplicated() method. Remove the duplicate rows based on the service_name column.

provider <- data.frame(service_id = c(21, 19, 18, 46, 29, 18, 46),
            service_name = c("Netflix", "Disney+", "HBOMAX", "Hulu", "Peacock", "HBOMAX", "Hulu"),
            service_price = c(18, 10, 15, 7, 12, 15, 7),
            stringsAsFactors = FALSE)

print(provider)

cat("======== After removing duplicate rows ==========", "\n")

provider[!duplicated(provider$service_name),]

Output

   service_id    service_name    service_price
1      21           Netflix           18
2      19           Disney+           10
3      18           HBOMAX            15
4      46           Hulu               7
5      29           Peacock           12
6      18           HBOMAX            15
7      46           Hulu               7

======== After removing duplicate rows ==========

   service_id    service_name    service_price
1      21           Netflix           18
2      19           Disney+           10
3      18           HBOMAX            15
4      46           Hulu               7
5      29           Peacock           12

The output contains only five rows in the data frame that mean two duplicate rows have been removed.

You can remove the row based on whatever column you like. In our example, we removed the duplicate rows based on the service_name column, but you can remove them based on any column.

The column values are case sensitive, so if there are two values like HULU and Hulu, then !duplicated() function takes this as two different values. So, it will not count as duplicate values. So please keep in mind that duplicated() function is case-sensitive.

Using unique() function in R

To extract unique items from the vector, data frame, or array-like object in R, use the unique() function.

rv <- c(11, 21, 46, 21, 19, 18, 19)

unique(rv)

Output

[1] 11 21 46 19 18

You can see that if we apply the unique() function to a vector, it will remove the duplicate elements from the vector and returns a vector of unique elements.

Extract unique rows from the data frame in R

The unique() is a built-in R function that returns a vector, data frame, or array-like object but with unique elements/rows.

provider <- data.frame(service_id = c(21, 19, 18, 46, 29, 18, 46),
            service_name = c("Netflix", "Disney+", "HBOMAX", "Hulu", "Peacock", "HBOMAX", "Hulu"),
            service_price = c(18, 10, 15, 7, 12, 15, 7),
            stringsAsFactors = FALSE)

print(provider)
cat("======== After extracting unique rows ==========", "\n")
unique(provider)

Output

   service_id    service_name    service_price
1      21           Netflix           18
2      19           Disney+           10
3      18           HBOMAX            15
4      46           Hulu               7
5      29           Peacock           12
6      18           HBOMAX            15
7      46           Hulu               7

======== After extracting unique rows ==========

   service_id    service_name    service_price
1      21           Netflix           18
2      19           Disney+           10
3      18           HBOMAX            15
4      46           Hulu               7
5      29           Peacock           12

And we get the unique rows from the data frame.

dplyr package’s distinct() method

The distinct() is a function of the dplyr package that can keep unique/distinct rows from the data frame. If there are duplicate rows, only the first row is preserved.

If the dplyr package is not installed in your system, then you need to install it first. Then you can use the distinct() function.

After installing, you need to import it into your program using the following code.

library(dplyr)

To get the unique rows from the data frame, use the following code.

provider %>% distinct()

See the below complete code.

library(dplyr)

provider <- data.frame(service_id = c(21, 19, 18, 46, 29, 18, 46),
                service_name = c("Netflix", "Disney+", "HBOMAX", "Hulu", "Peacock", "HBOMAX", "Hulu"),
                service_price = c(18, 10, 15, 7, 12, 15, 7),
                stringsAsFactors = FALSE)

print(provider)

cat("======== Using distinct() method to get unique rows ==========", "\n")

provider %>% distinct()

And we will get the same output as the above sections.

To remove duplicate rows based on a single column(variable), use the following code.

provider %>% distinct(service_price, .keep_all = TRUE)

To remove duplicate rows based on multiple columns (variables), use the following code.

provider %>% distinct(service_price, service_name, .keep_all = TRUE)

It will return the unique rows based on the service_price and service_name columns.

Final Words

To remove the duplicate elements or duplicate rows from vector or data frame, use the base functions like unique() or duplicated() method. If you are dealing with big data set and remove the duplicate rows, use the dplyr package’s distinct() function.

Leave a Comment