R Basic

Splitting Strings: A Beginner’s Guide to strsplit() in R

The strsplit() function in R splits elements of a character vector into a list of substrings based on a specified delimiter or regular expression pattern.

In the above figure, we are splitting a character vector “string” into three substrings based on the space in between them.

string <- ("Hello dystopian world")

strsplit(string, split = " ")

# Output:

# [[1]] 

# [1] "Hello" "dystopian" "world"

Syntax

strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)

Parameters

Argument Description
x It is an input character vector that contains strings to be split.
split It is a character vector or regular expression defining the delimiter from which you want to split a string.

It can be a comma, a character, or anything your data has from which you want to split.

If it is empty (“”), it splits into individual characters.

fixed

It is a logical argument, either TRUE or FALSE. The default is FALSE.

As the name suggests, it is a fixed string rather than a regular expression.

If  TRUE, it will split a string based on a specific string, not a regular expression.

Don’t use the “fixed” parameter if you are going to use a regular expression.

perl

It is a logical argument. The default is FALSE.

It enables Perl-compatible regular expressions.

useBytes

It is a logical argument. The default is FALSE.

If you set it to TRUE, it will split byte by byte rather than character by character.

It allows byte-level splitting.

Splitting into individual character vectors

You can split a string into individual character vectors by passing an empty string as a delimiter, which will separate each character, and then use the unlist() function to convert the list into a vector.

In the above figure, we split a single string into individual characters separated by an empty string.

string <- "hello"

list <- strsplit(string, split = "")

char_vector <- unlist(list)

print(char_vector)

# Output

# [1] "h" "e" "l" "l" "o"

Multiple strings

This function is not limited to a single string; you can use it to split the multiple strings.

# Multiple strings
strings <- c("one,two,three", "four,five,six")

# Split each string by ','
splitted <- strsplit(strings, split = ",")

# Print the result
print(splitted)

# Output:

# [[1]] 
# [1] "one" "two" "three" 

# [[2]] 
# [1] "four" "five" "six"

Passing delimiter

You can pass a delimiter (a symbol or special character) that separates the words or text in the data.

The above figure illustrates how we split a string based on the “&” delimiter.

string <- "Hello&dystopian&world"

strsplit(string, split = "&")

# Output:

# [[1]]
# [1] "Hello" "dystopian" "world"

Regular expression

Regular expressions are a compact and flexible way of describing patterns in strings. You can provide the specific regular expression you need to split the string.

In the above figure, we split a string based on regular expression as a separator. Basically, we split a string where R found numerical values.

string <- ("Hello19dystpoian21world")

strsplit(string, split = "[0-9]+")

# Output:
# [[1]] 
# [1] "Hello" "dystpoian" "world"

Passing fixed = TRUE

If you set fixed = TRUE, it tells R to treat the split pattern as a fixed string, not as a regular expression.

fruits <- c("apple, banana, cherry", "orange, peach, grape")

split_fruits <- strsplit(fruits, ", ", fixed = TRUE)

print(split_fruits)

# Output
# [[1]] 
# [1] "apple" "banana" "cherry" 

# [[2]] 
# [1] "orange" "peach" "grape"

Handling NA

When it comes to NA values, the strsplit() function handles NA values gracefully.

When an input vector contains NA, this function includes NA in the output list at the corresponding position without causing errors or altering the structure of the result.

input_with_na <- c("data, info", NA, "red,blue")

output_including_na <- strsplit(input_with_na, ",")

print(output_including_na)

# Output:
# [[1]]
# [1] "data" " info"

# [[2]]
# [1] NA

# [[3]]
# [1] "red" "blue"

Perl-Compatible Regular Expressions

If you want to use a Perl-compatible regular expression, you can do it by passing perl=TRUE in the argument.

complex_string <- "audi123bmw456bugatti"

perl_pattern_output <- strsplit(complex_string, "\\d+", perl = TRUE)

print(perl_pattern_output)

# Output
# [[1]]
# [1] "audi" "bmw" "bugatti"

In this code, the Perl-compatible regex \\d+ matches one or more digits, splitting the string at numeric sequences.

Byte-Level splitting

For non-standard encodings, you might have to split the string at the byte level.

non_standard_string <- "café"

# Splitting the string byte-by-byte
byte_split <- strsplit(non_standard_string, "", useBytes = TRUE)

print(byte_split)

# Output
# [[1]]
# [1] "c" "a" "f" "\xc3" "\xa9"

On most UTF-8 systems, the output looks like above program. It helps manage non-UTF-8 encodings.

That’s all!

Recent Posts

Understanding of rnorm() Function in R

The rnorm() method in R generates random numbers from a normal (Gaussian) distribution, which is…

6 days ago

as.factor() in R: Converting a Vector to Categorical Data

The as.factor() function in R converts a vector object into a factor. Factors store unique…

6 days ago

cbind() Function: Binding R Objects by Columns

R cbind (column bind) is a function that combines specified vectors, matrices, or data frames…

3 weeks ago

rbind() Function: Binding Rows in R

The rbind() function combines R objects, such as vectors, matrices, or data frames, by rows.…

3 weeks ago

as.numeric(): Converting to Numeric Values in R

The as.numeric() function in R converts valid non-numeric data into numeric data. What do I…

4 weeks ago

Calculating Natural Log using log() Function in R

The log() function calculates the natural logarithm (base e) of a numeric vector. By default,…

1 month ago