R Advanced

Everything You Need to Know About read.table() Function in R

Most real-world data resides in external sources, including CSVs, Excels, Texts, or Databases. To bring back the data from these external sources, we need a bridge or function that we can use to analyze these data into our R ecosystem and further manipulate it to get the required results. That’s where a base function like read.table() comes into the picture.

read.table()

The read.table() function imports external tabular data into R data frames for analysis and manipulation. It connects R to the outside world datasets. You can use it as the foundation for any data analysis project.

Syntax

# Reading a CSV file with a header row
df <- read.table("data.csv", header = TRUE, sep = ",", colClasses, na.strings = "")

# Reading a tab-separated text file without a header
df <- read.table("data.txt", header = FALSE, sep = "\t")

Parameters

Name	Value
file	It is a character string that specifies an input file we want to read.
header	It is a header argument. If set to TRUE, it suggests a first row of Data Frame.
sep	It specifies a separator character used to delimit the columns in the file.
colClasses	It specifies column classes.
na.strings	It specifies which strings should be interpreted as missing values.

Sample dataset

Before proceeding further, we need to create an external data source if we have not already:

You can skip this step if you have a data source.

We can create a CSV file using R by the command below:

cat("Name,Age,City\nKrunal,31,Perth\nJane,30,London\nSunita,35,Ahmedabad\n", file = "data.csv")

Output CSV file

It will create a “data.csv” file that looks like this:

Basic CSV Import with Header

We will read the sample “data.csv” file with a header in our R environment:

# Reading a CSV file
df <- read.table("data.csv", header = TRUE, sep = ",")

print(df)

Output

You can see from the above output of RStudio that we passed header = TRUE, which means that the output must contain the first row that has column names. The sep = “,” specifies the comma as the column separator.

Tab-Separated File (TSV) without Header

Let’s create a TSV (Tab-Separated File) without a header and import it using read.table() function in RStudio.

# Creating a sample TSV file (data.tsv)
cat("John\t25\tNew York\nYogita\t30\tDelhi\nPeter\t22\tParis\n", file = "data.tsv")

# Read the TSV file
df <- read.table("data.tsv", header = FALSE, sep = "\t")

print(df)

Output

In this code, we created a TSV file on the fly and imported it using read.table() function and displayed it in RStudio by writing few lines of code.

You can see that we have not imported column names, and that’s why it assigns by default names: V1, V2, and V3. The sep = “\t” uses the tab character as the separator.

Specifying column classes

Let’s create a new CSV file on the fly that contains mixed data type columns. After importing it as a data frame, we will analyze its structure.

# Create a sample CSV file with mixed data types (data_types.csv)
cat("Name,Age,Salary,IsActive\nJohn,25,50000,TRUE\nJane,30,60000,FALSE\n", file = "data_types.csv")

# Read the CSV with specific column classes
data <- read.table("data_types.csv", header = TRUE, sep = ",",
                    colClasses = c("character", "numeric", "numeric", "logical"))

str(data) # Check the structure of the data frame

Output

In the above programming code, you can see that we passed the “colClasses” argument, which is a vector specifying a data type for each column.

Handling Missing Values

If we want to replace missing values with NA in a final data frame, we need to pass na.strings = “” as an argument.

# Creating a sample CSV file with missing values (missing_data.csv)
cat("Name,Age,City\nJohn,25,\nJane,,London\nPeter,22,Paris\n", file = "missing_data.csv")

# Reading the CSV, specifying the missing value representation
df <- read.table("missing_data.csv", header = TRUE, sep = ",", na.strings = "") # na.strings = c("","NA","?") for multiple missing values

print(df)

Output

In the above code, we specified which string should be interpreted as missing value, and we told R that (“”) empty values should be treated as missing values, and hence, it was replaced with <NA> in the final data frame.

That’s all for today!

Krunal Lathiya

Krunal Lathiya is a seasoned Computer Science expert with over eight years in the tech industry. He boasts deep knowledge in Data Science and Machine Learning. Versed in Python, JavaScript, PHP, R, and Golang. Skilled in frameworks like Angular and React and platforms such as Node.js. His expertise spans both front-end and back-end development. His proficiency in the Python language stands as a testament to his versatility and commitment to the craft.