R Advanced

How to Calculate Variance in R

To calculate the sample variance (measurement of spreading) in R, you should use the built-in var() function. It calculates the sample variance (using n−1 for unbiased estimation, where n is the sample size). By default, it does not calculate the population variance.

Variance measures how spread out a set of numbers is around the mean (average). If the variance is a small number, it is close to the mean value. If the variance is a high number, it is very far from the mean value, which means it is spread out.

If variance becomes 0, all the data points become identical. Variance cannot have a negative value.

Syntax

var(x, y = NULL, na.rm = FALSE, use)

Parameters

Name Value
x It is a numeric vector, matrix, or data frame
y It is the second vector or matrix for covariance calculation.
na.rm By default, it is FALSE, but if TRUE, it removes missing values (NA).
use It specifies how to handle missing values in matrices or data frames.

Return value

It returns the variance of the input vector. If your input is a data frame, it returns a covariance matrix if y is provided or if the input has more than one column.

Variance of a numeric vector

vec <- c(60, 55, 50, 65, 59)

var(vec)

# Output: [1] 31.7

Handling NA values

If your data contains NA values, it will return NA as an output.

vec <- c(60, 55, 50, NA, 59)

var(vec)

# Output: [1] NA

To exclude them from the calculation, pass the na.rm = TRUE to the function.

vec <- c(60, 55, 50, NA, 59)

var(vec, na.rm = TRUE)

# Output: [1] 20.66667

Using a real-life dataset

Step 1: Install the libraries

We need “tidyverse” and “ggplot2” libraries to continue this small project.

install.packages("tidyverse")

install.packages("ggplot2")

Step 2: Import and Load the Dataset

We will use Kaggle’s Google Stock Data.

library(tidyverse)
library(ggplot2)

google_data <- read_csv("./DataSets/GOOGL.csv")

head(google_data)

Output

Step 3: Calculation of Variance

# Calculate variance for each numerical column
variance_data <- sapply(select(google_data, -Date), var)

variance_data

Output

Step 4: Visualize the Variance

To visualize the variance data, we can create a barplot where the x-axis represents the features (columns) of the dataset and the y-axis represents their respective variance values. This will allow us to easily compare the variance across different features.

variance_data <- sapply(select(google_data, -Date), var)

# Convert variance_data into a dataframe for ggplot
variance_df <- as.data.frame(variance_data)
variance_df$Features <- rownames(variance_df)
colnames(variance_df) <- c("Variance", "Features")

# Plot variance data
ggplot(variance_df, aes(x = reorder(Features, -Variance), y = Variance)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(
    title = "Variance of Google Stock Features",
    x = "Features", y = "Variance Value"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Visualizing the data distribution can provide insights into its spread and central tendencies.

Histograms and boxplots are commonly used for this purpose.

# Plotting histograms for each numerical column
google_data %>%
  select(-Date) %>%
  gather(key = "Features", value = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(fill = "blue", color = "black", bins = 50) +
  facet_wrap(~ Features, scales = "free_x") +
  labs(title = "Distribution of Google Stock Features", x = "Value", y = "Frequency")

Sample Variance vs. Population Variance

The main difference between a sample and population variance relates to the variance calculation.

Population variance refers to the value of variance calculated from population data, and sample variance is the variance calculated from sample data.

The correction does not matter for large sample sizes. However, it does matter when the dataset has small sample sizes. When the variance is calculated from population data, n equals the number of elements.

The formula of sample variance

The formula of population variance

To calculate the population variance p (where the division is by n), you need to manually adjust the calculation:

mean((x - mean(x)) ^ 2)

Here is a code example of this:

population_variance <- function(rv) {
 mean((rv - mean(rv)) ^ 2)
}

weights <- c(60, 55, 50, 65, 59)
population_variance(weights)

Output

[1] 25.36

That’s it.

Recent Posts

R scale(): Scaling and Centering of Matrix-like Objects

The scale() function in R centers (subtracting the mean) and/or scales (dividing by the standard…

2 weeks ago

file.rename(): Renaming Single and Multiple Files in R

To rename a file in R, you can use the file.rename() function. It renames a…

3 weeks ago

R prop.table() Function

The prop.table() function in R calculates the proportion or relative frequency of values in a…

3 weeks ago

exp() Function: Calculate Exponential of a Number in R

The exp() is a built-in function that calculates the exponential of its input, raising Euler's…

3 weeks ago

R split() Function: Splitting a Data

The split() function divides the input data into groups based on some criteria, typically specified…

1 month ago

colMeans(): Calculating the Mean of Columns in R Data Frame

The colMeans() function in R calculates the arithmetic mean of columns in a numeric matrix,…

1 month ago