R Advanced

summary() Function: Producing Summary Statistics in R

The summary() is a generic function that produces the summary statistics for various R objects, including vectors, matrices, data frames, and model objects.

The above figure explains the summary for a data frame with three columns.

For different types of objects, the summary() function produces different types of summaries:

  1. If the input is numeric, it returns the minimum, 1st quartile, median, mean, 3rd quartile, and maximum.
  2. If the input is categorical data such as factors, summary() returns a frequency table.
  3. For data frames, it applies the appropriate summary for each column.
  4. For model objects (like lm, glm), it produces a summary of the model fit, including coefficients, residuals, and significance tests.

Syntax

summary(object, …)

Parameters

Arguments Description
object It represents an R object, including a vector, a data frame, a matrix, a list, or a model object.

Summary of data frame

To find the summary of a data frame, pass it to the summary() method, which returns the summary of each column appropriately (numeric/factor/character).

df <- data.frame(
  service_id = c(1:5),
  service_name = c("Netflix", "Disney+", "HBOMAX", "Hulu", "Peacock"),
  service_price = c(18, 10, 15, 7, 12),
  stringsAsFactors = FALSE
)

summary(df)

The above output shows the column-wise summary of the data frame.

The data frame contains three columns, and the summary is also provided for each column individually.

If you inspect carefully, the first column is numeric; in that case, the summary is different.

The second column is a character vector; its summary is different.

The third column is again a numeric vector, so its summary is the same as the first one except for different values.

Vector

 

For a normal vector without containing NA values, it returns the minimum, Q1, median, mean, Q3, and maximum.

vec <- 1:5

summary(vec)

# Output:
#  Min.  1st Qu.   Median   Mean   3rd Qu.   Max.
#   1       2         3       3      4        5

Vector with NA values

If a vector contains missing values (NA), it also reports the count of NA values.

vec_with_na <- c(1, 2, NA, 4, 5, NA)

summary(vec_with_na)

# Output:

#  Min.   1st Qu.  Median   Mean   3rd Qu.  Max.   NA's
#  1.00     1.75     3.00   3.00    4.25    5.00    2

If you carefully analyze the above output, you will know that there are two NAs in the input vector.

Empty vector

summary(numeric(0))

# Output:

#  Min.  1st Qu.  Median  Mean  3rd Qu.  Max.

Factor / Categorical data

As we know, when you pass a factor to the summary() function, it returns a frequency table that contains the count of each element of the factor.

gender_factor <- factor(c("male", "female", "female", "male", "female"))

summary(gender_factor)

# Output:
# female   male
#   3       2

The above output shows that female appears 3 times and male appears 2 times in the factor.

List

The summary of a list has Length, Class, and Mode attributes. 

 

vec <- 1:5

list <- list(vec)

summary(list)

# Output:
#      Length  Class     Mode
# [1,]   5     -none-   numeric

Matrix

If the input matrix has two columns, the output will have two summaries. Again, it returns the summary column-wise.

rv <- c(11, 18, 19, 21)

mtrx <- matrix(rv, nrow = 2, ncol = 2)

summary(mtrx)

Summary of the linear regression model

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered an explanatory variable, and the other is a dependent variable.

A widespread application of the summary functions is the calculation of summary statistics of statistical models.

set.seed(93274)

l_x <- rnorm(1000)
l_y <- rnorm(1000) + l_x

mod <- lm(l_y ~ l_x)

summary(mod)

Output

Call:
lm(formula = l_y ~ l_x)

Residuals:
 Min 1Q Median 3Q Max
-3.7337 -0.6964 -0.0047 0.7333 3.3489

Coefficients:
 Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.02159 0.03292 -0.656 0.512
l_x 1.00156 0.03262 30.707 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.041 on 998 degrees of freedom
Multiple R-squared: 0.4858, Adjusted R-squared: 0.4853
F-statistic: 942.9 on 1 and 998 DF, p-value: < 2.2e-16

Summary of regression model: coefficients, p-values, R-squared, residuals, etc.

For more detailed or specific summaries, other functions like str(), table(), or specialized packages for statistical modeling might be necessary.

Recent Posts

R paste() Function

The paste() function in R concatenates vectors after converting them to character. paste("Hello", 19, 21,…

1 week ago

paste0() Function in R

R paste0() function concatenates strings without any separator between them. It is a shorthand version…

1 week ago

How to Calculate Standard Error in R

Standard Error (SE) measures the variability or dispersion of the sample mean estimate of a…

2 weeks ago

R max() and min() Functions

max() The max() function in R finds the maximum value of a vector or data…

2 weeks ago

R as.Date() Function: Working with Dates

The as.Date() function in R converts various types of date and time objects or character…

3 weeks ago

R pnorm() Function [With Graphical Representation]

The pnorm() function in R calculates the cumulative density function (cdf) value of the normal…

3 weeks ago