The aggregate function in R can aggregate the input data frame by applying a function defined by the FUN argument to each column of sub-data frames defined using a by parameter.
aggregate in R
To aggregate data in R, use the aggregate() function. The aggregate() is a built-in R method that splits the data into subsets, calculates the summary statistics for each, and returns the result conveniently.
The aggregate() function accepts the by parameter, which must be a list. Nevertheless, considering the data frames are handled as (named) lists of columns, one or more columns of a data frame can also be given as the by parameter.
The basic usage of the aggregate() function is to use the base functions such as mean and sd. It is certainly one of the most common uses of aggregate to compare the mean or other properties of sample groups.
To perform aggregate in R, we need to specify three things in the code:
- The data that we want to aggregate.
- The variable to group by within the data.
- The calculation function to apply to the groups.
Let’s see the syntax of the aggregate() function.
aggregate(x, by, FUN, …, simplify = TRUE, drop = TRUE, formula data, subset, na.action = na.omit, nfrequency = 1, ndeltat = 1, ts.eps = getOption("ts.eps"), …)
x: It is an R object.
by: It is a list of grouping items, each as long as the variables in the data frame x.
FUN: It is a function to compute the summary statistics.
drop: It is a logical implication of whether to drop superfluous combinations of grouping values.
formula: It is a formula, such as y ~ x or cbind(y1, y2) ~ x1 + x2, where the y variables are numeric data to be split into groups according to the grouping x variables.
data: It is a data frame (or list) from which the variables in the formula should be taken.
subset: It is an optional vector defining a subset of observations to be used.
na.action: It is a function that shows what should happen when the data include NA values.
ndeltat: It is a new fraction of the sampling period between consecutive observations; must be a divisor of the sampling interval of x.
…: They are the further arguments passed to or used by methods.
Let’s create a data frame using data.frame() function and then apply the aggregate() function to filter the data as per our requirement. We will find how many unique values appear in the data frame using the aggregate() function.
The first step, define a data frame.
df <- data.frame(value = c(11, 11, 11, 11, 19, 19, 19, 19, 19, 19, 21, 21, 21))
Now, pass the data frame to the aggregate() function. We want to aggregate based on the unique values of the data frame. So our function will be length.
df <- data.frame(value = c(11, 11, 11, 11, 19, 19, 19, 19, 19, 19, 21, 21, 21)) total_appearances <- aggregate(x = df, by = list(unique.values = df$value), FUN = length)
So, here, we want to find each numeric values that appear how many times. See the complete R program.
df <- data.frame(value = c(11, 11, 11, 11, 19, 19, 19, 19, 19, 19, 21, 21, 21)) total_appearances <- aggregate(x = df, by = list(unique.values = df$value), FUN = length) total_appearances
unique.values value 1 11 4 2 19 6 3 21 3
As you can see that there are three unique numeric values which are 11, 19, and 21.
The 11 appears 4 times, 19 appears 6 times, and 21 appears 3 times.
Applying aggregate() to DataSet in R
R comes with lots of inbuilt datasets and one of them is ChickWeight.
Let’s apply the mean function to the weight column and get the aggregate values.
aggdata <- aggregate(ChickWeight, by = list(gw = ChickWeight$weight), FUN = mean) print(head(aggdata))
gw weight Time Chick Diet 1 35 35 2.0000000 NA NA 2 39 39 0.2500000 NA NA 3 40 40 0.0000000 NA NA 4 41 41 0.0000000 NA NA 5 42 42 0.1333333 NA NA 6 43 43 0.0000000 NA NA
That is it for aggregate() function in R.