Efficient Multivariate Summary in R: A Guide to Analyzing Multiple Independent Variables
In my previous post, I introduced how to summarize data, such as mean, standard deviation, and standard error. However, at that moment, I demonstrated how to summarize only one variable.
□ Streamlined Data Summary in R STUDIO: Enhancing Bar Graphs with Error Bars
Now, let’s discuss this further with a dataset.
if(!require(readr)) install.packages("readr")
library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/yield%20component_nitrogen.csv"
dataA= data.frame(read_csv(url(github), show_col_types= FALSE))
head(dataA, 3)
Genotype Block Nitrogen GN AGW Yield
1 CV1 I N1 21488 43.9 902.4
2 CV1 II N1 23707 41.7 944.9
3 CV1 III N1 20817 45.6 907.7
.
.
.
I would like to summarize the Yield data, including the mean, standard deviation, and standard error. I’ll use ddply()
dataB= ddply (dataA, c("Genotype","Nitrogen"), summarise, mean=mean(Yield),
sd=sd(Yield), n=length(Yield), se=sd/sqrt(n))
dataB
Genotype Nitrogen mean sd n se
1 CV1 N0 475.9667 146.39646 3 84.52204
2 CV1 N1 918.3333 23.15952 3 13.37115
3 CV2 N0 603.8000 41.99190 3 24.24404
4 CV2 N1 865.1667 229.03985 3 132.23622
5 CV3 N0 485.1667 105.48423 3 60.90135
6 CV3 N1 960.6333 84.80627 3 48.96292
7 CV4 N0 640.1000 47.77457 3 27.58266
8 CV4 N1 846.5000 80.20916 3 46.30878
9 CV5 N0 701.0000 86.74001 3 50.07937
10 CV5 N1 966.4667 63.15262 3 36.46118
Now, I also want to summarize variables GN and AGW. Do you think it is a good approach to summarize all variables?
dataC= ddply (dataA, c("Genotype","Nitrogen"), summarise, mean=mean(GN),
sd=sd(GN), n=length(GN), se=sd/sqrt(n))
dataD= ddply (dataA, c("Genotype","Nitrogen"), summarise, mean=mean(AGW),
sd=sd(AGW), n=length(AGW), se=sd/sqrt(n))
dataB$GN=dataC$mean
dataB$GN_se=dataC$se
dataB$AGW=dataD$mean
dataB$AGW_se=dataD$se
colnames(dataB)[3]=c("Yield")
colnames(dataB)[6]=c("Yield_se")
dataB
Genotype Nitrogen Yield sd n Yield_se GN GN_se AGW AGW_se
CV1 N0 475.9667 146.39646 3 84.52204 10250.33 1659.8340 46.93333 0.71724783
CV1 N1 918.3333 23.15952 3 13.37115 22004.00 873.2539 43.73333 1.12891295
CV2 N0 603.8000 41.99190 3 24.24404 11618.33 498.8525 52.06667 0.06666667
CV2 N1 865.1667 229.03985 3 132.23622 17975.67 2172.0740 48.90000 1.47986486
CV3 N0 485.1667 105.48423 3 60.90135 11677.00 1202.7329 43.13333 1.18649812
CV3 N1 960.6333 84.80627 3 48.96292 24803.00 922.3407 41.60000 0.55677644
CV4 N0 640.1000 47.77457 3 27.58266 13364.00 748.6497 49.13333 1.76477887
CV4 N1 846.5000 80.20916 3 46.30878 17466.33 732.2956 50.50000 0.77674535
CV5 N0 701.0000 86.74001 3 50.07937 15419.00 730.9747 45.96667 0.72188026
CV5 N1 966.4667 63.15262 3 36.46118 26012.33 1237.1395 39.10000 1.20554275
It seems alright, but we can save time using a much simpler code. I’ll introduce how to summarize multiple variables at once. I’ll use the following code, dplyr::summarize().
library(dplyr)
dataB= data.frame(dataA %>%
group_by(Genotype, Nitrogen) %>%
dplyr::summarize(across(c(GN, AGW, Yield),
.fns = list(Mean = mean,
SD = sd,
n = length,
se = ~ sd(.)/sqrt(length(.))))))

How about when missing values exist? Here is another example.
library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/fertilizer_fungicide_treatment_with_missing_values.csv"
dataA= data.frame(read_csv(url(github), show_col_types= FALSE))
head(dataA, 3)
Fertilizer Fungicide Yield Height
1 Control Yes 12.2 45
2 Control Yes 12.4 NA
3 Control Yes 11.9 42
.
.
.
In this data, there are some missing values. I’ll use the same code, dplyr::summarize().
library(dplyr)
dataB= data.frame(dataA %>%
group_by(Fertilizer, Fungicide) %>%
dplyr::summarize(across(c(Yield, Height),
.fns= list(Mean= mean,
SD= sd,
n= length,
se= ~ sd(.)/sqrt(length(.))))))
When there are missing values, the mean and standard deviation and standard error cannot be calculated. In this case, we can add na.rm= TRUE to calculate the mean and others even though missing values exist.
library(dplyr)
dataB = data.frame(dataA %>%
group_by(Fertilizer, Fungicide) %>%
dplyr::summarize(across(c(Yield, Height),
.fns= list(Mean=~mean(., na.rm= TRUE),
SD= ~sd(., na.rm= TRUE),
n=~length(.),
se=~sd(.,na.rm= TRUE) / sqrt(length(.))))))

We aim to develop open-source code for agronomy ([email protected])
© 2022 – 2025 https://agronomy4future.com – All Rights Reserved.
Last Updated: 11/12/2023