Efficient Multivariate Summary in R: A Guide to Analyzing Multiple Independent Variables

Efficient Multivariate Summary in R: A Guide to Analyzing Multiple Independent Variables

In my previous post, I introduced how to summarize data, such as mean, standard deviation, and standard error. However, at that moment, I demonstrated how to summarize only one variable.

Streamlined Data Summary in R STUDIO: Enhancing Bar Graphs with Error Bars

Now, let’s discuss this further with a dataset.

if(!require(readr)) install.packages("readr")
library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/yield%20component_nitrogen.csv"
dataA= data.frame(read_csv(url(github), show_col_types= FALSE))

head(dataA, 3)
  Genotype Block Nitrogen    GN  AGW  Yield
1       CV1     I       N1 21488 43.9  902.4
2       CV1    II       N1 23707 41.7  944.9
3       CV1   III       N1 20817 45.6  907.7
.
.
.

I would like to summarize the Yield data, including the mean, standard deviation, and standard error. I’ll use ddply()

dataB= ddply (dataA, c("Genotype","Nitrogen"), summarise, mean=mean(Yield), 
              sd=sd(Yield), n=length(Yield), se=sd/sqrt(n))

dataB
   Genotype Nitrogen     mean        sd n        se
1       CV1       N0 475.9667 146.39646 3  84.52204
2       CV1       N1 918.3333  23.15952 3  13.37115
3       CV2       N0 603.8000  41.99190 3  24.24404
4       CV2       N1 865.1667 229.03985 3 132.23622
5       CV3       N0 485.1667 105.48423 3  60.90135
6       CV3       N1 960.6333  84.80627 3  48.96292
7       CV4       N0 640.1000  47.77457 3  27.58266
8       CV4       N1 846.5000  80.20916 3  46.30878
9       CV5       N0 701.0000  86.74001 3  50.07937
10      CV5       N1 966.4667  63.15262 3  36.46118

Now, I also want to summarize variables GN and AGW. Do you think it is a good approach to summarize all variables?

dataC= ddply (dataA, c("Genotype","Nitrogen"), summarise, mean=mean(GN), 
              sd=sd(GN), n=length(GN), se=sd/sqrt(n))

dataD= ddply (dataA, c("Genotype","Nitrogen"), summarise, mean=mean(AGW), 
              sd=sd(AGW), n=length(AGW), se=sd/sqrt(n))

dataB$GN=dataC$mean 
dataB$GN_se=dataC$se
dataB$AGW=dataD$mean 
dataB$AGW_se=dataD$se
colnames(dataB)[3]=c("Yield")
colnames(dataB)[6]=c("Yield_se")

dataB
Genotype Nitrogen    Yield        sd n  Yield_se       GN     GN_se      AGW     AGW_se
     CV1       N0 475.9667 146.39646 3  84.52204 10250.33 1659.8340 46.93333 0.71724783
     CV1       N1 918.3333  23.15952 3  13.37115 22004.00  873.2539 43.73333 1.12891295
     CV2       N0 603.8000  41.99190 3  24.24404 11618.33  498.8525 52.06667 0.06666667
     CV2       N1 865.1667 229.03985 3 132.23622 17975.67 2172.0740 48.90000 1.47986486
     CV3       N0 485.1667 105.48423 3  60.90135 11677.00 1202.7329 43.13333 1.18649812
     CV3       N1 960.6333  84.80627 3  48.96292 24803.00  922.3407 41.60000 0.55677644
     CV4       N0 640.1000  47.77457 3  27.58266 13364.00  748.6497 49.13333 1.76477887
     CV4       N1 846.5000  80.20916 3  46.30878 17466.33  732.2956 50.50000 0.77674535
     CV5       N0 701.0000  86.74001 3  50.07937 15419.00  730.9747 45.96667 0.72188026
     CV5       N1 966.4667  63.15262 3  36.46118 26012.33 1237.1395 39.10000 1.20554275

It seems alright, but we can save time using a much simpler code. I’ll introduce how to summarize multiple variables at once. I’ll use the following code, dplyr::summarize().

library(dplyr)
dataB= data.frame(dataA %>%
  group_by(Genotype, Nitrogen) %>%
  dplyr::summarize(across(c(GN, AGW, Yield), 
                          .fns = list(Mean = mean, 
                                      SD = sd, 
                                      n = length,
                                      se = ~ sd(.)/sqrt(length(.))))))

How about when missing values exist? Here is another example.

library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/fertilizer_fungicide_treatment_with_missing_values.csv"
dataA= data.frame(read_csv(url(github), show_col_types= FALSE))

head(dataA, 3)
   Fertilizer Fungicide Yield Height
1     Control       Yes  12.2     45
2     Control       Yes  12.4     NA
3     Control       Yes  11.9     42
.
.
.

In this data, there are some missing values. I’ll use the same code, dplyr::summarize().

library(dplyr)
dataB= data.frame(dataA %>%
                  group_by(Fertilizer, Fungicide) %>%
                  dplyr::summarize(across(c(Yield, Height), 
                          .fns= list(Mean= mean, 
                            SD= sd, 
                             n= length,
                            se= ~ sd(.)/sqrt(length(.))))))

When there are missing values, the mean and standard deviation and standard error cannot be calculated. In this case, we can add na.rm= TRUE to calculate the mean and others even though missing values exist.

library(dplyr)
dataB = data.frame(dataA %>%
                   group_by(Fertilizer, Fungicide) %>%
                   dplyr::summarize(across(c(Yield, Height), 
                                     .fns= list(Mean=~mean(., na.rm= TRUE), 
                                      SD= ~sd(., na.rm= TRUE), 
                                      n=~length(.),
                                      se=~sd(.,na.rm= TRUE) / sqrt(length(.))))))

We aim to develop open-source code for agronomy ([email protected])

© 2022 – 2025 https://agronomy4future.com – All Rights Reserved.

Last Updated: 11/12/2023

Comments are closed.