[R package] Compute Cumulative Summaries of Grouped Data (Feat. datacume)

[R package] Compute Cumulative Summaries of Grouped Data (Feat. datacume)

When analyzing data, we sometimes need to analyze cumulative data. When calculating cumulative data, grouping is important, and it takes time to perform the grouping and calculations. To simplify this process, I developed an R package called datacume().

Let’s upload the dataset.

if(!require(readr)) install.packages("readr")
library(readr)

github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/refs/heads/main/biomass_cumulative.csv"
df=data.frame(read_csv(url(github),show_col_types = FALSE))
set.seed(100)

print(df[sample(nrow(df),5),])
    Treatment Block Plant Branch Days Biomass
503   Control   III     5      1   66  360.73
358   Control     I     5      4   59  345.46
470   Control     I     8      3   66  403.13
516   Control     I     1      4   73  565.40
98    Drought   III     3      2   59  162.15
.
.
.

This dataset contains biomass measurements across different treatments over time, recorded for various plants and their branches. I want to calculate the cumulative biomass over time. To do this accurately, I’ll first calculate the average biomass of the branches within each plant, since the number of branches varies between plants. This step ensures consistency before calculating the cumulative biomass.

First, let’s start by using Excel. We’ll use a PivotTable to calculate the average biomass of the branches for each plant.

After that, we’ll calculate the cumulative biomass over time (by day).

Next, I’ll calculate another average — this time by Treatment and Day — using the cumulative biomass data. This will show how biomass accumulates over time across different treatments. This step provides a clear overview of the biomass accumulation patterns for each treatment group over time.

This process can be time-consuming, which is why the datacume() package helps streamline the workflow and save time.

1. Install package

if(!require(remotes)) install.packages("remotes")
if (!requireNamespace("datacume", quietly = TRUE)) {
  remotes::install_github("agronomy4future/datacume", force= TRUE)
}
library(remotes)
library(datacume)

After installing the package, you can type ?datacume in the R console to view detailed information and documentation about the package.

2. Run the package

Let’s run the package.

if(!require(dplyr)) install.packages("dplyr")
library(dplyr)

dataA= datacume(
              data= df,
              group_vars= c("Treatment", "Block", "Plant"),
              time_var= Days,
              response_vars= c("Biomass"))

print(dataA[sample(nrow(dataA),5),])
  Treatment Block Plant  Days Biomass Cumulative_Biomass
1 Control   III       1    52    309.               449.
2 Drought   III       2    66    168.               292.
3 Control   II        4    59    207.               567.
4 Control   I         8    59    361.               996.
5 Control   III       5    59    423.               575.
.
.
.

This code performs the same steps as the PivotTable process. Additionally, the function calculates cumulative biomass based on the Days variable.

Next, I’ll calculate the average of the cumulative biomass grouped by Treatment and Days.

summary=data.frame(dataA %>%
                     group_by(Treatment, Days) %>%
                     dplyr::summarize(across(c(Cumulative_Biomass),
                                 .fns= list(Mean=~mean(., na.rm= TRUE),
                                   SD= ~sd(., na.rm= TRUE),
                                    n=~length(.),
                                   se=~sd(.,na.rm= TRUE) / sqrt(length(.))))))%>%
  ungroup()

print(summary[sample(nrow(summary),5),])
   Treatment Days Cumulative_Biomass_Mean Cumulative_Biomass_SD
5    Control   73               1445.3170              331.1185
3    Control   59                723.4944              231.9299
10   Drought   73                641.8063              168.4534
4    Control   66               1106.6503              263.5810
8    Drought   59                341.1837              149.3478
   Cumulative_Biomass_n Cumulative_Biomass_se
5                    29              61.48716
3                    29              43.06831
10                   27              32.41887
4                    26              51.69249
8                    26              29.28952

Let’s compare this result with the one calculated in Excel. The results are the same, which confirms that the code accurately calculates the cumulative values.

3. Create figure

if(!require(ggplot2)) install.packages("ggplot2")
library(ggplot2)

Fig1= ggplot() +
  geom_jitter(data= dataA,
              aes(x= as.factor(Days), y= Cumulative_Biomass, fill= Treatment, shape=Treatment),
              width=0.2, alpha=0.5, size=2, color="grey75") +
  geom_errorbar(data= summary, aes(x=as.factor(Days), y= Cumulative_Biomass_Mean,
                ymin= Cumulative_Biomass_Mean - Cumulative_Biomass_se,
                ymax= Cumulative_Biomass_Mean + Cumulative_Biomass_se),
                width=0.2, linewidth=0.5, color="black") +
  geom_line(data= summary, aes(x=as.factor(Days), y=Cumulative_Biomass_Mean,
            color=Treatment, group=Treatment)) +
  geom_point(data= summary, aes(x=as.factor(Days), y=Cumulative_Biomass_Mean,
             fill=Treatment, shape=Treatment), size=2) +
  scale_color_manual(values= c("darkred","cadetblue"))+
  scale_fill_manual(values= c("darkred","cadetblue"))+
  scale_shape_manual(values= c(24,21))+
  scale_y_continuous(breaks = seq(0,3000,500), limits = c(0,3000)) +
  labs(x="Days from planting", y="Cumulative biomass (g/plant)") +
  theme_classic(base_size=18, base_family="serif")+
  theme(legend.position=c(0.85,0.88),
        legend.title=element_blank(),
        legend.key.size=unit(0.5,'cm'),
        legend.key=element_rect(color=alpha("white",.05), fill=alpha("white",.05)),
        legend.text=element_text(size=15),
        legend.background= element_rect(fill=alpha("white",.05)),
        panel.border= element_rect(color="black", fill=NA, linewidth=0.5),
        panel.grid.major= element_line(color="grey90", linetype="dashed"),
        axis.line=element_line(linewidth=0.5, colour="black"))

options(repr.plot.width=5.5, repr.plot.height=5)
print(Fig1)

ggsave("Fig1.png", plot= Fig1, width=5.5, height= 5, dpi= 300)

How about different grouping?

In the previous case, I calculated the average biomass of branches within each plant. Now, I’ll calculate the average biomass across plants to reduce data variability. This means the data will be grouped by Treatment and Block.

if(!require(dplyr)) install.packages("dplyr")
library(dplyr)

dataB= datacume(
              data= df,
              group_vars= c("Treatment", "Block"),
              time_var= Days,
              response_vars= c("Biomass"))

print(dataB[sample(nrow(dataB),5),])  
  Treatment Block  Days Biomass Cumulative_Biomass
1 Control   IV       52    398.               553.
2 Drought   I        73    196.               930.
3 Drought   IV       52    138.               269.
4 Control   II       73    402.              1775.
5 Control   IV       66    469.              1533.   

If I create a figure based on this grouped dataset, it will look like the example below.

Averaging more variables reduces data variability, and this is a key concept in statistics and data analysis. Averaging helps eliminate random fluctuations or “noise” in the data, making the underlying trend more visible. Especially in small samples, individual measurements may be erratic. Averaging smooths them out, giving a more reliable estimate. On the other hand, averaging hides the differences between data points. Important variations or outliers may be lost. Also, when different groups or conditions are averaged together, you can mask important subgroup trends.

The left figure shows the average number of branches and cumulative biomass over time (by days), while the right figure shows the average number of plants and cumulative biomass. When averaging by branches, the cumulative biomass was calculated for each plant within blocks for each treatment. When averaging by plants, the cumulative biomass was calculated for each block within each treatment. By grouping the data pattern is slighly different.

full code: https://github.com/agronomy4future/r_code/blob/main/Compute_Cumulative_Summaries_of_Grouped_Data.ipynb

We aim to develop open-source code for agronomy ([email protected])

© 2022 – 2025 https://agronomy4future.com – All Rights Reserved.

Last Updated: 07/Sep/2025

Comments are closed.