Data filtering using R Studio

March 5, 2023 JK

When you conduct statistical analysis, you might want to include/exclude some variables. For example, here is one data.

This is data about how yield, grain number (GN) and average grain weight (AGW) are different according to two different fertilizers (N0, N1) in five genotypes (CV1 – CV5). That is, there will be 10 treatments [Genotype (5) x Nitrogen (2) =10]. Replicates are 10 as blocks, and therefore experimental unit will be 30 [10 treatments x 3 blocks = 30].

What if we want to analyze in only N1 condition? or only about CV1? I’ll introduce how to filter data in R studio?

Let’s upload the data above.

# to upload data
if(!require(readr)) install.packages("readr")
library (readr)

github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/yield%20component_nitrogen.csv"
dataA=data.frame(read_csv(url(github),show_col_types=FALSE))

head(dataA, 3)
  Genotype Block Nitrogen    GN  AGW Yield
1      CV1     I       N1 21488 43.9 902.4
2      CV1    II       N1 23707 41.7 944.9
3      CV1   III       N1 20817 45.6 907.7
.
.
.

1. subset

I’d like to filter one variable. For example, there are two ways to select N1.

N1= subset (dataA, Nitrogen=="N1")
N1= subset (dataA, Nitrogen!="N0")

head(N1, 5)
  Genotype Block Nitrogen    GN  AGW  Yield
1      CV1     I       N1 21488 43.9  902.4
2      CV1    II       N1 23707 41.7  944.9
3      CV1   III       N1 20817 45.6  907.7
7      CV2     I       N1 22072 51.6 1119.7
8      CV2    II       N1 14675 46.5  675.7
.
.
.

How about selecting several variables? For example, I want to select CV1 and N1.

CV1_N1= subset (dataA, Genotype=="CV1" & Nitrogen=="N1")

print(CV1_N1,5)
  Genotype Block Nitrogen    GN  AGW Yield
1      CV1     I       N1 21488 43.9 902.4
2      CV1    II       N1 23707 41.7 944.9
3      CV1   III       N1 20817 45.6 907.7

How about selecting one variable and excluding another variable?

CV1X_N1= subset (dataA, Genotype!="CV1" & Nitrogen=="N1")

print(CV1X_N1)
   Genotype Block Nitrogen    GN  AGW  Yield
7       CV2     I       N1 22072 51.6 1119.7
8       CV2    II       N1 14675 46.5  675.7
9       CV2   III       N1 17180 48.6  800.1
13      CV3     I       N1 23126 42.0  930.0
14      CV3    II       N1 26307 42.3 1056.5
15      CV3   III       N1 24976 40.5  895.4
19      CV4     I       N1 17601 50.1  845.1
20      CV4    II       N1 18662 52.0  927.4
21      CV4   III       N1 16136 49.4  767.0
25      CV5     I       N1 26927 40.1 1032.0
26      CV5    II       N1 23564 40.5  906.0
27      CV5   III       N1 27546 36.7  961.4

How about selecting two variables within the same factor? For example, I want to select both CV1 and CV3. So, I used the below code.

CV1_CV3= subset (dataA, Genotype=="CV1" & Genotype=="CV3")

print(CV1_CV3)
[1] Genotype Block    Nitrogen GN       AGW      Yield   
<0 rows> (or 0-length row.names)

But I can’t select any variables. This is because if I select CV1, and now only CV1 exists. In this condition, if I select CV3 which does not exist, no variables are selected.

We can solve this problem using |

CV1_CV3= subset (dataA,Genotype=="CV1" | Genotype=="CV3")

print(CV1_CV3)
   Genotype Block Nitrogen    GN  AGW  Yield
1       CV1     I       N1 21488 43.9  902.4
2       CV1    II       N1 23707 41.7  944.9
3       CV1   III       N1 20817 45.6  907.7
4       CV1     I       N0 13570 47.7  645.0
5       CV1    II       N0  8593 47.6  393.1
6       CV1   III       N0  8588 45.5  389.8
13      CV3     I       N1 23126 42.0  930.0
14      CV3    II       N1 26307 42.3 1056.5
15      CV3   III       N1 24976 40.5  895.4
16      CV3     I       N0 11967 41.8  480.3
17      CV3    II       N0 13600 45.5  593.0
18      CV3   III       N0  9464 42.1  382.2

Or below code is possible.

CV1_CV3= subset (dataA, Genotype!="CV2" & Genotype!="CV4" & Genotype!="CV5")

How about selecting two variables within the same factor, and another variable? For example, I want to select both CV1 and CV3, and then select N1.

CV1_CV3_N1= subset (dataA, c(Genotype=="CV1" | Genotype=="CV3") & Nitrogen=="N1")

print(CV1_CV3_N1)
   Genotype Block Nitrogen    GN  AGW  Yield
1       CV1     I       N1 21488 43.9  902.4
2       CV1    II       N1 23707 41.7  944.9
3       CV1   III       N1 20817 45.6  907.7
13      CV3     I       N1 23126 42.0  930.0
14      CV3    II       N1 26307 42.3 1056.5
15      CV3   III       N1 24976 40.5  895.4

There are no specific answers. Simply we can use below code.

N1= subset (dataA, Nitrogen=="N1") 
CV1_CV3_N1= subset(N1, Genotype=="CV1" | Genotype=="CV3")

First, we can select N1, and in N1, we can selecte both CV1 and CV3. We can shorten the code like below.

CV1_CV3_N1= subset(subset (dataA, Nitrogen=="N1"), Genotype=="CV1" | Genotype=="CV3")

In summary, these three codes are the same code to select CV1, CV3, N1.

#1
CV1_CV3_N1= subset (dataA, c(Genotype=="CV1" | Genotype=="CV3") & Nitrogen=="N1")

#2
N1<- subset (dataA, Nitrogen=="N1")
CV1_CV3_N1= subset(N1, Genotype=="CV1" | Genotype=="CV3")

#3
CV1_CV3_N1= subset(subset(dataA, Nitrogen=="N1"), Genotype=="CV1" | Genotype=="CV3")

2. dplyr package

Now, let’s use dplyr package.

if(!require(dplyr)) install.packages("dplyr")
library (dplyr)

Now, I’d like to select N1

#1
dataB= dataA %>% filter (Nitrogen=="N1")
#2
dataB= dataA %>% filter (Nitrogen!="N0")

print (dataB)
   Genotype Block Nitrogen    GN  AGW  Yield
1       CV1     I       N1 21488 43.9  902.4
2       CV1    II       N1 23707 41.7  944.9
3       CV1   III       N1 20817 45.6  907.7
4       CV2     I       N1 22072 51.6 1119.7
5       CV2    II       N1 14675 46.5  675.7
6       CV2   III       N1 17180 48.6  800.1
7       CV3     I       N1 23126 42.0  930.0
8       CV3    II       N1 26307 42.3 1056.5
9       CV3   III       N1 24976 40.5  895.4
10      CV4     I       N1 17601 50.1  845.1
11      CV4    II       N1 18662 52.0  927.4
12      CV4   III       N1 16136 49.4  767.0
13      CV5     I       N1 26927 40.1 1032.0
14      CV5    II       N1 23564 40.5  906.0
15      CV5   III       N1 27546 36.7  961.4

How about selecting CV1 and N1? It’s similar with subset()

#subset()
CV1_N1= subset (dataA, Genotype=="CV1" & Nitrogen=="N1")

#dplyr
CV1_N1= dataA %>% filter (Genotype=="CV1" & Nitrogen!="N0")

How about selecting both CV1 and CV3?

#subset()
CV1_CV3= subset (dataA, Genotype=="CV1" | Genotype=="CV3")

#dplyr
CV1_CV3= dataA %>% filter(Genotype=="CV1"| Genotype=="CV3")

How about selecting both CV1 and CV3, and then N1?

#subset()
CV1_CV3_N1= subset (dataA, c7(Genotype=="CV1" | Genotype=="CV3") & Nitrogen=="N1")

#dplyr
CV1_CV3_N1= dataA %>% filter (c(Genotype=="CV1" | Genotype=="CV3") & Nitrogen=="N1")

Now, let’s export this data to Excel.

if(!require(writexl)) install.packages("writexl")
library (writexl)

write_xlsx(CV1_CV3_N1,"C://Users/Usuari/Desktop//CV1_3_N1.xlsx")

We aim to develop open-source code for agronomy ([email protected])

Last Updated: 01/03/2025

Your donation will help us create high-quality content.
PayPal @agronomy4furure / Venmo @agronomy4furure / Zelle @agronomy4furure

Agronomy4future

Stories about cereals and statistics (plus coding). We aim to develop open-source code for agronomy.

Data filtering using R Studio

March 5, 2023 JK

1. subset

2. dplyr package