In R Studio, how to exclude missing value (NA)?
I’ll create one data.
Genotype= c("A", "B", "C", "D", "E")
Yield= c(100, 120, 130, NA, 110)
dataA= data.frame(Genotype,Yield)
dataA
Genotype Yield
1 A 100
2 B 120
3 C 130
4 D NA
5 E 110
In genotype D, yield data was missed, so it was indicated as NA. Now I’ll calculate the mean of total yield across all genotypes.
mean(dataA$Yield)
[1] NA
As you see above, we can’t calculate the mean dud to NA. To obtain the mean of total yield, we should exclude NA. Using subset(), we can simply exclude Genotype D,
dataB= subset (dataA, Genotype!="D")
mean(dataB$Yield)
[1] 115
But, a much simpler way is to use the code na.rm=TRUE, which enables you to avoid using subset().
mean(na.rm=T, dataA$Yield)
[1] 115

When the data size is small and there is only one variable, we can simply delete or ignore NA values. However, what should we do when the data size is large, and several variables exist? Let’s upload another dataset.
if(!require(readr)) install.packages("readr")
library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/chlorophyll_contents_on_leaves.csv"
df= data.frame(read_csv(url(github), show_col_types=FALSE))
colnames(df)[4]=c("Days")
colnames(df)[5]=c("Ch")
colnames(df)[6]=c("Ch_se")
colnames(df)[7]=c("green")
colnames(df)[8]=c("green_se")
print(head(df,3))
Location Genotype Treatment Days ch ch_se green green_se
1 Northern area CV1 Control 21 44.64 1.81 0.03 0.02
2 Northern area CV1 Control 22 43.50 2.81 0.05 0.02
3 Northern area CV1 Control 23 42.35 3.81 0.06 0.02
.
.
.
Now, there are several independent variables (Location, Genotype, Treatment), as well as dependent variables. First, let’s check which variables exist.
sapply(df, function(x) if (!is.numeric(x)) list(UniqueValues= unique(x)))
$Location
$Location$UniqueValues
[1] "Northern area" "Southern area"
$Genotype
$Genotype$UniqueValues
[1] "CV1" "CV2"
$Treatment
$Treatment$UniqueValues
[1] "Control" "Stress_2" "Stress_1"
$Days
NULL
$ch
NULL
$ch_se
NULL
$green
NULL
$green_se
NULL
There are 2 locations, 2 genotypes, and 3 treatments, indicating a total of 12 treatment combinations. Over the 20 days after planting, chlorophyll content (ch) and loss of greenness on leaves (Green) were measured, and the standard error of each measurement (ch_se and Green_se) was also included. Let’s check for NA values.
1) How many data rows contain NA values?
num_rows_with_na= sum(!complete.cases(df))
cat("Number of rows with NA values:", num_rows_with_na)
or
num_rows_with_na= sum(rowSums(is.na(df)) > 0)
cat("Number of rows with NA values:", num_rows_with_na)
Number of rows with NA values: 31
In this data, 31 rows have NA values. This means that if even one column has NA values, this code counts the entire row as having NA values, regardless of the other columns.

2) How many variables contain NA values?
Next, I want to see how many variables have NA values.
num_variables_with_na= sum(colSums(is.na(df)) > 0)
cat("Number of variables with NA values:", num_variables_with_na)
Number of variables with NA values: 2
There are two variables containing NA values. Let’s see what they are.
colSums(is.na(df))
Location Genotype Treatment Days ch ch_se green green_se
0 0 0 0 0 28 0 31
The standard error for chlorophyll content (ch_se) and loss of greenness on leaves (green_se) contain NA values.

3) How to discard NA values?
I’ll delete all NA values in Loss_of_greenness_on_leaves_Std_error (green_se) because it contains the most NA values.
1. Basic R function
df_na_trit= df[complete.cases(df$green_se),]
or
df_na_trit= df[!is.na(df$green_se), ]
colSums(is.na(df_na_trit))
Location Genotype Treatment Days ch ch_se green green_se
0 0 0 0 0 0 0 0
2. Using dplyr()
if(!require(dplyr)) install.packages("dplyr")
library(dplyr)
df_na_trit= df %>%
filter(!is.na(green_se))
colSums(is.na(df_na_trit))
Location Genotype Treatment Days ch ch_se green green_se
0 0 0 0 0 0 0 0
Now, none of the variables contain any NA values. This is because when deleting NA values in green_se, all NA values in ch_se were also deleted. This means that deleting NA values in one variable affects other variables as well.

We aim to develop open-source code for agronomy ([email protected])
© 2022 – 2025 https://agronomy4future.com – All Rights Reserved.
Last Updated: 17/May/2024