In R Studio, how to exclude missing value (NA)?

I’ll create one data.

Genotype= c("A", "B", "C", "D", "E")
Yield= c(100, 120, 130, NA, 110)
dataA= data.frame(Genotype,Yield)

dataA
  Genotype Yield
1        A   100
2        B   120
3        C   130
4        D    NA
5        E   110

In genotype D, yield data was missed, so it was indicated as NA. Now I’ll calculate the mean of total yield across all genotypes.

mean(dataA$Yield)
[1] NA

As you see above, we can’t calculate the mean dud to NA. To obtain the mean of total yield, we should exclude NA. Using subset(), we can simply exclude Genotype D,

dataB= subset (dataA, Genotype!="D")
mean(dataB$Yield)
[1] 115

But, a much simpler way is to use the code na.rm=TRUE, which enables you to avoid using subset().

mean(<mark style="background-color:rgba(0, 0, 0, 0)" class="has-inline-color has-vivid-red-color">na.rm=T</mark>, dataA$Yield)
[1] 115

When the data size is small and there is only one variable, we can simply delete or ignore NA values. However, what should we do when the data size is large, and several variables exist? Let’s upload another dataset.

if(!require(readr)) install.packages("readr")
library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/main/chlorophyll_contents_on_leaves.csv"
df= data.frame(read_csv(url(github), show_col_types=FALSE))
colnames(df)[4]=c("Days")
colnames(df)[5]=c("Ch")
colnames(df)[6]=c("Ch_se")
colnames(df)[7]=c("green")
colnames(df)[8]=c("green_se")

print(head(df,3))
       Location Genotype Treatment Days    ch ch_se green green_se
1 Northern area      CV1   Control   21 44.64  1.81  0.03     0.02
2 Northern area      CV1   Control   22 43.50  2.81  0.05     0.02
3 Northern area      CV1   Control   23 42.35  3.81  0.06     0.02
.
.
.

Now, there are several independent variables (Location, Genotype, Treatment), as well as dependent variables. First, let’s check which variables exist.

sapply(df, function(x) if (!is.numeric(x)) list(UniqueValues= unique(x)))

$Location
$Location$UniqueValues
[1] "Northern area" "Southern area"

$Genotype
$Genotype$UniqueValues
[1] "CV1" "CV2"

$Treatment
$Treatment$UniqueValues
[1] "Control"  "Stress_2" "Stress_1"

$Days
NULL

$ch
NULL

$ch_se
NULL

$green
NULL

$green_se
NULL

There are 2 locations, 2 genotypes, and 3 treatments, indicating a total of 12 treatment combinations. Over the 20 days after planting, chlorophyll content (ch) and loss of greenness on leaves (Green) were measured, and the standard error of each measurement (ch_se and Green_se) was also included. Let’s check for NA values.

1) How many data rows contain NA values?

num_rows_with_na= sum(!complete.cases(df))
cat("Number of rows with NA values:", num_rows_with_na)

or

num_rows_with_na= sum(rowSums(is.na(df)) > 0)
cat("Number of rows with NA values:", num_rows_with_na)

Number of rows with NA values: 31

In this data, 31 rows have NA values. This means that if even one column has NA values, this code counts the entire row as having NA values, regardless of the other columns.

2) How many variables contain NA values?

Next, I want to see how many variables have NA values.

num_variables_with_na= sum(colSums(is.na(df)) > 0)
cat("Number of variables with NA values:", num_variables_with_na)

Number of variables with NA values: 2

There are two variables containing NA values. Let’s see what they are.

colSums(is.na(df))
Location   Genotype  Treatment   Days   ch   ch_se   green   green_se 
      0          0           0      0    0      28       0         31

The standard error for chlorophyll content (ch_se) and loss of greenness on leaves (green_se) contain NA values.

3) How to discard `NA` values?

I’ll delete all NA values in Loss_of_greenness_on_leaves_Std_error (green_se) because it contains the most NA values.

1. Basic R function

df_na_trit= df[complete.cases(df$green_se),]

or

df_na_trit= df[!is.na(df$green_se), ]

colSums(is.na(df_na_trit))
Location   Genotype  Treatment   Days   ch   ch_se   green   green_se 
      0          0           0      0    0       0       0          0

2. Using dplyr()

if(!require(dplyr)) install.packages("dplyr")
library(dplyr)

df_na_trit= df %>% 
            filter(!is.na(green_se))

colSums(is.na(df_na_trit))
Location   Genotype  Treatment   Days   ch   ch_se   green   green_se 
      0          0           0      0    0       0       0          0

Now, none of the variables contain any NA values. This is because when deleting NA values in green_se, all NA values in ch_se were also deleted. This means that deleting NA values in one variable affects other variables as well.

We aim to develop open-source code for agronomy ([email protected])

Last Updated: 17/May/2024

1) How many data rows contain NA values?

2) How many variables contain NA values?

3) How to discard NA values?

1. Basic R function

2. Using dplyr()

3) How to discard `NA` values?