How to select/delete specific variables using R STUDIO?

□ How to select and delete specific columns using R STUDIO?

In my previous post, I explained how to select or delete specific columns. This time, I’ll elaborate on selecting or deleting specific variables within columns. Once again, I’ll generate a new set of data.

Genotype=rep(c("CV1","CV2","CV3"), times=5)
Yield=c(20,25,28,35,25,26,34,57,36,44,29,36,41,25,29)
dataA=data.frame(Genotype,Yield)

head(dataA, 10)
   Genotype Yield
1       CV1    20
2       CV2    25
3       CV3    28
4       CV1    35
5       CV2    25
6       CV3    26
7       CV1    34
8       CV2    57
9       CV3    36
10      CV1    44
.
.
.

If I want to divide the data by genotype, I use the code below.

cv1=subset(dataA, Genotype=="CV1")
cv2=subset(dataA, Genotype=="CV2")
cv3=subset(dataA, Genotype=="CV3")

But what if I simply want to delete all instances of the CV2 genotype? The code is below.

dataB= subset(dataA, Genotype!="CV2")

print(dataB)
   Genotype Yield
1       CV1    20
3       CV3    28
4       CV1    35
6       CV3    26
7       CV1    34
9       CV3    36
10      CV1    44
12      CV3    36
13      CV1    41
15      CV3    29

Alternatively, the code below is also a valid option.

dataC=subset(dataA, Genotype=="CV1" | Genotype=="CV3")

How about deleting multiple variables?

Country=c("Spain","Canada","USA","Korea","Netherlands","Denmark","France",
"UK","Japan","Germany")
Income=c("40k","50k","60k","45k","55k","70k","50k","55k","55k","50k")
dataA=data.frame(Country,Income)

print(dataA)
       Country Income
1        Spain    40k
2       Canada    50k
3          USA    60k
4        Korea    45k
5  Netherlands    55k
6      Denmark    70k
7       France    50k
8           UK    55k
9        Japan    55k
10     Germany    50k

Let’s assume that the data above lists the salaries of postdoctoral researchers by country. However, upon examining the data, I found inaccuracies for Spain, France, and Japan, so I would like to delete them. We can use the & to delete multiple variables.

dataA=subset (dataA, Country!="Spain" & Country!="France" & Country!="Japan")

print(dataA)
       Country Income
2       Canada    50k
3          USA    60k
4        Korea    45k
5  Netherlands    55k
6      Denmark    70k
8           UK    55k
10     Germany    50k

Alternatively, the code below is also a valid option.

dataB=subset(dataA,!(Country %in% c("Spain","France","Japan")))

print(dataB)
       Country Income
2       Canada    50k
3          USA    60k
4        Korea    45k
5  Netherlands    55k
6      Denmark    70k
8           UK    55k
10     Germany    50k

or, we can also delete using filter() in dplyr() package.

if(!require(dplyr)) install.packages("dplyr")
library (dplyr)

dataC= dataA %>%
  filter(!(Country %in% c("Spain","France","Japan")))

What is different `&` and `|` ?

I have a dataset that looks like the following. Let’s say this is a math and english score for 8 students from different countries.

name=c("Jack","Kate","John","Jane","David","Min","Hyuk","Jisoo")
math=c(90,85,95,75,80,90,90,85)
eng=c(85,90,90,88,95,85,87,88)
name=c("Jack","Kate","John","Jane","David","Min","Hyuk","Jisoo")
math=c(90,85,95,75,80,90,90,85)
eng=c(85,90,90,88,95,85,87,88)
country=c("USA","Spain","France","Germany","Netherlands", rep("Korea",3))
gender=c(rep(c("Male","Female"),times=4))
enroll=c(rep(c("Yes","No"),each=4))
grade=data.frame(name,math,eng,country,gender,enroll)

print(grade)
   name math eng     country gender enroll
1  Jack   90  85         USA   Male    Yes
2  Kate   85  90       Spain Female    Yes
3  John   95  90      France   Male    Yes
4  Jane   75  88     Germany Female    Yes
5 David   80  95 Netherlands   Male     No
6   Min   90  85       Korea Female     No
7  Hyuk   90  87       Korea   Male     No
8 Jisoo   85  88       Korea Female     No

Now, I would like to exclude David, people from Korea, and all male students. So, I used the code below.

dataA= subset (grade, name!="David" & country!="Korea" & gender!="Male")

print(dataA)
  name math eng country gender enroll
2 Kate   85  90   Spain Female    Yes
4 Jane   75  88 Germany Female    Yes

Now, I would like to include only Jack, David, and Jisoo. The code is below.

dataB= subset (grade, name=="Jack" | name=="David" | name=="Jisoo")

print(dataB)
   name math eng     country gender enroll
1  Jack   90  85         USA   Male    Yes
5 David   80  95 Netherlands   Male     No
8 Jisoo   85  88       Korea Female     No

Why did I use |? not &?

Think about this!!
When I select Jack first, there will be no David and Jisoo. So, if I use the code like dataB=subset(grade, name=="Jack" & name =="David" & name=="Jisoo"), it does not work logically. In this case, the | operator allows us to select multiple variables in the same column.

We can also use filter() in dplyr() package.

if(!require(dplyr)) install.packages("dplyr")
library (dplyr)

grade1= grade %>%
  filter((name %in% c("Jack","David","Jisoo")))

print(grade1)
   name math eng     country gender enroll
1  Jack   90  85         USA   Male    Yes
2 David   80  95 Netherlands   Male     No
3 Jisoo   85  88       Korea Female     No

□ How to select and delete specific columns using R STUDIO?

What is different & and | ?

What is different `&` and `|` ?