□ How to select and delete specific columns using R STUDIO?
In my previous post, I explained how to select or delete specific columns. This time, I’ll elaborate on selecting or deleting specific variables within columns. Once again, I’ll generate a new set of data.
Genotype=rep(c("CV1","CV2","CV3"), times=5)
Yield=c(20,25,28,35,25,26,34,57,36,44,29,36,41,25,29)
dataA=data.frame(Genotype,Yield)
head(dataA, 10) Genotype Yield 1 CV1 20 2 CV2 25 3 CV3 28 4 CV1 35 5 CV2 25 6 CV3 26 7 CV1 34 8 CV2 57 9 CV3 36 10 CV1 44 . . .
If I want to divide the data by genotype, I use the code below.
cv1=subset(dataA, Genotype=="CV1") cv2=subset(dataA, Genotype=="CV2") cv3=subset(dataA, Genotype=="CV3")
But what if I simply want to delete all instances of the CV2 genotype? The code is below.
dataB= subset(dataA, Genotype!="CV2")
print(dataB) Genotype Yield 1 CV1 20 3 CV3 28 4 CV1 35 6 CV3 26 7 CV1 34 9 CV3 36 10 CV1 44 12 CV3 36 13 CV1 41 15 CV3 29
Alternatively, the code below is also a valid option.
dataC=subset(dataA, Genotype=="CV1" | Genotype=="CV3")

How about deleting multiple variables?
Country=c("Spain","Canada","USA","Korea","Netherlands","Denmark","France",
"UK","Japan","Germany")
Income=c("40k","50k","60k","45k","55k","70k","50k","55k","55k","50k")
dataA=data.frame(Country,Income)
print(dataA)
Country Income
1 Spain 40k
2 Canada 50k
3 USA 60k
4 Korea 45k
5 Netherlands 55k
6 Denmark 70k
7 France 50k
8 UK 55k
9 Japan 55k
10 Germany 50k
Let’s assume that the data above lists the salaries of postdoctoral researchers by country. However, upon examining the data, I found inaccuracies for Spain, France, and Japan, so I would like to delete them. We can use the & to delete multiple variables.
dataA=subset (dataA, Country!="Spain" & Country!="France" & Country!="Japan")
print(dataA)
Country Income
2 Canada 50k
3 USA 60k
4 Korea 45k
5 Netherlands 55k
6 Denmark 70k
8 UK 55k
10 Germany 50k
Alternatively, the code below is also a valid option.
dataB=subset(dataA,!(Country %in% c("Spain","France","Japan")))
print(dataB)
Country Income
2 Canada 50k
3 USA 60k
4 Korea 45k
5 Netherlands 55k
6 Denmark 70k
8 UK 55k
10 Germany 50k
or, we can also delete using filter() in dplyr() package.
if(!require(dplyr)) install.packages("dplyr")
library (dplyr)
dataC= dataA %>%
filter(!(Country %in% c("Spain","France","Japan")))

What is different & and | ?
I have a dataset that looks like the following. Let’s say this is a math and english score for 8 students from different countries.
name=c("Jack","Kate","John","Jane","David","Min","Hyuk","Jisoo")
math=c(90,85,95,75,80,90,90,85)
eng=c(85,90,90,88,95,85,87,88)
name=c("Jack","Kate","John","Jane","David","Min","Hyuk","Jisoo")
math=c(90,85,95,75,80,90,90,85)
eng=c(85,90,90,88,95,85,87,88)
country=c("USA","Spain","France","Germany","Netherlands", rep("Korea",3))
gender=c(rep(c("Male","Female"),times=4))
enroll=c(rep(c("Yes","No"),each=4))
grade=data.frame(name,math,eng,country,gender,enroll)
print(grade) name math eng country gender enroll 1 Jack 90 85 USA Male Yes 2 Kate 85 90 Spain Female Yes 3 John 95 90 France Male Yes 4 Jane 75 88 Germany Female Yes 5 David 80 95 Netherlands Male No 6 Min 90 85 Korea Female No 7 Hyuk 90 87 Korea Male No 8 Jisoo 85 88 Korea Female No
Now, I would like to exclude David, people from Korea, and all male students. So, I used the code below.
dataA= subset (grade, name!="David" & country!="Korea" & gender!="Male")
print(dataA) name math eng country gender enroll 2 Kate 85 90 Spain Female Yes 4 Jane 75 88 Germany Female Yes
Now, I would like to include only Jack, David, and Jisoo. The code is below.
dataB= subset (grade, name=="Jack" | name=="David" | name=="Jisoo")
print(dataB) name math eng country gender enroll 1 Jack 90 85 USA Male Yes 5 David 80 95 Netherlands Male No 8 Jisoo 85 88 Korea Female No
Why did I use |? not &?
Think about this!!
When I select Jack first, there will be no David and Jisoo. So, if I use the code like dataB=subset(grade, name=="Jack" & name =="David" & name=="Jisoo"), it does not work logically. In this case, the | operator allows us to select multiple variables in the same column.
We can also use in filter()dplyr() package.
if(!require(dplyr)) install.packages("dplyr")
library (dplyr)
grade1= grade %>%
filter((name %in% c("Jack","David","Jisoo")))
print(grade1) name math eng country gender enroll 1 Jack 90 85 USA Male Yes 2 David 80 95 Netherlands Male No 3 Jisoo 85 88 Korea Female No
