[R package] Cook’s Distance Diagnostics and Outlier Detection (Feat. datacooks)

In my previous post, I explained how to calculate Cook’s Distance step by step, and noted that in R you can simply use the function cooks.distance()
.
However, this simple function only provides the Cook’s Distance values. In my previous post, I explained the formula for Cook’s Distance step by step, including how to compute residuals, leverage, and internal studentized residuals.

To make it easier to calculate these values, I recently developed an R package called datacooks()
.

□ datacooks()
Before installing datacooks(), please download Rtools (https://cran.r-project.org/bin/windows/Rtools), and install the following package.
if(!require(remotes)) install.packages("remotes") if (!requireNamespace("datacooks", quietly = TRUE)) { remotes::install_github("agronomy4future/datacooks", force= TRUE) } library(remotes) library(datacooks)
After installing datacooks()
, type ?datacooks
to view the code explanation.

Let’s practice using an actual dataset.
if(!require(readr)) install.packages("readr")
library(readr)
github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/refs/heads/main/fruit_size.csv"
df=data.frame(read_csv(url(github),show_col_types = FALSE))
set.seed(100)
print(df[sample(nrow(df),5),])
treatment block weight_g length_cm diameter_mm area_cm2
102 B III 39.4 9.0 30 58.35
112 B III 78.9 10.0 44 72.23
151 B II 57.5 10.5 50 110.89
4 A I 28.9 8.0 34 42.67
55 A I 92.2 10.5 65 93.56
.
.
.
This dataset contains fruit size traits such as weight, length, diameter, and area. First, I would like to fit a model between weight and area.
model= lm(area_cm2 ~ weight_g, data= df)
summary (model)
Call:
lm(formula = area_cm2 ~ weight_g, data = df)
Residuals:
Min 1Q Median 3Q Max
-46.066 -5.894 -2.390 2.197 69.957
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.01093 2.14532 8.395 2.73e-14 ***
weight_g 0.98325 0.04071 24.152 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 14.57 on 155 degrees of freedom
Multiple R-squared: 0.7901, Adjusted R-squared: 0.7887
F-statistic: 583.3 on 1 and 155 DF, p-value: < 2.2e-16
TThis linear regression is significant, with an R2 of 0.79. Let’s visualize the fitted model.
if(!require(ggplot2)) install.packages("ggplot2")
if(!require(ggpmisc)) install.packages("ggpmisc")
library(ggplot2)
library(ggpmisc)
Fig= ggplot (df, aes (x=weight_g, y=area_cm2)) +
geom_smooth(method=lm, level=0.95, se=FALSE, linetype=1,
linewidth=0.5, color="red", formula=y~x) +
geom_point(stroke=1, size=4, fill="grey", shape=21) +
# R-squared
stat_poly_eq(aes(label=paste(..rr.label.., sep= "~~~")),
label.x=0.9, label.y=0.6, rr.digits=3,
formula=y ~ x, parse=TRUE, size=5) +
scale_x_continuous(breaks = seq(0,200,50), limits = c(0,200)) +
scale_y_continuous(breaks = seq(0,200,50), limits = c(0,200)) +
labs(x="Fruit weight (g)", y=bquote("Fruit area ("~cm^2*")")) +
theme_classic(base_size=18, base_family="sans")+
theme(legend.position=c(0.89,0.13),
legend.title=element_text(size=11),
legend.key.size=unit(0.2,'cm'),
legend.key=element_rect(color=alpha("white",.05),
fill=alpha("white",.05)),
legend.text=element_text(size=11),
legend.background= element_rect(fill=alpha("white",.05)),
panel.grid.major= element_line(color="grey90", linetype="dashed"),
axis.line=element_line(linewidth=0.5, colour="black"))
Fig+windows(width=5.5, height=5)
ggsave("C:/Users/agron/Desktop/Coding_Output/Fig.jpg",
Fig, width=5.5*2.54, height=5*2.54, units="cm", dpi=300)

Some data points deviate markedly from the fitted line, so I would like to identify outliers mathematically using Cook’s Distance. With datacooks()
, you simply provide the model and a threshold; the function will optionally discard the outliers based on this threshold.
The common cutoff for defining outliers is 4/(n−p)
, where n
is the number of observations and p
is the number of parameters in the model. For example, in lm(area_cm2 ~ weight_g, data= df)
, p=2 (intercept + weight_g). If you want a stricter criterion for detecting outliers, you may choose a value smaller than 4.
df1= datacooks(model, threshold= 4, clean= FALSE)
set.seed(100)
print(df1[sample(nrow(df1),5),])
treatment block weight_g length_cm diameter_mm area_cm2 prediction residual
102 B III 39.4 9.0 30 58.35 56.75088 1.599123
112 B III 78.9 10.0 44 72.23 95.58915 -23.359147
151 B II 57.5 10.5 50 110.89 74.54765 36.342346
4 A I 28.9 8.0 34 42.67 46.42678 -3.756780
55 A I 92.2 10.5 65 93.56 108.66634 -15.106337
70 B I 140.4 12.5 50 168.83 156.05886 12.771141
98 B II 22.2 8.0 27 37.21 39.83902 -2.629022
leverage ISR CooksD category
102 0.006556083 0.1101403 4.002795e-05 normal
112 0.015726165 -1.6163482 2.087117e-02 normal
151 0.007732711 2.5045778 2.444231e-02 normal
4 0.008219032 -0.2589666 2.778833e-04 normal
55 0.024298731 -1.0498727 1.372492e-02 normal
70 0.078519191 0.9133190 3.553894e-02 outlier
98 0.010180246 -0.1814062 1.692297e-04 normal
When you run datacooks(model, threshold= 4, clean= FALSE)
and assign the result to a dataset such as df1
, the function automatically adds several diagnostic columns:
– prediction (the fitted values from the model),
– residuals (the distance between each point and the fitted line),
– leverage,
– internally studentized residuals, and
– Cook’s distance, which is computed from the leverage and studentized residuals.
If you call
datacooks(model)
without additional arguments, it uses the default settings ofthreshold = 4
andclean = FALSE
.
It also adds a category
column that labels each observation as either "outlier"
or "normal"
based on the model and chosen threshold. Let’s visualzie this dataset.
if(!require(ggplot2)) install.packages("ggplot2")
if(!require(ggpmisc)) install.packages("ggpmisc")
library(ggplot2)
library(ggpmisc)
Fig= ggplot (df1, aes (x=weight_g, y=area_cm2)) +
geom_smooth(method=lm, level=0.95, se=FALSE, linetype=1, linewidth=0.5, color="red", formula=y~x) +
geom_point(aes(fill=category, shape=category), stroke=1, size=4) +
# R-squared
stat_poly_eq(aes(label=paste(..rr.label.., sep= "~~~")),
label.x=0.9, label.y=0.6, rr.digits=3,
formula=y ~ x, parse=TRUE, size=5) +
scale_fill_manual(values= c ("grey","red")) +
scale_shape_manual(values= c (21,21)) +
geom_text(aes(family="serif", x=30, y=0, label= paste("outliers # =", sum(df1$category== "outlier"))), size=5, color="black") +
scale_x_continuous(breaks = seq(0,200,50), limits = c(0,200)) +
scale_y_continuous(breaks = seq(0,200,50), limits = c(0,200)) +
labs(x="Fruit weight (g)", y=bquote("Fruit area ("~cm^2*")")) +
theme_classic(base_size=18, base_family="sans")+
theme(legend.position=c(0.89,0.13),
legend.title=element_text(size=11),
legend.key.size=unit(0.2,'cm'),
legend.key=element_rect(color=alpha("white",.05),
fill=alpha("white",.05)),
legend.text=element_text(size=11),
legend.background= element_rect(fill=alpha("white",.05)),
panel.grid.major= element_line(color="grey90", linetype="dashed"),
axis.line=element_line(linewidth=0.5, colour="black"))
Fig+windows(width=5.5, height=5)
ggsave("C:/Users/agron/Desktop/Coding_Output/Fig.jpg",
Fig, width=5.5*2.54, height=5*2.54, units="cm", dpi=300)

datacooks()
identifies 10 data points as outliers. Let’s discard these outliers.
df2= datacooks(model, threshold= 4, clean= TRUE)
Fig= ggplot (df2, aes (x=weight_g, y=area_cm2)) +
geom_smooth(method=lm, level=0.95, se=FALSE, linetype=1, linewidth=0.5, color="red", formula=y~x) +
geom_point(aes(fill=category, shape=category), stroke=1, size=4) +
# R-squared
stat_poly_eq(aes(label=paste(..rr.label.., sep= "~~~")),
label.x=0.9, label.y=0.6, rr.digits=3,
formula=y ~ x, parse=TRUE, size=5) +
scale_fill_manual(values= c ("grey","red")) +
scale_shape_manual(values= c (21,21)) +
geom_text(aes(family="serif", x=30, y=0, label= paste("outliers # =", sum(df2$category== "outlier"))), size=5, color="black") +
scale_x_continuous(breaks = seq(0,200,50), limits = c(0,200)) +
scale_y_continuous(breaks = seq(0,200,50), limits = c(0,200)) +
labs(x="Fruit weight (g)", y=bquote("Fruit area ("~cm^2*")")) +
theme_classic(base_size=18, base_family="sans")+
theme(legend.position="none",
legend.title=element_text(size=11),
legend.key.size=unit(0.2,'cm'),
legend.key=element_rect(color=alpha("white",.05),
fill=alpha("white",.05)),
legend.text=element_text(size=11),
legend.background= element_rect(fill=alpha("white",.05)),
panel.grid.major= element_line(color="grey90", linetype="dashed"),
axis.line=element_line(linewidth=0.5, colour="black"))
Fig+windows(width=5.5, height=5)
ggsave("C:/Users/agron/Desktop/Coding_Output/Fig.jpg",
Fig, width=5.5*2.54, height=5*2.54, units="cm", dpi=300)

Now that all outliers have been removed, the R2 has increased. Let’s evaluate the model again using the df2
dataset.
model= lm(area_cm2 ~ weight_g, data= df)
summary (model)
Call:
lm(formula = area_cm2 ~ weight_g, data = df2)
Residuals:
Min 1Q Median 3Q Max
-24.843 -4.108 -1.045 2.632 37.734
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.45154 1.38810 10.41 <2e-16 ***
weight_g 1.04103 0.02908 35.80 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.559 on 145 degrees of freedom
Multiple R-squared: 0.8984, Adjusted R-squared: 0.8977
F-statistic: 1282 on 1 and 145 DF, p-value: < 2.2e-16


Let’s try different models.
df_clean= na.omit(df) # There are some missing values, so I deleted all missing values of the data.
model= lm(area_cm2 ~ diameter_mm, data= df_clean)
df3= datacooks(model, threshold= 4, clean= FALSE)
df4= datacooks(model, threshold= 4, clean= TRUE)

If some datapoints deviated from the fitting, and if you think it’s outliers, you can use datacooks()
, and easily discard outlies.
■ code summary: https://github.com/agronomy4future/r_code/blob/main/Cooks_Distance_Diagnostics_and_Outlier_Detection.ipynb

We aim to develop open-source code for agronomy ([email protected])
© 2022 – 2025 https://agronomy4future.com – All Rights Reserved.
Last Updated: 17/08/2025