[R package] Cook’s Distance Diagnostics and Outlier Detection (Feat. datacooks)

[R package] Cook’s Distance Diagnostics and Outlier Detection (Feat. datacooks)


In my previous post, I explained how to calculate Cook’s Distance step by step, and noted that in R you can simply use the function cooks.distance().

However, this simple function only provides the Cook’s Distance values. In my previous post, I explained the formula for Cook’s Distance step by step, including how to compute residuals, leverage, and internal studentized residuals.

To make it easier to calculate these values, I recently developed an R package called datacooks().

□ datacooks()

Before installing datacooks(), please download Rtools (https://cran.r-project.org/bin/windows/Rtools), and install the following package.

if(!require(remotes)) install.packages("remotes")
if (!requireNamespace("datacooks", quietly = TRUE)) {
  remotes::install_github("agronomy4future/datacooks", force= TRUE)
}
library(remotes)
library(datacooks)

After installing datacooks(), type ?datacooks to view the code explanation.


Let’s practice using an actual dataset.

if(!require(readr)) install.packages("readr")
library(readr)

github="https://raw.githubusercontent.com/agronomy4future/raw_data_practice/refs/heads/main/fruit_size.csv"
df=data.frame(read_csv(url(github),show_col_types = FALSE))
set.seed(100)

print(df[sample(nrow(df),5),])
    treatment block weight_g length_cm diameter_mm area_cm2
102         B   III     39.4       9.0          30    58.35
112         B   III     78.9      10.0          44    72.23
151         B    II     57.5      10.5          50   110.89
4           A     I     28.9       8.0          34    42.67
55          A     I     92.2      10.5          65    93.56
.
.
.

This dataset contains fruit size traits such as weight, length, diameter, and area. First, I would like to fit a model between weight and area.

model= lm(area_cm2 ~ weight_g, data= df)
summary (model)

Call:
lm(formula = area_cm2 ~ weight_g, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-46.066  -5.894  -2.390   2.197  69.957 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 18.01093    2.14532   8.395 2.73e-14 ***
weight_g     0.98325    0.04071  24.152  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.57 on 155 degrees of freedom
Multiple R-squared:  0.7901,	Adjusted R-squared:  0.7887 
F-statistic: 583.3 on 1 and 155 DF,  p-value: < 2.2e-16

TThis linear regression is significant, with an R2 of 0.79. Let’s visualize the fitted model.

if(!require(ggplot2)) install.packages("ggplot2")
if(!require(ggpmisc)) install.packages("ggpmisc")
library(ggplot2)
library(ggpmisc)

Fig= ggplot (df, aes (x=weight_g, y=area_cm2)) +
  geom_smooth(method=lm, level=0.95, se=FALSE, linetype=1, 
              linewidth=0.5, color="red", formula=y~x) +
  geom_point(stroke=1, size=4, fill="grey", shape=21) +
  
  # R-squared
  stat_poly_eq(aes(label=paste(..rr.label.., sep= "~~~")),
               label.x=0.9, label.y=0.6, rr.digits=3,
               formula=y ~ x, parse=TRUE, size=5) +
  
  scale_x_continuous(breaks = seq(0,200,50), limits = c(0,200)) +
  scale_y_continuous(breaks = seq(0,200,50), limits = c(0,200)) +
  labs(x="Fruit weight (g)", y=bquote("Fruit area ("~cm^2*")")) +
  theme_classic(base_size=18, base_family="sans")+
  theme(legend.position=c(0.89,0.13),
        legend.title=element_text(size=11),
        legend.key.size=unit(0.2,'cm'),
        legend.key=element_rect(color=alpha("white",.05), 
                                fill=alpha("white",.05)),
        legend.text=element_text(size=11),
        legend.background= element_rect(fill=alpha("white",.05)),
        panel.grid.major= element_line(color="grey90", linetype="dashed"),
        axis.line=element_line(linewidth=0.5, colour="black"))

Fig+windows(width=5.5, height=5)

ggsave("C:/Users/agron/Desktop/Coding_Output/Fig.jpg", 
       Fig, width=5.5*2.54, height=5*2.54, units="cm", dpi=300)

Some data points deviate markedly from the fitted line, so I would like to identify outliers mathematically using Cook’s Distance. With datacooks(), you simply provide the model and a threshold; the function will optionally discard the outliers based on this threshold.

The common cutoff for defining outliers is 4/(n−p), where n is the number of observations and p is the number of parameters in the model. For example, in lm(area_cm2 ~ weight_g, data= df), p=2 (intercept + weight_g). If you want a stricter criterion for detecting outliers, you may choose a value smaller than 4.

df1= datacooks(model, threshold= 4, clean= FALSE)

set.seed(100)
print(df1[sample(nrow(df1),5),])
    treatment block weight_g length_cm diameter_mm area_cm2 prediction   residual
102         B   III     39.4       9.0          30    58.35   56.75088   1.599123
112         B   III     78.9      10.0          44    72.23   95.58915 -23.359147
151         B    II     57.5      10.5          50   110.89   74.54765  36.342346
4           A     I     28.9       8.0          34    42.67   46.42678  -3.756780
55          A     I     92.2      10.5          65    93.56  108.66634 -15.106337
70          B     I    140.4      12.5          50   168.83  156.05886  12.771141
98          B    II     22.2       8.0          27    37.21   39.83902  -2.629022
       leverage        ISR       CooksD category
102 0.006556083  0.1101403 4.002795e-05   normal
112 0.015726165 -1.6163482 2.087117e-02   normal
151 0.007732711  2.5045778 2.444231e-02   normal
4   0.008219032 -0.2589666 2.778833e-04   normal
55  0.024298731 -1.0498727 1.372492e-02   normal
70  0.078519191  0.9133190 3.553894e-02  outlier
98  0.010180246 -0.1814062 1.692297e-04   normal

When you run datacooks(model, threshold= 4, clean= FALSE) and assign the result to a dataset such as df1, the function automatically adds several diagnostic columns:
– prediction (the fitted values from the model),
– residuals (the distance between each point and the fitted line),
– leverage,
– internally studentized residuals, and
– Cook’s distance, which is computed from the leverage and studentized residuals.

If you call datacooks(model) without additional arguments, it uses the default settings of threshold = 4 and clean = FALSE.

It also adds a category column that labels each observation as either "outlier" or "normal" based on the model and chosen threshold. Let’s visualzie this dataset.

if(!require(ggplot2)) install.packages("ggplot2")
if(!require(ggpmisc)) install.packages("ggpmisc")
library(ggplot2)
library(ggpmisc)

Fig= ggplot (df1, aes (x=weight_g, y=area_cm2)) +
    geom_smooth(method=lm, level=0.95, se=FALSE, linetype=1, linewidth=0.5, color="red", formula=y~x) +
    geom_point(aes(fill=category, shape=category), stroke=1, size=4) +

  # R-squared
  stat_poly_eq(aes(label=paste(..rr.label.., sep= "~~~")),
               label.x=0.9, label.y=0.6, rr.digits=3,
               formula=y ~ x, parse=TRUE, size=5) +

  scale_fill_manual(values= c ("grey","red")) + 
  scale_shape_manual(values= c (21,21)) +
  geom_text(aes(family="serif", x=30, y=0, label= paste("outliers # =", sum(df1$category== "outlier"))), size=5, color="black") +
  scale_x_continuous(breaks = seq(0,200,50), limits = c(0,200)) +
  scale_y_continuous(breaks = seq(0,200,50), limits = c(0,200)) +
  labs(x="Fruit weight (g)", y=bquote("Fruit area ("~cm^2*")")) +
  theme_classic(base_size=18, base_family="sans")+
  theme(legend.position=c(0.89,0.13),
        legend.title=element_text(size=11),
        legend.key.size=unit(0.2,'cm'),
        legend.key=element_rect(color=alpha("white",.05), 
                                fill=alpha("white",.05)),
        legend.text=element_text(size=11),
        legend.background= element_rect(fill=alpha("white",.05)),
        panel.grid.major= element_line(color="grey90", linetype="dashed"),
        axis.line=element_line(linewidth=0.5, colour="black"))

Fig+windows(width=5.5, height=5)

ggsave("C:/Users/agron/Desktop/Coding_Output/Fig.jpg", 
       Fig, width=5.5*2.54, height=5*2.54, units="cm", dpi=300)

datacooks() identifies 10 data points as outliers. Let’s discard these outliers.

df2= datacooks(model, threshold= 4, clean= TRUE)
Fig= ggplot (df2, aes (x=weight_g, y=area_cm2)) +
  geom_smooth(method=lm, level=0.95, se=FALSE, linetype=1, linewidth=0.5, color="red", formula=y~x) +
  geom_point(aes(fill=category, shape=category), stroke=1, size=4) +
  
  # R-squared
  stat_poly_eq(aes(label=paste(..rr.label.., sep= "~~~")),
               label.x=0.9, label.y=0.6, rr.digits=3,
               formula=y ~ x, parse=TRUE, size=5) +
  
  scale_fill_manual(values= c ("grey","red")) + 
  scale_shape_manual(values= c (21,21)) +
  geom_text(aes(family="serif", x=30, y=0, label= paste("outliers # =", sum(df2$category== "outlier"))), size=5, color="black") +
  scale_x_continuous(breaks = seq(0,200,50), limits = c(0,200)) +
  scale_y_continuous(breaks = seq(0,200,50), limits = c(0,200)) +
  labs(x="Fruit weight (g)", y=bquote("Fruit area ("~cm^2*")")) +
  theme_classic(base_size=18, base_family="sans")+
  theme(legend.position="none",
        legend.title=element_text(size=11),
        legend.key.size=unit(0.2,'cm'),
        legend.key=element_rect(color=alpha("white",.05), 
                                fill=alpha("white",.05)),
        legend.text=element_text(size=11),
        legend.background= element_rect(fill=alpha("white",.05)),
        panel.grid.major= element_line(color="grey90", linetype="dashed"),
        axis.line=element_line(linewidth=0.5, colour="black"))

Fig+windows(width=5.5, height=5)

ggsave("C:/Users/agron/Desktop/Coding_Output/Fig.jpg", 
       Fig, width=5.5*2.54, height=5*2.54, units="cm", dpi=300)

Now that all outliers have been removed, the R2 has increased. Let’s evaluate the model again using the df2 dataset.

model= lm(area_cm2 ~ weight_g, data= df)
summary (model)

Call:
lm(formula = area_cm2 ~ weight_g, data = df2)

Residuals:
    Min      1Q  Median      3Q     Max 
-24.843  -4.108  -1.045   2.632  37.734 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 14.45154    1.38810   10.41   <2e-16 ***
weight_g     1.04103    0.02908   35.80   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.559 on 145 degrees of freedom
Multiple R-squared:  0.8984,	Adjusted R-squared:  0.8977 
F-statistic:  1282 on 1 and 145 DF,  p-value: < 2.2e-16

Let’s try different models.

df_clean= na.omit(df) # There are some missing values, so I deleted all missing values of the data.

model= lm(area_cm2 ~ diameter_mm, data= df_clean)
df3= datacooks(model, threshold= 4, clean= FALSE)
df4= datacooks(model, threshold= 4, clean= TRUE)

If some datapoints deviated from the fitting, and if you think it’s outliers, you can use datacooks(), and easily discard outlies.

 code summary: https://github.com/agronomy4future/r_code/blob/main/Cooks_Distance_Diagnostics_and_Outlier_Detection.ipynb

We aim to develop open-source code for agronomy ([email protected])

© 2022 – 2025 https://agronomy4future.com – All Rights Reserved.

Last Updated: 17/08/2025

Comments are closed.