DSS Blog

"EDA and Beyond: Understanding, Detecting, and Interpreting Outliers"​​

Written by Meetu Malhotra | Jan 21, 2025 2:26:49 PM

Assume that everyone is dressed casually at a party, except for one person who is wearing a superhero outfit. You notice that person right away since they do not fit in the crowd. This superhuman visitor is referred to as an outlier in data analysis.

They could be mistakes, undiscovered treasures, or just peculiar outliers. Exploratory Data Analysis (EDA) is a detective's toolkit that we use to find these outliers. It is the first stage of any data analysis project that establishes the foundation for reliable modeling. 

EDA allows us to examine and analyze data to identify trends, connections between various variables, and—yes—the odd ones out. However, how can we spot these anomalies? What are they trying to say? Why should we give a damn?

Let us examine outliers in more detail, including their types, uses, and potential to revolutionize our understanding of and approach to data. Whether you are an experienced analyst or an inquisitive student, discovering those superhero moments in data is exciting.

Definition of an outlier:  

Outliers are the points which are different from the rest of the data points. They may be signs of inaccuracies, system instability, or unexpected revelations. It could happen for several reasons such as incorrect data entry, application malfunction, true data variability, Fraudulent Behavior etc. They can distort the outcome; therefore, it is critical to identify and deal with them appropriately.  

Types of outliers:  

 

Application of outlier detection:

It has a wide range of uses in several fields. It is used in banking [1] to identify fraudulent credit card transactions, and in healthcare [2] to find odd patterns in patient data to help with monitoring and diagnosis. It contributes to the detection of spam and phony accounts on social media [3], and it aids in the analysis of user buying patterns in the e-commerce industry to enhance recommendation systems [4]. Furthermore, it improves overall defense against assaults by assisting in the detection of malware [5] and other security risks.

Handling Outliers:  

There are multiple ways to handle outliers such as using visualization techniques (scatter plots, box plots, time series plot), Machine Learning methods such as isolation Forests for multivariate data, statistical methods such as Seasonal decomposition for time series data, Z score method or comparable approaches. 

In this article we will focus on the Z-score approach for outlier detection.  
The Z-score approach is a popular and straightforward way to find outliers. It determines observations that deviate from the mean by more than a specific number of standard deviations and computes the data's mean and standard deviation.   


  

A data observation which has an absolute Z-score value greater than three, or |Z-score| > 3, is commonly referred to as an outlier. However, the exact definition may change depending on the needs of the business.  

Limitations of Z-Score: 

  1. The Z-score approach assumes that the data is normally distributed. The Z-score may fail to detect true outliers or mistakenly label normal points as outliers if the data is distorted or has a different distribution.  
  1. It is sensitive to extreme outliers. Because extreme outliers skew the mean and standard deviation, they can have an impact on the Z-score. This lessens the ability to identify mild outliers. 
  1. It is typically applied to univariate variables. For datasets with multiple variables (multivariate outliers), these methods might miss complex relationships between  

Implementation of Z-score:

Let us say we have a simple dataset which contains daily sales data for a product from 2024-01-01 to 2024-04-09, spanning 100 days. The objective of this analysis is to identify outliers in the sales figures using the Z-score method. In other words, the z-score implementation on this dataset will show us any unexpected spikes or drops in sales. 

Various steps in the code are:

  1. Read the dataset – sales.csv file.

  1. Plot the sales data to check the distribution.

 

  1. Calculate standard deviation (SD_sales) and mean (mean_sales) of the dataset. Use standard deviation and mean to calculate z-score. The code snippet below shows these calculations:

 

Interpretation of Z-Score: 

In the above screenshot, the Z-score for the data observation (800) is approximately 6.31. This high Z-score (greater than 3) indicates that the sales are far above the average sales, making it a significant outlier in the dataset.