Assume that everyone is dressed casually at a party, except for one person who is wearing a superhero outfit. You notice that person right away since they do not fit in the crowd. This superhuman visitor is referred to as an outlier in data analysis.
They could be mistakes, undiscovered treasures, or just peculiar outliers. Exploratory Data Analysis (EDA) is a detective's toolkit that we use to find these outliers. It is the first stage of any data analysis project that establishes the foundation for reliable modeling.
EDA allows us to examine and analyze data to identify trends, connections between various variables, and—yes—the odd ones out. However, how can we spot these anomalies? What are they trying to say? Why should we give a damn?
Let us examine outliers in more detail, including their types, uses, and potential to revolutionize our understanding of and approach to data. Whether you are an experienced analyst or an inquisitive student, discovering those superhero moments in data is exciting.
Outliers are the points which are different from the rest of the data points. They may be signs of inaccuracies, system instability, or unexpected revelations. It could happen for several reasons such as incorrect data entry, application malfunction, true data variability, Fraudulent Behavior etc. They can distort the outcome; therefore, it is critical to identify and deal with them appropriately.
Types of outliers:
It has a wide range of uses in several fields. It is used in banking [1] to identify fraudulent credit card transactions, and in healthcare [2] to find odd patterns in patient data to help with monitoring and diagnosis. It contributes to the detection of spam and phony accounts on social media [3], and it aids in the analysis of user buying patterns in the e-commerce industry to enhance recommendation systems [4]. Furthermore, it improves overall defense against assaults by assisting in the detection of malware [5] and other security risks.
There are multiple ways to handle outliers such as using visualization techniques (scatter plots, box plots, time series plot), Machine Learning methods such as isolation Forests for multivariate data, statistical methods such as Seasonal decomposition for time series data, Z score method or comparable approaches.
In this article we will focus on the Z-score approach for outlier detection.
The Z-score approach is a popular and straightforward way to find outliers. It determines observations that deviate from the mean by more than a specific number of standard deviations and computes the data's mean and standard deviation.
A data observation which has an absolute Z-score value greater than three, or |Z-score| > 3, is commonly referred to as an outlier. However, the exact definition may change depending on the needs of the business.
Let us say we have a simple dataset which contains daily sales data for a product from 2024-01-01 to 2024-04-09, spanning 100 days. The objective of this analysis is to identify outliers in the sales figures using the Z-score method. In other words, the z-score implementation on this dataset will show us any unexpected spikes or drops in sales.
Various steps in the code are:
In the above screenshot, the Z-score for the data observation (800) is approximately 6.31. This high Z-score (greater than 3) indicates that the sales are far above the average sales, making it a significant outlier in the dataset.