What is an outlier, and how can you detect and handle outliers in your dataset?
When it comes to working with data, you can’t just jump straight into the analysis phase. More often than not, data analysts (or indeed anyone who deals with data as part of their job) will receive “dirty” data that needs to be cleaned before it can be analyzed.
Dirty data is essentially any data that needs to be manipulated or worked on in some way before it’s fit for analysis. Dirty data may be incomplete—for example, a survey with certain questions left blank—it may contain duplicates, it may be out of date, or it may contain outliers.
In data analytics, clean, quality data is essential to running meaningful and reliable analyses. If you go right ahead and analyze dirty data, you risk skewing the results and drawing false conclusions—which is not only bad practice but can be costly in the long run!
So, data cleaning is an absolutely essential step in the data analysis process. In fact, it’s estimated that data experts spend around 60% of their time on data cleaning. You can learn more about what data cleaning is and why it’s so important in this guide. For now, though, let’s focus on outliers.
What is an outlier?
In statistics and data analytics, an outlier is a data point that differs significantly from other observations in your dataset. Let’s say you collect data on how long it takes a group of college students to run five kilometers. You notice that most of your data falls between the range of 30-60 minutes, but that one person has scored a time of just ten minutes. This would be considered an outlier.
Likewise, if you have a dataset containing exam scores where most people have scored between 50 and 80 but one person has scored 2, this would constitute an outlier.
What causes outliers in a dataset?
Outliers may be the result of an error, but that’s not always the case. It’s therefore crucial to understand not only how to detect outliers in a dataset, but also to determine the best way to handle them. Some common causes of outliers include:
- Human error when entering the data (for example, a typo)
- Intentional outliers, i.e. dummy values that test detection methods
- Sampling errors as a result of extracting or combining data from multiple sources
- Natural outliers—this is when outliers occur “naturally” in the data, as opposed to being the result of an error. Natural outliers are referred to as novelties.
How to detect and handle outliers in your dataset
There are several methods you can use to detect outliers in your dataset. Without going into too much detail, these include finding the Z-Score, using the DBSCAN technique, and creating a box plot (a type of data visualization), to name just a few. Analysts can also use Excel or Python (an increasingly popular programming language) to detect outliers. It all depends on the kind of data you’re working with and the tools you’re using.
You can learn more about how to detect and handle outliers in your dataset in the video at the top of this page. This is a recording of a once-live event hosted by expert data scientist Dana Daskalova. In the video, Dana explains in more detail what outliers are, why they might pose problems for data analysis, and, most importantly, shares some useful, actionable techniques for detecting and handling them.
Handling outliers is just one of the many tasks associated with the varied, hands-on role of the data analyst. If you’d like to learn more about what it’s like to work in the field, check out this guide on what it takes to become a data analyst, explore the average data analyst salary, and compare some of the best data analytics certification programs on the market right now.
Course Reviewer & Writer, CareerFoundry Data Analytics Tutor
Dana Daskalova started her career as a data scientist from scratch. Initially a humanities alumnus, she embraced statistics and mathematics during her studies at the University of Vienna, and began tutoring others in the field. After graduating in Vienna, where she also worked as a freelance research analyst, she joined a management consulting agency in London, got acquainted with behavioural science, and started applying statistical modelling to predict customer behaviour for various retail and tech giants. Later on, Dana acquired in-depth knowledge with risk assessment and credit scoring working as a data modeller for Experian. Something was missing, though! Destiny brought her to CareerFoundry, where she’s written and reviewed courses for the Data Analytics Program, and tutors aspiring data analysts. All of these experiences have made teaching and helping others a real passion for Dana. In her free time, she loves street photography and digging into medical data and research.