{"id":9769,"date":"2021-09-23T12:12:54","date_gmt":"2021-09-23T10:12:54","guid":{"rendered":"https:\/\/careerfoundry.inbearbeitung.de\/en\/?p=9769"},"modified":"2023-09-28T12:13:45","modified_gmt":"2023-09-28T10:13:45","slug":"what-is-an-outlier","status":"publish","type":"post","link":"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/what-is-an-outlier\/","title":{"rendered":"What Is an Outlier?"},"content":{"rendered":"<p><span style=\"font-weight: 400;\"><strong>When it comes to working in <a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/what-is-data-analytics\/\" target=\"_blank\" rel=\"noopener\">data analytics<\/a>\u2014whether that\u2019s as a data analyst or in a role that involves data in another capacity\u2014there&#8217;s a long process involved, way before the actual analysis phase begins.<\/strong> <\/span><\/p>\n<p><span style=\"font-weight: 400;\">In fact, up to two-thirds of the time taken in the data analytics process is spent cleaning what\u2019s known as \u201cdirty\u201d data: data that needs to be edited, worked on, or otherwise manipulated before it\u2019s suitable for analysis.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">During the cleaning phase, a data analyst may find outliers in the \u201cdirty\u201d data, which leads to either removing them from the dataset entirely, or handling them in another way. And so begs the question: what is an outlier?<\/span><\/p>\n<p><span style=\"font-weight: 400;\">If you&#8217;re interested, why not try CareerFoundry&#8217;s <strong><a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/short-courses\/become-a-data-analyst\/\">free, 5-day data analytics course<\/a><\/strong>?\u00a0Otherwise, t<\/span><span style=\"font-weight: 400;\">o skip ahead, just use the clickable menu:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><strong><a href=\"#what-is-an-outlier\">What is an outlier?<\/a><\/strong><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><strong><a href=\"#outliers-in-datasets\">How do outliers end up in datasets?<\/a><\/strong><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><strong><a href=\"#identifying-outliers\">How can you identify outliers?<\/a><\/strong>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><strong><a href=\"#visualizations\">Using visualizations<\/a><\/strong><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"2\"><strong><a href=\"#statistical-methods\">Using statistical methods<\/a><\/strong><\/li>\n<\/ol>\n<\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><strong><a href=\"#when-to-remove-outliers\">When should you remove outliers?<\/a><\/strong><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><strong><a href=\"#wrap-up\">Wrap-up and next steps<\/a><\/strong><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">With that, let\u2019s begin.<\/span><\/p>\n<h2 id=\"what-is-an-outlier\"><span style=\"font-weight: 400;\">1. What is an outlier?<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">In data analytics, outliers are values within a dataset that vary greatly from the others\u2014they\u2019re either much larger, or significantly smaller. Outliers may indicate variabilities in a measurement, experimental errors, or a novelty. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">In a real-world example, the average height of a giraffe is about 16 feet tall. However, there have been recent discoveries of two giraffes that stand at 9 feet and 8.5 feet, respectively. These two giraffes would be considered outliers in comparison to the general giraffe population.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">When going through <\/span><a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/the-data-analysis-process-step-by-step\/\"><span style=\"font-weight: 400;\">the process of data analysis<\/span><\/a><span style=\"font-weight: 400;\">, outliers can cause anomalies in the results obtained. This means that they require some special attention and, in some cases, will need to be removed in order to analyze data effectively.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">There are two main reasons why giving outliers special attention is a necessary aspect of the data analytics process:<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Outliers may have a negative effect on the result of an analysis<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Outliers\u2014or their behavior\u2014may be the information that a data analyst requires from the analysis<\/span><\/li>\n<\/ol>\n<h3><span style=\"font-weight: 400;\">Types of outliers<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">There are two kinds of outliers:<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><\/p>\n<ul>\n<li aria-level=\"1\"><span style=\"font-weight: 400;\">A<\/span><b> univariate outlier<\/b><span style=\"font-weight: 400;\"> is an extreme value that relates to just one variable. For example, <\/span><a href=\"https:\/\/www.rips-irsp.com\/articles\/10.5334\/irsp.289\/\" rel=\"noopener\"><span style=\"font-weight: 400;\">Sultan K\u00f6sen is currently the tallest man alive<\/span><\/a><span style=\"font-weight: 400;\">, with a height of 8ft, 2.8 inches (251cm). This case would be considered a univariate outlier as it\u2019s an extreme case of just one factor: height.\u00a0<\/span><\/li>\n<li aria-level=\"1\"><span style=\"font-weight: 400;\">A <\/span><b>multivariate outlier <\/b><span style=\"font-weight: 400;\">is a combination of unusual or extreme values for at least two variables. For example, if you\u2019re looking at both the height and weight of a group of adults, you might observe that one person in your dataset is 5ft 9 inches tall\u2014a measurement that would fall within the normal range for this particular variable. You may also observe that this person weighs 110lbs. Again, this observation alone falls within the normal range for the variable of interest: weight. However, when you consider these two observations in conjunction, you have an adult who is 5ft 9 inches and weighs 110lbs\u2014a surprising combination. That\u2019s a multivariate outlier.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">Besides the distinction between univariate and multivariate outliers, you\u2019ll\u00a0 see outliers categorized as any of the following:<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><\/p>\n<ul>\n<li aria-level=\"1\"><b>Global outliers (otherwise known as point outliers)<\/b><span style=\"font-weight: 400;\"> are single data points that lay far from the rest of the data distribution.\u00a0<\/span><\/li>\n<\/ul>\n<ul>\n<li aria-level=\"1\"><b>Contextual outliers (otherwise known as conditional outliers)<\/b><span style=\"font-weight: 400;\"> are values that significantly deviate from the rest of the data points in the same context, meaning that the same value may not be considered an outlier if it occurred in a different context. Outliers in this category are commonly found in time series data.\u00a0<\/span><\/li>\n<li aria-level=\"1\"><b>Collective outliers<\/b><span style=\"font-weight: 400;\"> are seen as a subset of data points that are completely different with respect to the entire dataset.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">Now we know what an outlier is, let\u2019s take a look at how they end up in datasets in the first place.<\/span><\/p>\n<h2 id=\"outliers-in-datasets\"><span style=\"font-weight: 400;\">2. How do outliers end up in datasets?<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Now that we\u2019ve learned about what outliers are and how to identify them, it\u2019s worthwhile asking: how do outliers end up in datasets in the first place?\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Here are some of the more common causes of outliers in datasets:<\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Human error while manually entering data, such as a typo<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Intentional errors, such as dummy outliers included in a dataset to test detection methods<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Sampling errors that arise from extracting or mixing data from inaccurate or various sources<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Data processing errors that arise from data manipulation, or unintended mutations of a dataset<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Measurement errors as a result of instrumental error<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Experimental errors, from the data extraction process or experiment planning or execution<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Natural outliers which occur \u201cnaturally\u201d in the dataset, as opposed to being the result of an error otherwise listed. These naturally-occurring errors are known as novelties<\/span><\/li>\n<\/ul>\n<h2 id=\"identifying-outliers\"><span style=\"font-weight: 400;\">3. How can you identify outliers?<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Now that you know how each type of outlier is categorized, let\u2019s move on to figuring out how to identify them in your datasets. You can learn how to detect and handle them in our <a href=\"https:\/\/www.youtube.com\/watch?v=dKjHd7i-jB4\" target=\"_blank\" rel=\"noopener\">video seminar on outliers<\/a>, presented by expert data scientist Dana Daskalova.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">With small datasets, it can be easy to spot outliers manually (for example, with a set of data being 28, 26, 21, 24, 78, you can see that 78 is the outlier) but when it comes to large datasets or <\/span><a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/what-is-big-data\/\"><span style=\"font-weight: 400;\">big data<\/span><\/a><span style=\"font-weight: 400;\">, other tools are required.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We\u2019ll discuss some of the methods commonly used to identify outliers with visualizations or statistical methods, but there are many others available for implementation into your data analytics process. The method that you end up using will depend on the type of dataset you\u2019re working with, as well as the tools you\u2019re working with.<\/span><\/p>\n<h3 id=\"visualizations\"><span style=\"font-weight: 400;\">How to identify outliers using visualizations<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">In data analytics, analysts create <\/span><a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/what-is-data-visualization\/\"><span style=\"font-weight: 400;\">data visualizations<\/span><\/a><span style=\"font-weight: 400;\"> to present data graphically in a meaningful and impactful way, in order to present their findings to relevant stakeholders. These visualizations can easily show trends, patterns, and outliers from a large set of data in the form of maps, graphs and charts.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">You can read more about the <\/span><a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/data-visualization-types\/\"><span style=\"font-weight: 400;\">different types of data visualizations in this article<\/span><\/a><span style=\"font-weight: 400;\">, but here are two that a data analyst could use in order to easily find outliers.\u00a0<\/span><\/p>\n<h4><span style=\"font-weight: 400;\">Identifying outliers with box plots<\/span><\/h4>\n<figure id=\"attachment_9792\" aria-describedby=\"caption-attachment-9792\" style=\"width: 600px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-9792\" src=\"http:\/\/careerfoundry.inbearbeitung.de\/en\/wp-content\/uploads\/2021\/09\/Elements_of_a_boxplot_en-1024x409.jpg\" alt=\"Illustration of the elements of a boxplot, showing outliers (to the left of the diagram)\" width=\"600\" height=\"240\" title=\"\" srcset=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-content\/uploads\/2021\/09\/Elements_of_a_boxplot_en-1024x409.jpg 1024w, https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-content\/uploads\/2021\/09\/Elements_of_a_boxplot_en-300x120.jpg 300w, https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-content\/uploads\/2021\/09\/Elements_of_a_boxplot_en-768x307.jpg 768w, https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-content\/uploads\/2021\/09\/Elements_of_a_boxplot_en.jpg 1280w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><figcaption id=\"caption-attachment-9792\" class=\"wp-caption-text\">Elements of a boxplot, showing outliers (to the left). By Ruediger85 (changed language). Original by RobSeb (Own work) [CC-BY-SA-3.0], via <a href=\"https:\/\/commons.wikimedia.org\/wiki\/File:Elements_of_a_boxplot_en.svg\" rel=\"noopener\">Wikimedia Commons<\/a><\/figcaption><\/figure>\n<p><span style=\"font-weight: 400;\">Visualizing data as a box plot makes it very easy to spot outliers. A box plot will show the \u201cbox\u201d which indicates the interquartile range (from the lower quartile to the upper quartile, with the middle indicating the median data value) and any outliers will be shown outside of the \u201cwhiskers\u201d of the plot, each side representing the minimum and maximum values of the dataset, respectively. If the box skews closer to the maximum whisker, the prominent outlier would be the minimum value. Likewise, if the box skews closer to the minimum-valued whisker, the prominent outlier would then be the maximum value. Box plots can be produced easily using <\/span><a href=\"https:\/\/support.microsoft.com\/en-us\/office\/create-a-box-plot-10204530-8cdf-40fe-a711-2eb9785e510f\" rel=\"noopener\"><span style=\"font-weight: 400;\">Excel<\/span><\/a><span style=\"font-weight: 400;\"> or in Python, using a module such as <\/span><a href=\"https:\/\/plotly.com\/python\/box-plots\/\" rel=\"noopener\"><span style=\"font-weight: 400;\">Plotly<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<h4><span style=\"font-weight: 400;\">Identifying outliers with scatter plots<\/span><\/h4>\n<p><span style=\"font-weight: 400;\">As the name suggests, scatter plots show the values of a dataset \u201cscattered\u201d on an axis for two variables. The visualization of the scatter will show outliers easily\u2014these will be the data points shown furthest away from the regression line (a single line that best fits the data). As with box plots, these types of visualizations are also easily produced using <\/span><a href=\"https:\/\/support.microsoft.com\/en-us\/topic\/present-your-data-in-a-scatter-chart-or-a-line-chart-4570a80f-599a-4d6b-a155-104a9018b86e\" rel=\"noopener\"><span style=\"font-weight: 400;\">Excel<\/span><\/a><span style=\"font-weight: 400;\"> or in <\/span><a href=\"https:\/\/jakevdp.github.io\/PythonDataScienceHandbook\/04.02-simple-scatter-plots.html\" rel=\"noopener\"><span style=\"font-weight: 400;\">Python<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<h3 id=\"statistical-methods\"><span style=\"font-weight: 400;\">How to identify outliers using statistical methods<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Here, we\u2019ll describe some commonly-used statistical methods for finding outliers. A data analyst may use a statistical method to assist with machine learning modeling, which can be improved by identifying, understanding, and\u2014in some cases\u2014removing outliers.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Here, we\u2019ll discuss two algorithms commonly used to identify outliers, but there are many more that may be more or less useful to your analyses.<\/span><\/p>\n<h4><span style=\"font-weight: 400;\">Identifying outliers with DBSCAN\u00a0<\/span><\/h4>\n<figure id=\"attachment_9790\" aria-describedby=\"caption-attachment-9790\" style=\"width: 600px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-9790\" src=\"http:\/\/careerfoundry.inbearbeitung.de\/en\/wp-content\/uploads\/2021\/09\/DBSCAN-Illustration.jpg\" alt=\"Illustration of a DBSCAN cluster analysis\" width=\"600\" height=\"434\" title=\"\" srcset=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-content\/uploads\/2021\/09\/DBSCAN-Illustration.jpg 1280w, https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-content\/uploads\/2021\/09\/DBSCAN-Illustration-300x217.jpg 300w, https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-content\/uploads\/2021\/09\/DBSCAN-Illustration-1024x740.jpg 1024w, https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-content\/uploads\/2021\/09\/DBSCAN-Illustration-768x555.jpg 768w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><figcaption id=\"caption-attachment-9790\" class=\"wp-caption-text\">Chire, <a href=\"https:\/\/creativecommons.org\/licenses\/by-sa\/3.0\/\" rel=\"noopener\">CC-BY-SA-3.0<\/a>, via <a href=\"https:\/\/commons.wikimedia.org\/wiki\/File:DBSCAN-Illustration.svg\" rel=\"noopener\">Wikimedia Commons<\/a><\/figcaption><\/figure>\n<p>The above illustration is\u00a0of a DBSCAN <a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/what-is-cluster-analysis\/\" target=\"_blank\" rel=\"noopener\">cluster analysis<\/a>. Points around A are core points. Points B and C are not core points, but are density-connected via the cluster of A (and thus belong to this cluster). Point N is Noise, since it is neither a core point nor reachable from a core point.<\/p>\n<p><span style=\"font-weight: 400;\">DBSCAN (Density Based Spatial Clustering of Applications with Noise) is a clustering method that\u2019s used in machine learning and data analytics applications. Relationships between trends, features, and populations in a dataset are graphically represented by DBSCAN, which can also be applied to detect outliers.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">DBSCAN is a density-based clustering non-parametric algorithm, focused on finding and grouping together neighbors that are closely packed together. Outliers are marked as points that lie alone in low-density regions, far away from other neighbors.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Data analysts and those working with data mining and machine learning will surely come across DBSCAN\u2014it\u2019s an algorithm that\u2019s been around since 1996 and, having won a \u2018test of time award\u2019 at a leading data mining conference, it seems like it\u2019s going to remain an industry standard. Implementations of DBSCAN can be found on <\/span><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.cluster.DBSCAN.html\" rel=\"noopener\"><span style=\"font-weight: 400;\">scikit<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/cran.r-project.org\/web\/packages\/dbscan\/index.html\" rel=\"noopener\"><span style=\"font-weight: 400;\">R<\/span><\/a><span style=\"font-weight: 400;\">, and <\/span><a href=\"https:\/\/github.com\/annoviko\/pyclustering\" rel=\"noopener\"><span style=\"font-weight: 400;\">Python<\/span><\/a><span style=\"font-weight: 400;\">.\u00a0<\/span><\/p>\n<p><b>Read more: <\/b><a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/best-machine-learning-languages\/\"><span style=\"font-weight: 400;\">What\u2019s the Best Language for Machine Learning?<\/span><\/a><\/p>\n<h4><span style=\"font-weight: 400;\">Identifying outliers by finding the Z-Score\u00a0<\/span><\/h4>\n<p><span style=\"font-weight: 400;\">Z-score\u2014sometimes called the standard score\u2014is defined on <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Standard_score\" rel=\"noopener\"><span style=\"font-weight: 400;\">Wikipedia<\/span><\/a><span style=\"font-weight: 400;\"> as \u201cthe number of standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured.\u201d<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Computing a z-score helps describe any data point by placing it in relation to the standard deviation and mean of the whole group of data points. Positive standard scores appear as raw scores above the mean, whereas negative standard scores appear below the mean. The mean is 0 and standard deviation is 1, creating a normal distribution.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Outliers are found from z-score calculations by observing the data points that are too far from 0 (mean). In many cases, the \u201ctoo far\u201d threshold will be +3 to -3, where anything above +3 or below -3 respectively will be considered outliers.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Z-scores are often used in stock market data. Z-scores can be calculated using <\/span><a href=\"https:\/\/www.howtogeek.com\/400178\/how-to-calculate-a-z-score-using-microsoft-excel\/\" rel=\"noopener\"><span style=\"font-weight: 400;\">Excel<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/www.r-bloggers.com\/2020\/02\/how-to-compute-the-z-score-with-r\/\" rel=\"noopener\"><span style=\"font-weight: 400;\">R<\/span><\/a><span style=\"font-weight: 400;\"> and by using the <\/span><a href=\"https:\/\/www.socscistatistics.com\/tests\/ztest\/zscorecalculator.aspx\" rel=\"noopener\"><span style=\"font-weight: 400;\">Quick Z-Score Calculator<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<h4><span style=\"font-weight: 400;\">Identifying outliers with the Isolation Forest algorithm<\/span><\/h4>\n<p><span style=\"font-weight: 400;\">Isolation Forest\u2014otherwise known as iForest\u2014is another anomaly detection algorithm. The founders of the algorithm used two quantitative features of anomalous data points\u2014that they are \u201cfew\u201d in quantity and have \u201cdifferent\u201d attribute-values to those of normal instances\u2014to isolate outliers from normal data points in a dataset.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To show these outliers, the Isolation Forest will build \u201cIsolation Trees\u201d from the set of data, and outliers will be shown as the points that have shorter average path lengths than the rest of the branches.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Isolation Forest is used predominantly in machine learning. If you\u2019d like to implement the algorithm into your analyses, implementation can be found\u2014released by the algorithm\u2019s founder\u2014 on <\/span><a href=\"https:\/\/sourceforge.net\/projects\/iforest\/\" rel=\"noopener\"><span style=\"font-weight: 400;\">SourceForge<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<h2 id=\"when-to-remove-outliers\"><span style=\"font-weight: 400;\">4. When should you remove outliers?<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">It may seem natural to want to remove outliers as part of the <\/span><a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/what-is-data-cleaning\/\"><span style=\"font-weight: 400;\">data cleaning process<\/span><\/a><span style=\"font-weight: 400;\">. But in reality, sometimes it&#8217;s best\u2014even absolutely necessary\u2014to keep outliers in your dataset.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Removing outliers solely due to their place in the extremes of your dataset may create inconsistencies in your results, which would be counterproductive to your goals as a data analyst. These inconsistencies may lead to reduced statistical significance in an analysis.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">But what do we mean by statistical significance? Let\u2019s take a look.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">A quick introduction to hypothesis testing and statistical significance (p-value)<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">When you collect and analyze data, you\u2019re looking to draw conclusions about a wider population based on your sample of data. For example, if you\u2019re interested in the eating habits of the New York City population, you\u2019ll gather data on a sample of that population (say, 1000 people). When you analyze this data, you want to determine if your findings can be applied to the wider population, or if they just occurred within this particular sample by chance (or due to another influencing factor). You do this by calculating the statistical significance of your findings.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This is part of <\/span><b>hypothesis testing<\/b><span style=\"font-weight: 400;\">.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">With hypothesis testing, you start with two hypotheses: the <\/span><b>null hypothesis<\/b><span style=\"font-weight: 400;\"> and the <\/span><b>alternative hypothesis<\/b><span style=\"font-weight: 400;\">. Based on your findings and the statistical significance (or insignificance) of these findings, you\u2019ll accept one of your hypotheses and reject the other. The null hypothesis states that there is no statistical significance between the two variables you\u2019re looking at. The alternative hypothesis states the opposite.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Let\u2019s explain with an example. Imagine you\u2019re looking at the relationship between people\u2019s self-esteem (measured as a score out of 100) and their coffee consumption (measured in terms of cups per day). These are your two variables: self-esteem and coffee consumption. When analyzing your data, you find that there does indeed appear to be a correlation (or a relationship) between self-esteem and coffee consumption. For instance, higher coffee consumption correlates with a higher self-esteem score. Is this a fluke finding? Or do people who drink more coffee really tend to have higher self-esteem?<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To evaluate the strength of your findings, you\u2019ll need to determine if the relationship between the two variables is statistically significant. There are <\/span><a href=\"https:\/\/www.machinelearningplus.com\/statistics\/statistical-significance-tests-r\/\" rel=\"noopener\"><span style=\"font-weight: 400;\">several different tests used to calculate statistical significance<\/span><\/a><span style=\"font-weight: 400;\">, depending on the type of data you have. We won\u2019t go into detail here, but essentially, you run the appropriate significance test in order to find the p-value.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The p-value is a measure of probability, and it tells you how likely it is that your findings occurred by chance. A p-value of less than 0.05 indicates strong evidence against the null hypothesis; in other words, there is less than a 5% probability that the results occurred by chance. In this case, your findings can be deemed <\/span><b>statistically significant<\/b><span style=\"font-weight: 400;\">. If, on the other hand, your statistical significance test finds a p-value greater than 0.05, your findings are deemed <\/span><b>statistically insignificant<\/b><span style=\"font-weight: 400;\">. They may have just occurred by chance.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Removing outliers without good reason can skew your results in a way that impacts the p-value, thus making your findings unreliable. So: it\u2019s essential to think carefully before simply removing outliers from your dataset!<\/span><\/p>\n<p><span style=\"font-weight: 400;\">While evaluating potential outliers to remove from your dataset, consider the following:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Is the outlier a measurement error or data entry error? If so, correct it manually where possible. If it\u2019s unable to be corrected, it should be considered incorrect, and thus legitimately removed from the dataset.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Is the outlier a natural part of the data population being analyzed? If not, you should remove it.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Can you explain your reasoning for removing an outlier? If not, you should not remove it. When removing outliers, you should provide documentation of the excluded data points, giving reasoning for your choices.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">If there is disagreement within your group about the removal of an outlier (or a group of outliers), it may be useful to perform two analyses: the first with the dataset intact, and the second with the outliers removed. Compare the results and see which one has provided the most useful and realistic insights.<\/span><\/p>\n<h2 id=\"wrap-up\"><span style=\"font-weight: 400;\">5. Wrap-up and next steps<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">In this article, we\u2019ve covered the basic definition of an outlier, as well as its possible categorizations. We then covered some commonly-used methods of identifying outliers, then discussed exactly how these outliers may end up in a dataset, and whether or not it\u2019s appropriate to remove them in order to create useful insights for your organization.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Handling outliers is a fascinating and sometimes complicated process, which makes the world of data analytics all the more exciting! If you\u2019d like to learn more about what it\u2019s like to work as a data analyst, check out our <\/span><strong><a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/short-courses\/become-a-data-analyst\/\">free, 5-day data analytics short course<\/a><\/strong><span style=\"font-weight: 400;\">.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">You could also check out some of the other articles in our series about data analytics:<\/span><\/p>\n<ul>\n<li><strong><a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/monte-carlo-method\/\">What Is the Monte Carlo Method?<\/a><\/strong><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><strong><a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/what-is-data-processing\/\">What is Data Processing? A Beginner\u2019s Guide<\/a><\/strong><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><strong><a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/what-is-time-series-data\/\">What Is Time Series Data and How Is It Analyzed?<\/a><\/strong><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>If you&#8217;re new to the field of data analytics, it won&#8217;t be long until you start hearing about outliers. But what is an outlier, and what do you need to do with them once you find them? Find out here.<\/p>\n","protected":false},"author":120,"featured_media":9772,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_lmt_disableupdate":"yes","_lmt_disable":"","footnotes":""},"categories":[3],"tags":[],"class_list":["post-9769","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-analytics"],"acf":{"homepage_category_featured":false},"modified_by":"Matthew Deery","_links":{"self":[{"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/posts\/9769","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/users\/120"}],"replies":[{"embeddable":true,"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/comments?post=9769"}],"version-history":[{"count":4,"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/posts\/9769\/revisions"}],"predecessor-version":[{"id":29185,"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/posts\/9769\/revisions\/29185"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/media\/9772"}],"wp:attachment":[{"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/media?parent=9769"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/categories?post=9769"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/tags?post=9769"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}