{"id":3773,"date":"2020-11-17T01:00:00","date_gmt":"2020-11-17T00:00:00","guid":{"rendered":"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/uncategorized\/what-is-data-cleaning\/"},"modified":"2023-09-14T14:33:16","modified_gmt":"2023-09-14T12:33:16","slug":"what-is-data-cleaning","status":"publish","type":"post","link":"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/what-is-data-cleaning\/","title":{"rendered":"What Is Data Cleaning and Why Does It Matter?"},"content":{"rendered":"<p id=\"effective-data-cleaning-is-a-vital-part-of-the-data-analytics-process-but-what-is-it-why-is-it-important-and-how-do-you-do-it-read-on-to-find-out\"><strong>Effective data cleaning is a vital part of the data analytics process. But what is it, why is it important, and how do you do it?\u00a0<\/strong><\/p>\n<p>Good data hygiene is so important for business. For starters, it\u2019s good practice to keep on top of your data, ensuring that it\u2019s accurate and up-to-date. However, data cleaning is also a vital part of the data analytics process. If your data has inconsistencies or errors, you can bet that your results will be flawed, too. And when you\u2019re making business decisions based on those insights, it doesn\u2019t take a genius to figure out what might go wrong!<\/p>\n<p>In a field like marketing, bad insights can mean wasting money on poorly targeted campaigns. In a field like <a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/healthcare-data-analytics\/\" target=\"_blank\" rel=\"noopener\">healthcare<\/a> or the sciences, it can quite literally mean the difference between life and death. In this article, I\u2019ll explore exactly what data cleaning is and why it\u2019s so vital to get it right. We\u2019ll also provide an overview of the key steps you should take when cleaning your data.<\/p>\n<p>Why not get familiar with data cleaning and the rest of the data analytics process in our <a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/short-courses\/become-a-data-analyst\/\"><strong>free 5-day data short course<\/strong><\/a>?<\/p>\n<p>We\u2019ll answer the following questions:<\/p>\n<ol>\n<li><a href=\"#what-is-data-cleaning\">What is data cleaning?<\/a><\/li>\n<li><a href=\"#why-is-data-cleaning-important\">Why is data cleaning important?<\/a><\/li>\n<li><a href=\"#how-to-clean-your-data-step-by-step\">How do you clean data?<\/a><\/li>\n<li><a href=\"#data-cleaning-tools\">What are some of the most useful data cleaning tools?<\/a><\/li>\n<\/ol>\n<p>First up\u2026<\/p>\n<h2 id=\"what-is-data-cleaning\">1. What is data cleaning?<\/h2>\n<p>Data cleaning (sometimes also known as data cleansing or data wrangling) is an important early step in <a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/the-data-analysis-process-step-by-step\/\" target=\"_blank\" rel=\"noopener\">the data analytics process<\/a>.<\/p>\n<p>This crucial exercise, which involves preparing and validating data, usually takes place before your core analysis. Data cleaning is not just a case of removing erroneous data, although that\u2019s often part of it. The majority of work goes into detecting rogue data and (wherever possible) correcting it.<\/p>\n<h3>What is rogue data?<\/h3>\n<p>\u2018Rogue data\u2019 includes things like incomplete, inaccurate, irrelevant, corrupt or incorrectly formatted data. The process also involves deduplicating, or \u2018deduping\u2019. This effectively means merging or removing identical data points.<\/p>\n<h3>Why is it important to correct rogue data?<\/h3>\n<p>The answer is straightforward enough: if you don\u2019t, they\u2019ll impact the results of your analysis.<\/p>\n<p>Since data analysis is commonly used to inform business decisions, results need to be accurate. In this case, it might seem safer simply to remove rogue or incomplete data. But this poses problems, too: an incomplete dataset will also impact the results of your analysis. That\u2019s why one of the main aims of data cleaning is to keep as much of a dataset intact as possible. This helps improve the reliability of your insights.<\/p>\n<p>Data cleaning is not only important for data analysis. It\u2019s also important for general business housekeeping (or \u2018data governance\u2019). The <a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/big-data-examples\/\">sources of big data<\/a> are dynamic and constantly changing. Regularly maintaining databases, therefore, helps you keep on top of things. This has several additional benefits, which we\u2019ll cover in the next section.<\/p>\n<p><strong>Want to try your hand at cleaning a dataset?\u00a0<\/strong>You may be interested in this introductory tutorial to data cleaning, hosted by Dr. Humera Noor Minhas.<\/p>\n<style>.embed-container { position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden; max-width: 100%; } .embed-container iframe, .embed-container object, .embed-container embed { position: absolute; top: 0; left: 0; width: 100%; height: 100%; }<\/style>\n<div class=\"embed-container\"><iframe src=\"https:\/\/www.youtube.com\/embed\/kNl7YDN-_js\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<h2 id=\"why-is-data-cleaning-important\">2. Why is data cleaning important?<\/h2>\n<p>A common refrain you\u2019ll hear in the world of data analytics is: \u2018garbage in, garbage out\u2019. This maxim, so often used by data analysts, even has its own acronym\u2026 GIGO. But what does it mean?<\/p>\n<p>Essentially, GIGO means that if the quality of your data is sub-par, then the results of any analysis using those data will also be flawed. Even if you follow every other step of the data analytics process to the letter, if your data is a mess, it won\u2019t make a difference.<\/p>\n<p>For this reason, the importance of properly cleaning data can\u2019t be overstated. It\u2019s like creating a foundation for a building: do it right and you can build something strong and long-lasting. Do it wrong, and your building will soon collapse. This mindset is why good data analysts will spend anywhere from <strong>60-80% of their time<\/strong> carrying out data cleaning activities. Beyond data analytics, good data hygiene has several other benefits. Let\u2019s look at those now.<\/p>\n<h3 id=\"key-benefits-of-data-cleaning\">Key benefits of data cleaning<\/h3>\n<p>As we\u2019ve covered, data analysis requires effectively cleaned data to produce accurate and trustworthy insights. But clean data has a range of other benefits, too:<\/p>\n<ul>\n<li><strong>Staying organized:<\/strong> Today\u2019s businesses collect lots of information from clients, customers, product users, and so on. These details include everything from addresses and phone numbers to bank details and more. Cleaning this data regularly means keeping it tidy. It can then be stored more effectively and securely.<\/li>\n<li><strong>Avoiding mistakes:<\/strong> Dirty data doesn\u2019t just cause problems for data analytics. It also affects daily operations. For instance, marketing teams usually have a customer database. If that database is in good order, they\u2019ll have access to helpful, accurate information. If it\u2019s a mess, mistakes are bound to happen, such as\u00a0<a href=\"https:\/\/www.dailyrecord.co.uk\/news\/scottish-news\/snp-huge-leaflet-blunder-100000-15059539\" target=\"_blank\" rel=\"noopener\">using the wrong name in personalized mail outs<\/a>.<\/li>\n<li><strong>Improving productivity:<\/strong> Regularly cleaning and updating data means rogue information is quickly purged. This saves teams from having to wade through old databases or documents to find what they\u2019re looking for.<\/li>\n<li><strong>Avoiding unnecessary costs:<\/strong> Making business decisions with bad data can lead to expensive mistakes. But bad data can incur costs in other ways too. Simple things, like processing errors, can quickly snowball into bigger problems. Regularly checking data allows you to detect blips sooner. This gives you a chance to correct them before they require a more time-consuming (and costly) fix.<\/li>\n<li><strong>Improved mapping:<\/strong> Increasingly, organizations are looking to improve their internal data infrastructures. For this, they often hire data analysts to carry out data modeling and to build new applications. Having clean data from the start makes it far easier to collate and map, meaning that a solid data hygiene plan is a sensible measure.<\/li>\n<\/ul>\n<h3>Data quality<\/h3>\n<p>Key to data cleaning is the concept of data quality. Data quality measures the objective and subjective suitability of any dataset for its intended purpose.<\/p>\n<p>There are a number of characteristics that affect the quality of data including accuracy, completeness, consistency, timeliness, validity, and uniqueness. You can <a href=\"\/en\/blog\/data-analytics\/what-is-data-quality\/\">learn more about data quality in this full article<\/a>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-9288\" src=\"http:\/\/careerfoundry.inbearbeitung.de\/en\/wp-content\/uploads\/2020\/11\/data-analysts-cleaning-data.jpeg\" alt=\"Data analyst colleagues working together to clean data\" width=\"1200\" height=\"600\" title=\"\" srcset=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-content\/uploads\/2020\/11\/data-analysts-cleaning-data.jpeg 1200w, https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-content\/uploads\/2020\/11\/data-analysts-cleaning-data-300x150.jpeg 300w, https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-content\/uploads\/2020\/11\/data-analysts-cleaning-data-1024x512.jpeg 1024w, https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-content\/uploads\/2020\/11\/data-analysts-cleaning-data-768x384.jpeg 768w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/p>\n<h2 id=\"how-to-clean-your-data-step-by-step\">3. How to clean your data (step-by-step)<\/h2>\n<p>So far, we\u2019ve covered what data cleaning is and why it\u2019s important. In this section, we\u2019ll explore the practical aspects of effective data cleaning. Since there are multiple approaches you can take for completing each of these tasks, we\u2019ll focus instead on the high-level activities.<\/p>\n<h3 id=\"step-1-get-rid-of-unwanted-observations\">Step 1: Get rid of unwanted observations<\/h3>\n<p>The first stage in any data cleaning process is to remove the observations (or data points) you don\u2019t want. This includes irrelevant observations, i.e. those that don\u2019t fit the problem you\u2019re looking to solve.<\/p>\n<p>For instance, if we were running an analysis on vegetarian eating habits, we could remove any meat-related observations from our data set. This step of the process also involves removing duplicate data. Duplicate data commonly occurs when you combine multiple datasets, scrape data online, or receive it from third-party sources.<\/p>\n<h3 id=\"step-2-fix-structural-errors\">Step 2: Fix structural errors<\/h3>\n<p>Structural errors usually emerge as a result of poor data housekeeping. They include things like typos and inconsistent capitalization, which often occur during manual data entry.<\/p>\n<p>Let\u2019s say you have a dataset covering the properties of different metals. \u2018Iron\u2019 (uppercase) and \u2018iron\u2019 (lowercase) may appear as separate classes (or categories). Ensuring that capitalization is consistent makes that data much cleaner and easier to use. You should also check for mislabeled categories.<\/p>\n<p>For instance, \u2018Iron\u2019 and \u2018Fe\u2019 (iron\u2019s chemical symbol) might be labeled as separate classes, even though they\u2019re the same. Other things to look out for are the use of underscores, dashes, and other rogue punctuation!<\/p>\n<h3 id=\"step-3-standardize-your-data\">Step 3: Standardize your data<\/h3>\n<p>Standardizing your data is closely related to fixing structural errors, but it takes it a step further. Correcting typos is important, but you also need to ensure that every cell type follows the same rules.<\/p>\n<p>For instance, you should decide whether values should be all lowercase or all uppercase, and keep this consistent throughout your dataset. Standardizing also means ensuring that things like numerical data use the same unit of measurement.<\/p>\n<p>As an example, combining miles and kilometers in the same dataset will cause problems. Even dates have different conventions, with the US putting the month before the day, and Europe putting the day before the month. Keep your eyes peeled; you\u2019ll be surprised what slips through.<\/p>\n<h3 id=\"step-4-remove-unwanted-outliers\">Step 4: Remove unwanted outliers<\/h3>\n<p>Outliers are data points that dramatically differ from others in the set. They can cause problems with certain types of data models and analysis.<\/p>\n<p>For instance, while <a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/what-is-a-decision-tree\/\" target=\"_blank\" rel=\"noopener\">decision tree algorithms<\/a>\u00a0are generally accepted to be quite robust to outliers, outliers can easily skew a linear regression model. While outliers can affect the results of an analysis, you should always approach removing them with caution.<\/p>\n<p>Only remove an outlier if you can prove that it is erroneous, e.g. if it is obviously due to incorrect data entry, or if it doesn\u2019t match a comparison \u2018gold standard\u2019 dataset.<\/p>\n<h3 id=\"step-5-fix-contradictory-data-errors\">Step 5: Fix contradictory data errors<\/h3>\n<p>Contradictory (or cross-set) data errors are another common problem to look out for. Contradictory errors are where you have a full record containing inconsistent or incompatible data.<\/p>\n<p>An example could be a log of athlete racing times. If the column showing the total amount of time spent running isn\u2019t equal to the sum of each racetime, you\u2019ve got a cross-set error.<\/p>\n<p>Another example might be a pupil\u2019s grade score being associated with a field that only allows options for \u2018pass\u2019 and \u2018fail\u2019, or an employee\u2019s taxes being greater than their total salary.<\/p>\n<h3 id=\"step-6-type-conversion-and-syntax-errors\">Step 6: Type conversion and syntax errors<\/h3>\n<p>Once you\u2019ve tackled other inconsistencies, the content of your spreadsheet or dataset might look good to go.<\/p>\n<p>However, you need to check that everything is in order behind the scenes, too. Type conversion refers to the categories of data that you have in your dataset. A simple example is that numbers are numerical data, whereas currency uses a currency value. You should ensure that numbers are appropriately stored as numerical data, text as text input, dates as objects, and so on. I<\/p>\n<p>n case you missed any part of Step 2, you should also remove syntax errors\/white space (erroneous gaps before, in the middle of, or between words).<\/p>\n<h3 id=\"step-7-deal-with-missing-data\">Step 7: Deal with missing data<\/h3>\n<p>When data is missing, what do you do? There are three common approaches to this problem.<\/p>\n<p>The first is to <strong>remove the entries<\/strong> associated with the missing data. The second is to<strong> impute (or guess) the missing data<\/strong>, based on other, similar data. In most cases, however, both of these options negatively impact your dataset in other ways. Removing data often means losing other important information. Guessing data might reinforce existing patterns, which could be wrong.<\/p>\n<p>The third option (and often the best one) is to <strong>flag the data as missing<\/strong>. To do this, ensure that empty fields have the same value, e.g. \u2018missing\u2019 or \u20180\u2019 (if it\u2019s a numerical field). Then, when you carry out your analysis, you\u2019ll at least be taking into account that data is missing, which in itself can be informative.<\/p>\n<h3 id=\"step-8-validate-your-dataset\">Step 8: Validate your dataset<\/h3>\n<p>Once you\u2019ve cleaned your dataset, the final step is to validate it. Validating data means checking that the process of making corrections, deduping, standardizing (and so on) is complete.<\/p>\n<p>This often involves using scripts that check whether or not the dataset agrees with validation rules (or \u2018check routines\u2019) that you have predefined. You can also carry out validation against existing, \u2018gold standard\u2019 datasets.<\/p>\n<p>This all sounds a bit technical, but all you really need to know at this stage is that validation means checking the data is ready for analysis. If there are still errors (which there usually will be) you\u2019ll need to go back and fix them\u2026there\u2019s a reason why data analysts spend so much of their time cleaning data!<\/p>\n<h2 id=\"data-cleaning-tools\">4. Data cleaning tools<\/h2>\n<p>Now we\u2019ve covered the steps of the data cleaning process, it\u2019s clear that this is not a manual task. So, what tools might help? The answer depends on factors like the data you\u2019re working with and the systems you\u2019re using. But here are some baseline tools to get to grips with.<\/p>\n<h3 id=\"microsoft-excel\">Microsoft Excel<\/h3>\n<p>MS Excel has been a staple of computing since its launch in 1985. Love it or loathe it, it remains a popular data-cleaning tool to this day. Excel comes with many inbuilt functions to automate the data cleaning process, from deduping to replacing numbers and text, shaping columns and rows, or <a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/how-to-use-concatenate-function-in-excel\/\">combining data from multiple cells<\/a>. It\u2019s also relatively easy to learn, making it the first port of call for most new data analysts.<\/p>\n<h3 id=\"programming-languages\">Programming languages<\/h3>\n<p>Often, data cleaning is carried out using scripts that automate the process. This is essentially what Excel can do, using pre-existing functions. However, carrying out specific batch processing (running tasks without end-user interaction) on large, complex datasets often means writing scripts yourself.<\/p>\n<p>This is usually done with programming languages like <a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/what-is-python\/\">Python<\/a>, Ruby, <a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/sql-cheat-sheet\/\">SQL<\/a>, or\u2014if you\u2019re a real coding whizz\u2014R (which is more complex, but also more versatile). While more experienced data analysts may code these scripts from scratch, many ready-made libraries exist. Python, in particular, has a tonne of data cleaning libraries that can speed up the process for you, such as\u00a0<a href=\"https:\/\/pandas.pydata.org\/\" rel=\"noopener\">Pandas<\/a> and\u00a0<a href=\"https:\/\/numpy.org\/\" rel=\"noopener\">NumPy<\/a>.<\/p>\n<h3 id=\"visualizations\">Visualizations<\/h3>\n<p>Using data visualizations can be a great way of spotting errors in your dataset. For instance, a bar plot is excellent for visualizing unique values and might help you spot a category that has been labeled in multiple different ways (like our earlier example of \u2018Iron\u2019 and \u2018Fe\u2019). Likewise, scatter graphs can help spot outliers so that you can investigate them more closely (and remove them if needed).<em>\u00a0<\/em><\/p>\n<h3 id=\"proprietary-software\">Proprietary software<\/h3>\n<p>Many companies are cashing in on the data analytics boom with proprietary software. Much of this software is aimed at making data cleaning more straightforward for non-data-savvy users. Since there are tonnes of applications out there (many of which are tailored to different industries and tasks) we won\u2019t list them here. But we encourage you to go and see what\u2019s available. To get you started, play around with some of the free, open-source tools. Popular ones include <a href=\"https:\/\/openrefine.org\/\" rel=\"noopener\">OpenRefine<\/a> and <a href=\"https:\/\/www.trifacta.com\/start-wrangling\/\" rel=\"noopener\">Trifacta<\/a>.<\/p>\n<p>You\u2019ll find <a href=\"\/en\/blog\/data-analytics\/best-data-cleaning-tools\/\">a more thorough comparison of some of the best data cleaning tools in this guide<\/a>.<\/p>\n<h2 id=\"final-thoughts\">Final thoughts<\/h2>\n<p>Data cleaning is probably the most important part of the data analytics process. Good data hygiene isn\u2019t just about data analytics, though; it\u2019s good practice to maintain and regularly update your data anyway. Clean data is a core tenet of data analytics and the field of data science more generally.<\/p>\n<p>In this post, we\u2019ve learned that:<\/p>\n<ul>\n<li><strong>Clean data is hugely important for data analytics:<\/strong> Using dirty data will lead to flawed insights. As the saying goes: \u2018Garbage in, garbage out.\u2019<\/li>\n<li><strong>Data cleaning is time-consuming:<\/strong> With great importance comes great time investment. Data analysts spend anywhere from 60-80% of their time cleaning data.<\/li>\n<li><strong>Data cleaning is a complex process:<\/strong> Data cleaning means removing unwanted observations, outliers, fixing structural errors, standardizing, dealing with missing information, and validating your results. This is not a quick or manual task!<\/li>\n<li><strong>There are tools out there to help you:<\/strong> Fear not, tools like MS Excel and programming languages like Python are there to help you clean your data. There are also many proprietary software tools available.<\/li>\n<\/ul>\n<p>Why not try your hand at data analytics with our\u00a0<a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/short-courses\/become-a-data-analyst\/\">free, five-day data analytics short course<\/a>? Alternatively, read the following to find out more:<\/p>\n<ul>\n<li><a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/different-types-of-data-analysis\/\">What are the different types of data analysis?<\/a><\/li>\n<li><a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/difference-between-quantitative-and-qualitative-data\/\">Quantitative vs. qualitative data: What\u2019s the difference?<\/a><\/li>\n<li><a href=\"https:\/\/careerfoundry.inbearbeitung.de\/en\/blog\/data-analytics\/data-analysis-techniques\/\">The 7 most useful data analytics methods and techniques<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Effective data cleaning is a vital part of the data analytics process. But what is it, why is it important, and how do you do it? Find out in this guide.<\/p>\n","protected":false},"author":101,"featured_media":360,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_lmt_disableupdate":"yes","_lmt_disable":"","footnotes":""},"categories":[3],"tags":[],"class_list":["post-3773","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-analytics"],"acf":{"homepage_category_featured":false},"modified_by":"Matthew Deery","_links":{"self":[{"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/posts\/3773","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/users\/101"}],"replies":[{"embeddable":true,"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/comments?post=3773"}],"version-history":[{"count":3,"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/posts\/3773\/revisions"}],"predecessor-version":[{"id":28938,"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/posts\/3773\/revisions\/28938"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/media\/360"}],"wp:attachment":[{"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/media?parent=3773"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/categories?post=3773"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/careerfoundry.inbearbeitung.de\/en\/wp-json\/wp\/v2\/tags?post=3773"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}