Data quality underpins many key aspects of data analytics, including data cleaning. Read on to find out what data quality is and why it’s so important.
They say that a bad worker shouldn’t blame their tools. The exception to this rule might be data analytics. When your main tool is data, it’s vital that it’s of high enough quality to help you complete the task at hand. Carrying out a financial analysis? Measuring customer sentiment on a new product? If your data quality is poor, your insights won’t be of much use.
This underscores the importance of high-quality data. The difficulty is, measuring data quality isn’t easy. Although it must meet certain measures, data quality requires judgment, too. In this post, we’ll explore the topic in more detail. When you come away, you should understand what data quality is and how it works.
- What is data quality (and why is it important)?
- How do you measure data quality?
- Five data quality best practices
- Data quality in summary
Let’s start by defining what “data quality” actually means.
1. What is data quality and why does it matter?
When learning about data analytics, you may have come across the term “data quality.” But what does “data quality” mean?
Data quality is how we describe the state of any given dataset. It measures objective elements such as completeness, accuracy, and consistency. But it also measures more subjective factors, such as how well-suited a dataset is to a particular task. This subjective aspect makes determining data quality challenging at times. Even so, data quality is an important concept that underpins both data analytics and data science.
If data quality is high, you can use a dataset for its intended purpose. This might be to make key spending decisions, improve operations, or inform future growth. Yet if data quality is low, all these areas are negatively affected. You might spend money on the wrong things. Operations may become more cumbersome. Your future plan could sink the business. These are extreme examples but they highlight the importance of good data quality: not just in preparation for data analysis, but as general practice in your ongoing data governance.
One measure of data quality is how well it’s been cleaned (deduplicated, corrected, validated, and so on). But context is also an important factor. Datasets that are high quality for one task may be completely useless for another. They might lack key observations or be in a format that is useless for a different job. To reduce this gray area, we can determine data quality using several measures. Let’s cover these next.
2. How do you measure data quality?
As ever, in data analytics, few problems have a straightforward solution. Measuring data quality is no exception! But this is what we love about the field—it’s always challenging us to think creatively.
Determining whether a dataset is of high quality involves looking at its characteristics and deciding whether it meets your, or your organization’s, needs. While there’s always wiggle room for what makes a dataset high quality, a good baseline is to look at the six characteristics of quality data. These are:
Now let’s look at each in detail.
Validity is the degree to which a dataset conforms to a defined format or set of rules. These rules, or constraints, are easy to enforce with modern data capture systems, e.g. online forms. Since forms are a common source of data capture (one we’re all familiar with) let’s use them to highlight a few examples:
Data type: In an online form, values must match the data type, e.g. numbers > numerical, true/false > Boolean, and so on.
Range: Data must fall within a particular range. Ever tried putting a false year of birth into a form (e.g. 1700)? It will tell you this invalid because it falls outside of the accepted date range.
Mandatory data: It’s happened to us all. You hit submit and the form comes back at you with an angry, red warning to say you can’t leave cell ‘X’ empty. This is mandatory data. In online forms, it includes things like email addresses and customer ID numbers.
- Regular expression patterns: Sometimes data is invalid if it doesn’t follow defined conventions. For instance, dates of birth often need to be in the MM-DD-YYYY format. Phone numbers are also often considered invalid when they include a space.
Examples of data validity are wide-ranging, but this hopefully gives you a good overview. Today, data validity is often monitored using automated system rules. (Basically, like algorithms that tell you how to fill out a form.) However, it’s difficult to manage validity if data come from Excel spreadsheets or outdated systems (both of which often lack validity rules). Luckily, it’s usually possible to improve data validity via data cleaning.
Accuracy is a simple measure of whether your data are correct. This could be anything from your date of birth to your bank balance, eye color, or geographical location. Data accuracy is important for the obvious reason that if data are incorrect, they’ll hurt the results of any analysis (and subsequent business decisions). Unfortunately, it’s hard to measure accuracy since we can’t test it against existing ‘gold standard’ datasets.
Data completeness is how exhaustive a dataset is. In short, do you have all the necessary information needed to complete your task? Identifying an incomplete dataset isn’t always as easy as looking for empty cells. Let’s say you have a database of customer contact details, missing half the surnames. If you wanted to list the customers alphabetically, the dataset would be incomplete. But if your only aim was to analyze customer dialing codes to determine geographical locators, surnames wouldn’t matter. Like data accuracy, incomplete data are challenging to fix. This is because it’s not always possible to infer missing data based on what you already have.
Data consistency refers to whether your data match information from other sources. This determines its reliability. For instance, if you work at a doctor’s surgery, you may find patients with two phone numbers or postal addresses. The data here are inconsistent. It’s not always possible to return to the source, so determining data consistency requires smart thinking. You may be able to infer which data are correct by looking at the most recent entry, or by determining reliability in some other way.
Data uniformity looks at units of measure, metrics, and so on. For instance, imagine you’re combining two datasets on people’s weight. One dataset uses the metric system, the other, imperial. For the data to be of any use during analysis, all the measurements must be uniform, i.e. all in kilograms or all in pounds. This means converting it all to a single unit. Luckily, this aspect of data quality is easier to manage. It doesn’t mean filling in gaps or determining accuracy…phew!
Data relevance is a more subjective measure of data quality. It looks at whether data is sufficiently complete, uniform, consistent (and so on) to fulfill its given task. Another aspect of data relevance, though, is timeliness. Is the data available when you need it? Is it accessible to everyone who requires it? For instance, if you’re reporting to the Board with quarterly profits and losses, you need the most up-to-date information. With only the previous quarter’s figures, you’ll have lower-quality data and can therefore only offer lower quality insights.
3. Five data quality best practices
As we’ve covered above, there are many measures for determining data quality. One thing we can agree on, though, is that high-quality datasets are those which are fit for their intended purpose, whether in operations, decision-making, or for future business planning. As a new data analyst, here are five best practices for ensuring high-quality data.
1. Good data governance
The best way to ensure high data quality is an effective data governance framework. Good data governance includes carefully crafted data policies and standards, shaped with input from senior management and other stakeholders. Ideally, this should be overseen by a data governance committee. A solid governance framework brings a sense of order to data management. Ensuring data quality then becomes a process rather than a standalone job. It also makes it easier to spot when data quality deviates from the agreed standards. You might want to introduce a data quality log to highlight, track, and resolve any issues.
2. Regular data cleaning
Maintaining, or ‘cleaning,’ data is an important aspect of the data analytics process. But effective data cleaning isn’t only something you should do before carrying out an analysis. As part of business ‘housekeeping’, you should clean data regularly. This means removing unwanted observations and outliers, fixing structural errors, tackling missing data, and so on. This process helps identify problems as they arise. It also helps deduce better ways of storing and collecting your data, improving its overall quality from the start. You’ll find a step-by-step guide to data cleaning here.
3. Data profiling
For any data analyst responsible for quality management, data profiling is a vital part of the role. Data profiling involves looking at data sources to collect statistics or insights. This might sound similar to data analysis, but it’s not quite the same. Whereas data analysis draws insights to inform, say, business operations, data profiling looks at data on a deeper, structural level (in isolation from its intended uses). Data profiling informs data quality in many ways. You might use it to identify flawed capture techniques, or it could help you determine if data quality is high enough for a given task, before importing it into a database.
4. Cross-departmental support
Structure and processes are vital for ensuring high-quality data. But one thing many people overlook is input from across the business. Data analysts are often seen as data ‘gatekeepers’. While this can be true, data quality is so subjective that it needs at least some support from non-data-driven personnel. Input from managers will help you to create key performance indicators (KPIs) that meet everybody’s needs. Most businesses usually pool many data sources, too. As such, input from senior management helps solve issues as they arise. The takeaway here? Communication is key.
5. Data quality reporting
You should log all the above activities and regularly report on them. This helps measure data quality KPIs and will shape your data quality issue log. This is vital for documenting and (hopefully!) solving problems. Data quality reporting also helps identify common themes relating to how you collect, store, and process your data. Reports also provide transparency about data quality governance. This helps build trust between team members. A final bonus is that you can provide these reports as interactive dashboards or visualizations. Never turn down a chance to flex your other data analytics skills!
4. In summary
In this post, we’ve looked at what data quality is and why it’s important. We’ve learned that:
- Data quality can be measured objectively (is it free from errors, typos, mistakes, etc.?) and subjectively (is it suitable for its intended task?)
- Data quality is key to data analytics and is particularly important for data cleaning.
- We usually explore data quality via six characteristics: Validity, accuracy, completeness, consistency, uniformity, and relevance.
- Data quality best practice includes implementing a governance framework, data cleaning, data profiling, fostering management support, and regular reporting.
Data quality is one small part of the fascinating, interconnected web that is data analytics. From data scraping to data cleaning, carrying out analyses, and helping shape business insights, data analytics is a problem-solving field with plenty of career opportunities. Learn the basics with our free, five-day data analytics short course, or read the following to explore further: