Data wrangling is the transformation of raw data into a format that is easier to use. But what exactly does it involve? In this post, we find out.
Manipulation is at the core of data analytics. We don’t mean the sneaky kind, of course, but the data kind! Scraping data from the web, carrying out statistical analyses, creating dashboards and visualizations—all these tasks involve manipulating data in one way or another. But before we can do any of these things, we need to ensure that our data are in a format we can use. This is where the most important form of data manipulation comes in: data wrangling.
In this post, we explore data wrangling in detail. When you’ve finished reading, you’ll be able to answer:
- What is data wrangling (and why is it important)?
- Data wrangling vs. data cleaning: what’s the difference?
- What is the data wrangling process?
- What tools do data wranglers use?
1. What is data wrangling and why is it important?
Data wrangling is a term often used to describe the early stages of the data analytics process. It involves transforming and mapping data from one format into another. The aim is to make data more accessible for things like business analytics or machine learning. The data wrangling process can involve a variety of tasks. These include things like data collection, exploratory analysis, data cleansing, creating data structures, and storage.
Data wrangling is time-consuming. In fact, it can take up to about 80% of a data analyst’s time. This is partly because the process is fluid, i.e. there aren’t always clear steps to follow from start to finish. However, it’s also because the process is iterative and the activities involved are labor-intensive. What you need to do depends on things like the source (or sources) of the data, their quality, your organization’s data architecture, and what you intend to do with the data once you’ve finished wrangling it.
Why is data wrangling important?
Insights gained during the data wrangling process can be invaluable. They will likely affect the future course of a project. Skipping or rushing this step will result in poor data models that impact an organization’s decision-making and reputation. So, if you ever hear someone suggesting that data wrangling isn’t that important, you have our express permission to tell them otherwise!
Unfortunately, because data wrangling is sometimes poorly understood, its significance can be overlooked. High-level decision-makers who prefer quick results may be surprised by how long it takes to get data into a usable format. Unlike the results of data analysis (which often provide flashy and exciting insights), there’s little to show for your efforts during the data wrangling phase. And as businesses face budget and time pressures, this makes a data wrangler’s job all the more difficult. The job involves careful management of expectations, as well as technical know-how.
2. Data wrangling vs. data cleaning: what is the difference?
Some people use the terms ‘data wrangling’ and ‘data cleaning interchangeably. This is because they’re both tools for converting data into a more useful format. It’s also because they share some common attributes. But there are some important differences between them:
Data wrangling refers to the process of collecting raw data, cleaning it, mapping it, and storing it in a useful format. To confuse matters (and because data wrangling is not always well understood) the term is often used to describe each of these steps individually, as well as in combination.
- Data cleaning,meanwhile, is a single aspect of the data wrangling process. A complex process in itself, data cleaning involves sanitizing a data set by removing unwanted observations, outliers, fixing structural errors and typos, standardizing units of measure, validating, and so on. Data cleaning tends to follow more precise steps than data wrangling…albeit, not always in a very precise order! You can learn more about the data cleaning process in this post.
The distinction between data wrangling and data cleaning is not always clear-cut. However, you can generally think of data wrangling as an umbrella task. Data cleaning falls under this umbrella, alongside a range of other activities. These can involve planning which data you want to collect, scraping those data, carrying out exploratory analysis, cleansing and mapping the data, creating data structures, and storing the data for future use.
3. What is the data wrangling process?
The exact tasks required in data wrangling depend on what transformations you need to carry out to get a dataset into better shape. For instance, if your source data is already in a database, this will remove many of the structural tasks. But if it’s unstructured data (which is much more common) then you’ll have more to do.
The following steps are often applied during data wrangling. But the process is an iterative one. Some of the steps may not be necessary, others may need repeating, and they will rarely occur in the same order. But you still need to know what they all are!
Extracting the data
Not everybody considers data extraction part of the data wrangling process. But in our opinion, it’s a vital aspect of it. You can’t transform data without first collecting it. This stage requires planning. You’ll need to decide which data you need and where to collect them from. You’ll then pull the data in a raw format from its source. This could be a website, a third-party repository, or some other location. If it’s raw, unstructured data, roll your sleeves up, because there’s work to do! You can learn how to scrape data from the web in this post.
Carrying out exploratory data analysis (EDA)
EDA involves determining a dataset’s structure and summarizing its main features. Whether you do this immediately, or wait until later in the process, depends on the state of the dataset and how much work it requires. Ultimately, EDA means familiarizing yourself with the data so you know how to proceed. You can learn more about exploratory data analysis in this post.
Structuring the data
Freshly collected data are usually in an unstructured format. This means they lack an existing model and are completely disorganized. Unstructured data are often text-heavy but may contain things like ID codes, dates, numbers, and so on. To structure your dataset, you’ll usually need to parse it. In this context, parsing means extracting relevant information. For instance, you might parse HTML code scraped from a website, pulling out what you need and discarding the rest. The result might be a more user-friendly spreadsheet containing the useful data with columns, headings, classes, and so on.
Cleaning the data
Once your dataset has some structure, you can start applying algorithms to tidy it up. You can automate a range of algorithmic tasks using tools like Python and R. They can be used to identify outliers, delete duplicate values, standardize systems of measurement, and so on. You can learn about the data cleaning process in detail in this post.
Enriching the data
Once your dataset is in good shape, you’ll need to check if it’s ready to meet your requirements. At this stage, you may want to enrich it. Data enrichment involves combining your dataset with data from other sources. This might include internal systems or third-party providers. Your goal could be to accumulate a greater number of data points (to improve the accuracy of an analysis). Or it could simply be to fill in gaps…Say, by combining two databases of customer info where one contains telephone numbers, and the other doesn’t.
Validating the data
Validating your data means checking it for consistency, quality, and accuracy. We can do this using pre-programmed scripts that check the data’s attributes against defined rules. This is also a good example of an overlap between data wrangling and data cleaning—validation is key to both. Because you’ll likely find errors, you may need to repeat this step several times.
Publishing the data
Last but not least, it’s time to publish your data. This means making the data accessible by depositing them into a new database or architecture. End-users might include data analysts, engineers, or data scientists. They may use the data to create business reports and other insights. Or they might further process it to build more complex data structures, e.g. data warehouses. After this stage, the possibilities are endless!
4. What tools do data wranglers use?
Data wranglers use many of the same tools applied in data cleaning. These include programming languages like Python and R, software like MS Excel, and open-source data analytics platforms likeKNIME. Programming languages can be difficult to master but they are a vital skill for any data analyst. However, Python is not that difficult to learn and it allows you to write scripts for very specific tasks. We share some tips for learning Python in this post.
There are also visual data wrangling tools out there. The general aim of these is to make data wrangling easier for non-programmers and to speed up the process for experienced ones. Tools like Trifacta and OpenRefine can help you transform data into clean, well-structured formats.
A word of caution, though. While visual tools are more intuitive, they are sometimes less flexible. Because their functionality is more generic, so they don’t always work as well on complex datasets. As a rule, the larger and more unstructured a dataset, the less effective these tools will be. Beginners should aim to combine programming expertise (scripting) with proprietary tools (for high-level wrangling).
Data wrangling is vital to the early stages of the data analytics process. Before carrying out a detailed analysis, your data needs to be in a usable format. And that’s where data wrangling comes in. In this post, we’ve learned that:
- Data wrangling involves transforming and mapping data from a raw form into a more useful, structured format.
- Data wrangling can be used to prepare data for everything from business analytics to ingestion by machine learning algorithms.
- The terms ‘data wrangling’ and ‘data cleaning’ are often used interchangeably—but the latter is a subset of the former.
- While the data wrangling process is loosely defined, it involves tasks like data extraction, exploratory analyses, building data structures, cleaning, enriching, and validating; and storing data in a usable format.
- Data wranglers use a combination of visual tools like OpenRefine, Trifacta or KNIME, and programming tools like Python, R, and MS Excel.
The best way to learn about data wrangling is to dive in and have a go. For a hands-on introduction to some of these techniques, why not try out our free, five-day data analytics short course? To learn more about data analytics, check out the following: