Spreadsheets are still in style
The use of electronic spreadsheets date back to 1979 and are still one of the most popular ways to review and manipulate data. Today, Microsoft’s ubiquitous spreadsheet program Excel has over 750 million users and is used in some of the largest companies worldwide. I personally use Excel almost every day because it can sometimes lead to quicker results than spinning up Python or loading the data into a database.
One of the reasons Excel is so popular is because it is jam-packed with features and functions that can be used to clean, aggregate, pivot, and graph data. In this article, we’ll go over the 10 features and functions for using data analysis in Excel I think every analyst needs to know:
- Pivot tables and pivot charts
- Conditional formatting
- Remove duplicates
- DAYS and NETWORKDAYS
You can click any of the features to skip ahead to them.
Before we get started…
In order to show the power of data analysis in Excel, we need some data to play around with and graph. I am using the Customer Personality Analysis dataset from Kaggle in these examples. You can download it to follow along.
1. Pivot tables and pivot charts
One of my favorite features in Excel is pivot charts and, as a close follow-up feature, pivot tables. Pivot charts visualize the data expressed in a pivot table, giving us insight at a glance. Pivot tables provide a simple approach to reformatting columns and rows, transforming them into groupings, statistics, or summaries. We can create a chart along with the table using the PivotChart feature under Insert.
Let’s say we wanted to count the records grouped by Marital_Status. Using a pivot table makes that super simple, because it transforms the data and aggregates it for us.
To create a pivot chart and pivot table, first select the range of data you want to include then click Insert > PivotChart > PivotChart & PivotTable. The Create PivotTable editor will pop up.
The selected range will appear in the Table/Range field by default. Click OK and the pivot table will generate.
In the PivotChart Fields, drag Marital_Status from the field list into the Axis (Categories) box. Then drag Marital_Status from the field list into the Values box. Since the Marital_Status data is a string, the Values aggregation should default to Count. If the data type were numeric, the aggregation defaults to Sum.
The pivot table and chart should populate. You can add additional dimensions or filters by dragging new fields into the corresponding box. It only takes a few clicks to aggregate data and visualize it in Excel, which is why it is such a widely-used tool.
2. Conditional formatting
Thinking about it, I probably use conditional formatting more than any other feature in Excel. Conditional formatting allows you to highlight or hide cells based on a rule you specify. Apply the rules to one cell or multiple cells in the same worksheet. It is useful for highlighting outliers, duplicates, or patterns in data.
Let’s say we want to highlight all Year_Birth values greater than 1987 in the dataset.
Select the Year_Birth column and click Conditional Formatting > Highlight Cells Rules > Greater Than. The editor will pop up:
Enter 1987 and click OK. The cells in the column with a value greater than 1987 will turn light red.
If we decide we want to adjust the conditional formatting rule we just created, follow path Conditional Formatting > Conditional Formatting Rules Manager.
From the manager we are able to create new rules or edit existing ones. It is possible to have multiple rules affecting the spreadsheet.
3. Remove duplicates
Data is often messy, so it is important that you know how to remove duplicates. Using conditional formatting rules, you can highlight the duplicate data to review it before deleting it. The Remove Duplicates feature is available under Data > Data Tools > Remove Duplicates.
Highlight the dataset in Excel and click Remove Duplicates. The Remove Duplicates editor appears:
The editor allows us to select columns that should be included when deleting the duplicates. Make sure the My data has headers checkbox is marked if the column names are not displaying.
Click OK. The duplicates will be dropped from the dataset. It will tell us how many unique values remain.
XLOOKUP is like a combination of VLOOKUP and HLOOKUP, since it can go either vertically or horizontally to lookup a value from a range. It essentially allows you to use a selected range as a lookup table and return a “looked up” result to a cell. The syntax is as follows:
Let’s say we want to look up the Year_Birth based on an entered ID value. In cell AD2, enter an ID value—for example, 8755. In cell AE2, enter the XLOOKUP formula:
The lookup_value is the value we want looked up by the function, so we enter AD2.
The lookup_array is a column or row that contains the lookup value, so we enter A2:A2241 since that will give us an array of IDs.
The return_array is the column or row that contains the value we want to return, so we select B2:B2241 since that will give us Year_Birth values.
The completed formula will look like this: =XLOOKUP(AD2, A2:A2241, B2:B2241)
Enter in different IDs and the corresponding Year_Birth will return.
Lookup functions are very powerful, and you can even join data from different sheets or nest lookup functions within each other. For example, you could sum the value of multiple lookups.
The IFERROR function is used to create a custom error message when a formula results in an error. For example, we can use it to wrap our XLOOKUP function so it returns a clear message if the ID isn’t found. The syntax is relatively simple.
Going back to the XLOOKUP function, if we enter an ID in to AD2 that doesn’t exist in the lookup array, AE2 outputs #N/A. Instead, let’s return “ID Not Found.” For the value in the IFERROR function, use the XLOOKUP function. It should look like this:
=IFERROR(XLOOKUP(AD2, A2:A2241, B2:B2241), “ID Not Found”)
In addition to text, you can point the value_if_error at another cell too. If you target a blank cell as the value for value_if_error, 0 will appear in the cell.
Similar to the lookup functions, MATCH can be used when you need the position of a value in a range instead of the value itself.
This is the syntax for MATCH:
=MATCH(lookup_value, lookup_array, [match_type])
When writing the function, it is important to know which match type to use. Although it is optional and defaults to 1, the available options are -1, 0, or 1.
- -1: Finds the smallest value that is greater than or equal to lookup_value.
- 0: Finds the first value that is exactly equal to lookup_value.
- 1: Finds the largest value that is less than or equal to lookup_value.
If we want to find the first time the looked up birth year occurs, add a the following MATCH formula to cell AF2:
The COUNTBLANK function is fairly intuitive, but it is an important function for data wrangling in analytics because many machine learning algorithms are sensitive to nulls. By knowing how many values are null, you have a better understanding of how to approach them. For example, if a lot of values are null you should drop the column. If few values are null you should impute a value to fill the null. COUNTBLANK counts the number of empty cells in a range. The syntax is simple:
We want to count the number of nulls in the Income column, so add this formula to cell AG2:
8. DAYS and NETWORK DAYS
DAYS and NETWORKDAYS are separate functions, but they are similar enough to not warrant individual entries in my list. The DAYS function simply returns the number of days between two dates, whereas the NETWORKDAYS function is slightly different in that it excludes weekends and specified holidays. It only returns the number of working days between two dates. The syntax for both is easy to remember:
=NETWORKDAYS(start_date, end_date, [holidays])
At my job, one of the things I analyze is usage data. I use these functions for things like counting the number of days since a user has logged in or used a tool. Since our software is used in schools, sometimes it makes sense to exclude weekends and holidays from our usage data so I’m thankful for the NETWORKDAYS function.
The RANK function orders a number by its size relative to other values in a list and returns the desired rank. That means the rank of the number would be its position if the list becomes sorted by ascending or descending order. For example, sort Income in descending order so the biggest value is at the top and that could be rank 1. RANK gives duplicate numbers the same rank, but cumulatively counts. That means if two values are rank four, the next rank will be six, not five (1,2,3,4,4,6). The syntax for RANK takes 3 arguments.
=RANK(number, ref, [order])
It is important to note that [order] can be set to 0 for descending and 1 (or greater) for ascending.
We want to rank the income (column E) of our customers, so we will use the following formula in cell AH2:
=RANK(E2, E2:E2241, 0)
In the bottom right corner of cell AH2, click the square and drag it down to the last row of data and the formula will automatically copy allowing us to quickly generate a rank for each income value.
The last function on my top 10 list is SUMPRODUCT. It is great when you need to do arithmetic on specific groups of values. It can be used to add, subtract, multiply or divide the selected numerical values for corresponding ranges. Although it sounds complicated, the logic is fairly intuitive once you try it. The syntax, however, is not very intuitive.
=SUMPRODUCT(array1, [array2], [array3], …)
We want to sum the KidHome for all rows where Education equals Graduation (C2) and Marital_Status equals Single (D2). We will add this formula to cell AI:
If we wanted to see a different Education value, PhD for example, we could change C2 to C6 in the formula.
Summary and next steps
This is by no means an exhaustive list of every feature and function Excel offers, but these 10 functions help me analyze and clean data without the hassle of booting up Python or loading the data in a SQL server. Pivot charts are great for small data sets, especially if you need to share the data with non-technical people. Even though Excel seems like a cumbersome tool in a world of big data, it remains one of the most popular data analytics tools on the planet.
To learn more about data analytics, try out this free 5-day short course, or check out some of our other data analytics articles: