What is secondary data analysis? How do you carry it out? Find out in this post.
Historically, the only way data analysts could obtain data was to collect it themselves. This type of data is often referred to as primary data and is still a vital resource for data analysts.
However, technological advances over the last few decades mean that much past data is now readily available online for data analysts and researchers to access and utilize. This type of data—known as secondary data—is driving a revolution in data analytics and data science.
Primary and secondary data share many characteristics. However, there are some fundamental differences in how you prepare and analyze secondary data. This post explores the unique aspects of secondary data analysis. We’ll briefly review what secondary data is before outlining how to source, collect and validate them. We’ll cover:
- What is secondary data analysis?
- How to carry out secondary data analysis (5 steps)
- Summary and further reading
Ready for a crash course in secondary data analysis? Let’s go!
1. What is secondary data analysis?
Secondary data analysis uses data collected by somebody else. This contrasts with primary data analysis, which involves a researcher collecting predefined data to answer a specific question. Secondary data analysis has numerous benefits, not least that it is a time and cost-effective way of obtaining data without doing the research yourself.
It’s worth noting here that secondary data may be primary data for the original researcher. It only becomes secondary data when it’s repurposed for a new task. As a result, a dataset can simultaneously be a primary data source for one researcher and a secondary data source for another. So don’t panic if you get confused! We explain exactly what secondary data is in this guide.
In reality, the statistical techniques used to carry out secondary data analysis are no different from those used to analyze other kinds of data. The main differences lie in collection and preparation. Once the data have been reviewed and prepared, the analytics process continues more or less as it usually does. For a recap on what the data analysis process involves, read this post.
In the following sections, we’ll focus specifically on the preparation of secondary data for analysis. Where appropriate, we’ll refer to primary data analysis for comparison.
2. How to carry out secondary data analysis
Step 1: Define a research topic
The first step in any data analytics project is defining your goal. This is true regardless of the data you’re working with, or the type of analysis you want to carry out. In data analytics lingo, this typically involves defining:
- A statement of purpose
- Research design
Defining a statement of purpose and a research approach are both fundamental building blocks for any project. However, for secondary data analysis, the process of defining these differs slightly. Let’s find out how.
Step 2: Establish your statement of purpose
Before beginning any data analytics project, you should always have a clearly defined intent. This is called a ‘statement of purpose.’ A healthcare analyst’s statement of purpose, for example, might be: ‘Reduce admissions for mental health issues relating to Covid-19′. The more specific the statement of purpose, the easier it is to determine which data to collect, analyze, and draw insights from.
A statement of purpose is helpful for both primary and secondary data analysis. It’s especially relevant for secondary data analysis, though. This is because there are vast amounts of secondary data available. Having a clear direction will keep you focused on the task at hand, saving you from becoming overwhelmed. Being selective with your data sources is key.
Step 3: Design your research process
After defining your statement of purpose, the next step is to design the research process. For primary data, this involves determining the types of data you want to collect (e.g. quantitative, qualitative, or both) and a methodology for gathering them.
For secondary data analysis, however, your research process will more likely be a step-by-step guide outlining the types of data you require and a list of potential sources for gathering them. It may also include (realistic) expectations of the output of the final analysis. This should be based on a preliminary review of the data sources and their quality.
Once you have both your statement of purpose and research design, you’re in a far better position to narrow down potential sources of secondary data. You can then start with the next step of the process: data collection.
Step 4: Locate and collect your secondary data
Collecting primary data involves devising and executing a complex strategy that can be very time-consuming to manage. The data you collect, though, will be highly relevant to your research problem.
Secondary data collection, meanwhile, avoids the complexity of defining a research methodology. However, it comes with additional challenges. One of these is identifying where to find the data. This is no small task because there are a great many repositories of secondary data available. Your job, then, is to narrow down potential sources. As already mentioned, it’s necessary to be selective, or else you risk becoming overloaded.
Some popular sources of secondary data include:
- Government statistics, e.g. demographic data, censuses, or surveys, collected by government agencies/departments (like the US Bureau of Labor Statistics).
- Technical reports summarizing completed or ongoing research from educational or public institutions (colleges or government).
- Scientific journals that outline research methodologies and data analysis by experts in fields like the sciences, medicine, etc.
- Literature reviews of research articles, books, and reports, for a given area of study (once again, carried out by experts in the field).
- Trade/industry publications, e.g. articles and data shared in trade publications, covering topics relating to specific industry sectors, such as tech or manufacturing.
- Online resources: Repositories, databases, and other reference libraries with public or paid access to secondary data sources.
Once you’ve identified appropriate sources, you can go about collecting the necessary data. This may involve contacting other researchers, paying a fee to an organization in exchange for a dataset, or simply downloading a dataset for free online.
Step 5: Evaluate your secondary data
Secondary data is usually well-structured, so you might assume that once you have your hands on a dataset, you’re ready to dive in with a detailed analysis. Unfortunately, that’s not the case!
First, you must carry out a careful review of the data. Why? To ensure that they’re appropriate for your needs. This involves two main tasks:
- Evaluating the secondary dataset’s relevance
- Assessing its broader credibility
Both these tasks require critical thinking skills. However, they aren’t heavily technical. This means anybody can learn to carry them out.
Let’s now take a look at each in a bit more detail.
Evaluating the secondary dataset’s relevance
The main point of evaluating a secondary dataset is to see if it is suitable for your needs. This involves asking some probing questions about the data, including:
What was the data’s original purpose?
Understanding why the data were originally collected will tell you a lot about their suitability for your current project. For instance, was the project carried out by a government agency or a private company for marketing purposes? The answer may provide useful information about the population sample, the data demographics, and even the wording of specific survey questions. All this can help you determine if the data are right for you, or if they are biased in any way.
When and where were the data collected?
Over time, populations and demographics change. Identifying when the data were first collected can provide invaluable insights. For instance, a dataset that initially seems suited to your needs may be out of date.
On the flip side, you might want past data so you can draw a comparison with a present dataset. In this case, you’ll need to ensure the data were collected during the appropriate time frame. It’s worth mentioning that secondary data are the sole source of past data. You cannot collect historical data using primary data collection techniques.
Similarly, you should ask where the data were collected. Do they represent the geographical region you require? Does geography even have an impact on the problem you are trying to solve?
What data were collected and how?
A final report for past data analytics is great for summarizing key characteristics or findings. However, if you’re planning to use those data for a new project, you’ll need the original documentation. At the very least, this should include access to the raw data and an outline of the methodology used to gather them. This can be helpful for many reasons. For instance, you may find raw data that wasn’t relevant to the original analysis, but which might benefit your current task.
What questions were participants asked?
We’ve already touched on this, but the wording of survey questions—especially for qualitative datasets—is significant. Questions may deliberately be phrased to preclude certain answers. A question’s context may also impact the findings in a way that’s not immediately obvious. Understanding these issues will shape how you perceive the data.
What is the form/shape/structure of the data?
Finally, to practical issues. Is the structure of the data suitable for your needs? Is it compatible with other sources or with your preferred analytics approach? This is purely a structural issue. For instance, if a dataset of people’s ages is saved as numerical rather than continuous variables, this could potentially impact your analysis. In general, reviewing a dataset’s structure helps better understand how they are categorized, allowing you to account for any discrepancies. You may also need to tidy the data to ensure they are consistent with any other sources you’re using.
This is just a sample of the types of questions you need to consider when reviewing a secondary data source. The answers will have a clear impact on whether the dataset—no matter how well presented or structured it seems—is suitable for your needs.
Assessing secondary data’s credibility
After identifying a potentially suitable dataset, you must double-check the credibility of the data. Namely, are the data accurate and unbiased? To figure this out, here are some key questions you might want to include:
What are the credentials of those who carried out the original research?
Do you have access to the details of the original researchers? What are their credentials? Where did they study? Are they an expert in the field or a newcomer? Data collection by an undergraduate student, for example, may not be as rigorous as that of a seasoned professor.
And did the original researcher work for a reputable organization? What other affiliations do they have? For instance, if a researcher who works for a tobacco company gathers data on the effects of vaping, this represents an obvious conflict of interest! Questions like this help determine how thorough or qualified the researchers are and if they have any potential biases.
Do you have access to the full methodology?
Does the dataset include a clear methodology, explaining in detail how the data were collected? This should be more than a simple overview; it must be a clear breakdown of the process, including justifications for the approach taken. This allows you to determine if the methodology was sound. If you find flaws (or no methodology at all) it throws the quality of the data into question.
How consistent are the data with other sources?
Do the secondary data match with any similar findings? If not, that doesn’t necessarily mean the data are wrong, but it does warrant closer inspection. Perhaps the collection methodology differed between sources, or maybe the data were analyzed using different statistical techniques. Or perhaps unaccounted-for outliers are skewing the analysis. Identifying all these potential problems is essential. A flawed or biased dataset can still be useful but only if you know where its shortcomings lie.
Have the data been published in any credible research journals?
Finally, have the data been used in well-known studies or published in any journals? If so, how reputable are the journals? In general, you can judge a dataset’s quality based on where it has been published. If in doubt, check out the publication in question on the Directory of Open Access Journals. The directory has a rigorous vetting process, only permitting journals of the highest quality. Meanwhile, if you found the data via a blurry image on social media without cited sources, then you can justifiably question its quality!
Again, these are just a few of the questions you might ask when determining the quality of a secondary dataset. Consider them as scaffolding for cultivating a critical thinking mindset; a necessary trait for any data analyst!
Presuming your secondary data holds up to scrutiny, you should be ready to carry out your detailed statistical analysis. As we explained at the beginning of this post, the analytical techniques used for secondary data analysis are no different than those for any other kind of data. Rather than go into detail here, check out the different types of data analysis in this post.
3. Secondary data analysis: Key takeaways
In this post, we’ve looked at the nuances of secondary data analysis, including how to source, collect and review secondary data. As discussed, much of the process is the same as it is for primary data analysis. The main difference lies in how secondary data are prepared.
Carrying out a meaningful secondary data analysis involves spending time and effort exploring, collecting, and reviewing the original data. This will help you determine whether the data are suitable for your needs and if they are of good quality.
Why not get to know more about what data analytics involves with this free, five-day introductory data analytics short course? And, for more data insights, check out these posts: