Dabbling with data analytics? Presuming you’ve got the basics under your belt, the next step is to get hands-on with a practice project. Whether you’re figuring out your interests, seeking some sample projects to populate your portfolio, or simply want to play around, you’ll find lots of possible approaches.
In this post, we look at five of the best data analytics projects for beginners. The ideas follow the basic steps of the data analytics process, from data collection and cleaning to exploratory analysis and more.
As the most popular data analytics programming language (and also one that’s free to use and easy to learn) we’ve focused on projects that can use Python. The Python Package Index has almost endless libraries of reusable code to help you on the journey. We’ve also recommended some other tools and links you might find helpful. We’ll cover:
Ready to cut your teeth on data? Then let’s get going!
1. Web scraping
As a beginner data analyst, you won’t have a specialism. However, there’s one thing you’ll need to get started: data. While datasets come from numerous sources, every analyst needs the ability to extract data from websites. Sure, you could do this manually, but why bother when you could create some code to do it for you? Enter web scraping.
What is web scraping?
Web scraping is an excellent way of automating repetitive data collection tasks, such as pulling reviews or product descriptions from e-commerce sites. It’s possible to do this manually. However, this would take a long time when some Python code can execute the task in a matter of minutes.
How does web scraping work?
Web scraping involves creating a looping algorithm. This crawls a website, finds selected elements in a site’s HTML code, then scrapes these data and exports them into a .txt file. If this project is for your portfolio, be sure to document each step you take. Some ideas include:
- Scrape a subreddit
- Scrape a review site, like Trustpilot or Tripadvisor
- Scrape financial data from Google, Yahoo! or Nasdaq
- Scrape a job site like Indeed or LinkedIn
How can I get started?
There are lots of great step-by-step YouTube tutorials explaining how to scrape the web using Python. Since web scraping often spills over into other areas of expertise, there are a few additional skills you might need, such as HTML and XPath (a querying language). We found this excellent web scraping cheatsheet to get you started, though.
- Popular Python libraries for scraping the web: BeautifulSoup, Selenium, Scrapy
- Other web scraping tools: Parsehub, Octoparse
- GitHub: Trustpilot review scraper, Bloomberg stock data scraper
2. Data wrangling
Whether you’ve scraped data from the web or have downloaded a raw dataset from a free repository, the next most important skill for any data analyst is data wrangling. This helps put data into a structured state that makes it more suitable for extracting insights. Once again, you can automate many aspects of this process, making it an ideal data analytics project for beginners.
What is data wrangling?
Data wrangling (also known as data munging or data cleaning) is the process of tidying, structuring, transforming, and storing raw data in a column format that is more valuable for data analysis. Without carrying out these tasks, raw data usually remains stored in one long text document that’s essentially useless for analytics purposes.
How does data wrangling work?
Data wrangling involves various processes. For instance, you might need to merge two or more datasets, group and de-duplicate data, concatenate, fuzzy match, and more. These varied tasks make it a great data analytics project for beginners, since they require varying levels of expertise. You also can automate many of these tasks using Python and MS Excel. Learn more about data wrangling in this post.
How can I get started?
The first thing you’ll need for this beginner data analytics project is a raw dataset to manipulate. Perhaps you’ve scraped it yourself. If so, great! If not, don’t worry—here are some example data sources you might want to try:
- New York City Airbnb open data
- American National Election Studies
- Amazon Web Services public datasets
Next up, check out some sample data-wrangling guides for beginners. The data science community, Kaggle, runs regular competitions and provides free access to the outcome code, an excellent tool for helping you get started. You can also find some great YouTube tutorials for data wrangling in Python, as well as some written guides.
- Popular Python libraries for data wrangling: pandas, NumPy, datacleaner
- Other data wrangling tools: OpenRefine, MS Power Query
- GitHub: Code repository for data wrangling with Python
3. Exploratory data analysis (EDA)
Often, exploratory data analysis (EDA) connects to data wrangling. But treat it as a standalone task and it can be a useful project for beginners. Practicing EDA will hone skills like data modeling and outlier detection; things you can apply in many other data contexts. It’s also your first real opportunity to start thinking critically as a data analyst.
What is exploratory data analysis?
Exploratory data analysis (EDA) is how we summarize a dataset’s main characteristics. EDA helps identify patterns or trends within data, providing you with your first real insights. These have two main objectives. Firstly, they help determine what further wrangling or cleaning is required. Secondly, they help shape your initial hypotheses and test early assumptions about what the data are telling you.
We go into more detail about exploratory data analysis in this article.
How does exploratory data analysis work?
There are two broad types of exploratory data analysis. Firstly, univariate analysis. This explores one variable at a time. Secondly, bivariate/multivariate analyses. These explore the relationship between two or more variables. We can further subdivide these categories into graphical and non-graphical analyses. The former includes the creation of visualizations such as scatter plots, or box and whisker diagrams. The latter tends to rely on tables and statistics. Whichever approach you take, you can carry out all of these tasks using Python. And since EDA shares a fair amount of crossover with data wrangling (for example, identifying missing values,) you’ll likely get to practice two tasks in one!
How can I get started?
First up, you’ll need to choose a dataset (see section 2). Next, select an appropriate Python library. At this point, you should decide why you’re practicing this task. For instance, some Python packages automate the whole EDA process for you. However, you may want to show that you understand each step of the process. If so, it’s perhaps better to roll your sleeves up and use Python to carry out the tasks manually. (For example, by determining the five-number summary, which identifies key statistics in a dataset.)
Any EDA project will involve loading your dataset into an appropriate tool, before using text, tables, and visualizations to summarize it. Next, you should fill in any gaps in the data and present your findings using a tool like Jupyter Notebook (which shows your code and the output in an interactive document).
To understand how best to start this beginner data analytics project, make sure you explore the tasks in more detail first, as they vary in complexity. Check out this post to learn more about exploratory data analysis, then seek out appropriate tutorials to support you.
- Popular Python libraries for EDA: Sweetviz, dataprep.eda
- Other EDA tools: MS Excel, Trifacta
- GitHub: pandas profiling, autoviz
4. Data visualization
One of the most creative parts of data analytics is playing with visuals. As part of the data wrangling and EDA processes, you’ll often create basic graphs and tables to represent insights. But progressing with your primary data analysis, you can take data visualization even further.
What is data visualization?
Data visualization (or data viz) is the graphical representation of data or statistics. While beautiful data is a glory to behold, data viz plays more than an aesthetic role. It helps communicate insights and it highlights patterns that aren’t easy to spot in a table of numbers. It’s used in data wrangling, EDA, and many other aspects of the data analytics process.
How does data visualization work?
There are numerous types of data visualization, from box and scatter plots, to line graphs, pie charts, histograms, and maps. Different types of data viz lend themselves better to different types of datasets. There are many tools (not just Python) designed with data viz in mind. These are suitable both for experienced programmers to those who have never coded in their lives. Data viz is great fun to explore, and if you’re looking for ways to give your portfolio some pizazz, a data visualization project is a great place to start.
How can I get started?
As ever, you’ll need to have a dataset ready to go. From here, we’d recommend keeping things simple. For instance, why not use a single dataset to create the same data visualization using various tools? This can help you decide on a tool that you like. For example, you could create a scatter plot of the same dataset using Plotly, matplotlib, and pandas.
Alternatively, you could choose a single tool and use it to create numerous different data visualization types. For instance, choose a Python library, such as Seaborn, and use it to produce a boxplot, scatterplot, heatmap, line graph, or multivariate plot—but be aware you might need more than one dataset depending on your chosen visualization!
Once you’ve nailed these, why not take things a step further, producing a more sophisticated visualization in Tableau?
- Popular Python libraries for data visualization: Seaborn, matplotlib, plotly
- Other data visualization tools: Tableau Public, Grafana, Datawrapper
- GitHub: Data viz tools for the web
5. Machine learning
So far, we’ve suggested beginner projects that hone your basic data analytics skills. But what if you fancy a project that takes your skills a step further? Something like machine learning? While this might sound like a highly advanced skillset, even expert machine learning engineers have to start somewhere—meaning that you can master this skill, too.
What is machine learning?
A form of artificial intelligence, machine learning (ML) is all about predictive modeling. ML algorithms are designed to improve their predictions the more data they have access to. While ML underpins many aspects of modern data science, there are projects out there that introduce the basics even for beginners.
How does machine learning work?
Machine learning algorithms work by ingesting massive datasets, parsing them, and analyzing them to spot patterns. So far, so familiar! Using these patterns, the algorithm can update how it makes predictions about future data. This might sound daunting. But while advanced machine learning requires a high level of math and statistics, beginners can create basic ML algorithms using models like logistic regression. Better yet, all this is possible using Python.
How can I get started?
This project will require a bit of prep. Even if you’re not diving in at the deep end, machine learning requires a solid base of algebra, statistics, and calculus. If you enjoy data analytics, there’s a good chance you’ll already have a flair for math! And if you’ve tried any of the other data analytics projects on our list, you’ll already be familiar with Python.
Perhaps the best-known machine learning project works with a famous dataset of survivors from the Titanic. The project aims to create a simple machine learning model that predicts which passengers will survive the disaster and which will die, based on their age, gender, and class—to name a few. One of the reasons this is a great project for beginners is that Kaggle runs a machine learning competition using the dataset. Kaggle provides a great tutorial to get started, and a discussion forum providing help and tips. And because we know exactly how many people died on the Titanic, it’s possible to measure the success of your algorithm against the actual outcome. The project is challenging but fun, too.
When you’re finished, it’s up to you if you want to submit your work to Kaggle’s competition. Whether you do or not, once you’ve completed your first machine learning project, you’ll have the foundations you need to expand your expertise. Using similar skills, you can branch into natural language processing, for example. This can help with fake news prediction, spam filtering, and more.
- Popular Python libraries for machine learning: Scikit-learn, Tensorflow, Pytorch
- GitHub: machine learning for beginners
Wrap up and further reading
So there we have it! Five data analytics project ideas for beginners. Whether you’re dabbling for personal purposes or are planning on publicizing your portfolio, these projects can help you figure out which area of data analytics most interests you. There are plenty of options to get your teeth into.
To learn more about how a future possible career in data analytics, sign up for this free, 5-day data analytics short course (delivered right to your inbox) or check out the following posts: