How data is stored, retrieved, and transformed is a critical question that those in data analytics need to consider, whether you’re setting up data pipelines, generating reports, or running machine learning models. Regardless of which industry you belong to, it’s helpful to gain a deeper understanding of software systems and programs that lay the foundation of analytics architectures.
For many of us, the world of databases and query languages sounds like a foreign concept. As you will soon learn, there are many types of databases out there, which can make them seem challenging to learn more about. We’ll take a look at one of the most common databases you’ll likely end up encountering: the relational database.
In this post, we’ll explain the core principles that define what a relational database is, how it is modeled, and help you understand the tradeoffs between relational and non-relational databases.
We’ll cover the following topics—feel free to use the clickable menu to skip ahead, too, if you like:
- What is a relational database?
- The relational database model
- Examples of relational databases
- Benefits and challenges of relational databases
- Relational vs. non-relational databases
- How do relational database management systems use SQL?
Databases can get technical pretty fast, so let’s kick things off with a simple definition!
1. What is a relational database?
A relational database sounds intimidating, but at the very simplest level, you can think of a database as just a place to store data. In data analytics, software engineering, and data engineering, databases play a very important and specific role. What this looks like in practice can vary in complexity, but in those fields of work, relational databases are a specialized form of database that are used to store and retrieve data from applications.
Relational databases are a popular type of database as they can cover a wide variety of industry use cases. Even if you have not had the chance to work with one yet, you might have already heard of industry favorites like MySQL, PostgreSQL, and MongoDB.
So what is a relational database exactly? If you are familiar with Excel spreadsheets, they work in conceptually similar ways. You can think of a database as having the same row and column structure, but they take the concept of a table and add certain features that make them much more powerful for data retrieval and analysis.
Databases typically hold multiple tables and they can be linked together if they have overlapping columns, or in industry jargon, a primary key. This interlinking architecture is immensely useful for companies to make sense of different forms of data generated by their lines of businesses.
2. The relational database model
To better understand the relational database, let us first understand why its emergence sparked such a paradigm shift in computer science. It was created by E. F. Codd at IBM in the 1970s, and he outlined the model in a paper that described several rules to structure data storage. Tables have a unique primary key for each row, and tables have a relationship to each other when they have columns for the unique key. When the primary key is used in another table, it is known as the foreign key there.
In addition to primary keys and column names, relational databases also define the types of data in each column (e.g. string, integer, floats). These attributes, called the relational schema, structure how new entries are added to a database.
This might feel too abstract, and it can be difficult to conceptually imagine why this matters for you if you are interested in analytics, so let’s dive into an example of how they are used by businesses.
Consider a typical company, a Shoe Company, that sells shoes online. On the backend, they will store customer information across different databases depending on what data is stored when a customer makes a purchase on their website. The Customer table might have columns like Customer ID, Mailing Address, Purchase ID, Purchased Item ID, Transaction Amount, and the Marketing table might have columns like Promotional Campaign, Customer ID, Transaction Amount, Purchase ID.
In this simplified database, Customer ID is the primary and foreign key of the two tables—we can now analyze the impact that marketing has on generating sales. For example, if data analysts want to evaluate the success of Shoe Company’s promotional campaign that offered discounts during a seasonal holiday, they can join these tables by Customer ID before being able to run a statistical analysis on the campaign’s impact.
Hence, we can see how the schema structures relations between tables that enable us to access data across multiple tables without needing to reorganize any of the tables.
3. Examples of relational databases
Now that we understand the logic behind relational databases, we can turn to the different types used by companies and what management systems are in place to run relational databases. When companies like Facebook or Google run their services, they have to create and maintain thousands of different databases, which would be incredibly difficult without the use of database programs to automate some of the management.
You might have encountered some of them at work. Some of the more popular ones that companies use include MySQL, Microsoft SQL, PostgreSQL, and MongoDB. To better familiarize yourself with industry favorites, check out this list curated by the industry research team at Gartner that features reviews and ratings of public and private cloud database management systems.
4. Benefits and challenges of relational databases
Relational databases offer many benefits, which explains why they have dominated the market for databases over many decades now. They are great because they are organized from the start: the schema is a set of rigid rules that determines what types of data are allowed to be stored, and where they go (whether it is in a different table or certain column). This makes it relatively straightforward to update tables with new entries without having to overhaul entire applications or data pipelines.
This organized structure also makes it easy to access what you need for analysis through a programming language like SQL. If you have a background in SQL, or have read through our helpful explainer on how to write a query, it is easy to see how SQL can be used with relational databases to empower even the non-technically inclined to perform powerful queries on their own.
Another important consideration is that it is the best way one can optimize the amount of space needed to store data, which is incredibly important for companies in the digital age when they are collecting reams and reams of data but storage is costly. Relational databases can help reduce cost by efficiently storing data, something that is appreciated by business stakeholders when proposing the budget needed to launch a new database.
In addition to these top features, they also guarantee disaster recovery by making it easy to backup and restore the full database during system failure. Many cloud solutions offer continuous mirroring which further minimizes any data loss.
The organized nature of relational databases comes at a cost: speed. They are not the fastest to work with! This can be a problem when dealing with massive datasets, which is something that most companies need to work with. This problem worsens when the number of tables grows. Hence, they do have a reputation for being a slower type of database to work with.
5. Relational vs. non-relational databases
Now that we have discussed what a relational database is and how it is structured, you might probably be able to guess the fundamental difference between relational and non-relational databases. It comes down to how we can organize the stored data.
Non-relational databases are unstructured. As you do not need to specify a schema upfront, this format offers unparalleled flexibility in what kinds of data you can add to a database. In fact, they can even be completely unrelated to each other! We can add documents alongside images and text—there are no restrictions on the types of data that can be added. For example, NoSQL databases do not have a schema but store data within one type of data structure—most commonly JSON. Broadly speaking, the different types of non-relational databases you can have include ones with dictionary-like key-values, documents, wide columns, and graphs.
You might be wondering when users might favor a non-relational database. These reasons generally have to do with the downsides of using relational databases. For one, the unstructured nature of non-relational databases gives it greater speed. This matters when data retrieval needs to be optimized, say if we need to generate analytical reports frequently.
Non-relational databases are also helpful when we are not entirely sure about what sort of data we will need to record, or if the nature of the data will evolve over time. We can be flexible about changes in format since we don’t need a schema.
Next, we’ll look at how relational database management systems make use of SQL.
6. How do relational database management systems use SQL?
You might have noticed that SQL is often mentioned alongside databases. That’s because SQL queries make it easy for both technical and non-technical analysts to transform tables, merge datasets, and retrieve data from databases in just a few lines of code.
Let’s take a look at how this works in practice by using the example of the Shoe Company’s database with two tables, Customer and Marketing. Recall that the Customer table has these columns: Customer ID, Mailing Address, Purchase ID, Purchased Item ID, Transaction Amount; the Marketing table has these columns like Promotional Campaign, Customer ID, Transaction Amount, Purchase ID.
Imagine you’re an analyst tasked with finding out whether a back-to-school sale promotional campaign had any impact on the sale of shoes. There are many ways you could approach this question. One way would be to look at whether the sale increased how much your top five spending customers bought, compared to a period with no sale. You could write a SQL query to find the top five customers by transaction during a specific promotional campaign, and it would look something like this:
SELECT Customer_ID, Sum(Transaction_Amount) FROM Customer LEFT JOIN Marketing ON Customer.Customer_ID = Marketing.Customer_ID WHERE Promotional_Campaign = back_to_school GROUP BY 1 ORDER BY 2 LIMIT 5
What we did here was join both the Customer and the Marketing table based on the shared primary key, Customer_ID. From there, we could easily retrieve the subset of customers who made a purchase during the back_to_school campaign, and sum the transactions to find the top five customers by transaction amount.
This example illustrates how relational databases make it super easy to pull all the data you need to answer a business question from different tables while reducing duplication on the data storage end. Data analysts can thus get the data they need from simply by writing SQL queries to retrieve, clean, and transform data. Check out this article we published that explains why knowing SQL is a useful analytics skill to have since most companies use it to optimize accessing their relational databases.
The world of databases is vast and complex; if you’re interested in learning more about career pathways in data analytics, consider gaining a deeper knowledge of the ins-and-outs of different types of databases. It’s also equally important to pair this knowledge with some experience of a database query language like SQL. As we’ve learnt here, relational databases are pretty nifty pieces of technology that form the building blocks of how we do analytics.
To summarize the core concepts that are essential to doing great work as an analyst, keep these core concepts in mind:
- As relational databases use a pre-defined schema, advance planning is key.
- It can be helpful to familiarize yourself with popular relational databases such as MySQL, Microsoft SQL, PostgresSQL, and MongoDB.
- Consider the business use case carefully when deciding between a relational and non-relational database. Is flexibility or efficiency more important? How important is disaster recovery? Will you be dealing with diverse data types?
- Understanding how to craft SQL is essential to making the most out of a relational database as it can help you join and transform the tables you need for analysis.
Want to learn more about the world of relational databases, and data analytics as a whole? Why not try out this free, 5-day data analytics short course? You might also be interested in the following articles: