What is data science? And what does a data scientist actually do? Here’s your ultimate guide.
When it comes to decision-making, data is vital. This is true on a personal level, but absolutely essential on an organizational level—an overwhelming majority of today’s businesses and organizations rely on data-driven decision making and strategic plans in order to achieve their goals.
So who deals with this, and how? There are a range of roles that work with data in order to glean insights from it: from data analyst, to data scientist, to data engineer—and more.
In this article, we’ll focus on the fascinating field of data science. We’ll ask the following questions:
- What is data science?
- What is data science used for?
- What does a data scientist do?
- What is the life cycle of a data science project?
- What tools do data scientists use?
- What skills do you need to become a data scientist?
- How do I become a data scientist?
- What are some of the best data science courses?
- What is the average data scientist salary?
- Key takeaways and further reading
If you’d like to skip ahead to any section, simply use the clickable menu. Now, let’s get started!
1. What is data science?
According to Japanese statistician Chikio Hayashi, data science is a “concept to unify statistics, data analysis, informatics, and their related methods” in order to “understand and analyze actual phenomena” in data.
So, what does this mean in plainer language? It means that data science is a multidisciplinary field that uses a variety of methods to make use of the vast amounts of data available to us and extract valuable insights that help drive future decision-making for individuals and organizations.
So, what are some of these disciplines, then? In data science, skills such as data analytics, machine learning, computer science and artificial intelligence are all employed in order for data scientists to research and identify areas of interest within their organization.
2. What is data science used for?
Data science is used to provide the data-driven insights that help answer questions that stakeholders in an organization may have. But what does this really mean? In this section, we’ll list some of the common business applications for data science.
Predicting individual consumer behavior
In order to determine how much revenue a retailer can expect from an individual customer, data scientists will look to produce a metric known as ‘customer life value’, or CSV. Data scientists will make use of predictive analytics that can be applied to various aspects of the customer’s retail experience, which build up an accurate picture for the retailer and allows them to offer greater customer personalization, enhancing the retail experience for the consumer.
Increasing security and protecting information
As you can probably tell by now, a lot of big data collected and stored is often of a sensitive nature.
We see the word encryption being used a lot online. When data is encrypted, it means that the plain text—like a user’s address, for example—is scrambled, using an encryption algorithm, into an unreadable format, otherwise known as ciphertext. Only authorized parties are able to “unscramble” this data for use with an encryption key.
Fraud is rife in the banking and insurance industries (and other related industries), to the point that organizations will employ large teams to detect and resolve issues related to fraud.
However, instances of fraud are so frequent that a human team is often not enough to catch them all—and this is where data science comes into play. Data scientists will use machine learning and predictive analytics to detect fraudulent transactions or claims, giving team members more time to focus on resolving issues and saving an organization both time and money.
3. What does a data scientist do?
In short, a data scientist will collect, analyze, and interpret large amounts of data—known as big data—in order to provide insights into an organization’s operations. They may develop statistical models that analyze these large amounts of data, detecting trends, patterns, relationships and outliers in datasets.
So, what is big data? We cover it in detail in our big data guide, but in short: it’s huge swathes of unstructured data that are generated whenever we do, well, anything! Think about how connected we are, using phones, social media, ATMs, and so on. Every time you interact with a piece of technology, a bit of data is created.
Now, think about how many other people are doing the same interaction with the same piece of technology. All these bits of data are being generated at once, and often in such a way that is too fast and complex to be processed in a traditional—or structured—manner.
Data scientists work with big data to make sense of the swathes of unstructured data, and to discover data insights and present them to stakeholders and decision-makers, who use these insights to identify business and operational risks, to predict consumer behavior, and to improve overall operations.
On a day-to-day basis, the specific tasks of a data scientist will be dependent on the type of organization they’re working in, as well as the specific goals they’re working towards. Broadly speaking, however, they’ll likely encounter some—or all—of the following tasks and responsibilities:
- Performing preliminary research of the organization and the industry as a whole, in order to identify areas for improvement and opportunities for growth.
- Identifying relevant data sets, then pulling the data they need for their project.
- Scrubbing data in order to make the data set uniform and usable. This may involve tidying up similar terms, and removing outliers where necessary.
- Using exploratory data techniques to get an idea of the characteristics of the obtained data.
- Creating data models to visually represent the data structures.
- Interpreting data, creating visualizations and making recommendations, which are presented to the relevant stakeholders.
These tasks are rooted in the data science life cycle, which we’ll get into next.
4. What is the life cycle of a data science project?
As we’ve explored while looking at some of the use cases for data science, projects can vary greatly, depending on many factors. These include whether it’s predicting future trends, increasing security and protecting sensitive information, or making processes more efficient.
In many cases though, the life cycle for a data science project will loosely follow the same framework, which is known as OSEMN—pronounced like awesome!—which is an acronym that stands for Obtain, Scrub, Explore, Model, iNterpret. It’s not a perfect acronym, sure, but it’s easier to remember and pronounce than OSEMI. Let’s go into what each of these parts of the acronym mean now.
Just like with the data analytics process, the life cycle for a data science project begins with obtaining data. Data is, of course, a data scientist’s bread and butter—they can’t do anything without it! In many data science projects, the data scientist will need to pull data from many sources, perhaps even needing to scrape data from websites that require specific query syntax (like SQL). Languages like Python or R are often used for the retrieval of data.
You’ll hear a lot about data cleaning or data scrubbing in relation to data science. It’s a critical part of the data science life cycle, as ‘clean’ data is a lot simpler to analyze than ‘noisy’ or ‘irregular’ data. Data cleaning is also crucial for obtaining accurate insights.
But what does it mean to clean data? Well, in its ‘original’ form, a data set is often unorganized and messy. It’s a data scientist’s role to make the data set readable, uniform and to remove outliers where necessary. A good example of data that should be ‘cleaned’ is the way users self-report their location to social media.
Sydney, Australia can be written as any of the following: Sydney, Syd, SYD, Sydders, The Emerald City, Gadigal Land, Eora Country…and so on. A data scientist would amalgamate these names and nicknames and consolidate them under one name to ensure consistency across the data set
Again, programming languages such as Python will be often used for this task, especially considering the breadth of data that needs to be processed.
Once your data set (or group of data sets) has been cleaned, it’s time to perform some exploratory data analysis (or EDA).
In this stage, a data scientist will look at the cleaned data as a whole and make sense of its characteristics before deciding on how to model it. Many exploratory data analysis techniques make use of visual devices, such as graphs, plots, and other visualizations to quickly show trends and anomalies, or even missing or incorrect data that wasn’t spotted during the cleaning process.
Learn more: What Is Exploratory Data Analysis?
This is where things get interesting! In this stage of the life cycle, a data scientist will use this collected, cleaned, and explored data to create a data model: a visual representation of the types of data gathered and the relationships between them.
By providing this visual structure to the data, organizations can see clearly how data is stored. This also makes it easier for stakeholders to retrieve relevant data as necessary.
There are three types of data models:
- Conceptual data model: the most basic type of data model, this provides an overview of the different entities and their potential attributes. Here, an entity represents a set of things, persons, or concepts relevant to the data and the organization. An attribute is a characteristic or other identifying information that further describes an entity.
- Logical data model: a little more complex than a conceptual model, a logical data model includes the relationship between entities, as well as the data types of the attributes and entities. A relationship is an association between two entities.
- Physical data model: the most complex of the three types, this is the last data model created before producing an actual database. As well as including all of the information in a conceptual and logical data model, a physical data model also highlights the schema of the database.
A logical data model showing the relationship between vehicles and their owners in a data set, by Ethacke1, CC BY-SA 4.0 via Wikimedia Commons
In this stage of the data science life cycle, any of these data models can be used, depending on the needs of the organization at the time. Data models are often seen as ‘living’ for this reason—they’re able to change as necessary. However, as the project matures, you’ll find that there will be a natural progression in data modeling from conceptual, to logical, to physical—before a database is built.
For the decision-makers and other stakeholders of the organization, this is the most important stage of the data science life cycle. Here, a data scientist will use the data models built in the previous stage to draw meaningful conclusions and come up with actionable insights that allow these decision-makers and stakeholders to decide what the next steps are for the organization. This is usually done through data visualization, which can be achieved using tools such as Tableau, D3.js, or Plotly, to name a few. You’ll find a round-up of some of the most popular data visualization tools here.
5. What tools do data scientists use?
When working with big data, data scientists use a variety of tools to streamline aspects of the data science process and make sense of the vast amounts of big data they obtain. Here are some of the tools you’ll come across when working in the field:
No list of data science tools would be complete without Python. A programming language with a wide range of uses, Python is a must-have and must-know for anyone working with data. Python focuses on readability, and its general popularity in the tech field means many programmers are already familiar with it.
In addition, it has a huge range of resource libraries suited to many tasks associated with the data science life cycle. For example, the NumPy and pandas libraries are great for streamlining highly computational tasks, as well as supporting general data manipulation. Libraries like Scrapy are used to scrape data from the web, while Matplotlib is excellent for data visualization and reporting. Python’s main drawback is its speed—it is memory intensive and slower than many languages. Generally speaking though, when it comes to building software from scratch, Python’s benefits far outweigh its drawbacks.
While D3 has a steep learning curve, once mastered it offers full control over your visualizations. This means you can tweak them to interact in any way you want, making it excellent for nuanced reporting. However, D3 is only suited to visualizations, and can’t be used for other parts of the data science process, such as data cleaning. It does have a great support community though, which has led to many books and online tutorials becoming available to help you upskill.
First developed in 2012, before being donated to the non-profit Apache Software Foundation, Apache Spark is an open-source software framework that allows data scientists to quickly process vast data sets. Designed to analyze unstructured big data, Spark distributes computationally heavy analytics tasks across many computers.
While other similar frameworks exist (for example, Apache Hadoop) Spark is exceptionally fast. By using RAM instead of local memory, it is around 100 times faster than Hadoop, which is why it’s often used for the development of data-heavy machine learning models. Spark even has a library of machine learning algorithms, MLlib, including classification, regression, and clustering algorithms, to name a few—making it very useful for data scientists.
6. What skills do you need to become a data scientist?
Being a mid-level to senior role, working as a data scientist requires a high proficiency—that is, demonstrable experience—in a variety of hard and soft skills. Here are some of the most important skills for a data scientist to possess:
Statistics and mathematics
It goes without saying that anyone looking to work with big data needs to have a strong foundation grounded in mathematics and statistics—including descriptive statistics and probability theory—in order to make informed business decisions from data.
As we’ve pointed out earlier in this article, data scientists work with tools that streamline the data science life cycle. In order to use these tools, a data scientist will need strong programming skills. Python and R are absolutely necessary, but knowledge of other programming languages will definitely be valued by prospective employers.
Machine learning methods
As a data scientist, having a thorough understanding of machine learning methods is critical for the data science life cycle, especially in the areas of predictive analytics and data mining. There’s many machine learning methods out there to learn and build upon, but having a good understanding of both supervised and unsupervised techniques will put you in good stead for many data science roles.
Data modeling and analytics
This is a hard skill that comes with training, but is built upon the soft skill of critical thinking. On a daily basis, a data scientist will need to be able to analyze data, create models and run tests that will gather new insights and predict possible outcomes.
When working in any data role, understanding how to create effective data visualizations is an absolute must. After all, if you’re not able to communicate your findings in a way that’s easily understood by the end user, then these data-driven business decisions simply won’t happen. Having strong skills in showing complex data findings using graphs, charts, or other visual representations will take you far in your career.
In addition to being able to visually present your data findings, so too should you be able to communicate them verbally to your organization’s stakeholders and decision-makers. Being a successful data scientist will require you to be able to present your findings and be able to confidently back up the decisions you’ve made along the way.
Some of these skills will come to you naturally, but some will need to be learned. You could take a course at a data analytics school if the programming and other technical skills are new to you, or look into data analytics internships if you’ve got the technical part down.
7. How do I become a data scientist?
Data scientists are essential in basically every successful organization operating today. Data scientists use data to help inform decision-making, predict future trends and patterns, and identify areas of interest for an organization. As such, the data scientist job title is normally regarded as a mid-level to senior level role.
There’s no one-size-fits-all approach to becoming a data scientist, but if you’re looking to seriously enter the field, you could consider one of the following routes:
- Earn a bachelor’s degree in computer science, mathematics, IT, business, or another related field (up to 4 years);
- Earn a master’s degree in data, or another related field (approximately 2 years);
- Earn a certification from a reputable data science bootcamp (anywhere from 29 hours up to 8 months);
- Gain related experience in the field you’re interested in working in, then take a data science course to upskill and enter the field.
If you’re currently working as a data analyst, the career path towards becoming a data scientist is slightly more linear. We’ve got an in-depth guide for moving across data disciplines here: How to Make the Transition From Data Analyst to Data Scientist
8. What are some of the best data science courses?
Just as there’s no one-size-fits-all approach to becoming a data scientist, the same holds true when it comes to deciding on a data science course. The best data science course is the one that suits your individual needs and objectives. It pays dividends to do your own research, but here we’ll briefly introduce a handful of some of the best courses on the market:
Ideal for career-changers with some existing data science knowledge
If you’re looking for an immersive, interactive, and intensive learning experience geared towards optimal employability after graduation, you may be interested in the General Assembly online data science course.
Working in real-time via an interactive classroom, you’ll learn from instructors and work alongside fellow students, getting to grips with statistical modeling, decision trees, random forests, and more. This is an intermediate-level course with some prerequisites to apply: you’ll be expected to have some proficiency in Python, as well as possess a strong mathematical background.
Ideal for career-changers with no prior experience
Okay, so this one isn’t a data science course, but if you’re completely new to the industry, the CareerFoundry Data Analytics Program teaches many of the fundamental concepts crucial to data science.
You’ll start your data journey from the very beginning: learning how to prepare and analyze data, before moving on to SQL, Python, and interactive dashboards. In addition to a project-based curriculum with a strong focus on portfolio building, students benefit from a unique dual mentorship model.
You’ll work with both a mentor and a tutor, as well as an expert career coach—and, if you don’t find a job within six months of graduating, you’ll get your money back. You can also try a free introductory short course to test out the curriculum before committing to the full program.
Ideal for career-changers with experience in statistics and programming
The Springboard Data Science Career Track is a six month-long course run on a part-time basis.
It promises a project-based curriculum, unlimited one-to-one mentorship, and a job guarantee (or your money back). The curriculum is split into 18 units, covering topics such as data wrangling, storytelling with data, statistics, and machine learning. This is another intermediate-level course with some prerequisites to entry: you’ll need six months of coding experience under your belt, and to have basic proficiency with probability and descriptive statistics.
If you’re a software developer or analyst, or working within a related discipline and looking to move into data science, this may be a good option.
Learn more: The Best Data Science Bootcamps
9. What is the average data scientist salary?
As you can see, a career in data science allows you to make a real impact, by providing decision-makers with the data-driven insights they need to move an organization forward. It’s a pivotal role in any organization, and it comes with great responsibility—and yes, often a great salary, too.
Glassdoor lists ‘data scientist’ as the second-best job in America for 2021, with a median base salary of $113,735. Of course, this is the median—meaning that there are likely to be many organizations that cannot offer as healthy a base salary, especially if you’re just starting out in the field.
Looking at Payscale, another salary aggregator, they’ve listed the median data scientist annual salary as being closer to $96,565. Similarly, the U.S. Bureau of Labor Statistics lists the median data scientist salary as being around $98,230.
Another salary aggregator, Built In, is the most generous of the sources we looked at, announcing the median base salary for a data scientist in the U.S. as being $122,000. It also showed their lowest recorded salary as being $50,000, and the highest being a whopping $345,000! It just goes to show that the range of salaries for a data scientist in the U.S. is quite broad, but generally quite healthy across the board. As with any role, the type of salary you can expect will depend on the organization, the job’s location, and your level of seniority, among many other factors. When applying for jobs, keep these factors in mind, and don’t be afraid to negotiate your salary if you feel like you can provide additional value than the role initially advertised.
10. Key takeaways and further reading
The question we sought to answer here was: what is data science? Well, data science is a broad discipline, which uses many methods, techniques, and systems to extract useful information from the unfathomably large amount of data that exists around us.
In this article, we’ve given a broad overview of the topic for those who are interested in eventually working in the field as a data scientist. We covered the basic definition of data science, its applications in business, the responsibilities of a data scientist, and the data science life cycle. We then looked at what it takes to become a data scientist: the skills required, the routes to entry into the field, our picks for online data science courses, and some insights into average data scientist salaries in the U.S.
While it can be overwhelming to take in all at once, we’re hoping you can return to this guide and use it as needed as you delve into the world of data science. In a time when data-driven decisions are more important than ever, data scientist roles will always be in demand and, with the right training, passion, and determination, anyone can make the career change into the field.
Keen to learn more about the fields of data science and data analytics? Check out some of the following articles: