In today’s economy, the role of the data analyst is blooming.
But with this evolving role comes fresh challenges, including the growing complexity of data. Unstructured text, in particular, poses a major challenge. From emails to social media comments, customer service documents, product reviews, and surveys, text data is everywhere and comprises an increasing amount of the data we now produce every day.
So how can we make sense of it? Enter text analysis.
Text analysis is a powerful machine-learning technique used to interpret large amounts of text data in various ways. Used well, it allows data analysts to quickly classify text, categorize topics, and measure customer sentiment, among other things.
But how exactly do data analysts use text analysis? What are its use cases, and how might you get started? In this beginner’s guide, we’ll discuss all the fundamentals of text analysis to get you up to speed.
Interested in learning some data analytics skills? Try this free data short course out to see if you like it.
- What is text analysis, and why is it important?
- What’s the difference between text analysis and text mining?
- Text analysis methods and techniques
- Common text analysis use cases
- How to analyze text data (step by step)
- Next steps
All geared up for the basics of text analysis? Then let’s dive in.
1. What is text analysis, and why is it important?
Text analysis (sometimes referred to as text mining) involves analyzing disordered text documents to determine their underlying structure. Doing so allows data analysts to explore everything from the topics covered in text documents to the sentiment of the language used. This approach is useful for uncovering trends and patterns that would be difficult or impossible to detect using manual means. This makes text analysis a powerful tool for exploring large volumes of data.
The potential applications of text analysis are vast and varied. Using text analysis, a data analyst can analyze anything from analyzing political discourse to identifying topics of discussion in a particular subject area, detecting fraud and security threats, or managing spam. Within digital marketing, in particular, text analysis can also generate powerful insights into customer preferences that can inform marketing decisions. Understandably, all these potential applications make text analysis a sought-after tool in any data analyst’s arsenal.
2. What’s the difference between text analysis and text mining?
In casual usage, people sometimes refer to text analysis as text mining. In most practicable cases, using the terms synonymously is no problem. But in certain situations, it’s necessary to be aware of it.
So what’s the difference?
In its broadest sense, ‘text analysis’ is a stepped process. It focuses, first, on bringing order to unstructured text data and, second, on extracting meaningful insights. Breaking this down, however, we can also describe the first step in this process as text mining. Text mining involves taking unstructured data and bringing it into a sense of order.
Once you’ve brought structure to your dataset, it becomes possible to select specific elements to explore further, to uncover relevant trends and patterns. This second step is the literal text analysis.
So, text mining could involve taking a large, unstructured set of documents extracted from the web and organizing them into categories, topics, or other groupings that are useful for your needs. Meanwhile, you’ll use text analysis to determine, for example, which words are most commonly associated with which categories. This might help you gain deeper insights into the topics covered and their relevance to your business.
Since text analytics, in its broadest sense, involves text mining and text analysis, this explains why people use the terms interchangeably. The difference only really matters when diving into the specifics. But it’s not something to lose sleep over!
3. Text analysis methods and techniques
We now know what text analysis involves. But beyond the high-level concept, what specific techniques and methods do data analysts use? In this section, we explore some of the most common ones.
Topic modeling
Data analysts use topic modeling to find different topics present within text data. It helps them understand things like customer feedback and market trends. For instance, an analyst might use topic modeling to identify the most common issues discussed in customer service emails, allowing teams to prioritize issues and design customer-handling workflows.
Sentiment analysis
Data analysts may use sentiment analysis to understand how customers feel about a product or service. For example, by analyzing ‘positive’ and ‘negative’ keywords in customer reviews, data analysts can determine if customers are satisfied or unhappy with the product, or specific features included within it. This concept underpins many social listening tools.
Text summarization
Text summarization involves automatically producing a shorter version of a text document, providing an overview of a document’s main ideas. A company might use text summarization to draw out key points from a complex series of technical documents, for example.
Text classification
Text classification automatically classifies text into predefined categories. A customer service help desk might use text classification to automatically assign labels to customer emails, such as ‘billing’ or ‘technical issue’, helping them to sort their workload. The automated category sections in email software such as Gmail or Outlook also use this type of text analysis.
Language translation
We’ve all occasionally used tools like Google Translate or DeepL, right? These are prime examples of translation software that uses text analysis to quickly and (more or less accurately!) translate documents or text from one language into another.
Named entity recognition
Named entity recognition (NER) is a fancy term for identifying and classifying specific names, places, events, organizations, and other categories or entities within a text document. For instance, data analysts might use it to track news articles for mentions of particular companies or products.
Natural language processing (NLP)
Technically, all text analytics uses NLP, but it’s a focus in its own right, too. In particular, there’s much work being done to improve human-computer interaction, and NLP plays a crucial role here.
Using NLP algorithms, data engineers are creating evermore sophisticated chatbots and personal voice assistants that understand and respond to customer queries, even when they use casual language and colloquialisms. It’s driving the huge explosion in generative AI at the moment.
As you can see from these techniques, text analysis is a powerful and diverse tool. Data analysts might use text analysis for tasks as niche as automatically creating ‘tweet-length’ summaries of a long-form article to something as prosaic as email categorization. However, most text analytics tasks fall under one of the techniques outlined here.
4. Common text analysis use cases
In section 3, we provided examples of the different text analysis techniques. In this section, we explore a few real-world cases in which text analysis is used. Let’s take a look.
Text analysis in pharmaceuticals
Recently, pharmaceutical companies have begun using text analysis for widely varied tasks. They can use it to gain deeper insight into clinical trial participants or to identify interactions and side effects of different drugs in the medical literature, for example.
Pharma giant GlaxoSmithKline (GSK) recently used text analysis to understand why some parents vaccinate their children and others don’t. Since pharmaceutical regulations restrict consumer interaction, GSK analyzed online message boards, focusing on terms like ‘safety’ and ‘comfort’ to better grasp parental sentiment regarding vaccines.
Not only did this bypass the interaction issue; but it also helped GSK obtain more candid perspectives on parents’ concerns than they would get through more formal surveys, in which people are more inclined to conceal their true feelings. GSK is now using this data to improve its vaccine messaging and address parental concerns.
Text analysis in retail
Text analysis is also a key tool for data analysts working in retail. The most obvious use in retail is for customer sentiment. However, text analysis can go much further than that. Take Walmart, for example, which has been using AI and text analytics to push forward its customer experience in other ways.
Using cutting-edge NLP models, Walmart is anticipating customer needs and providing intuitive personal feedback in various formats. This includes online chatbots, at Walmart’s help center, and via interactive voice response (IVR) telephone systems.
Furthermore, Walmart is also one of the first retailers to integrate its products and services directly with existing personal voice assistants. Via Google Assistant, customers can now shop seamlessly with Walmart, placing orders for products, which are then delivered straight to their door. Heard of the internet of things? This is a prime example of how to do it right.
Text analysis in finance
Banks and other financial institutions have been using text analysis for years. Using it was possible early on because financial information has to be highly structured and relatively simple to read.
More recently, NLP has broadened what can be made searchable to more unstructured free text. This makes what’s known as ‘enterprise search’ (quickly and easily finding information within an organization’s data) far easier, with many products now available on the market.
A good example would be NLP software that categorizes internal legal documents to seek out finance or fraud-related excerpts. In turn, internal users such as compliance teams could search for relevant information more quickly and cost-effectively than by trawling documents by hand.
Text analysis in recruitment
Text analysis is used to automate aspects of the recruitment process, for example, by reading job applications and resumes. By analyzing text for relevant keywords and phrases, companies can quickly create a shortlist of potential candidates.
There have been some unfortunate cases where types of algorithms have discriminated against particular groups. While this is always something to be mindful of, text analytics can also mitigate this issue in another way. For example, AI is being used to detect bias in job descriptions, ensuring that companies are not discriminating against applicants based on age, gender, race, or other protected characteristics.
5. How to analyze text data (step by step)
So, you’ve identified a use case for text analysis. Next up, how do you go about it? While the specific steps you’ll take will vary depending on your approach and need, here’s a brief guide to get you started.
Collect your data
The first step is to gather the text data you need. For instance, your organization might have internal documentation, such as customer service documents, which could be useful for training a model. You might also want to supplement this internal data with text data from the web. In this case, you could use software like Scrapy or BeautifulSoup to scrape web pages, or you might use APIs to mine text data directly from public databases.
If you’re seeking text data from books, articles, or documents, you could code a web crawler to search online libraries or archives. Or you might just use pre-existing document mining tools to extract the data. In short, there are many data collection methods. But this offers a taste of some of them.
Learn more: What is web scraping?
Prepare your data
Once you’ve got your data, you need to prepare it for analysis. This is a complex process that involves tasks such as removing unwanted outliers and normalizing your dataset.
Learn more: What is data cleaning and why does it matter?
Besides standard cleaning tasks, however, there are also some steps specific that you’ll need to take when preparing text data. These include:
Tokenizing the data: Your unstructured text data should be broken into individual words and phrases, known as tokens.
Normalizing the data: All unstructured needs normalizing before it can be analyzed. For text data, this can mean ensuring that all words are in the same format. For example, converting everything to lowercase.
Stemming/ lemmatization of the data: Stemming and lemmatization are linguistic terms for breaking down words into stems, and by context, respectively. The stem of the word ‘driving, for instance, would be ‘driv’. Meanwhile, the lemmatization would be ‘drive’.
Removing stop words: Common words such as ‘the’, ‘a’, and ‘an’ are known as stop words. Eradicating them reduces unnecessary noise in your dataset.
Vectorizing the data: Vectorizing text data involves assigning numerical values to words and phrases. This reduces the size of the dataset, making it faster for machine learning models to process and extract patterns from it.
Creating a vocabulary: It’s necessary to create a vocabulary of all the unique words in a dataset. This will help you understand the contents of your text data when interpreting insights.
There are other preparation tasks you can carry out on text data, but these are the most important ones for raw unstructured data.
Analyze your data
Now it’s time to use the text analysis techniques outlined in section 3.
Most of the work takes place during text preparation, which makes the analysis much easier, faster, and more accurate. With text data appropriately structured, you can carry out things like sentiment analysis, topic modeling, or text summarization (to name a few).
All these activities involve looking for patterns in the language, the structure of the text, what topics are being discussed, and so on. At a more granular level, analyzing text data can help you identify unexpected relationships between different words, phrases, and contexts. Ultimately, this will provide new insights into the information and ideas conveyed.
Visualize your findings
Once you’ve analyzed your data, you need to present results in an easy-to-digest format. Visualizing data allows you (and your audience) to clearly understand its patterns, and trends you might otherwise miss.
Depending on the data, visualization might involve creating bar charts, line graphs, or word clouds. For instance, a word’s frequency might be shown in a word cloud by the size of the word (a higher frequency represented by larger words and a lower frequency represented by smaller ones) with colors representing sentiment (lighter colors representing positive sentiments and darker colors representing negative ones). However you choose to visualize your text analysis, visualizations should always reveal something new.
Decision-making time!
Finally, it’s time to make decisions based on your analysis. Depending on your needs, this might involve using the insights to inform product development, for example, or using customer sentiment to craft a tailored marketing campaign.
Remember: always take time to understand what the analysis is telling you and what its limitations might be. A sentiment analysis, for instance, should always be considered in context with other factors like competitor analysis, market trends, and other feedback.
6. Next steps
There we go! A complete beginner’s guide to text analysis!
In this article, we’ve explored what text analysis is and why it’s important. We’ve also provided examples of different text analytics techniques, practical use cases, and a step-by-step guide to getting started with analyzing text data.
If you enjoy linguistics, have a flair for the analytical mindset, and enjoy the challenge of getting to grips with things like machine learning, text analytics could be for you. As well as being a useful tool for any data analyst, it’s fast becoming an important subset of data analytics in its own right.
To learn about more data analytics topics, why not try CareerFoundry’s completely free, self-paced data analytics course? Alternatively, check out the following guides: