What is topic modeling, and how can it help analyze customer data?

Throughout the early 2000s, "big data" often led to even bigger headaches. Now, most organizations want to know how they can meaningfully navigate and apply their sea of data.

Topic modeling meets this growing demand for fast and contextualized data summaries across various formats. It's a form of "unsupervised" machine learning (ML) data processing, so it doesn't require training or pre-configuration.

Let’s dive into everything you need to know about the subject, including learning how to measure topic modeling accuracy. We'll also explain how topic modeling is making waves in computer science and language modeling.

What is topic modeling?

Topic modeling is an artificial intelligence (AI) advancement that companies can use to enhance customer experience and improve business operations. It allows them to harness the power of big data rather than be overwhelmed by it.

Topic modeling is a form of unsupervised machine learning (ML) using natural language processing (NLP) modeling. It uncovers hidden themes or topics within a collection of text documents called corpus.

Compared to a manual review, topic modeling is a virtually effortless way to understand what large volumes of unstructured data are about.

Instead, a topic modeler magically (sort of) determines what themes run through a collection of data. It attempts to infer the most likely topics underlying the data without human involvement.

Free AI content analysis generator

Make sense of your research by automatically summarizing key takeaways through our free content analysis tool.

Use free

How does topic modeling work?

Consider how a document or website's "search" feature requires you to know what you're searching for. Topic modeling doesn't need a point of reference like this.

Instead, it works by:

Determining the most common word clusters throughout documents (without prompting)
Comparing word clusters between multiple sets of data
Contextualizing word clusters to determine semantically connected themes

Context is critical in topic modeling because a topic modeler goes beyond just ranking the frequency of words and phrases. The end goal is to rank how often certain topics come up.

For example, the topic modeler might determine the following four terms appear most often in a data sample:

Interest
Earning credit
Accounting for
Trust

The topic may seem to be about banking or finance, but what if terms like money, debt, and budget, are absent? Suddenly, the original terms could refer to dating, friendship, or the psychology of relationships in general.

You can't always determine a topic from word frequency. Topic modeling involves a certain amount of guesswork, even if a language-modeling algorithm does that guessing.

As in the example, unclear results could mean one of two things:

A need for more data (see the FAQ section at the bottom)
A different method is more appropriate, such as topic classification

For these reasons, topic modeling isn't perfect and relies on estimates. This is where different forms of topic modeling come in, which sort and categorize data differently.

Types of topic modeling

Topic modeling is based on natural language processing (NLP), a branch of computer science studying how people use language.

This begs the question, what is topic modeling in NLP?

NLP is a branch of computer science that draws from various algorithmic tools to model different aspects of language. Topic modeling fits into NLP as a form of abstraction, meaning it aims to unveil the latent topics behind a collection of text.

Naturally, topic modeler programs use several methods, each with their strengths.

Latent Semantic Analysis (LSA)

LSA attempts to model language as we commonly use it.

It's largely based on word sorting performance from human tests and attempts to gauge topic coherence by analyzing which words are and aren't used.

In the earlier example, LSA would likely place significant weight on the telling absence of words most closely related to finance.

Even though we sometimes use the most common words in one context, the lack of other expected words calls that context into question.

Determining the most likely topic requires balancing the actual language with estimates about what other language should also be there. If it is, the context is strong; if it isn't, the context is weak, so the algorithm will rank that topic as less likely.

The best time to use LSA is when analyzing conversational and readable data, such as:

Customer reviews
Testimonials
Survey answers
Long-form articles or books
Articles and blogs for a general audience
Audio or video transcriptions

Latent Dirichlet Allocation (LDA)

Much like LSA, LDA compares the frequency of words, word clusters, and their connecting themes.

However, LDA takes a more probability-driven approach, emphasizing hard, statistics-driven data over natural language.

LDA still compares data with syntax, phrasing consistency, and other matters important to all NLP studies, but LSA represents these qualities better.

By contrast, LDA places statistical probability of word clusters at the core of its topic modeling algorithm. It also presents topic modeling reports in a more information-dense chart.

LDA is a better method for analyzing customer data related to:

Dense, data-driven analytics
Fields with precise language (e.g., science, law, and any kind of technical writing)
Any type of quantified data where text merely supports or presents hard, measurable data

Using Python programming language for topic modeling

One of the benefits of Python is that it closely resembles English syntax. This makes Python the perfect programming language for topic modeling.

It also features numerous text-mining features and libraries specifically for NLP.

While a guide on using Python for topic modeling is beyond the scope of this article, we wanted to mention its utility for anyone going deep into the technical aspects.

Topic classification

Like topic modeling, topic classification mines data for common phrases, but it works in the opposite way.

Unlike topic modeling, topic classification is a form of "supervised" machine learning, so the user must enter inputs for it to function.

The user begins searching by manually tagging certain keyphrases into the topic classifier.

A topic classifier program uses these keyphrases to:

Search the data (or sets of data)
Identify all instances of the keyphrases
Tag text containing text related to the keyphrase wherever found
Compare tagged passages with each other with many of the same language modeling algorithms as topic modeling

It's much more complex than a simple "search" function. Topic classification uses rule-based systems, which differentiate topics semantically.

The topic classifier can conveniently categorize portions of the text under separate tags, even if the text doesn't contain the keyphrase but only the topic implied by it.

Gradually, machine learning is replacing rule-based systems, performing the same functions with less and less required input.

There are also hybrid systems with topic classifiers that use machine learning when possible.

The user can use the topic classification system to double-check the work of the machine learning system.

However it's accomplished, what's important is simplifying customer data analysis.

allows companies to mine previously unwieldy amounts of raw data for greater context and meaning.

Of course, this applies equally to topic modeling. So how do you know when topic modeling or classification is the right option?

Topic modeling vs. topic classification

While topic classification requires more work than topic modeling, topic classification provides more accurate results.

Topic modeling basically estimates the most relevant keyphrases for you—but how can you be sure they really are the most relevant keyphrases?

After all, won't a topic modeler find the words "the," "and," or "is" more often than almost anything else? Of course, it goes beyond such simplicity by contextualizing word clusters according to themes.

But it still raises the question: How certain can you be in the topic modeler's conclusions?

With topic modeling, you still need to review the word clusters.

For example, you'll want to be sure the most relevant topic truly was "computers," when a manual review shows "computer science" more specifically was the core subject.

Even if a particular phrase occurs often, sheer frequency doesn't prove it's the main subject.

Broadly, deciding between topic modeling and topic classification raises three main issues.

Respectively, topic modeling and topic classification are:

Generic vs. specific terms
Speed vs. accuracy
Automation vs. manual effort

With this in mind, topic modeling and topic classification have their place. You just need to be sure which tool is right for the job. The following rules of thumb should help:

If you know what you're looking for, use topic classification
If you need a quick estimate, use topic modeling
If you have a short list of possible key phrases, use topic classification to narrow it down
If you have large volumes of data and only know what a small portion is about, try topic modeling—but look for word clusters overlapping with known tags

Of course, it's always possible to use topic modeling, then use topic classification to test and review those results or narrow them down using manual searches.

Use cases and applications

Generally, topic modeling is useful when you have more data than you can read or even skim through, but you still need to know what it's generally about. This is quite common.

Consider how many different types of data your organization might use on a given day:

Survey results
Product descriptions
Customer reviews and feedback
Articles, white papers, and reports
Legal documents
Internal reports
Meeting minutes
Text-based communication, including:
Email
SMS
Web chat
Call transcriptions
Message boards and forums

It can seem like pulling it all together is an endless, futile test of your ability to compare apples and oranges.

As a "format-agnostic" text-comparison method, topic modeling can see through differences and automatically classify text into a clearer, searchable form.

Topic modeling can dramatically simplify the analysis of:

Customer service
CRM data
Market research
Focus groups
Customer feedback (e.g., product reviews, company ratings, direct messages)
Survey results
Product testing
Sales call transcriptions

Examples of topic modeling and topic classification

Consider the following scenarios, plus how either topic modeling or topic classification can help solve the customer data issue:

Topic modeling in customer service

Imagine acquiring a new company, then discovering their customer support ticket system is in serious disarray and severely backlogged.

Your parent company uses a totally different system, and it's unfeasible to redo or merge the disorganized system into yours.

A topic modeler can automatically parse through the disorganized support ticket system, tagging them with the most likely topics. Assigning support tickets to customer service reps with relevant skills becomes as easy as compiling all support tickets with a given tag.

Topic classification with customer feedback

A new product release is imminent, and you're reviewing the last possible round of customer feedback before you can make any last changes.

If you had more time, topic modeling would greatly help determine what your customers considered most important.

Instead, you know you can only address a short list of possible concerns, so you use topic classification to tag feedback according to those matters you can actually affect. You'll quickly see which product feature under your control has attracted the most customer interest.

Topic modeling for sales call transcriptions

As subjective as sales call evaluations can be, it doesn't have to be a complete mystery.

Simply run your sales call transcriptions through a topic modeling algorithm and let your customers tell you—in more ways than one—exactly what issues are at the top of their list.

You'll have a compelling list of topics most likely driving customer buying decisions, which you can further hone by testing each topic with future sales attempts.

FAQs

What are topics in topic modeling?

Topics are a text sample's main subjects or themes, as determined by the language modeling algorithm. Often, the topics are unique word clusters used most frequently, but not always.

For instance, word clusters are sometimes semantically related to a different overarching theme that's left unstated but heavily implied.

Word clusters can even be misleading, such as dual meanings used in different contexts bearing no relation with the true topic.

A prime example is using "fast" in an exercise science article. It can mean [A] performing an exercise quickly or [B] "fasting" by not eating food for an extended time.

What's structural topic modeling (STM)?

STM is a form of topic modeling specific to social science. It incorporates metadata to help researchers consider frequently discussed topics, even with widely differing phrasing.

What's the sample size for topic modeling—and how many documents do you need?

The size of the sample in topic modeling is very important. Generally, people use topic modeling when the data volume is so high that manually reviewing it is totally unfeasible (usually, recommendations are for anything over 1,000 pages).

Still, topic modeling can work with a single page of text if it’s highly organized. The structure and cohesiveness of a sample also affect the quality of topic modeling.

Should you be using a customer insights hub?

Do you want to discover previous customer research faster?

Do you share your customer research findings with others?

Do you analyze customer research data?

Editor’s picks↘

The essential guide to customer behavior analysis16 October 2024

Unlocking the B2B customer journey: step-by-step guide22 February 2024

Buyers guide for an enterprise customer intelligence platform12 May 2025

How Dovetail ensures enterprise-grade security and compliance11 May 2025

Enterprise use cases for a customer intelligence platform like Dovetail11 May 2025

How to analyze your NPS results4 July 2024

How to run a results-driven journey mapping workshop2 December 2024

The cost of data silos for enterprise organizations11 May 2025

Navigating enterprise security & compliance in customer intelligence solutions11 May 2025

From feedback to impact: How to organize and act on customer insights22 April 2025

Measuring the ROI of a customer intelligence platform19 May 2025

Scaling user research through democratization11 May 2025

Latest articles↘