Intelligent Audio Transcript Analytics – NLP Anthology (Part 1)

Voice Transcription & Speech Analytics | Verint Financial Compliance

Intelligent Audio Transcript Analytics: The Next Big Thing for Scaling Industrial NLP Applications.

Over the last few years, Natural Language Processing (NLP) has made significant strides in mastering language models to understand the nuances of different languages, dialects, and voices. NLP has unlocked countless new possibilities, and the market has corroborated this with a high rate of adoption. However, many in the industry, across the Fortune 500, are still skeptical about implementing NLP-based tools to derive value from texts; instead, they rely on their experts to do this manually, resulting in low efficiency, inconsistency, and challenges at scale.

Tredence is helping enterprises address these challenges through our AI CoE. We’ve recently partnered with a Fortune 100 Research and Advisory client to solve several challenging NLP problems using audio transcripts for conversations between the client’s analysts and their customers (usually Business Directors and above). These problems have potential use cases, including Research Guidance, Evolving Categorizations, Automated Reports, and Process Automation.

This is the first in a 4-part blog series that will discuss the overview of the problem and motivation behind the solution along with some challenges faced during the solution’s development.

The Problem: Rising Metadata and Lack of Actionable Insights

With companies using a host of call monitoring and recording applications, a large amount of unstructured call data gets generated every day. But the inherent resource constraints of a manual approach fail to provide valuable insights.

NLP solutions can play a vital role in mining the call data and categorizing and providing actionable insights. For example, it can be applied on call transcripts to quickly extract key topics covered with little or no human input. Further, using the solution to understand call transcripts can improve workplace efficiency, reduce human capital costs and improve training and feedback for employees. It can also help in identifying business problems algorithmically, making it easier for the organization to deploy resources in an evidence-based manner.

We have built an NLP-enabled Audio Transcript Analytics Solution that helps systematically understand the business calls by using three key components:

  • Key Concepts Identification
  • Natural Language Intent Extraction
  • Multi-label Document Tagging

We will discuss each component in detail in the next three blogs of this series.

Our solution has been successfully applied to many Fortune industrial 500 clients’ various transcription needs in multiple domains.

The tools can be combined to form a full-spectrum Natural Language Understanding and Processing System that’s customized for new domains relatively easily.

Data & Present Framework

Sensr: Evaluating A Flexible Framework for Authoring Mobile Data-Collection Tools for Citizen Science – Follow the Crowd

Roughly 100,000 analyst-client calls, lasting between 30-40 minutes, take place every year. Before our solution was deployed, the domain experts had to analyze and extract the key elements of each call transcript.

Before we discuss the critical components used by our Audio Transcript Analytics Solution, let’s glance at some of the challenges.


  • Ambiguity is inherent to human language. Hence, the speech-to-text converted data poses many problems for NLP systems like transcription errors – incorrect words, spelling errors, and incorrect sentence segmentation.
  • The lack of speaker text segregation hinders the application of NLP algorithms in client spoken segment.
  • Off-topic conversations or casual talks also impact the algorithm’s effectiveness significantly. Hence, to address this issue, we’ve developed a Casual Talk Removal method in which we considered the causal talk identification as a sentence classification problem using:
  • A supervised approach: We trained an ensemble model for nearly 10,000 sentences on the quantitative features derived from each sentence, such as the sentence’s position and count of tokens, stop words, entities, person names, geographic location. We observed that the sentence’s position is the most important feature since the transcripts have a high density of casual talk in the beginning. This approach performed well in the classification of sentences present at the beginning of the call transcript.

However, this approach had two significant limitations:

  • It required a sizeable labeled corpus to train the model.
  • Poor classification accuracy in later sections of the transcript.

To overcome these limitations, we developed an unsupervised method to classify casual talk sentences.

  • An unsupervised approach: Some information such as people names, geographical location names, and certain stop words were removed from the sentences. We used part-of-speech (POS) tags such as Noun and Proper Nouns, and IDF values at the sentence level to classify casual talks.

Hope you liked our approach to call data analysis and framework for removing ambiguity and casual talk from call transcripts and perform meaningful analysis.

error: Content is protected !!