Key Concept Extraction from NLP Anthology (Part 2)

Automated Keyword Extraction from Articles using NLP | by Sowmya Vivek | Analytics Vidhya | Medium

Key Concept Extraction: Intelligent Audio Transcript Analytics Extracting Key Phrases for Scaling Industrial NLP Applications

The COVID‐19 pandemic that hit us last year brought a massive cultural shift, causing millions of people across the world to switch to remote work environments overnight and use various collaboration tools and business applications to overcome communication barriers.

However, this generates humongous amounts of data in audio format. Converting this data to text format provides a massive opportunity for businesses to distill meaningful insights.

One of the essential steps for an in-depth analysis of voice data is ‘Key Concept Extraction,’ which determines the business calls’ main topics. Once the identification is accurately completed, it leads to many downstream applications.

One way to extract key concepts is to use Topic Modelling, which is an unsupervised machine learning technique that clusters words into topics by detecting patterns and recurring words. However, it cannot guarantee precise results and may present many transcription errors when converting audio to text.

Let’s glance at the existing toolkits that can be used for topic modelling.

Some Selected Topic Modelling (TM) Toolkits

  • Stanford TMT : It is designed to help social scientists or researchers analyze massive datasets with a significant textual component and monitor word usage.

Stanford Topic Modeling Toolbox

  • VISTopic : It is a hierarchical visual analytics system for analyzing extensive text collections using hierarchical latent tree models.

VISTopic: A visual analytics system for making sense of large document collections using hierarchical topic modeling - ScienceDirect

  • MALLET : It is a Java-based package that includes sophisticated tools for document classification, NLP, TM, information extraction, and clustering for analyzing large amounts of unlabelled text.

MALLET homepage

  • FiveFilters : It is a free software solution that builds a list of the most relevant terms from any given text in JSON format.

fivefilters ( · GitHub

  • Gensim : It is an open-source TM toolkit implemented in Python that leverages unstructured digital texts, data streams, and incremental algorithms to extract semantic topics from documents automatically.

GitHub - RaRe-Technologies/gensim: Topic Modelling for Humans

Anteelo’s AI Center of Excellence (AI CoE)

Our AI CoE team has developed a custom solution for key concept extraction that addresses the challenges we discussed above. The whole pipeline can be broken down into four stages, which follow the “high recall to high precision” system design using a combination of rules and state-of-the-art language models like BERT.


Intro to Automatic Keyphrase Extraction

1) Phrase extraction : The pipeline starts with basic text pre-processing, eliminating redundancies, lowercasing texts, and so on. Next, use specific rules to extract meaningful phrases from the texts.

2) Noise removal: This stage of the pipeline uses the above-extracted phrases to remove noisy phrases based on signals mentioned below:

  • Named Entity Recognition (NER): Certain NER such as quantity, time, and location type that are most likely to be noise for the given task are dropped from the set of phrases.
  • Stop-words: Dynamically generated list of stop words and phrases obtained from casual talk removal [refer to the first blog of the series for details regarding casual talk removal (CTR) module] are used to identify noisy phrases.
  • IDF: IDF values of phrases are used to remove common recurring phrases, which are part of the usual greetings in an audio call.

3) Phrase normalization: After removing the noise, the pipeline proceeds to combine semantically and syntactically similar phrases. To learn phrase embedding, the module uses state-of-the-art BERT language model and domain trained word embeddings. For example, “Price Efficiency Across Enterprise” and “Business-Venture Cost Optimization” will be clubbed together by this pipeline as they essentially mean the same.

4) Phrase ranking: This is the last and final stage of the pipeline, which ranks the final set of phrases using various metadata such as frequency, number of similar phrases, and linguistic POS patterns. These metadata signals are not comprehensive, and other signals may be added based on any additional data present.

error: Content is protected !!