February 5, 2019

Key Concept Extraction from NLP Anthology (Part 2)

Key Concept Extraction: Intelligent Audio Transcript Analytics Extracting Key Phrases for Scaling Industrial NLP Applications

The COVID‐19 pandemic that hit us last year brought a massive cultural shift, causing millions of people across the world to switch to remote work environments overnight and use various collaboration tools and business applications to overcome communication barriers.

However, this generates humongous amounts of data in audio format. Converting this data to text format provides a massive opportunity for businesses to distill meaningful insights.

One of the essential steps for an in-depth analysis of voice data is ‘Key Concept Extraction,’ which determines the business calls’ main topics. Once the identification is accurately completed, it leads to many downstream applications.

One way to extract key concepts is to use Topic Modelling, which is an unsupervised machine learning technique that clusters words into topics by detecting patterns and recurring words. However, it cannot guarantee precise results and may present many transcription errors when converting audio to text.

Let’s glance at the existing toolkits that can be used for topic modelling.

Some Selected Topic Modelling (TM) Toolkits

Stanford TMT : It is designed to help social scientists or researchers analyze massive datasets with a significant textual component and monitor word usage.

VISTopic : It is a hierarchical visual analytics system for analyzing extensive text collections using hierarchical latent tree models.

MALLET : It is a Java-based package that includes sophisticated tools for document classification, NLP, TM, information extraction, and clustering for analyzing large amounts of unlabelled text.

FiveFilters : It is a free software solution that builds a list of the most relevant terms from any given text in JSON format.

Gensim : It is an open-source TM toolkit implemented in Python that leverages unstructured digital texts, data streams, and incremental algorithms to extract semantic topics from documents automatically.

Anteelo’s AI Center of Excellence (AI CoE)

Our AI CoE team has developed a custom solution for key concept extraction that addresses the challenges we discussed above. The whole pipeline can be broken down into four stages, which follow the “high recall to high precision” system design using a combination of rules and state-of-the-art language models like BERT.

Pipeline:

1) Phrase extraction : The pipeline starts with basic text pre-processing, eliminating redundancies, lowercasing texts, and so on. Next, use specific rules to extract meaningful phrases from the texts.

2) Noise removal: This stage of the pipeline uses the above-extracted phrases to remove noisy phrases based on signals mentioned below:

Named Entity Recognition (NER): Certain NER such as quantity, time, and location type that are most likely to be noise for the given task are dropped from the set of phrases.
Stop-words: Dynamically generated list of stop words and phrases obtained from casual talk removal [refer to the first blog of the series for details regarding casual talk removal (CTR) module] are used to identify noisy phrases.
IDF: IDF values of phrases are used to remove common recurring phrases, which are part of the usual greetings in an audio call.

3) Phrase normalization: After removing the noise, the pipeline proceeds to combine semantically and syntactically similar phrases. To learn phrase embedding, the module uses state-of-the-art BERT language model and domain trained word embeddings. For example, “Price Efficiency Across Enterprise” and “Business-Venture Cost Optimization” will be clubbed together by this pipeline as they essentially mean the same.

4) Phrase ranking: This is the last and final stage of the pipeline, which ranks the final set of phrases using various metadata such as frequency, number of similar phrases, and linguistic POS patterns. These metadata signals are not comprehensive, and other signals may be added based on any additional data present.

Karan Makan

Technology Engineer and Entrepreneur. Currently working with International Clients and helping them scale their products through different ventures. With over 8 years of experience and strong background in Internet Product Management, Growth & Business Strategy.

Key Concept Extraction from NLP Anthology (Part 2)

Share:

Karan Makan

Related Posts

Hook Up on Tinder

Why The Public Cloud is Best for Big Data?

5 Essential Hosting Features for WordPress e-Commerce Site

5 Ways Google Analytics Can Improve Your Website

A Guide to Finding and Creating Free Website Images

Delivering excellence, collaborating across time zones.

Take a look at our global hideouts.

Contact

India (HQ)

Atlanta, USA

London, UK

Dubai, UAE

Melbourne, Australia

Surabaya, Indonesia

India (HQ)

Atlanta

London

Dubai

Australia

Indonesia

Key Concept Extraction from NLP Anthology (Part 2)

Share:

Karan Makan

Related Posts

Hook Up on Tinder

Why The Public Cloud is Best for Big Data?

5 Essential Hosting Features for WordPress e-Commerce Site

5 Ways Google Analytics Can Improve Your Website

A Guide to Finding and Creating Free Website Images

Delivering excellence, collaborating across time zones.

Take a look at our global hideouts.​

Contact

India (HQ)

Atlanta, USA

London, UK

Dubai, UAE

Melbourne, Australia

Surabaya, Indonesia

India (HQ)

Atlanta

London

Dubai

Australia

Indonesia

Take a look at our global hideouts.