Speaker Separation – Bring Your Call Recordings Up-to-Snuff for AI-fueled Speech Analytics

By Mingren Xiang

Automatically identify customer & agent speakers within single channel audio recordings

We are well within the “Age of the Customer”.  The first step toward competing successfully in this era is capturing the Voice-of-the-Customer. Many organizations are sitting on a gold mine of customer intelligence which they have already captured – their contact center call recordings. Unfortunately, call recording wasn’t designed for the purpose of analytics and insight. The design of many recording platforms pre-dated the rise of AI-fueled speech analytics, which automatically transcribes and analyzes your call recordings. Call recording was initially designed for the purpose of storage and archiving, often to meet specific regulatory needs. As a result, storage efficiency was prioritized over recording quality – recordings are often highly compressed, degrading the audio quality. Many older systems use a single channel for audio from the speaker and the agent.

Why Mono Recording Poses Challenges for Speech Analytics

The primary issue is that transcriptions cannot identify the source of speech between the agent and caller. Therefore, it becomes difficult to zero in on customer satisfaction or agent performance. Mono recording makes it impossible to pinpoint whether the caller or agent is responsible for what was said as well as the associated sentiment and acoustic measures.

So, what should you do if you are already invested in a mono call recording system? Not to worry, check out CallMiner Speaker Separation.

Macrosoft’s CallMiner Services

Download Macrosoft’s CallMiner Implementation Services brochure to learn more how we can help you achieve your Customer Satisfaction goals.

What is CallMiner Speaker Separation

CallMiner Speaker Separation is a voice biometrics-based software that divides mono recordings into speaker channels representing the agent and the customer portions of a call to improve speech analytics effectiveness.

A “passive enrollment” process is used by training on a group of calls with the same agent. The system then identifies the most prevalent talker across those calls (assumes the customer changes from call to call) and assigns the agent’s voiceprint. During the speech-to-text process, each part of the conversation is then attributed to the correct speaker – improving transcription readability, reducing the time to target agent or customer issues via speaker filtering, and speaker targeted topic-mining. Once a voiceprint of the agent is created, the system assumes any speaker NOT the agent is the customer.

Why Speaker Separation is Helpful to Your Organization

  • Transcript Usability – Speaker-associated search with tags automatically applied for categories are enabled with speaker separation. Issues such as customer satisfaction and escalations are easily identified. Agent performance including compliance and sentiment is also clear with automated scoring
  • Topic Discovery – Trending issues that may or may not have been identified are revealed based on agent or customer utterances. Topic circles with size indicating call volume split between agent and customer speakers can innovatively support root cause awareness and action.
  • Accuracy – Speaker Separation between and agent and a caller is highly reliable if good quality audio is available along with the following considerations:
    • Call duration is longer than 5 seconds.
    • 3rd person talking or background over talk – assigns voices to most similar agent or customer.
    • Hold music, especially with a heavy percentage of vocals or the use instrumentals.
  • Efficiency – Storage requirements are the same for mono as only the transcripts are separated between caller and agent. Voiceprint processing overhead is likely 5% or less, compared to 25% and perhaps significantly more required for stereo call recording. Also eliminates the need for a stereo call recoding upgrade.
  • Unobtrusive – Passive voiceprint enrollment means agents always remain in service. Also, the integrity of transcription content remains only not associated with identity if speaker separation fails.

Macrosoft’s CallMiner Services

Download Macrosoft’s CallMiner Implementation Services brochure to learn more how we can help you achieve your Customer Satisfaction goals.

Share this:

By Mingren Xiang | April 7th, 2020 | CallMiner

About the Author

Mingren Xiang

Mingren Xiang

Mingren is a Data Science professional at Macrosoft. He is Macrosoft's technical lead in voice and conversational analytics using the CallMiner suite of utilities. The practice includes both partnering with CallMiner to deliver speech analytic solutions and developing customized NLP applications. Mingren has a Master of Science from the University of Wisconsin-Milwaukee.

Aside from leading the speech analytics practice at Macrosoft. Mingren's research work focus on Deep Learning applications for medical image processing. He presented the Master thesis on training a CNN (Convolutional Neural Network ) based Encoder-Decoder model to reconstruct CT scans using only one X-ray image. Such a task remains to be one of the hardest challenges in the computer graphic community

Coming from a strong computer science background, Mingren is also sufficient in multiple programming language such as Java, Python, C/C++, various JavaScript libraries and SQL scripts. His specialty in software development is to utilize API to create functional backend services using web development framework like Java Spring and Django in Python.

Recent Blogs

Power Automate AI Builder and Scenarios
Read Blog
Dazzle 3.0 Pre-Launch : Custom-built .NET Framework for Legacy Conversion
Read Blog
Speech to Text Quality Assessment and Analysis: Part 2
Read Blog
Cypress Web Automation: It’s Expanding Role in Macrosoft’s Web App Development Projects
Read Blog