Voice identification

Igor Roid • January 10, 2024

Live voice vs pre-recorded voice

The Client


One of the world’s leading service providers in the field of customer experience elevation. The company provides digital B2B solutions to improve the clients’ interaction with their customers, such as web platforms, CRM, support portal etc. While the company belongs to the largest service companies in their field, they operate a vast and distributed information processing ecosystem which puts them into being extremely concerned about data security of their customers.



The Problem


For a variety of the services the client provides different security options, beginning with classic approaches like password protection & MFA and on to various biometric solutions. Voice recognition is one of very specific biometric authentication tools used in a limited scope of services. Although the feature of voice based authentication was introduced before, we needed to address the case when a scammer tries to use someone’s pre-recorded voice samples to gain access.



Challenges


Single sample vs multisample

The best user experience presumes that a single authentication session requires a user to provide a single voice sample, while having multiple samples ensures a significant increase in accuracy.


Integrate with existing voice recognition solution

The client already had a voice recognition solution and our task was to extend it with additional functionality of sample validation without disrupting the existing piece of functionality.



The Solution


Rather than just modifying the services already onboard a decision was made to build an independent service in order to minimize the intrusion to the existing codebase. We still had to modify some outputs to connect with the new service and therefore we implemented a chain of responsibility that would answer questions step-by-step: does the voice belong to the speaker?; does the speaker say what is expected?; is it a pre-recorded or live speech sample? For our goal we used a model based on a convolutional neural network with a pack of inner layers for input data distortion. This allowed us to simulate the element of randomness in the audio which is a common situation for audio data, like distortions which tend to occur when data is transferred over the network.


We have tested a series of different models with various depth levels and various augmentation approaches to find the best accuracy with reasonable size of a neural network. The input data was transformed into a numeric array. We refused multi-sampling as a default option in favor of user experience, which has lowered our accuracy. Overall the achieved accuracy at ~92% which is relatively low as to real world expectation.


On the other hand the typical error was a false negative (meaning the live speech was recognized as recorded) and the flow allowed to retry the auth procedure which eliminated the flaw in most cases. We kept multisampling in case a secondary attempt requested. So if there’s a false negative scenario during the first attempt, the second attempt involves two models: the one used before and the one that can leverage the multisample approach. As a result we have three votes to make a final decision. The false positive case (identifying pre-recorded as live) was really insignificant.


Obstacles


Audio quality

Our solution had to stick to the commonly used sample rate of 44.1kHz which was sufficient for the problem given. Unfortunately, with bitrate lower than default (16bit) or sample rates at 24kHz the recognizability went down pretty fast. On the bright side, most of the modern recording devices such as phone microphones or windows hardware provides the required level of audio quality. But if there’s a need to optimize bandwidth for a poor network by reducing the audio quality this may lead to a larger number of false results. This is considered an edge case that has a little impact on business.


Microphone effects

Audio effects such as noise cancellation bring additional artifacts to the audio sample which brings significant disruption to the efficiency of the neural network. This effect has been discovered at later stages of development, so we had to generate an extended training set to teach our model how to treat these phenomena.



The Value


As a result of our work the client was able to seamlessly integrate our solution into their product, which guaranteed undisrupted behavior for existing systems. Our solution addressed yet another client’s potential vulnerability which elevated their overall security and allowed additional flexibility by one-click feature enable/disable mechanism. The solution is adaptive to changing conditions of input audio (like sound effects or feedback loop echo) while the constraints on the audio quality cover virtually all everyday use cases.

VSBD - Software Development

By Igor Roid July 3, 2024
AI based risk profile evaluation within KYC process
By Igor Roid March 7, 2024
We enabled possibility to consult thousand of customers simulatenously
By Igor Roid February 4, 2024
Pattern recognition on surveillance cameras
More Posts
Share by: