AI Voice Cloning Tools are getting surprisingly cheaper day by day. In this article, we will we would look into the matter in depth for the reasons due to why AI Voice Cloning Tools are getting cheaper.
Researchers at MIT created a fictional film of Richard Nixon declaring the failure of the Apollo 11 Moon landing in 2020 in collaboration with a business called Respeecher. A behind-the-scenes film demonstrates the time-consuming procedure needed to duplicate Nixon’s voice. The MIT researchers recorded a voice actor delivering the identical phrases after gathering hundreds of brief recordings of Nixon’s speech. The programme then altered the actor’s reading of Nixon’s alternate moon landing address so that it sounded more like Nixon.
The results of this procedure appear to be excellent: Respeecher was awarded a contract to duplicate James Earl Jones’ Darth Vader voice in upcoming Star Wars productions last year. But the price is steep. Recently, I contacted Respeecher to try their service, and they responded by telling me that “a project typically requires several weeks and fees from 4-digit up 6-digit in $USD.”
I chose an obscure business called Play.ht since I didn’t have hundreds of dollars to spend. I only had to wait a few hours after uploading a thirty-minute clip of myself reading a text of my choice.
I didn’t need to employ a voice actor because Play.ht offers text-to-speech services. The programme could quickly produce lifelike human speech form written text after being trained on my voice. And best of all, there was no cost to me. Play.ht’s free plan allowed me to duplicate my voice. Commercial plans have a $39 monthly minimum.
Play.ht and other realistic text-to-speech programmes are challenging to create since people speak the same word different depending on the circumstances. We do this based on the words that come before or after a word in a phrase, and we adhere to intricate, mostly unconsciously, rules on which phrases in a sentence to emphasise.
Additionally, there are also completely arbitrary variations in word pronunciation among people. Sometimes we need to pause and catch our breath, consider what we’re saying, or simply just become sidetracked. Any system that consistently pronounces words or sentences in the same way would thus sound somewhat robotic.
Respeecher, a voice-to-voice system, may follow the main character of the voice actor that provided the original audio, thus it needn’t worry as much about these problems. Contrarily, with a text-to-speech system, the A.I. system must comprehend human speech sufficiently to know when to pause, which phrases to emphasise, and other such things.
According to Play.ht, their technology makes use of a transformer, a sort of neural network created by Google in 2017 and used as the basis for several generative A.I. systems since then. (The transformer is represented by the letter T in OpenAI’s family of big language models, GPT.)
A transformer model’s strength lies in its capacity to “pay attention” to numerous aspects of its input at once. When Play.ht’s model creates the audio for a new word, it doesn’t simply “think about” the word that is being said or the one that came before; it also considers the overall organisation of the phrase. As a result, it may adjust speech’s intensity, pace, and other aspects to mimic the speech patterns of
Challenges Faced By Voice Cloning AI :-
Play.ht is designed to create lengthy audio files from scratch, such as a whole audio book. Overdub, on the other hand, is intended to insert brief sentences into an existing audio recording. Short audio snippets make it much more difficult to hear a synthetic voice, thus I believe Overdub’s voices are very realistic for this application.
Additionally, Descript makes advantage of A.I. technologies to further improve audio. For instance, a function called Studio Sound employs artificial intelligence to transform regular audio—possibly made with a subpar microphone in a loud environment—into the sound of studio recording. Not only does it eliminate background noise, but it also gently changes the speaker’s voice to make it appear as though a better microphone was used to record the audio.
Descript may also be of assistance in the reverse way, ensuring that a new audio clip you add to an existing recording has the same “room tone” as the surrounding audio by subtly introducing background noise.
The usage of such tools greatly reduces the amount of time-consuming post-production work necessary to distribute high-quality audio content, which is a godsend for independent creative workers. They may, however, also benefit criminals and other troublemakers.