The Morning After: Microsoft’s VALL-E AI can replicate a voice from a three-second sample
Microsoft’s latest research in text-to-speech AI centers on a new AI model, VALL-E. While there are already multiple services that can create copies of your voice, they usually demand substantial input. Microsoft claims its model can simulate someone’s voice from just a three-second audio sample. The speech can match both the timbre and emotional tone of the speaker – even the acoustics of a room. It could one day be used for customized or high-end text-to-speech applications, but like deepfakes, there are risks of misuse.
Researchers trained VALL-E on 60,000 hours of English language speech from 7,000-plus speakers in Meta’s Libri-Light audio library. The results aren’t perfect: Some are tinny machine-like samples, while others are surprisingly realistic.
Microsoft isn’t making the code open source, possibly due to the inherent risks. In the paper, the company said: “Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating.”
We’ve all seen the 1992 movie Sneakers, right? Right?!
– Mat Smith