Microsoft releases VALL-E, a new text-to-speech model that can produce speech in any voice with just 3 seconds of training
Text-to-speech (TTS) technology has come a long way since its early days over five decades ago. The first English text-to-speech system was invented by Japanese researcher Noriko Umeda at the Electrotechnical Laboratory in Japan in 1968 when he generated a computer voice from plain text.
With advances in speech recognition technology, new processing power, and artificial intelligence, text-to-speech technology has become very popular and ubiquitous. Today, the technology is used in voice-controlled devices such as Alexa and Siris.
In 1998, Microsoft released Sam (Speech Articulation Module) TTS Generator, the famous Text-to-speech voice included with Windows XP. It was a speech synthesizer with an online interface provided for use with applications that use the Microsoft Speech API (SAPI). Microsoft Sam also has a natural-sounding text-to-speech that matches the intonation and emotion of human voices.
Fast forward two decades later, Microsoft just released VALL-E, a new zero-shot text-to-speech model that can duplicate everyone’s voice in three seconds. The new model is a major advancement in the direction of more natural-sounding TTS systems since the release of the first text-to-speech (TTS) model in 1998.
According to mPost, which first spotted the new TTS system, VALL-E is a transformer-based TTS model that can generate speech in any voice after only hearing a three-second sample of that voice. It is a significant improvement over previous models, which required a much longer training period in order to generate a new voice.
According to a statement posted on the demo site GitHub, Microsoft said that its initial experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. “In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis,” Microsoft wrote.
Below are some LibriSpeech Samples
“They moved thereafter cautiously about the hut groping before and about them to find something to show that Warrenton had fulfilled his mission.”
Speaker Prompt
Ground Truth
Baseline
VALL-E
Click on the image below to listen to clips of other LibriSpeech Samples