ReadSpeaker’s Deep-Learning, Smart-Talking TTS Voices

Things have changed a lot since HAL 9000 interacted with the Dr. David Bowman in the movie “2001: A Space Odyssey”. What used to be considered nascent technology is now becoming mainstream. An article in the Economist recently stated that “voice is far more convenient and natural than any other means of communication” and that it is, in fact, so powerful, because “it can also be used while doing something else (driving, working out or walking down the street). It can extend the power of computing to people unable, for one reason or another, to use screens and keyboards. And it could have a dramatic impact not just on computing, but on the use of language itself”. ReadSpeaker TTS voices deliver high quality and market-leading accuracy. But ReadSpeaker is also involved in pushing the existing technological boundaries of text to speech even further, shaping the future of speech.

Today’s generation of text-to-speech (TTS) voices, such as the ones offered by Readspeaker, use a synthesis technique called unit selection synthesisRead about it here. Although the resulting speech sounds very natural, it involves recording many hours of speech with a professional speaker, which is costly. And in order to avoid “glitches” at points where speech units are pasted together, the speech is recorded in a very neutral speaking style with little variation in pitch.

Since the 1970s, researchers have been developing rule-based synthesizers to translate the linguistic properties of an input text to acoustic features of the speech (pitch, spectrum, duration, voicing). They then used a vocoder to translate those acoustic features into speech. The rules were very complex but not complex enough to describe human speech accurately and as such the result was very robotic sounding with stilted intonation. In the early 2000s, the rules were replaced by trained Hidden Markov Models (HMMs) which had already been used successfully in speech recognition. This made it possible to generate speech using a limited acoustic database for training. The HMMs can learn to group linguistic properties together that produce similar acoustic features from a training set of linguistic properties with matching acoustic features. The trained HMM model thus replaces the hand-crafted rules. The output speech sounded more natural than the rule-based vocoded speech, although there were still some artefacts such as buzziness.

We’re now seeing an exciting revolution in TTS, which is set to change the future of speech. In the last few years, computers have become more and more powerful and the concept of deep learning has become increasingly popular. The mapping of linguistic properties to acoustic features is now handled by Deep Neural Networks (DNNs) instead of HMMs. An iterative learning process tries to minimize objectively measurable differences between the predicted acoustic features and the observed ones in the training set.

Several companies are working on this technology. A startup called Lyrebird claims to be able to train a new DNN TTS voice using only one minute of speech. They have quite an impressive demo featuring Barack Obama, Donald Trump, and Hillary Clinton. However, the trained ear can hear many artefacts in these examples. They are due to different factors, including the quality of the recorded speech, the accuracy of the annotation of the database (where each sound is located), and the accuracy of the acoustic feature extraction.

Accurately extracting acoustic features is easier for some voices than for others. For instance, female voices can be more challenging due to the inherently higher pitch. But with a slightly larger acoustic database, a robust linguistic preprocessing module to determine the correct linguistic properties of the sentences in the acoustic database, and high-quality annotations and acoustic feature extraction, we can get much closer to natural sounding speech.

Baidu’s Deep Voice also looks promising, as does Google’s Wavenet. However, like the other players, they don’t have an end-to-end system yet. The difference between Wavenet and Deep Voice on the one hand and ReadSpeaker and Lyrebird’s systems on the other hand, is that the first two use acoustic features from previous frames as input features along with the linguistic features. While that produces even more natural-sounding speech, it increases the complexity of the system even more. For Wavenet, it can take over a minute at the moment to generate one sentence.

The advantage of the new DNN TTS method is that the acoustic database needed is much smaller than for a unit selection voice. Indeed, the new, smart TTS voices can be created based on just a few minutes’ recordings. The DNN TTS method is also more flexible. We can record more expressive speech, and then control how much emphasis we want to put on a word and which words we want to emphasize. We can direct the pitch up or down to signal a question or a declarative sentence.  The result is even more human-like.

There is still work to be done before DNN TTS matches the quality of unit selection synthesis. ReadSpeaker is engaged in bringing the world innovative solutions that overcome the hurdles to making DNN TTS a viable commercial product.

Compare the following ReadSpeaker TTS voices and their counterparts created leveraging a few minutes of speech and DNN technologies:

Listen to our TTS voice, Sophie                               Listen to DNN Sophie

Listen to our TTS voice Mark                                   Listen to DNN Mark

While we prepare DNN TTS voices for you, Readspeaker TTS voices are ready to power your business today – just as they do for countless customers around the world. Get in touch with the ReadSpeaker team to find out more.