Text-to-speech technology has improved greatly in recent years. Today, a synthetic voice reading a short text can often fool many listeners into thinking that they are actually listening to a human being.
Making a Computer Talk
The most common approach for making a computer talk is to record an actor reading a text and then to reuse small clips of the speech recordings to create new sentences. One might think that the actor simply reads all the words of a language and that the computer then simply strings these recordings together when creating new sentences. While a nice idea, it would not work very well in practice. It would be almost impossible to cover all the words and names in a language. Additionally, words are pronounced slightly differently depending on their location in a sentence. Instead, the actor reads a carefully created script that tries to capture the richness of the language and the actor’s speech in a limited number of sentences. Still many thousands of sentences are needed. The actor reads these sentences under strict supervision by a trained phonetician to make sure they come out as intended and that the style is kept the same. This process takes several weeks. The recorded sentences are analyzed in detail in order for the computer to be able to use them to create totally new sentences. When the computer is given a text to read, it will first translate characters that aren’t words, e.g. digits, into words. Then it will look up the pronunciation of each word in a digital pronunciation dictionary. Finally, it will try to choose the best recording clips, from all the recorded sentences, that match the text and piece these together to create new computer-generated speech. These clips can be words, but more commonly they are shorter bits, like syllables or even shorter. When this process works as intended, the results can sound very similar to recorded human speech. In a sense, it is, of course, recorded speech. The process is usually not perfect though, and small errors can sometimes be noticed.
Customization and Continuous Improvement
Since the computer’s “understanding” of the text it reads is extremely limited, there will be a more or less unnatural character to the generated speech, which might seem a bit “robotic”. Even though the words themselves sound okay, the overall stress pattern or prosody might give the impression that the speaker doesn’t really understand the text. We work to continuously improve text-to- speech technology and voices. We also encourage feedback from our customers and end-users. On top of using the best voices available, we add our own layer of improvements and both general and customer-specific customizations. We have linguists with a long experience of speech synthesis working with transcriptions to tweak the pronunciation and reading of the spoken text, thereby greatly helping our customers to optimize the quality of the text to speech. A great number of people are helped by text to speech today. As the technology matures further, it will prove beneficial to greater numbers of users. For how to use text-to-speech technology in your daily life, download our free ebook:
Improve Your Life with Text to Speech