The most common approach for making a computer talk is to record an actor reading a text and then to reuse small clips of the speech recordings to create new sentences. One might think that the actor simply reads all the words of a language and that the computer then simply strings these recordings together when creating new sentences. While a nice idea, it would not work very well in practice.
It would be almost impossible to cover all the words and names in a language. Additionally, words are pronounced slightly differently depending on their location in a sentence. Instead, the actor reads a carefully created script that tries to capture the richness of the language and the actor’s speech in a limited number of sentences.
Still many thousands of sentences are needed. The actor reads these sentences under strict supervision by a trained phonetician to make sure they come out as intended and that the style is kept the same. This process takes several weeks. The recorded sentences are analyzed in detail in order for the computer to be able to use them to create totally new sentences.
When the computer is given a text to read, it will first translate characters that aren’t words, e.g. digits, into words. Then it will look up the pronunciation of each word in a digital pronunciation dictionary. Finally, it will try to choose the best recording clips, from all the recorded sentences, that match the text and piece these together to create new computer-generated speech. These clips can be words, but more commonly they are shorter bits, like syllables or even shorter.
When this process works as intended, the results can sound very similar to recorded human speech. In a sense, it is of course recorded speech. The process is usually not perfect though, and small errors can sometimes be noticed. A common cause is that a name or word wasn’t included in the computer’s pronunciation dictionary.