An article by Ludmila Menert and Wanda Peters of the ReadSpeaker team
“Why does this word sound good when it’s read on its own, yet bad when it occurs in a sentence?”
This is one of the most frequently asked questions we get from our customers when they encounter pronunciation problems in our products which convert text to speech online. In our answer we first explain how human speakers handle combining speech sounds, and then, how we tackle it in our software for online text to speech. An assumption implied in the above question is that a given word should always sound the same, whether spoken in isolation or embedded in a sentence. But is that so? When listening closely to natural human speech, one quickly realizes that the pronunciation of a word is slightly different depending on the context in which it is spoken. This has to do with the fact that during speech, our articulators, which include our tongue, lips, teeth and hard palate, either anticipate the next sound or carry over the qualities of the previous sound, so that in effect every speech sound subtly changes according to its environment. This is called coarticulation. Take the final sound of “have”. It sounds like “v” in isolation, but like “f” in the phrase “I have to”. The position within a phrase also affects the pronunciation. For example, “John” pronounced on its own is longer and has a falling pitch, but in “John Wayne” the vowel is shorter and its tone stays level.
Sometimes, sounds which are fully realized in a single word, are omitted in connected speech. Try saying “act spontaneously” and you will realize that you do not pronounce the final t of “act” the way you do when you pronounce that word on its own. Maybe you do not even pronounce it at all. This is called elision. This means words can sound different depending on the words they follow or precede and depending on their place within a sentence. Even though we might not be aware of it, reconciling these differences is part of our speech recognition faculty. When listening to connected speech, our brain automatically takes phenomena such as coarticulation and elision (and there are others too) into account in order to process and understand what is being said. If synthetic speech were created without taking this into account, the result would sound so unnatural as to be almost unintelligible. This is why speech cannot be synthesized by simply pasting together recorded words.
So, how do we go about creating naturally sounding synthetic speech? We use a method of pasting together clips, taken from a database that we generate from recorded sentences. But the clips we use are not words. They are sometimes larger and sometimes smaller than single words. They can be whole sentences, or they can be as small as half of a speech sound (called a half-phone). We call these half-phones in the database, units. All the units in the database are meticulously labeled. When our text-to-speech software is given a text to synthesize, it selects which strings of units, or clips, should be taken from the database to produce the best sounding result. These clips are then strung together to form new speech. If, by chance, the text to be synthesized contains a sentence which was also among the recorded sentences with the exact same words, the software can clip the whole sentence. Of course, that sentence will sound 100% natural in the text-to-speech output. More often, smaller clips have to be used, often as small as individual units.
As explained above, to achieve a natural sounding result, each sound will have to be extracted from a recorded text where it occurs in the appropriate context. The software looks for a unit, i.e. a half-phone (half a consonant sound or half a vowel sound), in the database that had the same context in the originally recorded sentences as we now need for the new text. In other words, it looks for a unit that was recorded with the same properties in terms of degree of prominence, stress, and position within the word, syllable, and phrase, and which had equal or similar neighboring sounds. Wherever possible, the software will try to choose sequences of units that were originally recorded together, i.e. the clips will have more than one unit. The better the match the software finds in the database, the smoother the joints between the clips will be and the more fluent the resulting synthesized speech.
Figure 1: Schematic representation of the use of half-phones in text-to-speech software
When we succeed, synthesized words sound natural, both in isolation and as part of a sentence, and listeners never notice the fact that they sound slightly different. Listeners will only notice when something’s not right with the sound of a word. The first thing to check then is how the text-to-speech software pronounces the word in isolation, in case it doesn’t “know” the correct pronunciation. This is unlikely, except maybe in the case of some foreign word or an unusual name, and such errors are usually clearly audible in the sentence context already. More often, the word will sound fine when synthesized on its own.
Customers often check this for themselves. Then the customer will come to us with the question: “Why does this word sound good when it is read on its own, yet bad when it occurs in a sentence?” When the sound of a synthesized word in a sentence is wrong in some way this can have several causes. It can sound unnatural because a vowel is too short or too long, or because a syllable is stressed too much or too little. Or the transition from one sound to the next can sound jerky. We call this a glitch. In all cases, the probable cause is a misfit of the selected speech clips. Sometimes perfectly matching clips are simply not available in the database. But sometimes the software made an unfortunate choice. And in order for us at ReadSpeaker to find the cause of the problem and improve the pronunciation, we need to investigate not just the word, but the exact and complete sentence where the error occurs – and compare all the contextual parameters of the speech clips selected by the software with those of the sentence being synthesized. If you are interested to read related articles on how text-to-speech software converts text to voice, here are some links: