Using Female Voices in Speech Synthesis

International Women’s Day was Sunday, 8 March. This year’s theme was “Make It Happen.” Around the globe, different events celebrated how far women’s rights have come, while at the same time making clear the need to keep fighting for equal rights over the entire world.

At ReadSpeaker, in honour of International Women’s Day, we’d like to discuss something a bit nearer and dearer to our organisation’s goals: female voices in text-to-speech.

When looking at the voices available for text-to-speech, a pattern becomes evident: female voices are more common than male voices, especially for languages on which less development has been done. We pride ourselves at ReadSpeaker on offering more than 100 different voices in over 40 languages. For some of these languages, however, a male voice is not part of our current portfolio; a female voice is the only option.

Diagram of Vocal TractWhether one would rather listen to a male or a female voice is, of course, an individual preference. These preferences may also depend on certain aspects of the voice. For example, one may overwhelmingly prefer a female voice, but only if the voice is calm, controlled, and rather level in terms of tone.

However, there are good reasons why female voices are more common in certain languages for text to speech. Our physical anatomy leads to major differences in our voices. Males and females have different sizes of vocal folds, resulting from the difference in size of the larynx between the sexes. The male vocal folds are between 17mm and 25mm in length, while the female vocal folds are between 12.5mm and 17.55mm in length (Source: Thurman et. al 2000).

This anatomical difference leads to a difference in average pitch. The average pitch (fundamental frequency, or f0) for a standard male voice is approximately 125Hz and for a standard female voice, approximately 200Hz (Source: National Center for Voice and Speech). F0 of an adult male will range from 85 to 155Hz, and of an adult female from 165 to 255Hz (Source: Michigan State University). The range of the adult female voice is therefore greater than that of the adult male: 90Hz compared to 70Hz on average.

Generally speaking, this larger range makes for easier creation of a synthesized voice to be used in text to speech. The huge differences in fundamental frequency make distinguishing between phonemes easier, and therefore the synthesizer is better able to generate speech based on these identified phonemes and chunks of text (unit selection synthesis). For more information on how speech is synthesized from text, please refer to our previous article, “How text to speech is made”. Simply put: “female voices are more economically interesting in the context of text-to-speech synthesis” (Source: Dutoit 2001).

