Go to Menu
Celebrating 25 Years of Voice! 🎉

e-Learning Voices: Text to Speech or Voice Actors?

April 8, 2022 by Amy Foxwell
eLearning Voices: Text to Speech or Voice Actors?

For companies building out their e-learning curriculum, we can’t stress enough the importance of adding an audio component to materials. Learning and development (L&D) professionals know that the inclusion of audio is an integral part of course design and delivery. Audio enables learners to more effectively absorb key points or entire sections by having them read aloud.

Listening acts as an enhancement for many learners, like those with dyslexia and visual impairments. It can also help others who have low levels of literacy or who are learning a second language. Even busy professionals or individuals who travel frequently may find it useful to have access to another version of content beyond the written word. After all, studies show that bimodal forms of delivery (for example, both audio and visual) help with information recall, reading comprehension, memorization, and decoding skills.

So when the decision is made to integrate audio into your e-learning materials, where do you start?

You have two choices: Hiring voice actors to create your e-learning voice over, or leveraging text-to-speech (TTS) technology. In most cases, TTS is the better choice—and research backs up this claim.

e-Learning Voices: Challenges with Human Voice Actors

Traditionally, companies have hired professional voice talents to speech-enable their content. This is a complicated prospect: e-learning creators must consider and align everything from cost to talent availability, accent, gender, and language.

At the end of the day, all these considerations add up to a hefty cost, both in terms of time and money. The expense and time commitment involved in using a live voice talent may exceed your company’s comfort level. This is especially common when you require multiple voices and languages.

Some companies take a different approach by utilizing in-house resources for their voice-over work, but this has downfalls too. Companies experience employee turnover and employees get sick. Illness, unexpected time off, and other human factors can make your in-house voice talent unavailable for scheduled voice recordings—delaying the project in the process. Plus, not every company has the ability to provide translated materials to their multilingual workforce through a self-service approach.

Benefits of TTS e-Learning Voices Vs. Voice Actors

The smartest option for companies looking to minimize cost and maximize consistency, quality, and ease is TTS. These synthetic voices allow you to feature a consistent, recognizable voice across your e-learning material.

Text to speech allows you to edit your audio whenever you need, without the extra expense of booking a studio and a voice actor. And with recent developments in the field of speech synthesis, today’s TTS even allows for adjustments in pitch, stress, pace, and pauses in the audio. These developments help your human-like voice truly emote, providing an engaging experience as your content is read aloud.

A recent study looked into why many instructional designers are opting for TTS rather than using human voice talent in e-learning materials. The reasons varied, but overall the consensus was that TTS is simpler to maintain and update. As one respondent who had used human voice talent in the past relayed:

“We found that the added production time, and having to schedule around voice-over, plus re-doing entire segments for one small correction, (e.g. to get the sound to match), was prohibitive both cost- and time-wise.”

With TTS, the process of voice creation for e-learning materials remains firmly in your control.

Creating voiceovers for elearning

Within e-learning settings, TTS proves to be invaluable in a variety of ways, allowing you to:

  1. Produce multilingual content for your diverse workforce.
  2. Integrate multiple voices at a fraction of the cost of hiring multiple voice actors.
  3. Develop scripts to inform the creation of final products recorded by human actors.
  4. Stream audio or integrate text to speech locally into mobile applications, installed software applications, or hardware devices so that anyone can digitally listen to your text content with a click of a Listen button.

Of course, one important question remains: Is a TTS e-learning voiceover as effective as a human voice recording?

Multiple researchers have investigated this question, and their results may surprise traditionalists.

Studies Show Modern TTS Voices Match Human Voices in Learning Scenarios

In 2017, researchers Scotty D. Craig and Noah L. Schroeder published “Reconsidering the voice effect when learning from a virtual human.

In the study, the researchers used Microsoft’s classic speech engine as a baseline and NeoSpeech’s (now under the ReadSpeaker brand) “Kate” voice as the representative of a modern engine. A human voice was used as a high-end control. Animated “virtual humans” used each of these voices to present information to a sample of learners.

The study found that the modern TTS voice performed as well or better than the human voice. One of the study’s conclusions makes this clear:

“The modern voice engine produced significantly more learning on transfer outcomes, had greater training efficiency, and was rated at the same level as an agent with a human voice for facilitating learning and credibility while outperforming the older speech engine. These results call into question previous results using older voice engines and the claims of the voice effect.”

This finding challenges earlier assumptions about the comparative effectiveness of human and synthetic speech in education, a now-discredited argument researchers describe as the “voice effect.” Here’s what that means—and why advanced TTS technology has rendered the argument obselete.

Updating the “Voice Effect” in e-Learning Environments

The “voice effect,” or “voice principle,” is a theory by psychology professor Richard E. Mayer of the University of California, Santa Barbara. As Mayer writes in the Cambridge Handbook of Multimedia Learning:

“The voice principle is that people learn more deeply when the words in a multimedia message are spoken in a human voice rather than in a machine voice.”

Mayer found compelling evidence to support this conclusion. However, four of the studies that support the voice effect were published at least 10 years ago. Since then voice technology has been advancing rapidly, and text-to-speech software has greatly improved.

Early research by Mayer et al.—most notably in the 2003 study “Multimedia learning: Role of speaker’s voice”—did indeed show that human voices outperformed synthetic voices. However, the results of a similar 2012 study by Mayer and DaPra, using more advanced voice technology, indicated no differences in learning between groups that had agents with human voices or those with synthetic voices. This suggests that the voice effect no longer exists.

Mayer, the founder of the voice effect theory, isn’t the only researcher who’s found that it might be outdated.

We’ve already mentioned Craig and Schroeder’s “Reconsidering the voice effect when learning from a virtual human,” from 2017. The following year, Craig and Schroeder published a follow-up study, “Text-to-Speech Software and Learning: Investigating the Relevancy of the Voice Effect.

Again, they compared the performance of an advanced TTS voice with an older speech engine and a human speaker. Again, the researchers found no statistical differences in terms of learner “perceptions, learning outcomes, or cognitive efficiency measures.”

As Craig and Schroeder conclude:

“Our results imply that software technologies may have reached the point where they can credibly and effectively deliver the narration for multimedia learning environments.”

As we publish, nearly five years have passed since this conclusion. During that time, TTS technology has continued to leap forward. Today’s TTS voices are even more lifelike than the ones Craig and Schroeder studied.

These findings have wide-reaching repercussions for educational institutions and instructional designers. Dynamically updated TTS solutions are more cost-effective, and much faster, than recording human voices. If they work just as well (or better), the e-learning designer’s choice is clear.

Ongoing studies are now showing that voice engines have reached an acceptable level of performance for use within learning technologies. These findings point to the opportunity to use synthetic voices to develop more dynamic and less expensive learning technologies for improved learning outcomes. In other words, when you design e-learning materials, opt for TTS voices over human voice actors.

TTS e-Learning Voices from ReadSpeaker

At ReadSpeaker, we have a passion for developing the most advanced, high-quality TTS voices. In fact, third-party industry observers rate ReadSpeaker TTS voices as being the most accurate on the market.

We use a multi-phase process to generate voices that are near-to-human in sound. Our voices are based on real recorded speech, which is then sampled and processed. A rich mark-up is created, in which each word, phoneme, and stress is annotated, using a powerful combination of artificial intelligence and machine learning technologies. Our state-of-the-art methodologies are also augmented with expert linguistic capability.

In addition to offering top-of-the-line TTS, which includes over 50 languages and 200 voice options to serve the unique needs of our clients, ReadSpeaker TTS provides:

  • Default and customer-specific pronunciation dictionaries
  • Prosody adjustments and SSML support
  • Ongoing assistance from on-staff linguists to continually improve TTS pronunciation

Want to see what ReadSpeaker can do for your e-learning content? We offer speech production, online web reading, and embedded text-to-speech solutions to accommodate any e-learning project.

Related articles
Start using Text to Speech today

Make your products more engaging with our voice solutions.

Contact us