Vox Daily The Official Voices.com Blog

Will Text-To-Speech Ever Replace Custom Voice-Over Recordings?


By Stephanie Ciccarelli

March 5, 2013

Comments (5)

Male podcaster smiling at the microphone, headphones covering his ears

Everywhere you go, it seems like there is a voice talking to you. New technologies incorporate the human voice in many ways, many of these applications employing the means of text-to-speech (TTS).

Do you think that TTS poses a threat to custom voice recordings? Why or why not?

Be sure to state your case in today's VOX Daily!

I May Be Biased, But...

Recently I had an opportunity to defend the intrinsic value and necessity of custom voice-over recordings in a debate over whether or not text-to-speech would ever fully replace the need for actors to record custom voice-overs in studio.

Being the co-founder of a company that specializes in the recording of spoken word messages, it is safe to say that I am more than a little biased. That being said, there are many different factors as to why text-to-speech will never fully replace the human voice.

In a nutshell, these factors include:

- Sheer number of different languages, dialects, accents, vocabulary and manner of speech (linguistics)
- Complexity of cultural, historical and societal nuance/understanding of context (social)
- The need to customize or brand for corporate purposes picking a specific voice (customization)
- Cognitive ability to know and see the 'big picture' when telling a story or making an argument (suspension of disbelief)
- Artistic direction that can be interpreted and internalized making a read more believable (performance)

A Closer Look

When you consider that there are 6,800+ spoken languages being used today, the potential for text-to-speech to replace the human voice, its delivery, correct pronunciation, tone, nuance and so on is difficult to comprehend let alone achieve.

There are so many things that a computer program cannot infer or know. Information, when interpreted, could be expressed in myriad ways depending on the situation, context and audience. It is up to the individual performing the script to properly assess what it is that they are reading and to know how best to convey that information to the intended audience. This is what makes custom voice-overs so effective.

The voice artist uses discernment and all of the tools at their disposal to act like a detective as it were to become educated on the subject, develop a character and determine how best to present the message to those meant to hear it. There are unspoken sentiments that can be expressed using the human voice in a performance that would not be as effective, artistically or technically, if TTS were the go-to solution.

Something else to consider is intent. An educated voice actor makes choices whereas an untrained actor makes guesses. The actor uses their own experiences (method acting) and combines those with the information in front of them to craft a read that is both accurate and persuasive (emotion). The computer program could be considered untrained in the sense that the selections it makes are based upon formulas and not upon heart knowledge. Head knowledge is important but heart knowledge is critical to comprehension and communicating effectively.

What Do You Think?

Will text-to-speech ever be on par with custom voice-over recording?

Looking forward to hearing from you!

Best wishes,

©iStockphoto.com/Eliza Snow

Related Topics: Accent, reading, SAG


    Couldn't agree more, Stephanie! For a lot of the "press 1" stuff, TTS may be satisfactory. However, not only do you make great arguments from the voice actor's side (I've done it for years!) but I've learned that for the listener, inflection and understanding are key to a) keeping them engaged and b) getting the proper response. And isn't that the point?

    Posted by:

      Any musical instrument under the sun has been sampled, and entire symphony orchestras can come out of a can. Yet, people are still buying real Steinways, and there are plenty of musicians who make a very decent living.

      Do I think that we’ll ever see the time when Stravinsky’s “Rite of Spring” as performed on virtual instruments, will win a Grammy? Will a laboratory ever be able to produce a recording of Bach’s cello solo sonatas that rivals the depth of Yo Yo Ma’s interpretation?

      No way!

      There’s still hope for the most subtle, most flexible, most surprising and unique of all instruments: the human voice.

      Here’s the rub: robots have a hard time emoting. They can patiently and dispassionately guide you to the next exit, but they have a hard time expressing even the most basic of feelings such as fear, anger, hurt, guilt and… love.

      The inimitable subtleties of the human voice can leave us... speechless.

      Posted by:

        I don't think automated speech synthesis will ever replace a human voice completely.

        I can however foresee a future where the speech synthesis gets so good that it *can* convey emotion, and imitate accent well enough that for mass media production it will be a cheaper, easily customisable alternative to human voice-overs.

        It is not too much of a stretch to see the same split between synthesize and live-acted voice as we have today between synthesized and live music. The live version will be considered more artistic - better quality for those who can appreciate the difference - and the mass media will churn out synthesized but popular garbage.

        That said - I very much doubt that will happen in *any* of our lifetimes. I'm estimating at least 80-100 years out for this scenario.

        Posted by:
        • Megan McVey
        • March 5, 2013 3:03 PM

          TTS won't completely replace the human voice, IMO. That being said, I have actually seen a recording company called Learning Ally, which recorded materials for the blind and deaf, shut its doors earlier this year because of products containing TTS such as Dragon becoming mainstream.

          However, given that the realm of the voice over industry composes of many different avenues, such as animation, books on record, and documentaries, areas such as these need clear, concise emotion to be effective. Therefore, if you replace a human voice with TTS in these areas, all you would get is a monotonous drone that largely lacks the emotion you would receive in a human voice.

          As a trained actor and voice actor, you make a good point about how we make choices in portraying emotion, Stephanie. Computers, even those with AI, don't have that ability. This alone is enough for me to remain confident that our voices will continue to be standard for decades to come.

          Posted by:

            I don't think I'll be able to trust a robotic voice coming out of TTS technology. I would still prefer the warmth and real emotion in a real person's voice.

            Posted by:
            • Trina
            • May 19, 2013 10:08 PM

Leave a Comment

Recent Articles

How To Use Storytelling to Make Science More Accessible

4 Tips to Keep Players from Tapping the Mute Button on Your Casual Game

The Beautiful Narration of Many Beautiful Things

Introverts: Does Auditioning Energize or Drain You?

Want To Be More Authentic in a Role? Here's How!

5 Tips For Doing Business with Northwestern Europeans

5 No-Nos for Voice Actors

How Do You Get into Character?

Want to Act Full-Time? How to Go from Hobbyist to Acting Pro

Find the Right VO Delivery for Any Script, Every Time