How hard would it be for a development studio to create a text to speech (TTS) engine that mimics a fictional character? What about making an engine that could do a variety of voices?* Assume you’ve got full cooperation from the voice actor who does the voice for that character.
*I know many gps devices with voice directions offer you a variety of voices, but I don’t know if that’s a bunch of different engines (which means changing to a different voice is a resource-intensive operation) or just different profiles for the same engine.
I use Natural Reader and it does a pretty good job. If I want different voices I would have to pay for them.
I’ve done a little bit of research into text to speech, mostly with the idea of incorporating different voices into some my own software. I didn’t get very far, because I wanted a variety of voices that were easily selectable/configurable, sounded reasonably close to a real human, and the software had to be free or cheap. I wasn’t able to find anything that fit those requirements.
What I found was you have the cheap and simple text to speech generators that are fairly easy to customize with new voices, but they all sound extremely robotic. On the other end of the scale, you have fairly realistic voices that apparently requires a huge amount of work to create new voices for.
For example, if you look at something like Festival Text To Speech you see that they took 2 hours of recorded speech to generate their “Alan” voice and 3 hours of recorded speech to generate their “Nina” voice. You also find that their documentation for the project, like many linux projects, is extremely horrible and lacking in detail. I get the impression though that there is a lot of computer processing and human labor that goes into those hours of recordings to create a usable TTS voice.
For a commercial project, I found this one using google:
https://www.cereproc.com/en/support/faqs/voicecreation
They have this to say:
They also claim that they can build custom TTS voices with as little as 40 minutes of recorded and transcribed speech, but they recommend 4 hours of speech for good quality results.
So, basically, it can be done, but it’s going to be expensive. You need at least 4 hours of your voice actor’s time (as well as the recording studio’s time), transcription services, and then the services of the TTS provider.