Adam Twardoch
There's already done ways to push the model to do some things, like I can write
He whispered “Hello”
and many voices will whisper. I think a good solution would be to not introduce special markup but simply ability to surround certain portions of the script as NON SPOKEN. So with something like{He whispered} “Hello”
the language model would interpret everything as previously, but the speech would only be generated for the stuff not surrounded by the special braces. The surrounded portion would be "inner voice". This way we could still use natural language to prompt.C
Christopher Perry
I agree that the ability to define emotion, whether by voice tags, or some other method, would be very helpful, and sooner than later. This is the last remaining piece that is needed to make this perfect. I have been a content creator for years and have tried many TTS platforms over the years, but I must say, this one is the best I have found. The one and only complaint I have with it is that I often use unnecessary credits having to regenerate a clip three or four times to get the right emotion sometimes, but this would solve that. I read somewhere that there were plans to include plans to make the voice laugh or cry or express other emotions in a similar fashion, this would be absolutely amazing as it is something that no other TTS platforms have been able to master. As someone else stated on this thread, I would be a lifetime subscriber as well. Thanks!
CarcomCars
THIS FEATURE IS A MUST! I would be a lifetime paying subscriber if you had this. Alright, here are some suggestions of emotions or expressions to help you start:
With Strength Levels:
- Happiness (Level 1 = Cheerful, Level 2 = Joyful, Level 3 = Ecstatic)
- Sadness (Level 1 = Melancholic, Level 2 = Sorrow, Level 3 = Crying)
- Anger (Level 1 = Irritated, Level 2 = Enraged, Level 3 = Furious)
- Fear (Level 1 = Apprehensive, Level 2 = Anxious, Level 3 = Panicking)
- Calmness (Level 1 = Whisper, Level 2 = Collected, Level 3 = Soft Spoken)
- Aroused (Level 1 = Relaxed, Level 2 = Excited, Level 3 = Intense)
Future Considerations:
- Stutter (Repeating beginning parts of a word)
- Drowsy (Lethargic & sleepy tone)
- Injured (Very heavy breathing in between 1-2 words and loud delivery)
Freyja
CarcomCars: I would add laughter and being drunk
Jonathan
Merged in a post:
Projects desperately needs emotion markup
m
moshe
Projects makes life much easier than before. It is very handy to be able to specify multiple voices. But there is still no way to assign emotion to specific paragraphs, the way a voice should be assigned. I find myself going back to the original voice generator for specific sentences, adding a bunch of generation prompts like "She was outraged" to get the correct emotional coloration. It is very annoying. Right where we specify voice parameters, please add a text box for us to type in a natural language emotion tag. Have the AI generate whatever emotion is specified by the tag.
D
Dave M.
Just to add to this, inflection tags could also include attributes ... such as strength.
For example:
[angry strength="5"]I know[angry].
[frustrated strength="3"]I know[frustrated].
Where, "I know" can be made to sound very angry or just moderately frustrated.
V
Voice Labs
This one is a no-brainer because Elevenlabs's voices sound "robotic" or "artificial" to 94.8% of our test groups, consisting of over 200 people based in North America, the UK, and NZ.
Please see good examples here: https://google-research.github.io/seanet/soundstorm/examples/
![](https://canny.io/images/41bc4eb10169e07ef8f95bb137be8fcc.png)
Benedict Thienpont
Voice Labs: I didn't have the impression the voices of ElevenLabs sound robotic even though the ability to generate emotions (including whispering) via tags is very much welcome indeed. I listened to the audio on the linked site and like the quality of dialogue interaction.
Chris S
Benedict Thienpont: Unfortunately they do sound synthetic or “robotic”. Our auditory sense of emotion is stronger then our vision. People identify context and emotional state via emotion in a voice more so then they do facial expression. Not only is emotion crucial to convey intent or even storyline it’s a requirement to convince the human cognition that a voice is natural. An ERP study demonstrated a neural signature of implicit emotion decoding within 200 ms after the onset of an emotional sentence, suggesting that emotional voices can also be differentiated from neutral voices within a 200 ms timeframe. So thus our brain also recognizes that something is wrong with the current (emotionless) synthetic voices in less then 200ms. It sounds “Robotic” because it lacks the vocal codex of humanity...emotion. It also sounds “robotic” because certain brain activity that’s usually triggered by human emotion remains completely stagnant, subconsciously indicating to the listener something is wrong or unnatural with what they are hearing. The brain selectively responds to emotional vocal cues embedded within a stream of neutral vocal utterances. This is also true for animals and infants. Even they interpret our commands based on emotion channels over the understanding of language.
t
treeantsan
I've discovered for myself, despite generating thousands of lines of dialog (usually a sentence or a paragraph at most, given my use case of having multiple characters speaking to each other), the secret of adding descriptors like I've written a book.
Such as, if I need an excited sounding voice, I'll write:
"Hey there, what's your name?" he asked excitedly.
And take the audio generated and cut out the "he said with excitement" bit, and it should work just fine for me.
This comes with two pain points: That descriptive text costs money - it can be especially annoying if you do something very short (a word or two) but need a very specific take, and are willing to generate multiple times - most of your generations go into audio intended to be cut out.
Secondly, you have to manually remove the descriptive text.
It immediately made me think that, a quick hack, would be to add a secondary text input area that inform, with a natural description, the intonation you desire.
I mean, that's a broad brush, and I'm sure the team is already on it, but it's a naive idea I had for ya.
![](https://canny.io/images/53aced209d84494fd0282df9da811b6d.png)
Yanis Lukes
Now that 11labs implemented pauses, it means they could detect when a voice description is coming up and therefore halt its inclusion in the audio sample.
Rayan
planned
V
Voice Labs
Rayan: can you tell us when this will be implemented? Thank you.
Nerd Militia Entertainment
I really do want something like this. maybe take a look at all the different types of content on youtube, spotify, and facebook and add some easy to use presets to set the tone and emotion of the voice. I really want something that understands sarcasm.
J
Josh L
Nerd Militia Entertainment: This would be amazing... you get to a part where it's supposed to be angry you put [angry] [/angry] tags.