AI-synthesized speech that actually sounds like a human
**Disclaimer: Joe Rogan was selected as a model for RealTalk for demonstrative purposes only. The use of Rogan's voice in this project is not an endorsement by Dessa of his opinions or content.**
What is Realtalk?
"Hey, Joe Rogan, it's me - Joe Rogan!"
Image Source: Vivian Zink/Syfy/NBCU Photo Bank/NBCUniversal via
In May 2019, we recreated the podcaster Joe Rogan's voice with AI to raise awareness about synthetic media, and in particular, AI-synthesized audio. Synthetic media is an encompassing term for a growing collection of AI-generated and AI-manipulated images, text, and video. Colloquially, synthetic media works are often referred to as deepfakes.
Our replica of Rogan’s voice was produced using a text-to-speech deep learning system we built called RealTalk, which generates life-like speech using only text as its input. To our knowledge, the AI synthesized voice we created is also the most lifelike example to date.
Because of this technology's significant risks, we've decided not to release the RealTalk model, code, or data publicly.
A hockey team of chimps
CLICK TO PLAY AUDIO
Why would we do this?
Source: New York Times
Source: NBC News
A key way to combat the risks inherent in new technologies like synthetic media is to raise public awareness about their existence.
In 2019, the world was just starting to wake up to deepfakes, and in particular, their risks. Deepfake videos were seemingly cropping up everywhere, and every media outlet had a story about their emergence. One format they hadn't really thought of, though, was audio.
We wanted to alert the public that this variety of deepfake came with a distinct set of risks, especially in terms of scams. What if, for example, a scammer could convince a person that they were a family member, asking them to deposit their life’s savings into the scammer’s account?
Our goal to raise public awareness about synthetic audio was also a big reason why we chose Joe Rogan as our muse. As one of the most famous podcasters around the world, we knew we had a good chance of people taking notice.
So, how does RealTalk work?
"Being a robot has its benefits..."
Synthetic media examples like this one are still very hard to make, but it won’t be that way forever. Here’s a high-level glance of what goes into making it happen.
A recipe for Realtalk:
For deep learning models to perform well, you need a lot of data. This was another reason we picked Joe Rogan. When setting out to create the dataset, Rogan already had over 1000 episodes of his podcast online.
For our final dataset, we used 10 hours of audio from Joe Rogan's show. In late-stage experiments, however, we found that training the model on as little 2 hours of audio could still produce a pretty convincing facsimile.
Within the dataset, there are thousands of short audio clips and corresponding textual transcripts. We used this data to train the text-to-audio model we talk more about below.
To synthesize speech that sounds close to the real thing, we actually built a system of several deep learning models.
Model 1: Predicting pronounciation
The first model transforms text into audio. It does the heavy lifting when it comes to predicting the way a particular person speaks. The model does this by learning the individual letters that make up words as well as the patterns between them.
One of the most striking things about this model is that it's generalizable after training. This means that it can correctly pronounce words from outside the original dataset (how else do you think we got our faux Rogan to say ‘otolaryngology’?)
Meet the creators
The Dessa engineers behind RealTalk are Hashiam Kadhim, Rayhane Mama and Joe Palermo.