Artboard 4 copy 15@4x.png

AI-synthesized speech that actually sounds like a human

RealTalk

Artboard 4 copy 79@4x.png

**Disclaimer: Joe Rogan was selected as a model for RealTalk for demonstrative purposes only. The use of Rogan's voice in this project is not an endorsement by Dessa of his opinions or content.**

What is Realtalk?

"Hey, Joe Rogan, it's me - Joe Rogan!"

31d36bc07cdfc050e1319f940cce7fc2dc-joe-r
Artboard 4 copy 112@4x.png

Image Source: Vivian Zink/Syfy/NBCU Photo Bank/NBCUniversal via

In May 2019, we recreated the podcaster Joe Rogan's voice with AI to raise awareness about synthetic media, and in particular, AI-synthesized audio. Synthetic media is an encompassing term for a growing collection of AI-generated and AI-manipulated images, text, and video. Colloquially, synthetic media works are often referred to as deepfakes. 

 

Our replica of Rogan’s voice was produced using a text-to-speech deep learning system we built called RealTalk, which generates life-like speech using only text as its input. To our knowledge, the AI synthesized voice we created is also the most lifelike example to date.

Because of this technology's significant risks, we've decided not to release the RealTalk model, code, or data publicly.

 
 

A hockey team of chimps

CLICK TO PLAY AUDIO

Why would we do this? 

ZDIIycW.gif
24DEEPFAKES-01-superJumbo.jpeg
1267856.gif

Source: Imgur

Source: New York Times

Source: NBC News

A key way to combat the risks inherent in new technologies like synthetic media is to raise public awareness about their existence.

In 2019, the world was just starting to wake up to deepfakes, and in particular, their risks. Deepfake videos were seemingly cropping up everywhere, and every media outlet had a story about their emergence. One format they hadn't really thought of, though, was audio.  

 

We wanted to alert the public that this variety of deepfake came with a distinct set of risks, especially in terms of scams. What if, for example, a scammer could convince a person that they were a family member, asking them to deposit their life’s savings into the scammer’s account? 

 

Our goal to raise public awareness about synthetic audio was also a big reason why we chose Joe Rogan as our muse. As one of the most famous podcasters around the world, we knew we had a good chance of people taking notice

Artboard 4 copy 80@4x.png
31d36bc07cdfc050e1319f940cce7fc2dc-joe-r

So, how does RealTalk work?

"Being a robot has its benefits..."

Artboard 4 copy 82@4x.png

Synthetic media examples like this one are still very hard to make, but it won’t be that way forever. Here’s a high-level glance of what goes into making it happen.

A recipe for Realtalk:

The data

For deep learning models to perform well, you need a lot of data. This was another reason we picked Joe Rogan. When setting out to create the dataset, Rogan already had over 1000 episodes of his podcast online.

 

For our final dataset, we used 10 hours of audio from Joe Rogan's show. In late-stage experiments, however, we found that training the model on as little 2 hours of audio could still produce a pretty convincing facsimile.

Within the dataset, there are thousands of short audio clips and corresponding textual transcripts. We used this data to train the text-to-audio model we talk more about below. 

The models

To synthesize speech that sounds close to the real thing, we actually built a system of several deep learning models.

Model 1: Predicting pronounciation

"OTOLARYNGOLOGY"

ow·tow·leh·ruhn·gaa·luh·jee

The first model transforms text into audio. It does the heavy lifting when it comes to predicting the way a particular person speaks. The model does this by learning the individual letters that make up words as well as the patterns between them.

 

One of the most striking things about this model is that it's generalizable after training. This means that it can correctly pronounce words from outside the original dataset (how else do you think we got our faux Rogan to say ‘otolaryngology’?)

Artboard 4 copy 17@4x.png

Real or fake: can you tell the difference?

Nearly 50% of people had trouble discerning between real and faux Joe. Put your own listening skills to the test with our "Faux Rogan" game.

Screen Shot 2021-04-08 at 3.42.21 PM.png

Meet the creators

The Dessa engineers behind RealTalk are Hashiam Kadhim, Rayhane Mama and Joe Palermo.