Search
  • Dessa

Towards Deepfake Detection That Actually Works

Updated: Nov 26, 2019

Authors: Rayhane Mama & Sam Shi


Read a story featuring our work on deepfake detection by Cade Metz in The New York Times.



Before we start:

The codebase and data we present in this article can be found in our open source Github repository for anyone to replicate and build on.


Download the source code and data: Github repository


Download the trained model and check out our experiments: Atlas Experiment Dashboard


Introduction


In September 2019, with the objective of improving deepfake detection, Google released a large dataset of visual deepfakes. Since then, this dataset has been used in deep learning research to develop deepfake detection algorithms. The paper and dataset is called FaceForensics++, and focuses on two particular types of deepfake techniques: facial expression and facial identity manipulation.


In the FaceForensics++ paper, the authors augmented Google’s dataset with 1000 real videos from YouTube, from which they extracted 509,914 images by applying Face2Face, FaceSwap, DeepFakes and NeuralTextures deepfake techniques.


Here’s a summary of the four techniques:


Face2Face: (facial reenactment): transfer expression from source video to target photo, using model based approach


FaceSwap: facial identity manipulation, a graphics-based approach that uses, for each frame, landmarks to create a 3D models of the source, then projects it onto the target by minimizing the distance between landmarks


Deepfakes: facial identity manipulation, first uses facial recognition to crop the face, then train two autoencoder and one shared autoencoder for source and target. To produce a deepfake, the target is run through the source autoencoder and stuck to images using Poisson image editing.


NeuralTextures: facial reenactment using GANs


The paper's authors then fine-tuned an Xception net, pre-trained on ImageNet, to detect real vs. fake videos. The results mentioned in the paper suggest a state-of-the-art forgery detection mechanism tailored to face manipulation techniques.


Note: The FaceForensics++ inference codebase, datasets, and pre-trained models explained in the paper are open-sourced on Github.

The results shown in the table above represent the different accuracies for different models and data qualities. Each row defines a different model, whereas each column defines a different compression rate of the video. FaceForensics++ introduces the Xception net achieving state of the art of the data created in the paper.


In this article, we make the following contributions:

  1. We show that the model proposed in FaceForensics++ does not yield the same metrics stated in the paper when tested on real-life videos randomly collected from Youtube.

  2. We conduct extensive experiments to demonstrate that the datasets produced by Google and detailed in the FaceForensics++ paper are not sufficient for making neural networks generalize to detect real-life face manipulation techniques.

  3. We show the need for the detector to be constantly updated with real-world data, and propose an initial solution in hopes of solving deepfake video detection.


The Problem


The FaceForensics++ paper’s results seemed very exciting, especially after validating the reported metrics by applying the pre-trained Xception net on the paper's data. But then we noticed a problem.


When we used the same model on real-world deepfake data encountered on Youtube (i.e.data not contained in the paper’s dataset), the accuracy of the model was much, much lower.


To test how well the model performed on real-world data, we applied the pre-trained Xception-net on the following two videos:

  • Deepfake Impressionist

  • RealTalk — a recent initiative by Dessa to generate both the physical presence and voice of Joe Rogan, as an example of hyper-realistic synthetic media

Using this model, the best result we achieved detecting fake videos from Youtube is as shown in the video below:


The model assumes that 68% of the frames in the video are real, while in reality the entire video is a deepfake.


To ensure our initial observation was fair, we randomly selected additional non-manipulated videos from Youtube, in addition to the synthetic videos discussed earlier (the Deepfake Impressionist and RealTalk videos), and tested the model on all the videos collected at this stage. The model scores an accuracy of 58%, which is very close to a random guess.


Hypothesis


While trying out and reviewing the FaceForensics++ pre-trained model, we made the following observations:


  • The paper’s key focus is to introduce a large new dataset that makes the models generalize to unseen deepfake videos, more often than not encountered on the internet

  • The creation of the FaceForensics++ model did not involve extensive development. The way the model was built involved fine-tuning a pre-trained X-ception net by changing the classification layer.

  • Most of the FaceForensics++ work has been done on data creation and preparation. Most manipulated videos were created using video manipulation models. To our knowledge, the same models were using to manipulate the entire dataset. This is very important from an adversarial perspective as a detector trained on such data will only learn to detect the manipulation artifacts found in this dataset. Since all the forgery data is created using a handful of models, the detector will ultimately not generalize to recognize unfamiliar manipulation artifacts specific to other kinds of deepfake models.


This led us to make a hypothesis, which we then verified later on: both the training and test data are drawn from the same distribution, since the same deepfake generation models were used to create the fake samples. This distribution doesn’t represent examples of deepfake videos found ‘in the wild’ (for example, new examples of deepfakes uploaded onto YouTube).


Why this matters: the train and test data don’t represent real-world data robustly enough to reliably generalize to unseen data and thus, detect deepfakes accurately in the wild,. This means that the model, when developed using such test data, will inevitably overfit to the data samples specific to the FaceForensics++ dataset.


We believe in the importance of the task of detecting deepfakes. With this in mind, we’ve enlisted ourselves to in the effort to improve deepfake detection by experimenting to understand where the FaceForensics++ work falls short, while also providing potential solutions. The experiments that we conduct are shown in the following table.


We use colour codes to express the expected results that will confirm our hypothesis:


Green: Accurate classification. The model has learned the boundary between a real and a fake video. It is able to correctly classify unseen samples. (0.85+ AUC)


Yellow: Average classification. The model has learned how to correctly classify some of the real and fake videos. It is partially able to correctly classify unseen samples. (0.6-0.7 AUC)


Red: uncertain classification. The model does not have any discriminatory power. It is unable to detect the difference between real and fake videos. It is guessing randomly on unseen samples. (0.5 AUC)


Experimentation


1. Data preparation


For our experiments, our data preparation pipeline was split into three streams:


1. We downloaded 60 random deepfake videos and 60 random real videos, both sourced from YouTube. In our selection process, we made sure to choose as many different faces as possible, in different contexts, different backgrounds and different lighting conditions. We also tried to make our data as diverse as possible, to ensure the dataset was representative enough of real-world videos typically encountered on the internet. All the manipulated videos we use in the YouTube dataset fall under the deepfake category.


Since different people create different deepfake videos, we also assume that these videos are created using a variety of deepfake techniques. We don’t make any further assumptions about the type of models, their architecture or the type of artifacts they generate. It is also worth mentioning that we don’t have real-fake pairs of videos, or, in other words, that the real and fake videos used in our dataset are completely unrelated. We split this data into two different sets, keeping 16.67% of the videos for a test set of examples. We also made sure that the test videos don’t have faces from training data, which ensures that the deepfake videos in the test set cover a wide variety of deepfake techniques.


2. From the FaceForensics++ dataset, we download 11% of the overall data used in the paper. This ratio is picked to ensure we have comparable sizes with the real-world dataset downloaded from YouTube. The FaceForensics++ data consists of original videos on top of which manipulations are applied, making parallel samples between real and fake samples. In other words, for a real video A, the FaceForensics++ dataset preparation creates 4 forged versions of the same video (video A), using the different forgery techniques discussed in our article’s introduction.


The data is split by a fraction of 16.67% of the videos, the same ratio as our real-world YouTube dataset. We make sure that faces included in the validation set don’t have samples in training. In FaceForensics++ dataset (with the exception of Google data), each real video is used to create 4 manipulated (deepfake) videos.


To reflect this in our model, we used one of the following methods for every real video sample used in the test set: a) we kept all the manipulated (fake) videos associated with it in the test set, or b) we completely dropped the manipulated videos from both train and test data. Doing things this way ensures that during training the model is never exposed to the real test samples, nor the manipulated videos generated from them.


3. Finally, we create a third amalgamated dataset consisting of the real-world YouTube and FaceForensics++ datasets by combining both the training datasets discussed above.


Because the process of collecting and labeling videos from YouTube a) was limited in terms of decent quality source videos and b) involved an extensive amount of time-consuming manual work, we stopped data collection at this stage of the project. We'd almost certainly expect however to get better results if we included more examples in our dataset.


2. Data pre-processing


We apply the same data pre-processing strategy for all datasets:


1. Extract frames from each video.


2. Make use of reliable face tracking technology to detect faces in each frame and crop the image around the face.


When detecting videos created with facial manipulation, it is both possible to train the detectors on the entire frames of the videos, or simply crop the area surrounding the face, and apply the detector exclusively on this cropped area. Since our goal was to closely match the experimentation methodology that led to the FaceForensics++ paper's results, we focussed on performing facial detection by using the latter technique, cropping around the face.


At this stage, we obtained the following data statistics:


We keep all our videos at “raw” quality, and do not experiment with different compression rates, as performed in the FaceForensics++ paper. It would however be straightforward to apply different compression rates to the original videos, and redo the experiments with this methodology. That said, we believe the observations we make in this article would not change drastically if we were to apply compression.


3. Data augmentation


Since the datasets are relatively small — 100 faces for the entire training dataset—the model is prone to overfitting by memorizing faces. To combat this, we employed a series of different yet simple data augmentation techniques:

  • Random horizontal flip

  • Random rotation and scaling

  • Random perspective (distortion of the image)

  • Random brightness, contrast and saturation jitter

  • Random region masking (replace portions of input images with empty blocks)

We should note here that data augmentation is only applied to train images. All of the test data is only normalized, with the model applied on the untouched test images. For the test data, we only apply normalization. No further tweaks are applied to the input test data.


Modeling


On the modelling side, we decided to follow a similar approach to the one used in FaceForensics++ paper, in the hope of replicating the work done in that context as much as possible.


The biggest change we made in modelling was in using ResNet18 instead of Xception net to do faster iterations, and to reduce the chances of the data overfitting. We determined this would be effective since the former model is smaller than the latter.


In all of our experiments, we make use of a pre-trained ResNet18 model on ImageNet, and we unfreeze the weights to be finetuned on the deepfake detection task. The reasoning behind our unfreezing of the convolutional layers is to move the weights from learning to detect what humans would perceive as the typical set of facial features — eyes, ears, noses, etc. — towards learning features that are more useful for algorithmic deepfake detection — artifacts, skin color change, blur, etc.


The reason we do this is in hopes of discouraging the model from merely memorizing the differences between real and fake faces, instead encouraging it to look for more useful features instead. If we used a fully connected classifier that uses facial features extracted by ResNet18, such as eyes or ears, that would encourage the model to memorize that certain faces are associated with the label ‘real,’ while others are correlated with the label ‘fake.’ To overcome this, we also fine-tuned the ResNet18 layers to start looking for other artifacts useful in deepfake detection, such as blur or two sets of eyebrows appearing on a single face. Such features would enable our fully connected classifier to generalize better to never-before-seen faces.


We also replaced the fully connected layer of the ResNet18 with a classifier architecture made of one or many dense layers, with hidden relu activations. In all of our experiments, we either made the classifier linear (i.e. one fully connected layer) or added a single hidden layer with a ReLU activation. This Boolean decision functions as a hyperparameter that we can switch on and off during our experiments.


For regularization purposes, we employ dropout on the convolutional layers' outputs, as well as a dropout and a batch-norm on the hidden fully connected layer output (where appropriate). We also employ a weight decay using the L2 norm on all model trainable weights (including ResNet pre-trained convolutional weights).



Experiment management


To ensure that our results were not the result of sheer one-experiment luck, we defined a relatively wide range of hyperparameters on which we decided to launch a large number of experiments.


For speed, we chose to do a random search over our hyperparameters’ space:


  • Training data source: Which dataset was used to train the model. This can be either the real-world Youtube data, the paper data (subsample), or the combined data

  • Batch size

  • Number of training epochs

  • Dropout rate

  • Weight decay ratio

  • Learning rate scheme

  • Whether to use a hidden FC layer in the classifier


The objective of such an experimentation setup is to train as many models as possible, using as many different configurations as possible, on all three versions of the datasets available. We then analyze the impact of these configurations on the model performance on our different test datasets.



Results


For ease of parallel experimentation and easy analysis, we make use of our Atlas Community Edition software. Thus, most of the following observations and analysis are done on the software’s UI, as demonstrated in the video below. If you like reading more than watching videos, we provide the same observations following this video.



We also provide an interactive UI for people to examine our results and make their own observations, or even download pre-trained models.


To analyse the performance of our models, we propose to analyze both accuracy and AUC as metrics: We visualize accuracy results as it is a simple intuitive metric. We also visualize the AUC for its discriminatory power, which will help us recognize if the models are learning a decision boundary between real videos and deepfakes, or just randomly guessing.


Our hypothesis relies on the fact that the paper’s Xception-net is overfitting to the paper’s type of data, and is not capable of generalizing to real-life deepfakes that one might encounter on Youtube. An intuitive first experiment to validate this hypothesis, is to train a new model on the paper’s dataset, and validate it on both the paper dataset and Youtube dataset.


The following figure shows the learning curve of such a model where base refers to the paper data and augment refers to the Youtube data. We also plot the accuracy and AUC scores that a naive approach of y=0 gives in dashed lines. (Notice that for AUC, both lines are overlapping.)



Note: For the images shown above, we display 3 key pieces of information for each picture. The first piece is the probability of each being real and fake, respectively. The second represents the model prediction (0 for real, 1 for fake), while the third info represents the actual label.


It is important to notice that the model is not detecting deepfakes of better quality than the ones seen during training. The better quality 'in the wild' YouTube deepfakes are those outlined in the red rectangles.


While this result validates that a model — even if different in architecture — trained on paper data cannot generalize to unseen YouTube videos, we want to further explore the reason behind such behaviour. We assume that this previous observation is symmetrical, and that both datasets are very different from each other. In other words, we assume that a model trained on YouTube data would also not be able to generalize to paper data.


By doing such an experiment, we find the following figure:





It is very important to note that the model trained on 'in the wild' YouTube data learns to recognize subtle deepfakes, which are not very visible to the human eye, but additionally is capable of detecting worse quality deepfakes present in the FaceForensics++ paper’s dataset (highlighted above in red). It is also worth mentioning that this model does not detect other types of forgery like Neural Textures, as it never saw such forgery types in training.


This observation in concert with the first experiment shows that a model trained on “hard” examples — from YouTube data —is able to partially detect unforeseen “easy” ones from paper data. It also demonstrates that on the other hand, the opposite is not true. This suggests that, when training deepfake detectors, the machine learning community should aim for the best quality image manipulation examples available.


Since the previous model trained on YouTube dataset is only capable of detecting deepfake forgery techniques, it is not very interesting to train a model that is capable of solving only a portion of the problem. It is not ideal to have a deepfake detection model that checks any newly uploaded video on Youtube, if this model misses 50% of the manipulated videos, right? Thus, we propose to train the model on both datasets combined, and then evaluate its performance on both datasets.




As one would expect, training on both datasets together, optimizes for both metrics —validation AUC on the paper dataset and validation AUC on the YouTube dataset—at the same time, and fixes the non-generalization issues found with both of the approaches in isolation. The red rectangles show that the model learns to detect the paper’s forgery techniques such as Face Swap or Neural Textures, while also learning to detect deepfakes that are more difficult to catch by the naked (human) eye.


To ensure these results are consistent and are not the result of luck, we run 140 jobs while randomly changing the hyper-parameters of the model.


The following GIF shows our parallel coordinates plot (instead of providing numbers in tables) and all 140 experiments we ran during this experiment:

A parallel coordinates plot is an intuitive and very effective way to analyze the correlation between different features. These features may be hyperparameters, metrics or any other numerical typed data. In our case, we visualize the correlation between our experiments parameters (left-hand columns), and our metrics (right-hand columns).


As seen in the parallel coordinates plot, while we make sure to randomize our different hyperparameters across the whole space , we notice consistent and major differences in metrics by solely changing the nature of the dataset used to train the model. (0: FaceForensics++ dataset, 1: YouTube dataset, 2: both datasets combined)


For every different selection of the training dataset, we see that the hyperparameters explored emerge over a large portion of the entire exploration space, while the metrics we observed stay consistent. For example, a model trained on the FaceForensics++ dataset will give an AUC score of 0.5, no matter what variety of model hyperparameters we pick. Using the same logic, every model trained on YouTube data yielded an average AUC of 0.63 over all the tried parameters, and a model trained with the combination of both datasets yields good discriminatory power over both validation datasets. This is explained by an AUC of 0.8+ when using both datasets at the same time.


These results thus confirm that, while the FaceForensics++ paper introduces a new dataset that aims at creating generalizable deepfake detection models, it falls short at doing so. By looking closely at the paper’s dataset, we assume that this is caused by the bad quality of the forged samples from the paper, making them easy targets — which also suggests that metrics provided in the paper are an overestimate of the detector’s real-world performance.


We also believe that using the same forgery model to all the fake data samples of its type hurts the performance of the FaceForensics++ deepfake detector. By making it overfit to detecting only that type of model, it is not tractable for real life applications. Since anyone can upload content to social media, it is natural to think that a large number of different forgery models will be used to create different manipulated videos. Thus, it is necessary to try and detect forgeries of all sorts of different models instead of the forgery of one specific model.


Now, try it yourself!


We provide all the tools that allow to reproduce these experiments in our publicly available GitHub repository for anyone curious to do similar experiments with deepfake detection. We also open source the deepfake YouTube data that we collected (find a link to it inside the repository), in the hope of helping others who are working on improving this field.


We also provide the experiments that we conducted, in an interactive UI for people to freely explore our work, check out different experiments and make their own observations.


Fork the deepfake detector code on our GitHub: https://github.com/dessa-public/DeepFake-Detection/fork


Check out the experiments we did on our interactive UI of the Atlas dashboard:

http://deepfake-detection.dessa.com/projects


Conclusion


The deepfake detection field is far from being solved. In fact, with techniques that have adversarial applications, such as deepfakes, the quest to detect them reliably tends to be unending. This is because of the 'cat and mouse' like nature of this problem, in which finding ways of identifying deepfakes ironically tends to provide those developing models used to generate them with techniques to make them more advanced. With these new techniques built into successive models, these new techniques become capable of eluding once reliable deepfake detector systems. While our hope is that a solution to this kind of problem is eventually found, for now we must strive to continuously improve machine learning-powered deepfake detectors with more, newer, and better data samples.


We think the current way to stay on top of most video manipulation techniques is to make sure detectors have the most diverse datasets possible, and are developed on datasets that resemble our test real-life data. To that end, if you think you can contribute useful real/manipulated videos for this cause, please feel free to help out defending against video manipulation by getting in touch with us at foundations@dessa.com.


Learn more


Read Google's post on the dataset: Contributing Data To Deepfake Detection Research

https://ai.googleblog.com/2019/09/contributing-data-to-deepfake-detection.html


Read the FaceForensics++ paper

https://arxiv.org/abs/1901.08971


Find the FaceForensics repository on GitHub:

https://github.com/ondyari/FaceForensics/


Fork the deepfake detector code on our GitHub: https://github.com/dessa-public/DeepFake-Detection/fork


Check out the experiments we did on the Atlas dashboard:

http://deepfake-detection.dessa.com/projects


Learn more about Atlas and download for free:

https://dessa-atlas-community-docs.readthedocs-hosted.com/en/latest/


0 views

Stay tuned for more from Dessa.