Mini Deepfake Exploration

As part of my research internship with the AI for Good Lab, I assisted with generating a large literature review of the deepfake technology, which is far too long to share here. What I will share is this short exploratory approach on evaluating the quality of deepfake datasets.

Evaluating the Quality of Deepfake Datasets

We conduct a short experiment to assess the quality by utilizing a language model (GPT) in detecting errors within randomized frames extracted from these datasets. The experiment seeks to demonstrate that if the frames themselves contain noticeable errors or inconsistencies, the utility of training and benchmarking on these datasets is fundamentally flawed.

Dataset Selection

For this quick analysis, we utilized two well-known and publicly available datasets: Celeb-DF and WildDeepFake. These datasets were chosen due to their popularity and wide usage in the field of deepfake detection research.

Celeb-DF: To work with this dataset, we developed a simple pipeline that involved randomly selecting videos and then extracting random frames from each selected video. This allowed for a diverse and representative sample of frames from various videos within the dataset, ensuring coverage of different content and deepfake techniques.

WildDeepFake: In contrast to Celeb-DF, WildDeepFake already provided pre-extracted frames from videos, so we opted to skip the video selection and frame extraction steps. Instead, we randomly selected frames from the dataset’s existing collection, ensuring a balanced distribution across different video sets.

By utilizing these datasets and implementing a tailored approach for each, we ensured that the analysis was conducted efficiently while still maintaining variability in the images used for deepfake assessment.

Error Detection using GPT

To assist in determining whether video stills were generated by a deepfake or were from a real video, I employed ChatGPT as a tool for quick assessments. By utilizing a specific prompt, I uploaded a set of stills and requested ChatGPT to analyze the images. The prompt used was: "Hello! I'm going to upload a set of stills from a video. Can you tell me whether or not you think these stills are from a deepfake video? Make your best guess on whether these photos are from a deepfake or a real video."

I instructed the AI to respond with either "deepfake" or "real" based on its best judgment. While ChatGPT lacks direct image analysis capabilities, this approach allowed for rapid, preliminary assessments and provided a basic level of feedback based on patterns or information it could infer from descriptions of common deepfake traits

For this experiment, we randomly selected 10 frames from each of the two datasets—Celeb-DF and WildDeepFake—to evaluate the ability of GPT to distinguish between real and deepfake images based on a basic description of patterns common to deepfakes.

Results and Observations

From the WildDeepFake dataset, GPT predicted 4 out of the 10 frames to be "real," even though all 10 frames were confirmed deepfakes. In contrast, for the Celeb-DF dataset, GPT identified only 1 frame as "real" out of 10, again with all frames being deepfakes.

Upon further inspection of the frames, some noteworthy differences between the two datasets became apparent. The frames from WildDeepFake were tightly cropped and focused primarily on the face, likely because facial features are a primary source of deepfake artifacts. This zoomed-in approach, while useful for identifying facial inconsistencies, may have limited GPT's ability to detect contextual clues outside the face, which could explain why more "real" predictions were made in this set.

On the other hand, the Celeb-DF frames provided a broader view, including not just the face but the entire surrounding setting. This additional context—such as background environment, lighting, or body language—may have offered ChatGPT more cues to inform its decisions, resulting in fewer incorrect "real" predictions. Furthermore, since Celeb-DF contains well-known celebrities, prior knowledge or expectations of how these individuals should appear could have also influenced GPT's assessment, as familiarity with their typical features might make deepfake distortions more apparent.