Deepfakes look so real now because the technology stopped leaving fingerprints. As recently as 2022, most fakes gave themselves away with flickering edges, waxy skin, mismatched lighting, and eyes that never quite blinked right. Those tells are largely gone. In a 2025 study by iProov, only 0.1% of people could reliably tell modern AI-generated media apart from the real thing. This guide explains what changed under the hood, why the old visual checks no longer work, and what that means for anyone who has to trust a face or a voice on a screen.
It is written for security leads, fraud officers, identity verification product managers, and compliance teams, and it assumes no machine learning background. If you have ever watched a clip, felt that something was off, and still could not say what, this explains why that instinct is no longer reliable.
- Only 0.1% of people can reliably distinguish modern AI-generated media from real, per iProov's 2025 study.
- The core cause is a shift from GANs to diffusion models, which produce cleaner, more physically plausible images.
- Video models now maintain temporal consistency, so faces no longer flicker, warp, or drift between frames.
- The old visual checklist (bad blinking, waxy skin, weird teeth) is close to worthless against diffusion-era fakes.
- The number of deepfakes online rose from about 500,000 in 2023 to roughly 8 million in 2025 (DeepStrike).
- Because realism now defeats the human eye, the reliable line of defense has moved to automated detection.
What Is a Deepfake?
A deepfake is synthetic media in which a person's face, voice, or full likeness is generated or swapped using deep learning. The term covers everything from a single doctored selfie to a live video call where the person on the other end does not exist. What unites them is the method: a neural network studies thousands of examples of a target, learns the statistical structure of how that person looks or sounds, and then produces new media that fits the same pattern.
The realism of any deepfake depends almost entirely on the generative model behind it. For most of the past decade that meant one of two approaches. A third has now taken over, and that single shift is the biggest reason today's fakes are so convincing. For a deeper technical walkthrough of each model family, see our guide to how deepfake technology works.
Why Deepfakes Look So Real Now: The Short Version
Three things happened at roughly the same time. The underlying generation method changed from GANs to diffusion models, which produce cleaner and more physically plausible images. Video models learned temporal consistency, so faces stopped flickering and warping between frames. And the tools became cheap and fast enough that almost anyone with a single photo and a consumer graphics card can produce a passable result. Each of those deserves a closer look.
How Modern Deepfakes Are Made
The shift from GANs to diffusion
The first wave of convincing deepfakes ran on generative adversarial networks, or GANs, introduced by Ian Goodfellow in 2014. A GAN pits two networks against each other: a generator that makes fakes and a discriminator that tries to catch them. They improve together until the generator wins often enough to fool the judge. GANs made photorealistic faces possible, but they were unstable to train and prone to subtle repeating artifacts, especially around hair, teeth, and the boundary where a swapped face met the original head.
Since 2023, diffusion models have led. A diffusion model works backward from noise. It starts with a field of random static and removes it step by step, guided by a network trained to reverse a gradual noising process, until a coherent image emerges. Latent diffusion, which runs that process inside a compressed representation instead of on raw pixels, made it fast enough for everyday use. The payoff is higher fidelity, far fewer structural glitches, and much more stable output than GANs ever managed.
Video learned to hold still
A convincing still image is one problem. A convincing video is a much harder one, because the face has to stay consistent across dozens of frames every second while it moves, turns, and speaks. Early video deepfakes solved this badly. Faces jittered, edges warped, and identity drifted from one frame to the next. Those artifacts were the single most reliable tell a human viewer had.
Modern video generation systems were built specifically to fix this. Models such as OpenAI's Sora 2 and Google's Veo 3 use temporal attention and spatio-temporal latent representations to hold coherence across time. In plain terms, the model treats a clip as one connected object rather than a stack of independent pictures, so motion stays smooth, lighting stays consistent, and a face keeps its identity as it moves. The flicker and warping that used to expose fakes were engineering problems these models were designed to solve, and for the most part they have.
Identity and motion got separated
One reason today's fakes move so naturally is that the best models disentangle identity from motion. The information describing who a person is gets stored separately from the information describing how they move. That means the same set of movements can be mapped onto a different face, or one face can be driven through many different motions, without the two interfering. It is why a single reference photo can now be puppeted into a full, natural-looking performance.
Sound and picture arrive together
The newest models generate synchronized audio in the same pass as the video, including dialogue, ambient noise, and lip movement that matches the words. Poor lip-sync used to be a giveaway. When the mouth and the audio come out of one system rather than being stitched together afterward, that seam disappears.
Voice on its own has crossed the same threshold. A few seconds of a target's audio is now enough to clone their voice complete with natural intonation, rhythm, emphasis, pauses, and even breathing noise. The mechanical, flat quality that once marked a synthetic voice has largely gone, which is why some large retailers now report receiving over a thousand AI-generated scam calls a day. The upshot is that neither a familiar face nor a familiar voice can be treated as proof of who you are dealing with.
Scale did the rest
None of this would matter without the scale behind it. These models are trained on enormous annotated datasets using specialized hardware, and researchers have found again and again that adding data and compute yields large jumps in realism. The tools also became radically more accessible. A large language model can draft a script, a video model can render it, and an AI agent can chain the whole thing together. What once took an expert hours of rendering now takes a novice a few minutes.
The Old Tells That No Longer Work
Most advice about spotting deepfakes still circulates a list of visual checks that were genuinely useful three years ago and are close to worthless today. Unnatural blinking, waxy skin, strange teeth, blurry edges, and mismatched earrings were all real artifacts of GAN-era and early video deepfakes. Diffusion-era generation produces none of them reliably. Worse, repeating this advice gives people false confidence: they scan a clip for the old signs, find none, and conclude it must be real.
The uncomfortable truth is captured by that iProov figure. When only 0.1% of people can tell the difference, "trust your eyes" is not a strategy. It is a coin flip weighted against you. For a grounded look at what manual inspection can and cannot still do, see our guide on how to spot a deepfake.
Why This Matters Beyond the Uncanny Valley
Realism is not just an aesthetic milestone. It is a fraud enabler. Deepfake-enabled fraud exceeded 1.1 billion dollars in the United States in 2025, and deepfakes now account for roughly 6.5% of all fraud attempts, up from less than 1% in 2021 (FF News). The volume climbed alongside the quality: one estimate put the number of deepfakes online at about 8 million in 2025, up from roughly 500,000 in 2023 (DeepStrike). You can see how this plays out in the financial sector in our breakdown of deepfake fraud in financial services.
The specific danger for organizations is that realism defeats the controls built around human judgment. A finance employee who joins a video call with a convincing fake of the CFO has no visual reason to doubt it. This is not hypothetical. In one widely reported case, a finance worker at engineering firm Arup paid out around 25 million dollars after joining a video call in which every other participant, including the company's chief financial officer, was a deepfake. Every face on the screen looked and moved like a real colleague.
A liveness check that confirms a real, moving face is present cannot tell that the face was generated frame by frame and pushed in through a virtual camera. That is exactly how real-time deepfakes operate. As synthetic media researcher Siwei Lyu of the University at Buffalo has put it, simply looking harder at the pixels is no longer enough. The reliable line of defense has moved from something a trained eye does to something a purpose-built system does.
Common Mistakes in Judging Whether Something Is Real
Five habits now do more harm than good:
- Relying on the old visual checklist. Blinking, skin texture, and teeth were GAN-era tells. They no longer separate real from fake.
- Assuming quality equals authenticity. The thought that no one could fake something this polished is now precisely backwards. Polish is cheap.
- Trusting a familiar voice on a call. A few seconds of audio is enough to clone intonation, rhythm, and breathing. Voice recognition is not identity verification.
- Treating one free online checker as the final word. Free tools are useful for triage, not for high-value decisions.
- Concluding that detection is hopeless. Realistic to a human eye does not mean invisible to a detector. The signal moved, it did not vanish.
What To Do Instead
The response is not despair, it is process. A few practical moves:
- Verify through a second channel. For any high-value request, confirm identity out of band, through a known phone number or an in-person check, not through the same channel the request arrived on.
- Retire the visual-checklist training. Teaching staff to hunt for waxy skin builds false confidence. Teach them that a clean-looking video proves nothing.
- Put automated detection where identity decisions happen. Onboarding, high-value transactions, and sensitive video calls are the moments that matter. DeepDetector runs an ensemble across GAN, autoencoder, and diffusion artifacts and returns a sub-second verdict with the visual evidence behind it.
- Insist on ensemble coverage. Each architecture leaves a different signature, so a detector tuned for one can be blind to another. No single signal holds up alone.
- Keep detection current. New generators ship every quarter. A model trained on last year's fakes drifts out of date quickly, so continuous retraining is not optional. To see where this capability belongs, read our guide on where deepfake detection fits in an identity verification stack.
Frequently Asked Questions
Why do deepfakes look so realistic now?
Because generation shifted from GANs to diffusion models, video models learned to stay consistent across frames, and picture and sound are now generated together. The artifacts that used to expose fakes have been engineered away.
Can you still spot a deepfake just by looking?
Rarely. iProov's 2025 study found that only 0.1% of people could reliably tell modern AI-generated media from real. Manual inspection is now a supplement to automated detection, not a substitute for it.
What actually changed to make deepfakes better?
The move to diffusion generation, temporal consistency in video models, the separation of identity from motion, joint audio and video synthesis, and a large jump in training data and compute. The tools also got cheap enough for anyone to use.
Do the old signs like weird teeth and bad blinking still work?
No. Those were artifacts of GAN-era and early video deepfakes. Diffusion-era models blink naturally, render realistic skin and hair, and hold facial structure steady, so the old checklist gives false confidence.
Does more realism mean deepfakes cannot be detected?
No. Realistic to a human eye is not the same as invisible to a detector. Generation still leaves statistical traces in the frequency domain and in temporal patterns that automated systems can measure even when the eye cannot.
What is the difference between a GAN and a diffusion deepfake?
A GAN builds an image in one pass, with a generator competing against a discriminator. A diffusion model builds an image gradually by removing noise step by step. Diffusion is more stable and currently produces higher quality, while GANs are faster at generation time.
Will deepfakes keep getting more realistic?
Almost certainly. The frontier is moving toward real-time, interactive synthesis that reacts to a live conversation. That is why defenses are shifting from human judgment toward automated detection and content provenance.
Last update: Q3 2026.








.webp)




