How to Spot a Deepfake: Detection Guide 2026

Table of Content

No items found.

The honest answer to "how do I spot a deepfake?" in 2026 is: humans usually can't, and the old visual heuristics no longer work. iProov's 2025 Threat Intelligence Report found that just 0.1% of participants could reliably distinguish real from AI-generated content. The unnatural-blinking and weird-teeth tips that worked on 2019-era deepfakes fail against modern diffusion-based generators. Detection has to sit in the system, not with the user.

That said, security teams, fraud investigators, journalists, and IDV operators still need to know what works in practice — both the manual signs that remain useful in specific contexts, and the automated detection methods that scale. This guide covers both. It is written for security leads, IDV product managers, and compliance officers who need a clear, current playbook rather than a list of dated warning signs.

If you're new to the topic, start with our companion guide on what a deepfake is. For the technical mechanics behind the content you're trying to detect, see how deepfakes are made.

iProov 2025: 0.1% of humans can reliably distinguish modern synthetic content from real — manual detection alone is no longer sufficient.
Voice deepfake attempts increased 680% year-over-year through 2024; over 10% of surveyed financial institutions report voice deepfake attacks exceeding $1M in losses.
Modern liveness checks are no longer deepfake-proof — Microsoft Digital Defense Report 2025 confirmed AI forgeries can defeat selfie and liveness tests.
The Arup $25.6M case is the canonical live-call failure: a multi-person video conference, all synthetic, no detection in the loop.
Effective production detection ensembles cover four levels simultaneously: pixel/frequency analysis, temporal consistency, physiological signals (rPPG), and audio-visual coherence.
Off-script questions remain the most powerful manual technique against live voice deepfakes — Ferrari executives used this successfully in July 2024.

Why Manual Detection Is No Longer Sufficient

Three things have changed since the early-2020s "spot the deepfake" advice was written.

Diffusion models have replaced GANs. The artifacts that diffusion models produce live in different parts of the frequency spectrum and don't manifest as the visible ghosting, weird teeth, or unnatural blinking that older GAN-based deepfakes did. Modern outputs blink correctly, render teeth correctly, and handle lighting well.

Resolution and framerate have caught up. Open-source video models like LTX-2 generate 4K deepfakes at 50 frames per second on a single consumer GPU. The "low resolution and choppy" tell is gone.

Real-time live deepfakes are now operational. DeepFaceLive and similar tools stream synthesized faces and voices into Zoom and Teams calls with sub-100ms latency. The "post-hoc forensic analysis" model of detection no longer covers the threat surface.

The result is the iProov number: 0.1% of humans can reliably distinguish modern synthetic content from real. For practical detection, this means manual signs are useful as additional evidence in low-stakes contexts, but they cannot be the primary control for any high-value or identity-bound transaction.

Visual Signs That Still Work (Sometimes)

That said, deepfakes still fail at the edges of human behavior and physics — the unconscious micro-movements and physical interactions that are computationally expensive to render correctly. A trained reviewer, slowing the video down and watching with intent, can still catch a meaningful percentage of mid-quality deepfakes.

Region / Behavior	What to Watch For	Reliability
Face boundary (jawline, hairline, ears)	Discoloration, ghosting, mismatched texture where synthetic face meets real body	Moderate (autoencoder swaps)
Profile / head turns	Face flattening, nose distortion, ear texture changing during rotation	Moderate
Eye micro-detail	Inconsistent reflections, missing micro-saccades, mechanical blink patterns	Low to moderate (improving generators erase this)
Hand-face interaction	Face shimmer or distortion when a hand passes in front of the face	High when present (rare in attack videos)
Lighting consistency	Shadows on the face that do not match the apparent light source elsewhere in the frame	Low (modern generators handle lighting well)
Audio-visual sync	Lip movements lag or over-articulate against audio phonemes; "two layered tracks"	Moderate (lip-sync deepfakes)

Table 1: Visual signs of a deepfake by region. None of these is reliable on its own; modern diffusion-based generators erase most of them.

A few practical notes on each category.

Edges of the face. Pay attention to the jawline, hairline, and ears. Boundary regions are where blending artifacts are most likely to appear, particularly in autoencoder-based face swaps. Look for slight discoloration, mismatched texture, or ghosting where the synthesized face meets the real body.

Profile and head turns. Most face-swap models are trained primarily on front-facing data. When a synthetic face rotates to a full profile, the model has to extrapolate from less data. Watch for the face appearing to "flatten," the nose distorting, or the ear texture changing abruptly.

Eye-region micro-detail. Real eyes have subtle reflections from the environment that match the lighting in the rest of the scene. Synthetic eyes often have inconsistent or absent reflections. Spontaneous blinking has improved in modern models but micro-saccades — the small, involuntary darting movements real eyes make every few hundred milliseconds — are still hard to render.

Hand-face interaction. Real people touch their face: scratch the nose, brush back hair, rest their chin on their hand. These interactions are notoriously hard for face-swap models to handle because the hand occlusions break the model's assumption of an unobstructed face. A common modern tell is faces that distort or "shimmer" briefly as a hand passes in front.

Lighting consistency. Check that shadows on the face match the apparent light source elsewhere in the frame. Synthesized faces are often lit slightly differently from their environment.

Audio-visual sync. Lip movements should align tightly with audio phonemes. Even small misalignments — a few-frame lag, slight over-articulation — can indicate a lip-sync deepfake. A useful trick: mute the video and watch lips, then close your eyes and listen to the audio. Both should feel natural independently. Then, when combined, they should feel like one performance, not two layered tracks.

These signs catch some deepfakes. They miss most modern, well-produced ones. They should be one input among several, not the system of record.

Audio Signs of a Voice Deepfake

Voice deepfakes have become the highest-volume category of enterprise deepfake fraud. Group-IB's 2026 research found voice deepfake attempts increased 680% year-over-year through 2024, and over 10% of surveyed financial institutions have suffered deepfake voice attacks exceeding $1 million in losses.

What to listen for, in order of reliability:

Unusually consistent prosody. Real human speech has irregular emphasis, micro-pauses, and breath sounds. Cloned voices often sound unnaturally smooth — the prosody is "too clean."
Missing breathing and lip noises. Real recordings include breath intakes between sentences, lip smacks, and tongue clicks. Many synthetic voices omit these or place them at statistically wrong positions.
Background acoustics that don't match the channel. A cloned voice generated from a podcast clip might retain the studio's reverberation profile when supposedly speaking on a phone call.
Pitch contour artifacts. Subtle pitch glitches at word boundaries, especially on uncommon words or names the model wasn't trained on.
Lack of conversational adaptation. Real humans modulate voice in response to interruption, surprise, or emotion. AI-generated voices struggle with truly unscripted real-time response.

The most powerful manual technique against a live voice deepfake is the off-script question. CloudGuard's research demonstrated this in a 2026 live webinar: an unscripted question like "What's your favourite chocolate?" injected into a call elicits genuine confusion and conversational hesitation from a real person, while an attacker driving a deepfake voice in real time either freezes, evades, or produces an obviously canned response. This is the same principle Ferrari's executive used in July 2024 when challenging the suspected deepfake CEO call with a personal verification question — the attacker disconnected.

For a deeper analysis of voice cloning specifically, see our voice cloning fraud guide.

Behavioral and Contextual Verification

Independent of the media itself, context often gives the deepfake away. This is where most successful real-world detection actually happens.

Channel mismatch. A "CEO" calling on WhatsApp instead of the usual corporate Teams. A "vendor" sending wire instructions through a freshly-created email domain. A request that bypasses the normal approval chain. These are red flags before any media analysis.

Urgency and authority pressure. Deepfake attacks rely on time pressure — "wire this in the next hour," "I'm in a meeting and need it now," "do not tell anyone in finance until it closes." Genuine urgent executive requests rarely circumvent established controls.

Off-band verification. The single most reliable defense is to call back through a known, trusted channel. Look up the executive in the corporate directory and call their assistant. Hang up and dial back. Send a Slack message to the same person on the verified work account. Out-of-band confirmation is the closest thing to a silver bullet that exists in 2026.

Ask about something only the real person would know. Recent meetings, internal jokes, the name of their dog. The Ferrari case is the textbook example: when the suspected deepfake voice could not answer a personal question, the attempt collapsed.

For high-value transactions, no amount of media-level scrutiny replaces process. The defenses that actually work in 2026 are layered: detection technology + verification protocols + a culture in which junior staff feel safe pausing a transaction to verify. See our CEO fraud defense guide for the operational playbook.

How Automated Deepfake Detection Works

For any production deployment — IDV onboarding, contact center fraud screening, video conferencing — the question is not "how do I train my staff to spot deepfakes?" but "how do I deploy detection that works at scale?"

Modern automated detection ensembles operate at four levels.

Pixel and frequency-domain analysis. Convolutional neural networks (CNNs) and vision transformers analyze each frame for the statistical fingerprints that generative models leave behind. GAN-generated content has periodic high-frequency artifacts from transposed convolution upsampling. Diffusion-generated content has different signatures distributed across the frequency spectrum. Effective ensembles cover all three architectures (GAN, autoencoder, diffusion) simultaneously.

Temporal consistency analysis. Across consecutive frames, real video has predictable motion fields, optical flow, and identity stability. Deepfakes — particularly per-frame face swaps — exhibit micro-flickering, identity drift, and unnatural temporal correlations. Models like DPNet learn dynamic prototypes of these temporal inconsistencies.

Physiological signal analysis. Real human faces exhibit sub-pixel color variations caused by blood flow under the skin (photoplethysmography, or rPPG). These pulses match the heart rate of a real person and are extremely difficult for generative models to replicate consistently across a long video. Intel's FakeCatcher and similar systems rely on rPPG-based detection.

Audio-visual coherence. For multimodal content, models check that lip movements, jaw motion, and facial micro-expressions match the spectral content of the audio. Mismatches at the millisecond level are highly indicative of separately-generated audio and video.

The best 2026 production systems combine all four. A detector that relies on only one signal is brittle: a generator that addresses that one signal can defeat it. Ensembles that correlate across signals are substantially harder to fool.

Detection Methods Compared

A useful way to organize the detection landscape is by what kind of deepfake each method is designed to catch and where it fits in a security stack.

Method	What It Detects	Best Use Case	Limitation
Frequency-domain CNN	GAN upsampling artifacts, diffusion texture residuals	Static images, IDV onboarding	Architecture-specific; degrades against new generators
Temporal consistency model	Frame-to-frame inconsistencies, identity drift, micro-flickering	Recorded video, video conferencing	Needs multi-frame context; slower
Physiological signal (rPPG)	Absence or inconsistency of blood-flow pulse signal	High-quality video clips (5+ seconds)	Requires good lighting and resolution
Audio-visual coherence	Lip movement vs. audio spectrum mismatch	Multimodal deepfakes, lip-sync attacks	Misses pure audio or pure visual deepfakes
Audio anti-spoofing model	Voice clone artifacts: smooth prosody, neural codec residuals	Contact center fraud, vishing detection	Channel-quality dependent
Provenance / C2PA verification	Cryptographic content authenticity	Verifying signed content from trusted sources	Useless against unsigned content (i.e. most fakes)
Ensemble (multi-model)	Cross-architecture coverage with cross-correlated signals	Production deployments at any trust boundary	Higher compute and latency cost

Table 2: Detection methods compared by what they catch, where they fit, and their main limitations.

The choice of method depends on the deployment context. A bank running KYC onboarding needs frame-level image and document analysis with sub-second latency. A contact center handling fraud calls needs voice deepfake detection on streaming audio. A corporate video conferencing platform needs both, in real time, on consumer-grade endpoints.

For accuracy benchmarks across these methods in production conditions — not lab conditions — see our deepfake detection accuracy analysis.

Detection in Live Video Calls vs. Recorded Media

A subtle but important distinction: detecting a deepfake in a recorded video file is a different problem from detecting one in a live call.

Dimension	Recorded Media	Live Calls / Streams
Latency budget	Seconds to minutes	Sub-second per frame / audio chunk
Available context	Full file, all frames, full audio	Sliding window of recent frames / audio
Compute environment	Cloud GPU, batch inference	Edge or local CPU/GPU on endpoint
Acceptable miss rate	Low — full forensic confidence required	Higher tolerance for missed sub-frame artifacts; speed wins
Typical user	Forensic analyst, journalist, court	End user, contact center agent, video conferencing host
Canonical incident	Disputed video evidence in court	Arup $25.6M multi-person video call (Feb 2024)

Table 3: Recorded media versus live call detection — different problems, different products, different evaluation criteria.

Recorded media can be analyzed with full access to all frames, all audio, and unlimited compute. The Microsoft Video Authenticator–style approach: ingest the file, run frame-by-frame and clip-level analysis, return a confidence score with explanations. Latency is not a constraint.

Live calls are a streaming problem. Frames arrive at 30 or 60 fps; audio arrives in 20ms chunks. The detector has to flag synthetic content within a window short enough to alert the human user before damage is done. This means lighter models, smaller temporal windows, and graceful degradation — a live detector that misses a 5% subtle artifact is acceptable; a live detector that's three seconds late is useless.

The Arup $25.6M case is the canonical live-call failure: a multi-person video conference, all synthetic, no detection in the loop. The lesson industry has taken from it is that live-call deepfake detection has to be a primary product category, not a post-hoc forensic capability.

Common Mistakes Security Teams Make

Common patterns we see in audits and incident reviews:

Treating deepfake detection as a one-time vendor decision. Detection accuracy degrades against new generators within months. A 2024 detector tested against 2024 content is not a 2026 detector. Continuous retraining and benchmarking against current generators is essential.

Assuming liveness checks are deepfake-proof. They aren't. Microsoft's Digital Defense Report 2025 confirmed that AI-driven identity forgeries are now "convincing enough to defeat selfie checks and liveness tests." Liveness is a useful layer; it is not, on its own, a deepfake defense.

Relying on a single signal. A pure frequency-domain detector fails on diffusion content. A pure rPPG detector fails on stills. A pure audio detector misses video-only attacks. Ensemble detectors that cross-correlate signals are substantially more robust.

Confusing lab accuracy with production accuracy. A model that hits 99% on FaceForensics++ may drop to 70% on adversarial real-world traffic. The lab-to-production gap is the single most common source of overconfidence in vendor evaluations.

Ignoring explainability. A "trust us, it's a deepfake" verdict is operationally insufficient. Compliance officers, courts, and insurance carriers increasingly need to see why the system flagged the content. Black-box scores are not defensible. See our explainable AI in deepfake detection guide for what regulators and auditors are now expecting.

Skipping out-of-band verification. Even the best detector should be paired with a process control: any high-value transaction or sensitive request gets verified through a second, trusted channel. Detection technology buys time and signal; process buys insurance.

A Practical Detection Playbook for Enterprises

The defenses that actually hold up in 2026 are layered. A working playbook for enterprise deepfake defense:

Deploy automated detection at every identity-trust boundary — IDV onboarding, contact center authentication, video conferencing for sensitive meetings, executive communications. Use ensembles that cover GAN, autoencoder, and diffusion content, with audio detection on the voice channel.
Mandate out-of-band verification for any wire transfer above a defined threshold, any change to vendor banking details, any executive request that bypasses normal approval. Voice-callback or known-device confirmation should be policy.
Train staff on the threat model, not the visual signs. Teach them to recognize channel mismatch and urgency pressure, and to feel safe escalating. The Arup employee did exactly the right thing — calling a video meeting to verify — and was still defeated. The miss was structural, not individual.
Use explainable detection. Every high-confidence flag should come with a visualization of what drove the verdict. This is essential for analyst review, regulatory disclosure, and legal proceedings.
Re-benchmark quarterly. Generation models evolve faster than annual procurement cycles. Ask vendors for accuracy data on content from the last 90 days, not aggregate numbers from a year-old benchmark.

DuckDuckGoose's DeepDetector is built around this layered model: ensemble detection covering GAN, autoencoder, and diffusion content, with per-frame explainability that shows analysts exactly which spatial regions triggered the flag.

FAQ

Can I spot a deepfake just by looking at it? In most cases, no. iProov's 2025 study found only 0.1% of participants could reliably distinguish modern AI-generated content from real. Older signs like unnatural blinking and weird teeth no longer apply to diffusion-era deepfakes. Manual detection is supplementary to automated tools, not a replacement.

What's the most reliable way to detect a deepfake video? A combination of automated ensemble detection (covering GAN, autoencoder, and diffusion artifacts) plus out-of-band verification of the speaker's identity through a trusted channel. No single signal is reliable on its own.

Are there free tools to check if a video is a deepfake? Yes — examples include the ScreenApp AI Video Detector, Deepware Scanner, and several browser extensions. Their accuracy is best treated as a triage signal: useful for flagging suspicious content for further review, not as a final verdict for high-value transactions. Enterprise-grade tools like DuckDuckGoose's DeepDetector and Microsoft Video Authenticator offer significantly higher accuracy and explainability.

How can I tell if a voice on a call is cloned? Listen for unusually smooth prosody, missing breath sounds, inconsistent background acoustics, and inability to handle truly unscripted off-topic questions. The strongest manual technique is asking something the real person would know — a recent shared experience, a personal detail — and gauging the response. For systematic detection, automated voice deepfake tools running on the audio stream are required.

Do liveness checks stop deepfakes? Not reliably anymore. Modern AI-generated faces can defeat traditional selfie and liveness checks. Microsoft's Digital Defense Report 2025 confirmed this directly. Liveness remains a useful layer in a defense-in-depth strategy but is no longer sufficient on its own.

What should I do if I think I've received a deepfake call? Hang up and call back through a known, trusted channel — the corporate directory number, an established Slack DM, an in-person check. Do not authorize any transaction or share any sensitive information based on the suspicious call. Report the incident to your security team for forensic analysis.

How accurate are automated deepfake detectors? It depends sharply on whether you're measuring lab accuracy or production accuracy. On academic benchmarks, top systems exceed 95%. On adversarial real-world content, accuracy is typically lower — often 70–90% depending on generator coverage. The lab-to-production gap is the central operational challenge for the category.

Can deepfakes be detected in real time during a Zoom or Teams call? Yes — purpose-built live detection systems analyze incoming video and audio streams with sub-second latency and can alert participants or moderators when synthetic content is detected. This is a relatively new product category, driven directly by the Arup-style multi-person video deepfake threat. For more detail see our explainable AI in deepfake detection guide.

Detection that scales requires automation, ensemble coverage, and explainability. DuckDuckGoose's DeepDetector gives security teams a real-time deepfake verdict alongside the visual evidence that justifies it. Explore DeepDetector or talk to our team about deployment in your IDV pipeline.

Last update: Q2 2026.

How to Spot a Deepfake: A Practical Detection Guide for 2026