How Long to Make a Deepfake? 2026 Guide

Table of Content

No items found.

In 2026, the fastest realistic answer to how long does it take to make a deepfake is under 90 seconds for a static face swap from one photo, and zero added time for a real-time face swap on a live video feed. The full range stretches from milliseconds (live webcam swaps) to a week (high-end DeepFaceLab training), but almost every deepfake used in fraud now sits in the seconds-to-minutes band.

The compression of that timeline is the real story. A workflow that needed weeks of training and a hard drive of source footage in 2017 now needs one LinkedIn headshot and a free web app. This guide walks through what is possible in each time tier, what input is required, and why a single photo is now enough for most attack scenarios.

Under 90 seconds is the typical time to make a static face-swap deepfake from one photo on a consumer web app in 2026.
Zero added latency: real-time face swaps on a live video feed run at 30 to 50 milliseconds per frame, imperceptible inside a call.
One photo is now the floor for the majority of consumer deepfake workflows, including face swaps, talking heads, and live webcam swaps.
2 to 10 minutes is the typical time for a talking-head video from one photo plus an audio file or script.
30 to 60 seconds of clean target audio is enough to clone a voice; output generation runs in seconds per clip.
24 hours to 7 days of GPU training is still required for the highest-end DeepFaceLab deepfakes, but almost no fraud uses that workflow.
Approximately 70 percent training-time reduction is achievable with pretrained base models released in late 2025.
The defender response window has to compress in parallel. A 90-second generation cycle invalidates hour-long detection cycles.

Quick answer by time tier

The cleanest way to think about modern deepfake creation time is by tier rather than by tool. Each tier maps to a different category of attack scenario and a different set of defensive assumptions.

‍

By time tier · 2026

How long a deepfake takes, by tier

The full range runs from milliseconds to a week — but almost every deepfake used in fraud sits in the seconds-to-minutes band. Longer bars mean more time and more source material; the fraud-relevant tiers are the shortest ones.

<50ms / frame

Real-time

Live webcam face swap

Video-KYC bypass, live CEO-fraud calls

<1minute

Static face swap, voice clone

Synthetic ID selfies, scam profiles

1–10minutes

Talking-head, short video swap

BEC video attachments, testimonials

10m–1hmulti-image

Higher-fidelity swap (5–50 imgs)

Impersonation for video evidence

1–24hours

Custom-trained on small dataset

Targeted social-engineering

1–7days

Production DeepFaceLab

Influence ops, pro VFX — rarely fraud

The fastest tiers are now good enough for the attacks that matter — so almost no fraud bothers with the week-long workflow.

Time Tier	What Is Possible	Source Needed	Typical Attack Scenario
Zero added time (real-time)	Live webcam face swap, real-time voice clone	1 photo plus 30 to 60 seconds of audio	Video KYC bypass, live CEO-fraud calls
Under 1 minute	Static face swap, voice clone output, mobile swap	1 photo	Synthetic ID document selfies, romance scam profiles
1 to 10 minutes	Talking-head video, short video face swap	1 photo plus audio or script	BEC video attachments, fake testimonial videos
10 minutes to 1 hour	Multi-image swap, longer video clips	5 to 50 photos or short clips	Higher-fidelity impersonation for video evidence
1 to 24 hours	Custom-trained face swap on small dataset	Hundreds of frames of target	Targeted social-engineering campaigns
1 to 7 days	Production-grade DeepFaceLab face swap	Thousands of frames plus high-end GPU	High-value influence operations, professional VFX

Table 1: Deepfake creation time tiers in 2026, the input each one needs, and the attack scenarios each one enables.

The rest of this guide expands each tier and explains what changed to make it possible.

‍

How deepfake creation time collapsed from years to seconds

The shift from weeks-of-training to seconds-on-a-phone happened in roughly five steps over nine years. Understanding that trajectory matters because it explains why detection has to assume the floor will keep dropping.

‍

Nine years · weeks → real-time

Each generation cut creation time by an order of magnitude

And reduced the source material at the same time. Weeks of training and a hard drive of footage in 2017 became one headshot and a free web app — with no obvious reason for the trajectory to reverse.

2017

First GAN face-swap tools

Source: hundreds of photos & videos

Weeksof training

2019

Samsung MegaPortraits one-shot demo

Source: single photo (research-grade)

Daysresearch

2021

DeepFaceLab mainstream adoption

Source: tens to hundreds of frames

Hours–daystraining

2023

Consumer cloud apps go mainstream

Source: 1 photo + 60s audio

~8 minend to end

2025

Free web face-swap apps at scale

Source: one photo

<1 minstatic swap

2026

Open-source live swap on consumer GPUs

Source: one photo

<50 msper frame · real-time

The defensive takeaway isn't any single year — it's the slope. Detection has to assume the floor keeps dropping and design for whatever comes after real-time.

Year	Typical Creation Time	Source Needed	Key Development
2017	Weeks of training	Hundreds of photos and videos	First widespread GAN-based face swap tools
2019	Days	Single photo (research-grade)	Samsung MegaPortraits one-shot demo
2021	Hours to days	Tens to hundreds of frames	DeepFaceLab mainstream adoption
2023	Approximately 8 minutes	One photo plus 60 seconds of audio	Consumer cloud apps go mainstream
2025	Under 1 minute	One photo	Free web face-swap apps at scale
2026	Real-time (sub-50ms per frame)	One photo	Open-source live face swap on consumer GPUs

Table 2: How deepfake creation time has compressed from weeks to real-time over nine years.

The pattern is consistent: each generation of tooling cut creation time by an order of magnitude and reduced the source-material requirement at the same time. There is no obvious reason to expect that trajectory to reverse.

‍

Can you make a deepfake from one photo?

Yes. As of 2026, a single high-quality photo is enough input for the majority of consumer-grade deepfake workflows, including static face swaps, talking-head videos, and real-time webcam face swaps. This is the single most important capability shift of the last three years and deserves its own section.

The research goes back further than the productisation. Samsung Labs demonstrated single-image talking-head generation in 2019 with its MegaPortraits work, showing that an avatar could be animated from a single still frame or even a painting. Academic papers on one-shot 3D avatar generation continued through 2024. What changed between research and 2026 reality is that these capabilities have been packaged into free or low-cost web apps, open-source repositories, and mobile apps that anyone can run.

The trade-off for single-image input is a small quality penalty. A one-photo deepfake retains slightly more of the underlying actor's facial geometry, lighting can be inconsistent with the target environment, and small unique features (a mole, an asymmetric smile, a specific tooth gap) are sometimes lost or hallucinated. At forensic resolution these flaws are detectable. At video-call resolution, on a mobile screen, in a two-second KYC selfie check, they often are not.

The practical consequence: any photo of a target person that exists on the public internet is now a viable input for a deepfake of that person. The defensive assumption has to be that target faces are public and that the attacker has them.

‍

Zero added time: real-time deepfakes on live video

The category that has shifted most aggressively since 2024 is real-time face swap on a live video feed. Open-source projects let a user load a single source image and apply face swap to a webcam stream with no batch generation step at all. On a current consumer NVIDIA GPU, the swap adds 30 to 50 milliseconds of latency per frame, which is imperceptible inside a video call.

There is no meaningful generation time for this category. The time-to-first-fake is whatever it takes to launch the application and select a source photo. From a clean install on a fresh machine, a technical user can have a live face swap running in 15 to 30 minutes. From an existing setup, time-to-fake is seconds.

This is the category most relevant to live identity verification, video KYC, and CEO-fraud video calls. Any defensive control that treats a face on a live video stream as proof that a real human is on the other end is now invalidated by default.

‍

Under one minute: single-image face swaps and voice clones

The bulk of consumer-tier deepfake creation sits in the sub-minute band.

Static image face swap from one photo runs in 30 seconds to 2 minutes on most web platforms. The user uploads a source photo and a target image; the model performs landmark detection, alignment, swap, and blending in a single pass. Free-tier queues can push wait times to 3 to 5 minutes, but the underlying compute is sub-minute. Quality is good enough to fool a casual viewer, particularly when source and target have similar lighting and angle.

Voice clone output runs in seconds per clip once the model has been trained on a sample. Voice cloning typically requires 30 to 60 seconds of clean audio of the target. Commercial platforms now offer real-time streaming voice clones with sub-second latency, which is the technology enabling synthetic-voice phishing calls.

Mobile face swap apps run swaps in seconds to a minute on a phone, either locally or via cloud APIs. Quality is lower than dedicated web tools but is sufficient for social media, meme content, and many fraud-relevant scenarios.

‍

1 to 10 minutes: talking heads and short video face swaps

The next tier up covers most AI avatar and short video swap workflows.

Talking-head video from one photo combines a still image with an audio file or script and produces a lip-synced video of the photo subject speaking. End-to-end times are 2 to 5 minutes for clips up to one minute and 5 to 10 minutes for longer outputs. This is the workflow most relevant to business-email-compromise attacks that include a video element: one LinkedIn-grade photo, one earnings-call snippet for voice cloning, and a script are enough.

Short video face swap from one image runs in 1 to 10 minutes for 5 to 30-second clips. The model applies the swap frame by frame and maintains temporal consistency, which is why it takes longer than a single-image swap. Time scales roughly linearly with clip duration; a one-minute video might take 10 to 30 minutes end to end.

‍

10 minutes to a few hours: higher-fidelity face swaps with more source material

When an attacker has more source material (a folder of selfies, a few short videos), tools like DeepSwap and Deepfakes Web can deliver higher-fidelity output by selecting better frames and averaging across them. End-to-end time is in the minutes-to-hours range, dominated by upload, queue, and rendering rather than training.

Output quality is meaningfully better than the single-image case, especially for difficult angles, expressions, and lighting changes. For high-value targets where the attacker can afford the extra effort, this tier is the practical sweet spot.

‍

Hours to a week: high-end DeepFaceLab training

The original Hollywood-grade workflow is still in use for the highest-quality face swap videos, the ones designed to survive close inspection on a 4K screen. DeepFaceLab and similar frameworks require four stages:

Data collection of hundreds to thousands of frames of the target, which can take hours to days depending on how much footage already exists.
Face extraction and alignment, typically 1 to 4 hours of automated processing.
Model training on a high-end consumer GPU, which is the dominant cost. Practitioner guides recommend at least 24 hours of training for clean previews and 7 days for production-quality results.
Merging and post-processing, another 1 to 4 hours.

Pretrained base models released in late 2025 cut total training time by roughly 70 percent by giving the system a head start on facial geometry. Even so, this is the upper bound of deepfake creation time, and the point worth noting is that almost no fraud actually uses this workflow. The faster tiers are now good enough for the attack scenarios that matter.

‍

What changes when generation is this fast?

The compression of deepfake creation time has three direct consequences for the IDV, security, and fraud-detection stack.

‍

The clock both sides race

A 90-second attack breaks an hour-long defense

The real story of faster generation isn't quality — it's timing. When a usable deepfake takes 90 seconds and runs inside a 5-minute fraud call, any detection that flags it an hour later has already missed.

Attacker

90sec

to generate a usable deepfake from one photo — or zero, live

Defender

<1sec

scoring required at the point of upload or live stream

Detection latency had to change

~1 hour cycle→sub-second

The defender's response time has to compress in parallel with the attacker's generation time. That means inline detection on every transaction — not periodic sampling — scoring frames in milliseconds, not minutes.

Defender Challenge	What Changed	What Is Now Required
Liveness detection	Real-time face swap runs on live video feeds with no latency	Presentation-attack detection plus injection-attack detection on every stream
Attack volume	90-second generation enables high-volume opportunistic attacks	Inline detection at every transaction, not periodic sampling
Source intelligence	One photo is enough; any public LinkedIn image qualifies	Assume target faces are public and attacker already has them
Detection latency	Hour-long detection cycles miss the attack window entirely	Sub-second scoring at point of upload or live stream

Table 3: Four defender challenges created by the compression of deepfake creation time, and what is now required to address them.

The common thread across all three is that the defender's response time has to compress in parallel with the attacker's generation time. If a deepfake can be made in 90 seconds and used inside a 5-minute fraud call, a detection pipeline that flags it an hour later is no longer part of the defence. DuckDuckGoose's DeepDetector is built around this constraint, scoring images and video frames at the point of upload or live stream in milliseconds rather than minutes.

For the broader context on attack volume, financial losses, and detection-accuracy benchmarks, see our Deepfake Statistics for 2026 breakdown.

‍

Frequently asked questions

How long does it take to make a deepfake from one photo?

Under 90 seconds for a static image face swap on a consumer web app, 1 to 10 minutes for a short video face swap, and 2 to 10 minutes for a talking-head video from one photo plus an audio file. Real-time face swaps from a single image run at live video frame rates with no batch generation step.

How many photos do you need to make a deepfake?

One photo is now the floor for most consumer deepfake tools. More photos improve quality, especially for tricky angles and expressions, but they are not required. The single-photo workflow is what turns deepfakes from a targeted-attack problem into a mass-scale fraud problem.

What is the fastest deepfake method in 2026?

Real-time face swap on a live video feed is the fastest, with sub-50ms latency per frame and no rendering step at all. For static outputs, single-image face-swap web apps finish in 30 seconds to 2 minutes.

How fast can a deepfake video be made?

A 5 to 30-second face-swap video from one source photo takes 1 to 10 minutes on a consumer cloud app. A 30 to 60-second talking-head video from one photo plus audio takes 2 to 10 minutes. A live video face swap has no batch generation step and produces output in real time.

How long does it take to train a deepfake with DeepFaceLab?

Several hours at minimum for low-resolution test outputs, and typically 24 hours to 7 days of continuous GPU time for production-quality results. Pretrained base models can cut total training time by roughly 70 percent. This is the upper bound of deepfake creation time, not the typical case.

How long does it take to clone someone's voice?

Voice cloning typically requires 30 to 60 seconds of clean audio of the target. Once the model is trained, generation per output clip is seconds. Some commercial platforms support real-time streaming voice clones with sub-second latency.

Can you make a deepfake on a phone?

Yes. Mobile apps run face swaps either locally on the device or via cloud APIs and produce finished outputs in seconds to a minute. Quality is lower than desktop web-app workflows but is sufficient for social media, memes, and many fraud-relevant scenarios.

Do you need a powerful computer to make a deepfake?

Not for the consumer tier. Web-based face-swap tools run the model on the platform's servers, so any modern phone or laptop can drive the workflow. A powerful GPU is only needed for local training-based workflows like DeepFaceLab or for running real-time face-swap projects locally.

How realistic are sub-minute deepfakes?

Realistic enough to fool a casual human reviewer in a short interaction, particularly on a small screen or in a brief video call. They are not realistic enough to survive forensic analysis or trained-reviewer scrutiny. The relevant question for defenders is not whether a deepfake would fool an expert with unlimited time; it is whether it would fool a customer-service agent in a 30-second call or an automated liveness check in two seconds.

How Long Does It Take to Make a Deepfake? A 2026 Breakdown