How Are Deepfakes Made? Architectures & Tools 2026

Table of Content

No items found.

Deepfakes are generated by neural networks that learn the appearance, voice, or behavior of a target from training data, then produce new outputs that match those patterns. In 2026, the dominant architectures are autoencoders, generative adversarial networks (GANs), and diffusion models — with diffusion models now leading on quality for image and video synthesis. This guide explains how each architecture works, where they differ, what hardware and data they require, and which one is producing the deepfakes your security team is most likely to encounter today.

We wrote this for security architects, IDV product managers, ML engineers, and compliance officers who need a working technical understanding without wading through 50-page survey papers. If you're newer to the topic, start with our companion guide on what a deepfake is.

Modern voice cloning needs as little as three seconds of clean reference audio to produce a convincing clone.
A single RTX 4090 (~$1,800) can generate 4K-resolution face-swap video at near-real-time framerates in 2026.
Diffusion models — Stable Diffusion, Sora, Runway Gen-3, LTX-2 — have replaced GANs as the dominant deepfake generation architecture since 2023.
Deepfake files online grew from approximately 500,000 in 2023 to a projected 8 million in 2025 (Europol IOCTA 2025).
Voice deepfakes increased 680% year-over-year in 2024 (Group-IB) — the highest-volume enterprise deepfake category.
Wall-clock time for a 30-second targeted deepfake video with cloned voice is 10–30 minutes on a single consumer GPU.

What It Takes to Make a Deepfake

Every deepfake, regardless of architecture, follows the same three-stage pattern.

Stage 1: Data collection. A dataset of the target is gathered — public videos, social media images, podcast audio, earnings call recordings, anything that captures the target's appearance, voice, or movement. Modern voice cloning needs as little as three seconds of clean audio. Modern image-conditioned face-swap needs a single reference photo. Higher-fidelity outputs still benefit from larger datasets, but the floor has dropped dramatically.

Stage 2: Model training (or conditioning). The neural network is either trained from scratch on the target data, or — increasingly — a pre-trained foundation model is conditioned at inference time on a few reference samples. This second approach is what collapsed the barrier to entry: you no longer need a GPU cluster and a week of training to produce a convincing deepfake.

Stage 3: Generation and post-processing. The model produces synthetic frames or audio samples, which are then blended with the source media (face boundary blending, audio loudness matching, lighting correction) and exported.

The three stages have not changed since 2017. What has changed is the architecture inside Stage 2 — and the answers it gives, in 2026, are radically different from what they were three years ago.

The Three Core Architectures

Deepfake research has converged on three primary architectures. Each operates on a different mathematical principle, but all three solve the same underlying problem: learn a probability distribution over images or audio, and sample from it.

Architecture	Core Idea	Strengths	Era of Dominance
Autoencoder (paired)	Shared encoder + two decoders learn common face structure and per-identity reconstruction.	Simple, fast inference, good for static frontal faces	2017–2020 (still in hobbyist tools)
Generative Adversarial Network (GAN)	A generator and a discriminator train against each other. Generator outputs become indistinguishable from real samples.	Sharp outputs, fast inference, photorealistic faces	2018–2022
Diffusion model	Train a denoiser to reverse a noise-addition process. Sample by iteratively denoising pure noise.	Highest quality, stable training, clean text/image conditioning	2023–present (current state of the art)
Autoregressive transformer (audio)	Tokenize audio with a neural codec; transformer predicts next token conditioned on speaker reference.	Few-shot voice cloning from 3 seconds of audio	2022–present (voice cloning)

Table 1: The dominant deepfake generation architectures, their core mechanisms, and historical context.

The trajectory across 2017–2026 is clear: autoencoders dominated the early "FaceSwap"-style era, GANs took over for high-fidelity face synthesis through 2022, and diffusion models now lead on virtually every quality benchmark for image and video. Voice cloning is its own track, dominated by autoregressive transformers and flow-matching models.

Autoencoders: The Original Face-Swap Method

The original 2017 deepfakes were built on a paired autoencoder architecture. The intuition is elegant: if two faces share a common encoder but have separate decoders, the encoder learns a compressed representation of "face features in general" — pose, expression, lighting — while each decoder learns to reconstruct one specific person's face from that shared representation.

In practice:

Encoder E takes a face image and produces a latent vector — a compressed numeric representation of pose, expression, and structure.
Decoder D_A is trained to reconstruct Person A's face from that latent.
Decoder D_B is trained to reconstruct Person B's face from that latent.
To swap A's face onto B's body: pass B's video frame through E to get the latent (capturing B's pose and expression), then pass that latent through D_A to render A's face in B's pose.
Blend the synthesized face into B's video using Poisson image editing or similar boundary-blending techniques.

Autoencoder face-swap remains the dominant technique behind hobbyist tools like FaceSwap, DeepFaceLab, and FaceFusion. It works well for static faces in controlled lighting, struggles with profile views (training data is mostly frontal), and produces visible boundary artifacts at the chin and hairline that detectors have learned to exploit. For a discussion of these telltale signs, see our guide on how to spot a deepfake.

Generative Adversarial Networks (GANs)

GANs were introduced by Ian Goodfellow in 2014 and powered the second wave of deepfakes from roughly 2018 through 2022. The architecture pits two networks against each other:

A generator G takes random noise (and optionally an identity vector) and produces a candidate image.
A discriminator D receives both real images from the training set and generated images from G, and must classify each as real or fake.
The two networks train simultaneously: D tries to get better at spotting fakes, G tries to get better at fooling D. They reach a competitive equilibrium where G's outputs are statistically indistinguishable from real samples.

GANs produced the first photorealistic synthetic faces (StyleGAN, StyleGAN2, StyleGAN3 from NVIDIA) and powered face-swap pipelines like FaceShifter, SimSwap, and InsightFace's swap models. GANs are fast at inference (one forward pass through G produces an image), train in a single end-to-end loop, and produce extremely sharp outputs.

Their weaknesses: training is unstable (mode collapse, gradient vanishing), they struggle with the diversity of real-world data, and the discriminator's pressure to fool the generator means GANs leave detectable statistical fingerprints in the frequency domain. Most commercial deepfake detectors trained before 2023 are essentially GAN-fingerprint detectors. They were brittle in production and largely obsolete against diffusion-based content — which is part of the lab-to-production accuracy gap the industry has spent the last 18 months closing.

Diffusion Models: The 2026 State of the Art

Diffusion models — the architecture behind Stable Diffusion, DALL·E, Midjourney, Sora, Runway Gen-3, and LTX-2 — have become the dominant deepfake generation technique since 2023. The mathematical intuition runs in reverse compared to GANs:

Forward process: take a real image and incrementally add Gaussian noise over many steps until it is pure noise. This trajectory is fixed and known.
Reverse process: train a neural network — a denoising U-Net or transformer — to reverse one step of noising. Given a slightly-noisy image, predict what the slightly-less-noisy version looks like.
Generation: start with pure noise. Apply the trained denoiser repeatedly (typically 20–50 steps for modern samplers). The output is a clean image drawn from the same distribution as the training data.
Conditioning: text prompts, reference images, or identity embeddings are injected at each denoising step via cross-attention. This is what allows "make a video of [target] saying [text]" to work.

Diffusion has three properties that explain its dominance:

Quality. Diffusion models produce sharper, more diverse, and more prompt-faithful outputs than GANs at the same parameter count.
Stable training. There is no adversarial loop, just a single denoising objective. Training is reliable.
Compositionality. Text, image, and identity conditioning compose cleanly. You can specify the source person, the action, the lighting, and the camera angle independently.

The trade-off is inference cost. A single image takes 20–50 forward passes through the denoiser; a 5-second video can take minutes on a consumer GPU. Recent research (consistency models, rectified flow, distillation) has compressed this to 1–4 steps in some cases, which is what makes real-time deepfake video on a single RTX 4090 possible in 2026.

For deepfake detection, the shift from GANs to diffusion has been disruptive. Diffusion artifacts live in different parts of the frequency spectrum than GAN artifacts, with different statistical signatures. Detectors trained exclusively on GAN-generated content fail on diffusion content. This is the central technical reason that detection accuracy degraded sharply in 2023–2024 before retraining on diffusion datasets brought it back. Defenders now need ensembles that cover all three architectures plus voice — see our analysis of explainable AI in deepfake detection for how modern detectors handle this.

How Voice Cloning Works

Voice cloning is technically a separate track from image and video deepfakes, though the principles overlap. Modern voice synthesis has converged on three approaches:

Autoregressive token models. Audio is tokenized (using a neural codec like EnCodec or SoundStream), and a transformer predicts the next token conditioned on a reference voice and a target text. Tortoise-TTS, OpenAI's voice models, and ElevenLabs' core engine work this way.
Flow matching / diffusion-based vocoders. Continuous-time generative models map noise to audio waveforms. These are typically faster and produce more natural prosody.
Few-shot speaker adaptation. A pre-trained "voice foundation model" is conditioned at inference time on a 3–10 second reference clip. No per-speaker training is required. This is the technique that pushed the barrier from "needs a 30-minute interview" down to "needs a TikTok clip."

The pipeline looks like this:

Acquire reference audio (3 seconds is the modern floor for usable clones; 30+ seconds gives near-indistinguishable quality).
Extract a speaker embedding — a fixed-length vector that encodes the unique characteristics of the voice (pitch contour, formant structure, breathing patterns).
Generate target speech by conditioning a generative model on the speaker embedding plus the desired text.
Optionally, post-process for room acoustics (reverb matching) to make the clone sound consistent with the channel — a phone call versus a Zoom recording.

Group-IB's research found voice deepfakes increased 680% year-over-year in 2024, and the Arup $25.6M fraud, the Ferrari attempted fraud, and the WPP CEO impersonation case all relied on voice cloning at the front of the attack. For an audio-specific deep dive, see our voice cloning threat analysis.

The Modern Deepfake Pipeline, Step by Step

Putting the architectures together, here is what an end-to-end deepfake production looks like in 2026, whether built by an entertainment studio or a fraud actor.

Target reconnaissance. Scrape source content from LinkedIn, YouTube, podcast appearances, earnings calls, Instagram. For executives, 5–10 minutes of clean public footage is typically more than enough.
Reference encoding. Extract a face embedding (for face-swap), a voice embedding (for voice clone), and optionally a motion or style embedding.
Generation. Pass the reference embedding plus the target script (text, target video for reenactment) through a foundation model — diffusion-based for video, autoregressive transformer for audio.
Synchronization. For multimodal output, lip movements are aligned to generated audio using a model like Wav2Lip or its diffusion-era successors.
Post-processing. Color match, frame interpolation to target frame rate, encoding to the delivery format.
Delivery channel. The output is dropped into the chosen attack vector — uploaded to social media, played on a phone call, streamed live into a Zoom or Teams meeting using virtual camera and virtual microphone software.

The total wall-clock time for a competent operator producing a 30-second targeted deepfake video with cloned voice is, in 2026, on the order of 10–30 minutes on a single consumer GPU. Voice-only clones are seconds.

Hardware, Data, and Time Requirements

A common question from security teams is: "How big does an attacker have to be to do this?" The answer in 2026 is uncomfortable.

Output Type	Minimum Hardware (2026)	Reference Data Needed	Wall-Clock Time
Voice clone (30s of speech)	Laptop CPU (no GPU required)	3–10 seconds of clean audio	Seconds
Static face swap (single image)	Mid-range GPU (RTX 3060+)	1 reference photo	Under 1 minute
30-second face-swap video	RTX 4090 / 5090 or cloud GPU	1 ref photo + target source video	10–30 minutes
Real-time live deepfake (Zoom)	Single RTX 4090	1 ref photo + voice reference	Live (sub-100ms latency)
Text-to-video (Sora-class)	Cloud API or RTX 5090	Text prompt, optional ref images	Minutes per 5–10s clip

Table 2: Hardware, data, and time requirements for common deepfake outputs in 2026. The barrier to high-quality production has effectively collapsed to a single consumer GPU.

A single RTX 4090 (~$1,800 retail in 2026) can produce 4K-resolution face-swap video at near-real-time framerates. An RTX 5090 raises the bar further. Cloud GPU rental on services like RunPod or Vast.ai costs roughly $0.50–$2.00 per GPU-hour for equivalent capability. Voice cloning runs comfortably on a laptop's CPU.

The implication for threat modeling: assume any individually motivated attacker has access to enterprise-grade deepfake capability. The "nation-state-only" framing of 2020 no longer applies. The volume metrics back this up: deepfake files online grew from approximately 500,000 in 2023 to a projected 8 million in 2025, per Europol's IOCTA 2025 report.

Common Tools and Where They Fit

It is useful to know the names attackers and researchers actually use. The following are the most widely deployed open-source and commercial deepfake creation tools as of mid-2026.

Tool / Platform	Capability	Architecture	Distribution
DeepFaceLab / FaceSwap	Offline face-swap video production	Autoencoder + GAN refinement	Open source
FaceFusion	One-shot face swap with image input	GAN-based (InsightFace lineage)	Open source
DeepFaceLive	Live virtual-camera face swap into Zoom/Teams	GAN, optimized for streaming	Open source
Stable Diffusion / ComfyUI	Image generation, identity injection via LoRA/IP-Adapter	Diffusion model	Open source
LTX-2 / Sora 2 / Runway Gen-3	Text-to-video and image-to-video synthesis	Diffusion / flow-matching transformer	Open source / commercial API
ElevenLabs / Tortoise-TTS	Voice cloning from short reference audio	Autoregressive transformer + neural vocoder	Commercial / open source
Wav2Lip and successors	Lip-sync generation aligned to target audio	GAN / diffusion-based	Open source

Table 3: Representative deepfake creation tools widely used in 2026. The capability surface — not the specific tool list — is what matters; specific tools rotate.

The list rotates rapidly. The takeaway is not the names — those will change — but the capability surface. End-to-end production now requires assembling three to five tools at most, all freely available, all runnable on a single consumer GPU.

Why This Matters for Detection

Each generative architecture leaves different artifacts. The detection landscape mirrors this:

Autoencoder fingerprints — visible in face-boundary blending and frequency-domain residuals at low and mid frequencies.
GAN fingerprints — periodic high-frequency artifacts caused by transposed convolutional upsampling, often called "checkerboard" or "PRNU"-style signatures.
Diffusion fingerprints — different again, distributed across the frequency spectrum, often best detected via spectral and texture-statistics analysis.
Voice deepfake artifacts — temporal inconsistencies in formant trajectories, unnaturally smooth pitch contours, and missing microsignals like breathing and lip-smack noises.

A detector trained exclusively on one architecture's artifacts generalizes poorly to others. This is the central technical reason for the industry's lab-to-production accuracy gap, and it is what enterprise-grade detection ensembles are built to address. DuckDuckGoose's DeepDetector uses a multi-model ensemble specifically designed to cover GAN, autoencoder, and diffusion artifacts simultaneously, with per-prediction explainability so a security analyst can see which model fired and which spatial regions drove the decision.

For practitioners building or buying detection, the rule of thumb is: a detector that does not test against diffusion-generated content from 2024 or later is not a 2026-grade detector.

FAQ

What is the easiest way to make a deepfake? The lowest-effort path in 2026 is a hosted face-swap or voice-clone service that requires only a reference image or a short audio clip. End-to-end open-source tools (FaceFusion, DeepFaceLab) require slightly more technical skill but are still accessible to a moderately technical hobbyist. Real-time live deepfake software like DeepFaceLive runs as a virtual camera into Zoom or Teams.

How long does it take to train a deepfake model? With pre-trained foundation models conditioned at inference time, "training" time is effectively zero — generation takes seconds to minutes. Training a high-quality custom model from scratch still takes hours to days on a single GPU, but few attackers need to do this in 2026 because the foundation models are good enough out of the box.

Are deepfakes made with the same technology as ChatGPT? Partially. Voice deepfakes share the autoregressive transformer architecture with large language models. Image and video deepfakes use diffusion models, which are a different (though related) generative architecture. ChatGPT itself is not a deepfake tool; the underlying transformer family of models is general-purpose generation.

What's the difference between a deepfake and AI-generated content (AIGC)? "Deepfake" specifically refers to synthetic media that mimics a real person or scene — face, voice, identity, or recognizable event. "AIGC" is the broader category of any AI-generated content, including landscapes, fictional characters, abstract art, and synthetic data. All deepfakes are AIGC; not all AIGC is a deepfake.

Why do diffusion models produce better deepfakes than GANs? Three reasons: stable training (no adversarial mode collapse), better diversity in outputs (no mode coverage failure), and cleaner conditioning on text and reference images. Diffusion also tends to produce fewer high-frequency artifacts because the iterative denoising process averages over many trajectories.

Do deepfakes leave artifacts that detectors can find? Yes — but the artifacts are architecture-specific, often subtle, and increasingly invisible to humans. iProov's 2025 study found only 0.1% of people could reliably distinguish real from synthetic content, but well-trained ensemble detectors achieve significantly higher accuracy. The challenge is generalizing across new generators, which is why production detection systems retrain continuously. See our detection accuracy analysis for benchmark data.

Can a deepfake be detected just by looking at metadata? Sometimes — original recordings often have camera-specific EXIF data, codec signatures, and PRNU sensor noise that synthesized content lacks. But metadata is trivially stripped or spoofed, so metadata analysis alone is not reliable. Pixel-level and frequency-domain analysis is required for high-confidence detection.

Is it possible to make a deepfake without any training data of the target? Yes for some categories — text-conditioned diffusion video models can generate synthetic people who do not exist (one-shot prompts like "an elderly man in a dark suit speaking at a podium"). For deepfakes that impersonate a specific real person, at least one reference image or a few seconds of audio is generally required.

Generation has become accessible. Detection has had to evolve to keep up. DuckDuckGoose's DeepDetector covers all three major architectures — GAN, autoencoder, and diffusion — with explainable, per-prediction reasoning. Talk to our team about adding deepfake detection to your IDV pipeline or video conferencing platform.

Last update: Q2 2026.

How Are Deepfakes Made? A Technical Guide for 2026