A reference glossary of 32 terms covering AI face swap models, computer vision techniques, synthetic media regulations, and the infrastructure that powers production face-swap services. Maintained by the DeepSwapAI engineering team.
Generative Adversarial Network (GAN)
A class of machine learning frameworks introduced by Ian Goodfellow in 2014, comprising two neural networks — a generator and a discriminator — trained in opposition. The generator produces synthetic outputs (e.g., a face) while the discriminator evaluates them against real samples; both networks improve through this adversarial loop. GANs power most modern face swap systems alongside diffusion-based approaches. Notable variants include StyleGAN, StyleGAN2, and StyleGAN3.
See also: Wikipedia, Wikidata
Related: Diffusion Model, Convolutional Neural Network (CNN), Latent Space, Transfer Learning
Diffusion Model
A generative model that learns to reverse a gradual noising process applied to training data. Starting from pure Gaussian noise, the model iteratively denoises across hundreds of steps to produce a coherent output. Diffusion models (DDPM, Stable Diffusion, Imagen) have largely overtaken GANs for image generation due to higher quality and training stability, though GANs remain faster at inference. Modern face-swap pipelines often combine GAN-based identity encoders with diffusion-based refiners.
See also: Wikipedia
Related: Generative Adversarial Network (GAN), Latent Space, Inference (Machine Learning)
Face Embedding
A fixed-length numerical vector (typically 128 to 512 dimensions) that captures the distinguishing features of a face, generated by deep learning models such as ArcFace or FaceNet. Two faces of the same person produce embeddings with high cosine similarity (typically above 0.6); two different people produce dissimilar vectors. Embeddings enable face recognition, identity-preserving swaps, and biometric search. They are mathematically irreversible — the original photo cannot be reconstructed from the embedding alone.
Related: ArcFace, Face Alignment, Facial Landmark Detection
Facial Landmark Detection
The task of locating key anatomical points on a face — typically eye corners, nose tip, mouth corners, jawline contour — used to align faces before swap and to drive expression transfer. Common point counts are 5, 68, 98, and 128, with higher counts giving finer control over expression and pose. State-of-the-art landmark detectors include HRNet, FAN (Face Alignment Network), and MediaPipe FaceMesh. Accurate landmarks are the foundation of natural-looking face swaps.
Related: Face Alignment, HRNet (High-Resolution Network), RetinaFace, Face Embedding
Convolutional Neural Network (CNN)
A deep learning architecture optimized for grid-structured data such as images, using convolutional filters that share weights across the input to detect local patterns regardless of position. CNNs are the foundation of nearly all computer vision systems including face detection, landmark localization, and identity encoding. Influential architectures include AlexNet (2012), ResNet (2015), and EfficientNet (2019). Modern face swap pipelines combine CNNs with attention mechanisms and transformer blocks.
See also: Wikipedia, Wikidata
Related: Generative Adversarial Network (GAN), Transfer Learning, Inference (Machine Learning)
Inference (Machine Learning)
The process of running a trained machine learning model on new input to generate a prediction or output, distinct from training. Face swap inference involves running input photos through detection, landmark, identity-encoding, and generation networks — typically completing in under 1 second per frame on a NVIDIA A100 GPU. Inference cost scales with model size, image resolution, and batch size. Optimizations include mixed-precision (FP16/FP8), quantization (INT8), and graph compilation via TensorRT or ONNX Runtime.
Related: CUDA (Compute Unified Device Architecture), NVIDIA A100 / H100 GPU, TensorRT, ONNX (Open Neural Network Exchange)
Latent Space
A compressed representation space inside a generative model where each point corresponds to a possible output. In StyleGAN, the W and W+ latent spaces let users smoothly interpolate between faces, alter expressions, age, gender, or pose by moving along learned directions. Face swap systems often project source and target faces into shared latent space, then blend or substitute identity-related dimensions. Latent space manipulation is the technical foundation behind controllable face editing.
See also: Wikipedia
Related: Generative Adversarial Network (GAN), Diffusion Model, Transfer Learning
Transfer Learning
A machine learning technique where a model trained on one task (typically a large general dataset like ImageNet or VGGFace2) is adapted to a related but smaller task by fine-tuning its weights. Most production face-swap models start from a pretrained backbone — saving weeks of training time and improving accuracy on small datasets. Transfer learning is what makes high-quality custom face swaps practical without billion-image training corpora.
See also: Wikipedia
Related: Convolutional Neural Network (CNN), Generative Adversarial Network (GAN), Inference (Machine Learning)
Face Swap
The computer-vision task of replacing the face in a target image or video with the face of a source person, preserving the target's pose, expression, lighting, and surrounding context. Modern face swap uses a multi-stage pipeline: face detection, landmark alignment, identity encoding, generation via GAN or diffusion, color and lighting matching, and seamless blending. Outputs at 1080p typically render in 5–15 seconds; 4K outputs take 15–25 seconds on A100-class GPUs.
Related: Face Morphing, Deepfake, ArcFace, RetinaFace, Temporal Coherence
Face Morphing
A class of techniques that smoothly interpolates between two faces, often used for transition effects, ID-photo blending, or progressive aging. Unlike face swap (which substitutes identity wholesale), morphing produces an intermediate hybrid face. Morphing has known security implications: morphed passport photos can authenticate as either source person, prompting research into morph-attack detection (MAD) systems used by border control authorities.
Related: Face Swap, Deepfake, Facial Landmark Detection
Deepfake
A portmanteau of 'deep learning' and 'fake', referring to synthetic media — most commonly video — where one person's likeness is substituted with another's via deep neural networks. The term originated with a 2017 Reddit username. Deepfakes can be created responsibly (entertainment, education, accessibility) or maliciously (non-consensual imagery, fraud, disinformation). Major platforms watermark or sign exports via C2PA Content Credentials so AI-generated origin is detectable downstream.
See also: Wikipedia, Wikidata
Related: Face Swap, Synthetic Media Detection, C2PA (Coalition for Content Provenance and Authenticity), TAKE IT DOWN Act 2025
RetinaFace
A single-shot multi-level face detection model published by InsightFace in 2019, recognized for high accuracy on small faces and unusual poses. RetinaFace simultaneously detects face bounding boxes, 5-point landmarks, and 3D mesh vertices in a single forward pass, making it fast enough for real-time video processing. It remains a common first stage in production face-swap pipelines, often combined with HRNet for finer 68-point or 128-point alignment.
Related: Facial Landmark Detection, Face Alignment, ArcFace, Face Swap
ArcFace
An identity-encoding model introduced by InsightFace in 2018 that produces 512-dimensional face embeddings using additive angular margin loss. ArcFace dramatically improved face recognition accuracy on the LFW and MegaFace benchmarks and remains a de-facto standard for identity preservation in face-swap systems. The angular margin enforces clearer separation between identities, which prevents the swapped face from drifting toward an averaged or ambiguous appearance.
Related: Face Embedding, Facial Landmark Detection, Face Swap
HRNet (High-Resolution Network)
A neural-network architecture introduced in 2019 that maintains high-resolution feature maps throughout all layers (rather than down- and up-sampling like U-Net), making it especially strong at fine-grained tasks like facial landmark localization, human pose estimation, and segmentation. HRNet variants are widely used to produce 68- or 128-point landmark detection that drives high-fidelity face alignment in modern face-swap and lip-sync systems.
Related: Facial Landmark Detection, Face Alignment, Lip Sync (AI)
Face Alignment
The preprocessing step that warps a detected face into a canonical position — typically eyes horizontal, mouth centered, fixed inter-ocular distance — using an affine or perspective transform derived from facial landmarks. Alignment dramatically reduces the variation downstream networks must handle, improving both identity recognition and generation quality. Misalignment by even a few pixels can produce visible artifacts in the final swap, making landmark accuracy critical.
Related: Facial Landmark Detection, HRNet (High-Resolution Network), Face Embedding, Face Swap
Temporal Coherence
The frame-to-frame consistency of a face-swap video: identity, lighting, and micro-expressions must remain stable across hundreds of frames or the result will flicker. Temporal coherence is achieved through optical-flow-aware generation, recurrent state passed between frames, or explicit motion-vector regularization. Cheap face-swap tools that process each frame independently produce the characteristic 'shimmering' artifact; production systems run dedicated temporal modules to suppress it.
Related: Face Swap, Lip Sync (AI)
Lip Sync (AI)
The task of generating realistic mouth movements on a target face that match a given audio track, used for dubbing, virtual avatars, and synthetic spokesperson video. Modern AI lip sync models such as Wav2Lip and SadTalker take an audio waveform plus a static or moving target face and output a phoneme-accurate video. Quality is judged by LSE-D (lip-sync error distance) and viseme alignment metrics rather than visual similarity alone.
Related: Temporal Coherence, HRNet (High-Resolution Network), Face Swap
C2PA (Coalition for Content Provenance and Authenticity)
A cross-industry standards body — co-founded by Adobe, Microsoft, Intel, BBC, Sony, and others in 2021 — that publishes the C2PA Technical Specification for cryptographically signing the provenance and edit history of digital media. C2PA-compliant exports embed tamper-evident manifests recording the AI tool, model version, and creation timestamp. Major social platforms now read these manifests to display 'AI-generated' badges, satisfying obligations under the EU AI Act Article 50.
See also: Wikipedia
Related: Content Credentials, EU AI Act, Synthetic Media Detection, Deepfake
Content Credentials
The user-facing brand for C2PA manifests: a tamper-evident metadata package attached to images, videos, and audio that records the asset's origin (camera, AI model, software pipeline), edit history, and signing chain. Verification can be performed via verify.contentauthenticity.org or platform-native badges. Content Credentials are increasingly required: TikTok auto-displays the AI badge for any C2PA-signed upload, and the EU AI Act mandates similar provenance markers for synthetic media.
Related: C2PA (Coalition for Content Provenance and Authenticity), EU AI Act, Deepfake
EU AI Act
Regulation (EU) 2024/1689, the world's first horizontal AI law, entered into force on 1 August 2024 with phased application through 2026 and 2027. Article 50 specifically requires deployers of generative AI to mark synthetic content (images, video, audio) so that it is detectable as AI-generated, typically via C2PA or similar provenance standards. Penalties for non-compliance reach up to 35 million euros or 7% of global annual turnover, whichever is higher.
See also: Wikipedia, Wikidata
Related: C2PA (Coalition for Content Provenance and Authenticity), Content Credentials, GDPR Article 5(1)(c) Data Minimization, TAKE IT DOWN Act 2025
BIPA (Biometric Information Privacy Act)
An Illinois state law enacted in 2008 governing the collection, retention, and use of biometric identifiers including facial geometry, fingerprints, retinal scans, and voiceprints. BIPA requires written informed consent before any biometric collection, mandates a public retention schedule, and allows individual lawsuits with statutory damages of $1,000-$5,000 per violation. Face-swap and face-recognition vendors operating in Illinois must comply or face class-action exposure that has produced multi-hundred-million-dollar settlements.
See also: Wikipedia
Related: GDPR Article 5(1)(c) Data Minimization, TAKE IT DOWN Act 2025, Face Embedding
GDPR Article 5(1)(c) Data Minimization
A core principle of the European Union General Data Protection Regulation requiring that personal data be 'adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed.' For face-swap services, this means uploaded photos must not be retained beyond the time strictly needed to deliver the requested output. Best-practice retention is under 24 hours, with auto-deletion enforced at the storage layer rather than by manual cleanup.
See also: Wikipedia, Wikidata
Related: EU AI Act, BIPA (Biometric Information Privacy Act)
TAKE IT DOWN Act 2025
United States federal law signed in 2025 that criminalizes the publication of non-consensual intimate imagery, including AI-generated 'deepfake' versions, and requires online platforms to remove reported material within 48 hours of a verified request. The law establishes a federal civil cause of action against publishers and obligates AI tool operators to maintain takedown channels. DeepSwapAI and FaceSwapAI both implement compliant takedown workflows reachable through their respective Trust pages.
Related: Deepfake, Synthetic Media Detection, BIPA (Biometric Information Privacy Act), EU AI Act
Synthetic Media Detection
A class of forensic tools that classify whether an image, video, or audio file was generated or manipulated by AI, typically by analyzing low-level statistical artifacts left by GANs and diffusion models. Detection accuracy on in-distribution outputs exceeds 95% but degrades on novel models or post-processed media, which has driven the industry shift toward proactive provenance signing (C2PA) rather than reactive detection alone.
Related: Deepfake, C2PA (Coalition for Content Provenance and Authenticity), Content Credentials
CUDA (Compute Unified Device Architecture)
NVIDIA's parallel computing platform and programming model, released in 2007, that exposes GPU compute to general-purpose workloads via C/C++ extensions. CUDA underlies nearly all modern deep learning training and inference: PyTorch, TensorFlow, JAX, ONNX Runtime, and TensorRT all compile to CUDA kernels for execution on NVIDIA GPUs. Face-swap inference at production scale runs on CUDA-accelerated A100 or H100 instances.
See also: Wikipedia, Wikidata
Related: NVIDIA A100 / H100 GPU, TensorRT, Inference (Machine Learning)
NVIDIA A100 / H100 GPU
Data-center-class accelerators in NVIDIA's Ampere (A100, 2020) and Hopper (H100, 2022) architectures, with 40-80GB HBM2e/HBM3 memory and dedicated Tensor Cores for mixed-precision matrix math. The A100 delivers 312 TFLOPS of FP16 throughput; the H100 delivers 989 TFLOPS plus FP8 support that doubles effective throughput for inference. Both are standard GPUs for production face-swap inference, with H100 typically reserved for high-volume Enterprise workloads.
Related: CUDA (Compute Unified Device Architecture), TensorRT, Inference (Machine Learning)
ONNX (Open Neural Network Exchange)
An open-source intermediate representation for machine learning models, launched in 2017, that lets a model trained in one framework (PyTorch, TensorFlow, JAX) run in any compatible runtime. Face-swap pipelines export PyTorch checkpoints to ONNX so they can run inside ONNX Runtime, TensorRT, OpenVINO, or DirectML — separating training from deployment. The ONNX Model Zoo hosts pretrained face-detection and recognition checkpoints commonly used as pipeline starting points.
See also: Wikipedia
Related: TensorRT, CUDA (Compute Unified Device Architecture), Inference (Machine Learning)
TensorRT
NVIDIA's high-performance deep learning inference SDK that compiles trained models into optimized GPU engines via layer fusion, kernel auto-tuning, and reduced-precision execution (FP16, INT8, FP8). Compared to vanilla PyTorch inference, TensorRT typically delivers 2-5x latency reduction on the same GPU, which is the difference between a 30-second video processing in 60 seconds versus 12 seconds. Production face-swap services run TensorRT-compiled engines for tier-one latency SLAs.
Related: CUDA (Compute Unified Device Architecture), ONNX (Open Neural Network Exchange), NVIDIA A100 / H100 GPU, Inference (Machine Learning)
Video Codecs (H.264, H.265, AV1)
Standardized algorithms for compressing and decompressing video. H.264/AVC (2003) is the most widely supported codec; H.265/HEVC (2013) achieves roughly 50% smaller files at equal quality but carries patent licensing complexity; AV1 (2018) offers similar efficiency to H.265 royalty-free. Face-swap services accept all three on input and re-encode to H.264 by default for maximum playback compatibility, or H.265/AV1 when explicitly requested for archival quality.
See also: Wikipedia, Wikipedia, Wikipedia
Related: Color Space (sRGB, DCI-P3, Rec. 709), Chroma Subsampling (4:2:0, 4:4:4), Frame Rate (FPS)
Color Space (sRGB, DCI-P3, Rec. 709)
A defined range of representable colors for digital media. sRGB (1996) is the consumer-web standard; DCI-P3 covers 25% more color volume and is used by Apple displays and digital cinema; Rec. 709 is the HDTV standard, nearly identical to sRGB but with broadcast-specific gamma. Face-swap pipelines must preserve color space metadata or risk dull, washed-out output when content authored in DCI-P3 is rendered in sRGB without conversion.
See also: Wikipedia
Related: Chroma Subsampling (4:2:0, 4:4:4), Video Codecs (H.264, H.265, AV1)
Chroma Subsampling (4:2:0, 4:4:4)
A bandwidth-saving technique in video encoding where color (chroma) information is stored at lower resolution than brightness (luma), exploiting the human eye's lower sensitivity to color detail. 4:2:0 is the standard for streaming and consumer video; 4:4:4 preserves full chroma resolution and is required for color-critical work like compositing or chroma-key. Face-swap edge artifacts (especially around hair and skin) are more visible in 4:2:0 footage than 4:4:4.
See also: Wikipedia
Related: Color Space (sRGB, DCI-P3, Rec. 709), Video Codecs (H.264, H.265, AV1), Frame Rate (FPS)
Frame Rate (FPS)
The number of distinct video frames displayed per second. Standard rates are 24fps (cinematic), 30fps (television), 60fps (high-motion sports and gaming), and 120fps+ (smooth-motion HDR). Face-swap output preserves input frame rate by default; processing time scales linearly — a 60fps video takes twice the inference time of an otherwise-identical 30fps video. Variable frame rate (VFR) input is normalized to constant frame rate (CFR) before swap to prevent timing drift.
See also: Wikipedia
Related: Temporal Coherence, Video Codecs (H.264, H.265, AV1), Lip Sync (AI)