Day1/5: SkyReels-A3: The Art of Natural Speech for Digital Humans

PR Newswire

SINGAPORE, Aug. 11, 2025

SINGAPORE, Aug. 11, 2025 /PRNewswire/ -- The Skywork AI Technology Release Week officially kicked off on August 11. From August 11 to August 15, a new model will be unveiled each day, covering cutting-edge models for multimodal AI scenarios.

On August 11, Skywork officially launched the SkyReels-A3 model. Combining a Diffusion Transformer (DiT) model, frame interpolation for extended video generation, reinforcement learning-based motion refinement, and controllable camera techniques, SkyReels-A3 supports full-modality, audio-driven digital human synthesis with unrestricted duration.

The SkyReels-A3 model is now live! Visit the SkyReels official website to try it out:

Links
SkyReels-A3 homepage:

https://skyworkai.github.io/skyreels-a3.github.io/

SkyReels official website (After logging in, select the "Talking Avatar" tool from the left navigation bar):

https://www.skyreels.ai/home

SkyReels open-source model repository:

https://huggingface.co/Skywork

SkyReels-A3 is an audio-driven portrait video generation model that acts like an "AI vocal cord" for any photo or video:

Bring photos to life: Upload a portrait image and a voice clip – the person in the photo will lip-sync and speak or sing naturally;
Generate custom videos: Upload a portrait, add a voice clip, and provide a text prompt – the character will perform with directed expressions and motions;
Re-dub existing videos: Replace the original audio, and the model will automatically adjust lip movements, facial expressions, and gestures while preserving visual continuity.

The SkyReels-A3 model delivers innovative experiences across four key dimensions:

Text Prompt input enables dynamic scene modification;
Enhanced Natural Movements - More lifelike interactions, including object handling and natural hand gestures during speech;
Advanced Cinematic Control - Sophisticated camera work for artistic scenes (music/MVs) with elevated aesthetic quality;
Extended Video Generation - Single-shot videos up to 60 seconds; multi-shot sequences with unlimited duration potential.

Through analysis of real-world applications (e.g., advertising, live-stream commerce), we identified two key requirements: longer-duration videos with consistent quality, and more natural and precise interactive motions. To address these, we developed specialized training datasets for live-stream scenarios and implemented targeted optimizations in video generation.

Moreover, in scenarios requiring high artistic fidelity—such as music videos, film clips, or professional presentations—traditional digital humans are limited to generating "static shots," producing rigid and visually flat results.

To enable dynamic cinematography, we developed a ControlNet-based camera control module. By processing precise camera parameters, the system achieves frame-accurate camera motion control. Specifically, the module extracts depth data from reference images, and integrates user-defined camera parameters to render trajectory-guided reference videos. It uses these videos as explicit motion priors to reconstruct professional-grade camera movements frame-by-frame. The output is digital human videos with cinematic-quality camera work.

Currently, we offer eight preset camera movement parameters: static shot, push in, push out, pan left, pan right, crane up, crane down, and handheld swing shot. Each movement type supports continuous intensity adjustment from 0-100%, allowing users to achieve precisely tailored cinematographic effects for diverse needs.

SkyReels-A3 is built upon a Diffusion Transformer (DiT) video diffusion model framework.

The DiT model has garnered significant attention for its exceptional performance in image and video generation. By replacing traditional U-Net architectures with a Transformer structure, it demonstrates superior capability in capturing long-range dependencies. In SkyReels-A3, we employ a 3D Variational Autoencoder (3D-VAE) to process video data in latent space representation. The 3D-VAE compresses video data across both spatial and temporal dimensions, transforming high-dimensional raw video data into compact latent representations. This latent-space processing approach substantially reduces the computational load for subsequent diffusion models while preserving critical visual information.

SkyReels-A3's performance has been rigorously validated through extensive experimentation, including both quantitative and qualitative comparisons against state-of-the-art models (both open-source and proprietary). The results comprehensively demonstrate its capabilities in audio-driven video generation.

In addition, through step distillation techniques, we reduced the required inference steps from 40 to just 4 while maintaining comparable output quality.

From celluloid to digital, 2D to 3D – each imaging revolution has redrawn the boundaries of content creation.

SkyReels-A3 pioneers democratized voice-to-video synthesis, delivering studio-quality animation from just a single image and audio clip – no specialized hardware or production expertise required.

SkyReels-A3 animates static photos into lifelike talking portraits, overdubs speech in existing videos without face replacement, and delivers flawlessly smooth digital human livestreams. By offering an accessible, cost-effective, and high-fidelity AI solution, it serves diverse fields—from film production and virtual streaming to game development and educational content creation. With SkyReels-A3, personalized and interactive content has never been easier to produce.

SkyReels-A3 brings the "voice as vision" paradigm to life—where your inspiration could spark the next viral sensation.

View original content:https://www.prnewswire.com/news-releases/day15-skyreels-a3-the-art-of-natural-speech-for-digital-humans-302526394.html

SOURCE Skywork AI pte ltd