ShengShu Technology Unveils Vidu S1, Bringing Real-Time Interactive Generation to AI Video

PR Newswire

SINGAPORE, July 3, 2026

SINGAPORE, July 3, 2026 /PRNewswire/ -- At the 2026 Global Digital Economy Conference, ShengShu Technology today unveiled Vidu S1, its next-generation video foundation model, delivering real-time interactive video generation that transforms AI video from creating single clips to enabling continuous, live interaction.

Vidu S1 supports real-time video conversations with voice-guided character control, allowing users to control AI avatars naturally through voice input while engaging in unlimited, continuous interactions. The model delivers 540P (960x540) resolution at 25 FPS (up to 42 FPS) and enables users to instantly create personalized interactive characters from a single image—whether a real person, an anime character, or even a pet—paired with a customizable voice. Together, these capabilities create a more natural, fluid, and immersive real-time interactive experience. Notably, the entire system runs on consumer-grade GPUs, significantly reducing the hardware requirements for real-time interactive video generation.

From Offline Video Generation to Real-Time Interaction

Most existing video generation models operate in an offline workflow: users submit a prompt, wait for the video to be generated, and then view the completed result. Once generated, the content remains fixed. Making changes to an AI avatar's actions or the storyline typically requires generating a new video, limiting interaction to a one-way creation-and-viewing experience.

Vidu S1 introduces a real-time interactive video generation framework that enables users to provide voice input continuously throughout a real-time video conversation. The model processes voice input alongside conversational context and the current visual context, allowing subsequent video content to be generated and updated in real time.

Beyond real-time generation, Vidu S1 also advances voice interaction from simple lip synchronization to full AI avatar control. Rather than relying on audio-driven lip movements and predefined animation libraries, the model interprets the semantic meaning, intent, and emotional context of spoken input to generate synchronized lip movements, facial expressions, eye movements, gestures, body posture, and full-body actions in real time.

Together, these capabilities enable AI avatars to understand user instructions, respond naturally during conversations, and support continuous, real-time interaction.

Unlimited Real-Time Video Generation

Most video generation models today produce clips with a fixed duration, typically ranging from a few seconds to several tens of seconds. Once generation begins, users have limited ability to influence how the video evolves.

Vidu S1 adopts an autoregressive diffusion (AR + Diffusion) architecture. Rather than generating an entire video upfront, it continuously predicts and generates subsequent video content based on previously generated frames, current voice instructions, and conversational context. As users provide new instructions, the model updates the character's expressions, movements, and subsequent actions in real time, enabling the interaction to evolve continuously throughout the conversation.

In addition to real-time interaction, Vidu S1 is a leading model for unlimited-duration real-time video generation. This requires more than continuous generation alone. The model must simultaneously preserve character identity, maintain natural and coherent motion, continuously process user input, and respond in real time throughout extended conversations.

By combining these capabilities, Vidu S1 enables persistent generative video interaction, allowing characters to remain responsive, visually consistent, and continuously interactive over extended periods.

540P at 25 FPS for Video-Call-Quality Interaction

Delivering real-time interactive video requires not only streaming generation, but also the resolution and frame rate needed to support natural, responsive conversations.

To meet these requirements, ShengShu Technology optimized Vidu S1 across model acceleration, inference, and system deployment, enabling real-time interactive video generation at 540P (960x540) resolution and 25 FPS, with support for up to 42 FPS.

At the model level, Vidu S1 is powered by ShengShu Technology's inference acceleration techniques, including TurboDiffusion [1], low-bit SageAttention [2], and sparse attention methods such as SLA [3] and SpargeAttention [4]. Through few-step generation, model quantization, and optimized inference kernels, Vidu S1 significantly reduces the computational cost of generating each frame while supporting high-frame-rate output. This efficiency allows Vidu S1 to run real-time interactive generation on consumer-grade GPUs, rather than the large server clusters such workloads typically require.

At the system level, TurboServe [5], ShengShu Technology's inference serving engine, efficiently schedules inference workloads while maintaining user inputs, character states, and visual context throughout an interaction. Compute resources are dynamically allocated based on the interaction state to support stable, low-latency real-time interactive video generation.

Together, these model- and system-level optimizations enable Vidu S1 to deliver continuous, stable, and responsive real-time interactive video generation throughout extended interactions.

These capabilities provide the technical foundation for applications such as real-time video conversations, interactive livestreaming, AI companionship, interactive gaming, and XR experiences.

Create Interactive Characters from a Single Image

Creating traditional AI avatars typically requires multiple images or video assets, followed by character modeling, rigging, lip-sync configuration, and dedicated training before the character can be used for interaction.

Vidu S1 introduces a fully generative workflow that eliminates the need for character-specific modeling and training. Users simply upload a single image, and the model captures the character's identity, appearance, and visual style to generate synchronized lip movements, facial expressions, gestures, and full-body motion in real time.

Whether based on a real person, an anime character, or a pet, a single image can be turned into a real-time interactive character. Vidu S1 also supports customizable voices, enabling a consistent visual and vocal identity for each character.

By reducing character creation from a multi-step production pipeline to a single-image workflow, Vidu S1 makes personalized real-time interactive characters significantly easier to create.

A New Chapter for Interactive AI Video

As video foundation models continue to evolve, industry competition is expanding beyond image quality, generation speed, and video duration toward broader capabilities in real-time responsiveness, continuity, controllability, and interaction.

With Vidu S1, real-time interactive video generation enables AI video to move beyond pre-generated content toward dynamic, responsive experiences in which AI can understand user input, respond in real time, and evolve continuously throughout an interaction.

Looking ahead, Vidu S1 has the potential to support a wide range of applications, including AI companions, AI virtual influencers, interactive livestreaming, game NPCs, branded AI avatars, intelligent customer service, online education, and XR experiences. These capabilities enable AI avatars to evolve from one-time content assets into persistent, always-on interactive agents.

From generating individual video clips to enabling continuous interaction, and from one-way content creation to real-time two-way engagement, Vidu S1 expands the capabilities of video foundation models and lays the foundation for the next generation of interactive AI experiences.

Availability

Vidu S1 is now publicly available, enabling users to create and interact with AI avatars from their own custom images in real time. An API platform is also available for developers and enterprise partners to build real-time interactive applications.

Global Experience: https://www.vidu.com/vidu-stream

API: https://platform.vidu.com/live/landing

References
[1] TurboDiffusion: Accelerating Video Diffusion Models by 100–200 Times.
[2] SageAttention: Accurate 8-bit Attention for Plug-and-Play Inference Acceleration.
[3] SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention.
[4] SpargeAttention: Accurate and Training-Free Sparse Attention Accelerating Any Model Inference.
[5] TurboServe: Serving Streaming Video Generation Efficiently and Economically.

View original content to download multimedia:https://www.prnewswire.com/news-releases/shengshu-technology-unveils-vidu-s1-bringing-real-time-interactive-generation-to-ai-video-302817626.html

SOURCE ShengShu Technology