May 6th, 2024

China’s Sora competitor, Midjourney CEO’s Prediction, InstantFamily, StoryDiffusion, Stylus

For those of you who are new, this is SinkIn Newsletter, a 5 minutes read made at sinkin.ai to cover the most interesting stuff in the Image AI world.

We scroll, so you don’t have to.

Chinese tech-firm ShengShu-AI and Tsinghua University on Saturday unveiled text-to-video artificial intelligence (AI) model Vidu, which is said to be the first in China that's on par with Sora. Launched at the ongoing Zhongguancun Forum in Beijing, Vidu can generate a 16-second 1080P videoclip with one click. It is built on a self-developed visual transformation model architecture called Universal Vision Transformer (U-ViT), which the allows it to simulate the real physical world with multi-camera view generation.

Showcase Video of Vidu

Buckled up for it?

InstantFamily is an approach for multi-ID image generation, introduced by researchers from SK Telecom. This methodology leverages a masked cross-attention mechanism and a multimodal embedding stack, enabling the preservation and precise control of multiple identities within a single image. Through experiments, InstantFamily has shown solid performance in identity preservation, achieving state-of-the-art results in both single-ID and multi-ID scenarios.

A photo of seven men in mars

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

StoryDiffusion is a new framework that enhances the consistency of content across a series of images generated by diffusion models. It introduces Consistent Self-Attention, a method that improves the uniformity of generated images and integrates seamlessly with existing pretrained text-to-image models. Additionally, the Semantic Motion Predictor is introduced for creating smooth video transitions by predicting motion in semantic spaces between images. This enables the generation of stable long-range videos, a notable improvement over traditional models reliant on latent spaces.

Stylus is a method designed to enhance the generation of high-fidelity, custom images by efficiently selecting and automatically composing task-specific adapters (aka LoRAs). Stylus operates through a three-stage approach: it first improves adapter descriptions and embeddings, then retrieves relevant adapters based on a prompt's keywords, and finally assembles them to best match the prompt's requirements. Test shows Stylus is preferred twice as much as the base model when evaluated by humans and multimodal models.

Meme of the Day

What'd you think of today's edition?

Login or Subscribe to participate in polls.