RunwayML has had somewhat of a runaway success of late, as the trend of AI-generated videos has grown exponentially. From pizza commercials, to mockups of early 2000s home video, to short films, text-to-video is quickly becoming the new paradigm of generative AI.
Jumping on this trend, Stability AI has released a new SDK for Stable Diffusion that will allow for the creation of animations. With the SDK, users can prompt with just text, an image with text, or a video with text, to create output animations. What first began with Meta’s Make-A-Video has now become the new frontier of generative AI algorithms. However, there are a few key players who are suspiciously missing from the lineup.
Too little, too late
The new release by Stability AI is a software development kit which works with Stable Diffusion 2.0 and Stable Diffusion XL. The SDK has the capability to influence the output through a variety of parameters, from general purpose parameters like style presets, cadence, and FPS (frames per second) to more in-depth parameters to influence characteristics like colours, 3D depths, and post processing.
While this SDK is a good step forward for Stability AI, it seems that they are late to the party. Similar solutions have already existed in the market for a while now, built on Stability’s own models. Deforum, an online community of AI image creators and artists, has already created a Colab notebook for text-to-animation. However, Deforum is fairly basic, as it just melds similar images generated by SD into each other, creating the illusion of an animation.
The true competitor to the Stable Animation SDK is RunwayML’s Gen-2 text-to-video service. This new model, whose paper is yet to be released, builds upon Gen-1’s capabilities of style transfer and video modifications to completely generate video from just a text prompt. Similar to the Stable Animation SDK, users can use a text, images, or videos as a prompt to generate videos from scratch.
While RunwayML’s Gen-2 can only be accessed through a waitlist, it is a complete product which can be used without any technical knowledge. The Stable Animation SDK, on the other hand, is targeted at developers who wish to multiply the capabilities of Stable Diffusion’s models.
Even as video generation is emerging as the next big genAI technology, it seems that many of the companies that capitalise on text-to-image are nowhere to be found.
RunwayML: the new DALL-E?
Early last year, OpenAI released DALL-E 2, an image generation algorithm, which kickstarted a wave of innovation. This resulted in the creation of Midjourney, Stable Diffusion, Imagen, and more, catapulting generative AI into the mainstream. However, with the innovations surrounding text-to-video, a lot of these companies have stayed silent, especially OpenAI.
With the release of ChatGPT, and subsequently GPT-4, it seems that OpenAI is content with grooming its golden goose. As such, we have not seen any improvements to DALL-E, apart from its integration into Bing Chat. There is also no talk about any text-to-video model from the AI giant, counting it out of the newest wave of innovation.
Midjourney has also not provided any information on possible text-to-video algorithms, instead choosing to focus on increasing their market lead by adding new features to their image generator. However, it seems that research is leading innovation, as it had just before the explosion of text-to-image models.
Meta’s AI research wing released a paper in September last year that detailed an approach to generating video without the need for text-video data pairs. Similarly, ByteDance, the company behind TikTok, also released a research paper, harnessing the power of diffusion models to generate videos. While both these models have not been released to the public, the research shows that the idea behind these approaches are sound —— backed up by the variety of generated videos shown on their websites.
Google, in collaboration with the Korea Advanced Institute of Science and Technology, followed suit with the publication of a paper on projected latent video diffusion models. However, this paper was also published with code, allowing for the replication of this approach.
Building on the concept of feature-to-video diffusion models, a team from Alibaba released ModelScope on HuggingFace, which is open for all to use. This is the only service, apart from Deforum, that is open for use.
While the text-to-video market is still in its infancy, the AI-generated commercials show but an inkling of what is possible with video-generating algorithms. Meta has also released a set of generative AI tools that are targeted at advertisers on the platforms, so it is not a reach to think that Make-A-Video can be integrated into this in the future. Just as with any generative AI solution, the potential for innovation is boundless.
The post Why Stability AI Trails Behind RunwayML appeared first on Analytics India Magazine.