The next frontier for artificial intelligence is text-to-video

Mathures Paul

Published 16.04.23, 11:54 AM

Christopher Nolan and Marlon Brando are having a conversation around why Vito Corleone is seen petting a cat at the start of the film Godfather. The only problem is that the meeting or the conversation never happened. But imagine a future where AI-generated videos catch us as unprepared as OpenAI’s ChatGPT has. In fact, AI-generated text-to-video service is already available but at different stages of development.

The global generative AI market is expected to become a $42.6b industry this year, according to PitchBook, a capital market company. An AI-like ChatGPT can learn or appear to learn based on the stuff you feed into it. How far are we into this new technology?

Runway, the generative AI startup that co-created the text-to-image model Stable Diffusion, has an AI model that can convert existing videos into new ones by applying any style specified by a text prompt or reference image. Called Gen-1, Runway’s website has a few examples of this new kind of video. Even though the examples are very short, the output is realistic, like that of an “afternoon sun peeking through the window of a New York City loft” or “a low angle shot of a man walking down a street, illuminated by the neon signs of the bars around him”.

Runway has years of experience developing AI-driven video-editing software. The team behind The Late Show with Stephen Colbert has used Runway software to edit the show’s graphics while the visual effects team behind Everything Everywhere All at Once used the company’s tech to help create certain scenes. Runway has also been used for Finneas’ Naked music video (VFX artist Evan Halleck has said: “There’s some hand-stretch stuff. Mainly to cut out so I could create a background.”)

Gen-1 is a video-to-video model that allows you to use words and images to generate new videos out of existing ones. The model has become better in terms of fidelity and results. The success has prompted the company to come up with Gen-2, which unlocks text to video. You can generate a video with a simple text prompt, like “a surfer catching a wave”. It is expected to help create animations and stories. There are caveats. The clips are short and access is limited.

AI video technology can already reproduce common images, like waterfall or different views of mountains or closeup shots of the human eye

Google and Meta are working on it

Meta Platforms and Google both took off in this area with research papers on text-to-video AI models last year.

Last September, Meta unveiled a system called Make-A-Video. One look and you will know that the videos are machine generated but it represented a step forward in AI content generation. The clips were kept to a maximum of five seconds and didn’t contain audio. Meta CEO Mark Zuckerberg said in a post: “It’s much harder to generate video than photos because beyond correctly generating each pixel, the system also has to predict how they’ll change over time.”

Google too has some brilliant systems in this area. There is one that emphasises image quality while the other model prioritises the creation of longer clips.

The high-quality model is called Imagen Video. Imagen is what’s called a “diffusion” model, generating new data by learning how to “destroy” and “recover” many existing samples of data. Imagen Video has been kept as a research project to avoid harmful outcomes. Another team of Google researchers, last year, published details about another text-to-video model called named Phenaki, which can create longer videos that follow the instructions of a detailed prompt.

Recently, the show 60 Minutes showed the progress experimental text-to-video is making. Like with Google’s chat bot Bard, there are safety features, like it doesn’t create images of people.

The system from Runway could, in time, reduce human dependence in the editing department

New companies, old problems

Interestingly, the new generation of AI race are bringing forward plenty of new AI companies. Take the example of Hugging Face, which has ModelScope, a video generator. Key in a few words and you will get a video in return. Unlike the artwork that’s being created by Dall-E, videos still appear wonky. Yet, all these platforms can help model scenes before they are shot. It’s not that writers, directors or actors would be replaced anytime soon but things are scaling up. The biggest challenge at the moment is that AI video content is limited to a few seconds.

ModelScope has been trained on text and image data, and is then fed videos to show how movements should appear. The company Wonder Dynamics is helping film-makers use computer-generated characters in videos. It’s showing how visual effects can be done cheaper and easier. They are more focussed on imaginative characters rather than adding generated humans.

So far a good example of AI on film has been Nothing, Forever, a never-ending, streaming AI parody take on Seinfeld but even this got temporarily banned from Twitch in February after its main character, Larry Feinberg, came up with transphobic jokes. There is another problem to this entire AI chapter — copyright issues, like it has happened with AI-generated images.

Where are we on the AI video curve? Most of the text-to-video clips are short and many of the services restrict video to a few seconds. There’s another big problem — it will offer an opportunity to make compelling deepfakes, the process of which will become simpler. Deepfakes have so far used deep-learning AI techniques to edit existing videos but diffusion models generate new content by following patterns found across millions and millions of images. Now think of using a technique like “inpainting” where faces of real people get superimposed onto the bodies of AI-generated fakes.

With text-to-video progressing at a rapid pace, it’s only a matter of time before an open-source model emerges, bringing about a new era of opportunities and misinformation.

The next frontier for artificial intelligence is text-to-video

The chapter is already unfolding, Google and Meta are already working on it