Some may think it’s a gimmick but the truth is that videos will no longer be the same once OpenAI’s Sora becomes widely available. Not available to the public yet, OpenAI is allowing access to “red teamers” — basically testers — who are assessing potential risks associated with the model’s release. Also, a limited number of “visual artists, designers, and filmmakers to gain feedback on how to advance the model to be most helpful for creative professionals” have access to the tool.
What is Sora?
Sora, named after the Japanese word for sky, can generate videos up to one-minute long from text prompts. That may sound simple but the results are eye-grabbing. After text-to-image, it is time for text-to-video tools to shine as companies like OpenAI and Google are looking beyond simple images in a sector that is projected to reach $1.3 trillion in revenue by 2032. Before Sora’s arrival, the AI tools we had to make videos were a little more than a year old but the problem with those tools is: Terrible output to the point of being unrealistic.
All that changes with Sora. The same way DALL-E was able to understand our text input and turn it into a stylised image, the same thing with Sora. Since it’s videos, the model needs to understand how things like reflections, textures, materials and physics interact with each other over time to make a reasonably good-looking video.
OpenAI has put out one-minute videos generated using text prompts on Sora. This is a moment from a video of “several giant woolly mammoths approach treading through a snowy meadow”
It’s not perfect. But not everybody who sees AI-generated content on the Internet is looking for flaws. More than 90 per cent of people browsing the Internet don’t intentionally look for AI images. These are videos that can appear on your X or Facebook feed and you may watch it for a few seconds.
What sets Sora apart?
It has the ability to interpret long prompts really well. OpenAI has shared quite a few videos to showcase the power of the tool and OpenAI CEO Sam Altman continues to post such videos on X. One of the videos that has been shared comes with a 135-word description, executed fantastically. The sample video OpenAI shared can create a variety of characters and scenes, from people and animals, landscapes, and even New York City submerged underwater.
It’s insane how fast AI models are improving. Even a few months ago, with DALL-E 3 you could still find something off about it, especially when you ask it for something like a photorealistic image of a human… something about the hands or the ears would always be a little bit off. The new videos are great.
Since these can pass off as real videos to people who are not looking for AI-generated videos, it could be a problem during an election year in the US and in India.
Take a look at the following two videos OpenAI has posted. The first has been generated by a prompt asking for “a beautiful homemade video showing the people of Lagos, Nigeria in the year 2056”. It’s something a group of friends sitting at a table at an outdoor restaurant would capture; the camera pans from an open-air market to a cityscape. The second video shows “reflections in the window of a train travelling through the Tokyo suburbs.” It looks like footage any of us might capture using the iPhone on a train. The reflections on the glass appear real. There is one more video that caught my eye — that of a grandma celebrating her birthday.
If you pixel-peep, there are flaws in the videos. Some of them are too perfect, some have the quality of video games but all of them appear to capture the texture of real life. OpenAI claims that Sora “understands not only what the user has asked for in the prompt, but also how those things exist in the physical world”. The flaws that are there will mostly be taken care of in a few years.
The implications
Such videos have immense implications on those who shoot stock footage. Some of the videos that have been shown may do away with the drone pilot. Much of the footage can replace those stock videos shot by photographers and videographers. Say, there is a video of a wall of TVs, which in real life would be expensive to recreate. A one-minute video of something like this usually needs to be shot on an expensive camera with expensive props. It can now be generated with a simple text prompt.
Consider this: We are living 30 years ago, and the early word processors and spreadsheets are about to hit the market. The economic world is bracing for the next big productivity revolution. Their promise at the time was we’d all spend less time writing, drawing slides, computing numbers on a calculator. But here we are, 30 years later. The reality is we don’t work less. We just write much longer Word documents and our PowerPoint decks have gone from six slides to 50 slides. We engage in much more complex decision-making because the amount of data that we have to process has just exploded. Something similar will happen as generative AI videos become popular. The workflow will change and the videos that we will be making with cameras will also change.
Ultimately, the meaning of videos will change and we will begin to suspect all videos of being synthesised. In 2018, Peter Jackson released They Shall Not Grow Old, a documentary about the First World War which featured colourised archival footage. The team tried to keep it real when it came to the colours. AI is also trying to keep it real.
OpenAI hasn’t disclosed how many videos the system learned from or where they came from but there is information that says training included both publicly available videos and videos that were licensed from copyright holders.
“A movie trailer featuring the adventures of the 30-year-old space man” generated using Sora
Most of the videos look like scratches from movies because of depth of field, dolly moves, dynamic lighting and so on. But when we see a movie, we know it’s not real. So how is this different? It’s different because most of us have become used to watching movie-quality rushes in our social media feeds. So, it’s all too real at a glance.
Sora’s competition
Companies like Runway and Pika have already shown impressive text-to-video models, and Google’s Lumiere may well be one of OpenAI’s primary competitors in this space. Similar to Sora, Lumiere gives users text-to-video tools and also lets them create videos from a still image. Google’s Lumiere paper has noted that “there is a risk of misuse for creating fake or harmful content with our technology, and we believe that it is crucial to develop and apply tools for detecting biases and malicious use cases to ensure a safe and fair use”.
Are there safeguards?
Last week, 20 tech companies, including Adobe, Amazon, Anthropic, Google, Meta, Microsoft, OpenAI, TikTok and X, signed a voluntary pledge to help prevent deceptive AI content from disrupting voting in 2024. The accord, however, did not call for a ban on election-related AI content. Beyond the pledge, Anthrophic has said that it would prohibit its technology from being applied to political campaigning or lobbying. In recent weeks, people in Taiwan, Pakistan and Indonesia have voted, with India scheduled to hold the polling process very soon. Google said in December that it would require video creators on YouTube and all election advertisers to disclose digitally altered or generated content.