There are moments in AI where you are utterly amazed by how fast the industry is moving, and today is one of them. OpenAI just introduced their text-to-video model, Sora, and the results are mind blowing.
While the general public still doesn't have access to Sora at the time of writing, Sam Altman has been sharing unedited videos on X, and it's already hard to imagine where we'll be in 12 months...
Sora vs. Will Smith eating spaghetti
Just to provide some context about where text-to-video was less than a year ago, here's the classic (and somewhat disturbing) AI-generated Will Smith eating Spaghetti video...
Now, comparing that with OpenAI's new Sora model, you can see how far the industry has come, and how this model is truly a breakthrough...
What is Sora?
Sora is a text-to-video model that can produce videos up to a minute long with an incredibly high level of visual quality and adherence to the user’s prompt.
Sora is able to generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background.
Currently, the model is only available to select users and red teamers, both to prepare the film & design industry for what's coming, and also to assess the model for potential risks.
What stands out most about the Sora model is it's ability to understand and simulate the physical world in motion, marking a significant step towards AI models that can interact with real-world scenarios and solve problems requiring an understanding of physical dynamics and aesthetics.
What can Sora do?
So far, the capabilities of Sora seem to be vast and varied. From creating scenes of stylish individuals navigating the bustling, neon-lit streets of Tokyo to generating footage of prehistoric wooly mammoths treading through snowy landscapes, Sora's range is impressive.
It can produce content across genres, including historical reenactments, futuristic cyberpunk narratives, and photorealistic nature documentaries. This versatility will undoubtedly make Sora a valuable tool for filmmakers, visual artists, designers, and marketers looking to bring their imaginative visions to life...
How was Sora built?
Sora is a state-of-the-art diffusion model that's designed to transform videos from an initial state of static-like noise into clear, coherent visual narratives through a series of refining steps.
Here are a few of the details around the research techniques OpenAI used to build Sora:
- It uses a transformer architecture akin to that used in GPT models, although this model treats videos and images as collections of patches, analogous to tokens in language models.
- This approach enables a unified approach to training on diverse visual data with different durations, resolutions, and aspect ratios.
- By incorporating techniques such as recaptioning from DALL·E 3, Sora gains the ability to generate videos that closely follow textual instructions, showcasing its versatility in creating content from text prompts or enhancing existing images and videos.
This approach not only broadens the scope of visual content generation but also lays the groundwork for models capable of simulating real-world phenomena, marking a significant stride towards the development of Artificial General Intelligence (AGI).
Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
You can check out the full technical report of Sora here.
Sora vs. Runway
Now, the real question is...how many AI startups did OpenAI just kill with Sora?
A few main competitors in the text-to-video space include Runway, Pika Labs, Stable Video, and more. Arguably the most notable competitor in the text-to-video space is Runway's Gen 2, although as you can see below from Runway, Sora is significantly more impressive:
You can find more text-to-video tools in the MLQ app here.👇
Sora Video Examples
Alright enough writing, let's look at a few more examples of what Sora can do.
Summary: Sora Text to Video
If you've been on Twitter in the last day, you know that Sora has taken the AI world by storm. It makes you think how in a few short months it will be very hard to tell which videos on your feed are real or AI-generated.
Sora is clearly a major breakthrough in AI and offers unprecedented capabilities in text-to-video generation. As OpenAI continues to refine Sora and roll out access to the more users, the possibilities for what's coming are truly boundless...