Alright it’s over. We’ve figured out video AI. Those were essentially my first thoughts when I first saw OpenAI’s newest development, Sora: their impressive own generative video AI. The diffusion transformer model is capable of generating up to a minute of fullHD video based on text prompts, and/or input images and videos, even being able to merge/transition between two input videos, though it does seem to alter the given videos quite drastically.
Strengths
Most examples show realistic styles but the model also seems to be capable of stylised 3D and 2D video and still generations. But the focus seems to be especially on realistic generations, many of which are essentially flawless.
Text to Image
Prompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.
As seen here, the model has exceptional understanding of 3D space, geometry, human anatomy as well as lighting and, perhaps most impressively, reflections. It’s not perfect, of course, but considering this is the worst the model will ever perform, I’d say we aren’t far from it.
In this example, the model shows excellent understandings of the chameleon’s anatomy and motion as well as the camera’s optics and overall geometric consistency. Again, it’s not perfect, but it is still incredibly impressive.
Prompt: This close-up shot of a chameleon showcases its striking color changing capabilities. The background is blurred, drawing attention to the animal’s striking appearance.
Image to Video
Here we see Sora’s first stylised performance, using an image generated by DALL-E 2 or DALL-E 3, which of the two was used for this particular image was not disclosed.
The model shows appropriate understanding of how 2D characters like these are often animated, but the result is rougher than the more realistic approaches, showing weird motion and morphing of body parts. It is also the only 2D example they gave, which leaves me worrying a bit for my anime application.
Prompt: Monster Illustration in flat design style of a diverse family of monsters. The group includes a furry brown monster, a sleek black monster with antennas, a spotted green monster, and a tiny polka-dotted monster, all interacting in a playful environment.
Furthermore, OpenAI did not disclose whether the prompt was given to DALL-E 2, DALL-E 3 or Sora, so it is a bit difficult to judge the model’s performance.
Video to Video
Sora is capable of ingesting any video, be it “real” or AI-Generated and changing certain aspects of it. At the moment it looks like the AI only affects the entire frame, and changes aspects of the video the user does not specify to be changed. This behaviour reminds me a bit of ChatGPT failing to generate, say, a “room without an elephant in it”, but as I mentioned before – this version of Sora is the worst we will ever have.
The base video, its prompt not being given on disclosed.
As we can see, the AI changes the entire frame, completely changing the car and even altering the road slightly.
Prompt: change the setting to be a lush jungle
Here, even after specifically asking the AI to “Keep the video the same”, it is still making drastic changes in my opinion.
Prompt: keep the video the same but make it be winter
An intriguing feature is Sora’s ability to blend between two videos, creating creative transitions that show the model’s exceptional understandings of 3D space, and motion, but definitely also shows it struggling with scale.
Input video 1
Prompt undisclosed
Input video 2
Prompt undisclosed
Connected video.
Prompt undisclosed
As previously mentioned, the model finds creative ways to combine the videos and there are many more examples on OpenAI’s website which I have linked below, but it does get the scale pretty wrong. What I find impressive is that even though the input videos are being changed very drastically, the first frame of video 1 and the last frame of video 2 match perfectly with Sora’s stitched generation, meaning one could use shorter transitions and have original footage before after the transition with no hiccups.
Simulation
I’m not sure to call this a feature, as OpenAI seems to use the term ‘Simulation’ to show off the model’s understandings of 3D, object permeance and object interactions. But they also point out that Sora has a good understanding of virtual worlds and rulesets complete with its contents, objects and subjects, as can be seen here:
Prompt semi-disclosed: ‘captions mentioning “Minecraft.”’ as per OpenAI
OpenAI say that this development is promising in possible actual simulation applications of AI, not just Minecraft. But apart from the pig fading out of existence spontaneously it is very impressive; what surprises me the most is the consistency in the player HUD, sword and movement of the character through the virtual world. OpenAI claim that Sora is capable of “simultaneously controlling the player (…) while also rendering the world”, but don’t go in too much detail. I wonder how good Sora’s understanding of the virtual world actually is and how well it understands the user’s prompts.
Weaknesses
Apart from the familiar struggles with human anatomy, especially hands and fingers, the model does not seem to like physics very much, generating illogical motion when asked to produce things like explosions or destruction.
Prompt undisclosed
Some errors are more familiar, again, like objects or subjects popping in and out of existence and sometimes even merging with each other.
Prompt: Five gray wolf pups frolicking and chasing each other around a remote gravel road, surrounded by grass. The pups run and leap, chasing each other, and nipping at each other, playing.
And some errors are just pretty funny.
Prompt: Step-printing scene of a person running, cinematic film shot in 35mm.
Safety
In the name of safety, Sora is not yet available to the public, only being entrusted to a number of “red teamers”, a selection of experts in misinformation, hateful content and bias and will be testing the model before it is released. OpenAI will naturally apply similar text classification processes that it is already using for DALL-E and ChatGPT to reject text input that features likenesses, intellectual property or sexual content. Additionally, before a video is presented to the user, Sora checks every frame of generated content for potentially violating results. OpenAI is also focussing on developing methods to detect whether content is AI-Generated or not and directly embedding robust metadata into any generations by Sora.
Thoughts
WELL. Given that the technology of Text to Video AI is only about a year old, this obviously shows just how fast things are moving at the moment, and therefore also underlines the potentially short significance of my master’s thesis. Anime AI generation seems to be a very, very niche application, so I have that going for me, which is nice, but the development is crazy fast, moving from horrifying AI generated video that are obviously unusable and only a technical showcase of what could be possible to nearly flawless results within one year.
Prompt: Will Smith eating Spaghetti (March 2023, Stable Diffusion)
I still think that my thesis will have its own value, especially if I focus on comparing traditional with AI-assisted methods and also talk about the creative aspect of the whole media design process. And new developments are seldom bad – let’s hope that these developments are also beneficial for me.
Links:
https://openai.com/research/video-generation-models-as-world-simulators