Blog post 3 – Video AI #1

Given that I aim to create a short film using AI, I will need to come up with a workflow concerning moving imagery. I initially wanted to approach this blog entry practically as I have done with many in the past, exploring new AI tools that I could possibly use for my project by trying them out directly, reporting on my experiences and providing humble criticism. With video AI it turned out not to be as simple as initially anticipated. There are a huge number of different workflows, so researching what currently exists out there already took a very long time. Additionally, most of the workflows are, in my opinion, still in their infancy and require a huge amount of manual labor and are ultimately too time-consuming.

Method #1: Stable Diffusion with EBsynth and/or Controlnet

The foundation for many AI video workflows is Stable Diffusion, which I personally find reassuring, since users are able to run it client-side, it supports a huge variety of add-ons and grants access to community created models. For this method, short videos with a maximum resolution of 512×512 with simple movement seem to yield the best results. The method requires a “real” (meaning not AI generated, filed or animated both work) base video, which the AI then generates on top of. The video is rendered out as individual frames, out of which a minimum of four keyframes are chosen and arranged in a grid, which is then fed into stable diffusion. From there, a relatively normal stable diffusion workflow is required, requiring the correct usage of prompts, generation settings and negative prompts. Once happy with the result, the keyframes need to be split up again and fed into EBsynth alongside the rest of the unedited frames. EBsynth will then interpolate between the AI-modified keyframes using the original frames as motion information. After some cleanup of faulty frames that definitely still seem to happen, the results are realistic, virtually flicker-free and aesthetically pleasing.

Including Controlnet, a stable diffusion add-on that allows for further control over the generated results which is frequently used for video AI productions, this process can be elaborated upon. Using a depth map extension for it, Controlnet can be used to render out a depth pass with high accuracy to help in the cleanup process.

This workflow may seem very complicated and time-consuming, and that’s because it is, but when compared to other methods this still seems relatively simple. What’s more concerning is that the video length is quite short and the resolution very limited. Using video-to-video as opposed to text-to-video yields the much more usable results in my opinion, too. This is definitely also limiting and needs more human input yet as opposed to just prompting the AI and it creating the entire shot for you but could be a workflow in there.

Imagine blocking out a scene in Blender and animating a scene with careful camerawork and fine tuned timing and motion and then having Stable diffusion “render” it. I’m excited to see what the state of this approach is and what I think of this idea when I will actually have time to start work on the project.

Method #2: Midjourney (?) and PikaLabs

At the moment, using PikaLabs through Discord seems to yield better results when compared to the amount of manual labor, but human input and creativity is obviously still needed, especially prompting plays a pivotal role here. I was wrongfully under the impression that PikaLabs needs Midjourney as a base to work well, yet any text-to-image model or any image for that matter can be used as a base, which is quite obvious now that I think about it.

PikaLabs offers many great tools and interaction specifically tailored towards video creators and feels more purpose-built for AI video creation in general. The user can easily add camera movements, tell the AI that subjects should be in specific parts of the frame or perform specific actions. Again, however, the AI seems to work best if it is being fed great looking base material, so this is also a workflow I could see working in conjunction with animatics and blockouts using Blender or even After Effects for that matter.

Method #?

As I mentioned before, the number of methods currently out there is very high, and it is a very daunting task to even keep an overview of it all, this blog post was hard to summarise and I didn’t even mention Deforum or go into depth about Control net. With all of these quickly evolving technologies and Adobe’s text to video AI right around the corner, the field is not about to stop changing any time soon, but if I had to start work on my project right away I could and likely would comfortably use PikaLabs and Stable Diffusion in combination with a familiar tool like Blender or After Effects. But let’s see what the future holds

Links:

Stable Diffusion, EBsynth & Controlnet

Stable Diffusion & Deforum

PikaLabs

Leave a Reply

Your email address will not be published. Required fields are marked *