I’ve found Stable Video Diffusion to be pretty good at rotating some objects, like nature scenes, but less accurate with artificial objects, and terrible with printed items, like labels. I was kind of expecting that.
I rendered rotating jars with product labels surrounded by a variety of medicinal plants with penetrating sun rays. The plants and sun-rays rotated and looked great; the jars deformed a little; the labels immediately became unrecognizable. It would take time consuming tracking and rotobrushing of the labels in After Effects for me to make that clip look realistic.
But, I see the potential with Stable Video. I think it’s just a matter of time until it nails nature, man-made, and print.
Of course, one would expect it to have better comprehension of the sorts of objects it’s generating in the first place. Does it perform as well with real or human-generated images?
And is there any ability to prompt it for what kind of motion you want?
I haven’t seen any Video Diffusion examples/clients that suggested I could say “Make him wave” or “She should be move her hands to be steering the car”. It’s more like small rotations, jerky movements* and wind/water effects. Amusing, if unconvincing, with the results usually looking like a student’s first CGI project video. I think I’m currently more excited about the One Step diffusion generation than this though I’m sure it’ll get there.
*Jerky movements could possibly be helped by increasing framerates which means longer generation times. But a lot of it is just in the rendering itself and the AI not knowing the steps in creating an convincing eyebrow lift or smile.
I don’t see any reason why it wouldn’t work on a photo. You’re just uploading a jpg or png file and the AI doesn’t know what the origin of that file was. I uploaded it locally just because I enjoy having the option for local control of my stuff and because you never know when an online provider will lock stuff down. It takes a long time (40+min) on my 12GB 3080Ti though and that’s a heck of a wait to say “Heh” at the result so I don’t see myself using it very often. Of course, queues for free online generators also take a long while. 4090s are going for $2,500 (!!) now which makes me reflect on when I scoffed at buying for the inflated price for $1,800
It takes around a minute or so on the A10G at Hugging Face. The wait varies, from maybe 5 minutes up to around 15. When I’m using it I usually have one copy open in Firefox and one in Chrome, so the time is effectively halved.
I was getting 15-20min queues but I’m sure that depends on time of day, phase of moon and luck. Certainly better than running it locally though I like to keep my options open for when some organization suddenly decides that “Joe Biden beatboxing” or “woman with bare ankles” is verboten.
I haven’t used it a ton though (from either method) since the current results typically range from mildly amusing to “I waited 20min for this?”. Midjourney is supposedly integrating more robust video creation in their upcoming model(s) so maybe that’ll be more worthwhile.
One thing that I’ve thought about with image generators. We’ve gone through the stage of raw experimentation, being impressed to see anything at all. We’ve gone through the stage of laughing at how bad mistakes are. We’ve gone through the stage of showing off the language tricks you have to use to get the type of image you want. But now we are entering the stage where you have an idea for a concept that you want to illustrate, and you simply type in a prompt and get it. Especially with Dall-E 3. It isn’t nearly pefrect yet, but it is getting there.
For example, trending since yesterday on the Facebook AI groups are posts that start with “You’ve heard of Elf on the Shelf, but how about…” followed by an image, and it isn’t about the process of getting it(, and it isn’t about laughing at the flaws in the image, but is about the image itself.
I tried an experiment of taking the final frame from a clip I liked (the swaying ghosts from the beginning of my compilation video) and using that as an input for a new clip. Unfortunately neither continued the same type of action. One did a pan, the other had the ghosts hopping and dissolving into a mess. I don’t know if I would have got more consistent results if I had preserved the original seed, but probably not.
Okay, I just did a video of a photo of my cat. (This time the wait was more than 20 minutes). It only did a shakeycam thing instead of giving the cat any movement, but it did a reasonable job of creating parallax changes on the various depths of objects.
SDV does have a slider for the amount of motion, which I haven’t messed with before. I tried the cat photo again (another 20 minute wait) with the slider moved from the default of 127 to the max of 255. This time…it did not go well.
As an aside to this fascinating thread (and it is amazing how it has evolved in a short time), which platforms would you now recommend to play around with:
A) which is free?
B) which is low cost?
C) which is most professional?