Nice. But you can download the image without the watermark in the corner.
Stable Diffusion video model released.
I downloaded the tensor weights for Stable Video Diffusion the other day, but I use ComfyUI as it’s the only front end I’ve found that totally suck, and they don’t seem to have added support yet. Hopefully soon.
Image-generating adjacent, a lawsuit against LLMs has been largely tossed out.
Bold choice to use the only front end that totally sucks
I’m a glutton for punishment, apparently. But seriously, ComfyUI is great for running local models. And it looks like support for Stable Video Diffusion just came out:
I’m visiting the parents, though, so I won’t be able to try it for a few days.
Weird. The popcorn, pretzels, and jellybeans look great, the bird’s only problem is being implausibly still, and the dog is pretty good aside from being out of focus, but what the heck is going on with that toast? I’d think that any model that can do the other things well would also be able to handle toast, or at least do a better job of it than that.
I had better toasts, but many images had a dog/table merger going on.
Hugging Face has a usable version of Stable Diffusion image to video now.
I made one video last night, but spent more than 12 minutes in a queue waiting for it to run. Here’s the result:
For comparison, here’s what I got with RunwayML Gen2 when that first released:
(Here’s the source image, made with Bing/DE3):
Stable Diffusion released an early test of SDXL Turbo, a “one step” model. “Steps” being how many passes it makes when it renders an image, not how many tasks you need to complete to use it. A photorealistic image in Stable Diffusion is usually around 30-50 steps.
As a result, it was completing two images per second on my RTX 3080Ti when I tested it. That’s hecka fast. It does have some major limitations though – it doesn’t really do photorealism, photoreal human faces are a disaster, it’s meant to run at 512x512 and adding more steps immediately blows the image out. It’s really just one step and setting CFG to 1 as well. It is a really cool hint at how the tech is progressing though and I tried a bunch of stuff and had some “great” results when you remember: two images per second.
(Also, results are from a single run using A1111. I believe that Comfy might do it better since I wasn’t doing the second Refiner pass and for various other boring technical reasons)
On clipdrop:
OK, those chesscrapers are just plain cool. A human artist probably wouldn’t have had too much difficulty making that, either… given the concept. But it’s a great concept.
@Darren_Garrison , I think the possum is probably happier in the RunwayML version. It’s still getting a creepyhand stuck to its nape, but its head isn’t dematerializing.
Since I’ve set up a new Youtube account, here are the more coherent clips that I generated with my limited free Runway Gen2 seconds a while back.
Being able to run 800 images in just under 5min (4:58) is addicting
Was worried for my SSD but those 800 images only take up 280MB of space at 512x512
I’ve set my output to a RAM drive, and then only move the good ones to the permanent disk.
Another day, another new toy.
Wanted a Mandalorian on a DeLorean. For some reason Bing/DE3 keeps thinking that concept should include a cat or small dog and some brown leather bags.
I’ve been playing around a lot with Stable Video Diffusion. It is a long way from perfected, but what is already there is impressive.
It creates video clips from images without a prompt, and needs to try to recognize the objects in the image but also what type of motion makes sense. A few videos are failures, with little or even no motion. A few are very simple pans across a static image (essentially the “Ken Burns effect”) but with new data created for the edges as the “camera” moves.
The zooms in or out are more sophisticated, moving the elements of the images around relative to each other in an awareness of parallax and that the image contains distinct objects layered in a three-dimensional environment, with the software very accurately determining the edges of individual objects. Some videos involve rotating “inside” the image, understanding the three dimensional nature of the objects and creating new “structure” on the objects as they rotate.
With some images the software recognizes specific types of objects and tries to give them the appropriate type of movement. It can be environmental like billowing clouds and crashing waves or mechanical like turning wheels. It also can often recognize humans and animals and attempt to move their limbs and faces in appropriate ways.
Stable Video Diffusion does not do any of these things perfectly, and makes major mistakes. But this is an experimental early release of the first version of the software, and could (like still image generating) improve rapidly.
SVD creates a set of 24 images, allocated as 6 frames per second for 4 seconds. It seems to me like they could make it an indefinite duration and not RAM-limited, basing each new frame off a number of previous frames, but currently it seems to keep all frames in memory all once and reference all of them–some of the video clips form near-seamless loops, and some loose detail mid-video only to regain them before the end.
This is a compilation of 75 if the 4 second clips, all generated from images I made using Dall-E or Stable Diffusion.
I think one of those was a small Wookiee. Also, it really likes making a Back to the Future DeLorean with the extra wires and stuff on the outside.