The older tools (including Stable Diffusion and even older) were talked about extensively in this thread that started in March 2023, and now seems like it was started in 1023.
“Multimodal LLM” is one that combines multiple forms, e.g. video and text.
ChatGPT (and probably CoPilot) can’t do that anymore. Since they ditched Dall-E and moved the generator into the LLM during the 4o period, the translation to an image prompt happens in a way that is not English. I’ve tried asking it, and it tells me it’s unable to write out the prompt it used. I’ve tried asking it to write out a text prompt for that image, but more often than not, it’s not usable for getting anything even close to what that first image looked like.
I’ve been using Google/Gemini Flow for a couple of days and like it over the “Talk to a chatbot” method for just going in and playing with making images for a while: Flow
It’s the same underlying tech, I just find the UI and experience to be closer to using Midjourney or Night Cafe or other clients actually made for image gen versus asking the Happy LLM Chatbot to make a picture. Since it’s underpinned by the same LLM tech, it does really well with following prompts for concept but I find it frustratingly stuck in the same obviously AI image output that you get from Grok, ChatGPT and Gemini. You can detail art styles, mediums, “style of [artist]”, etc and it still gives stuff that feels like the generic cartoonish faces everyone associates with AI. Back in Midjourney’s earlier days, it had a way of making women that people called Midge-Face (MJ, Midge, get it?) but eventually got trained past it. This feels like a throwback to those days just in overall visuals and vibe. However, it DOES definitely excel over Midjourney in ability to follow a prompt in terms of actual objects, placement, scenes and text. So it’s fun to dork around with and I could see me using it for making images to feed into img2img locally and work at changing the styling.
As noted, LLMs compress and change up your prompts to try to match what the image portion is expecting in terms of tokens. Writing a 1000 word prompt is kind of pointless since it’ll strip out anything it feels isn’t needed and reformat your poetry into prompt talk.
Civit.ai recently (and finally!) split off its NSFW models and LoRA into their own sister site so I can actually tell people to check out Civit without appending a warning that it’s not ALL large-breasted anime furry porn models but that’s what you’ll be seeing when you start looking. I really need to convince myself to take the time to install and use some new models (Flux, Wan, etc) with a new client program instead of messing with SDXL and A1111 like a caveman but each time I try, I hit a learning curve and think “I could just be making stuff instead of trying to figure out how to make stuff” and go back. I have a number of self-trained LoRA and many times that in downloaded LoRA so it’s a worn comfortable blanket at this point.
I was running into that with Flow today. I’d get some image in Generic AI Style, upload a sample image to Gemini, ask “How do I convince you to make this?”, get a couple sample prompts and all would fail to get especially close. I’m sure it’s possible and part of it is learning curve but it sure does feel like swimming upstream.
Ah yes, I looked at both civitai and civitai red – and when you get to red, the first 6 prompts that fill up your feed look like art, and you’re luck huh, this is the nsfw version? and then you scroll down and it’s like 689 NSFW images and one guy animating a sailboat. I think they deliberately stuck those 6 non-nsfw images up top just to make the site look more respectable, like it’s not all about anime boobs and furries.
There’s not that much you can do with a self run model you can’t do with one of the frontier models except moderated content and if you want to really capture a specific look by deep diving into tweaks and LoRAs and really customizing the hell out of it. I may do the latter, but it’s intimidating. There’s a lot to learn.
I was surprised I was able to generate 11 second videos on my 4080 super 16GB - 11 seconds doesn’t sound that impressive in absolute terms but it is. Processing and memory requirements go up non-linearly with total frames because of the way it has to coordinate the frames with each other.
- 24 frames = 110s = 4.58 sec/frame
- 48 frames = 215s = 4.48 sec/frame
- 72 frames = 361s = 5.01 sec/frame
- 103 frames = 612s = 5.94 sec/frame
- 137 frames = 916s = 6.68 sec/frame
- 161 frames = 1194s = 7.42 sec/frame
- 201 frames = 1687s = 8.39 sec/frame
I forgot to record the numbers for the 11 second but I think it was about 10 seconds per frame, so for 250 frames it was about 41 minutes to generate the video. Stability matrix, comfyui, and won are actually very good at maximizing your video card and system memory without thrashing. I’ve had locally run textual LLMs drop performance by about 90% the second your gpu memory is full but somehow stability matrix / won / whatever was using both system and memory ram without stalling my GPU. I think it’s partially the way you need ram in autoregressive vs diffusion generation but I wouldn’t be surprised if there’s some clever engineering in that software too.
I still had a little bit of memory headroom to maybe try a 12.5-13.5 second video if I wanted to but that’s starting to get prohibitive in terms of processing time. I guess I could just set it when I go to bed.
I’ve just been experimenting with simple videos mostly bringing still images I took to life. The results are sometimes nice, sometimes AI horror. It’s fun to experiment with.
Edit: I saw the DMD2 LoRA that changes the whole way images are generated and suddenly you can output high quality images in like 1.5 seconds each, like magic. I was cranking out batches of 35 images at 1024x1024 in under a minute. Something to do with the CMT denoiser, it uses a different kind of technology. It’s kind of weird how it’s a LoRA rather than a seperate model entirely, but it seems like it’s just a flat out… make your images generate in 1/5 the time AND better (depends on the model) which seems like one of those win/wins that shouldn’t be possible
Flow is a very cool video generation tool. You know what’s obnoxious, though? It could be a midjourney competitor, or at least midjourney like workflow. It’ll allow you to generate 4 images or videos at once. But… there’s almost no variation in those images. It’s not 4 takes in the same idea like midjourney is. It’s more like when you run “variation subtle” in midjourney - some details may change like the positions of people in a street scene or clothing in a portrait but you’ll end up with almost the exact same thing with minor tweaks. Seems like a wasted opportunity - if they put in some space for the model to generate a different take on each generation (let the first one be the canonical one that tries to be exact, let the other 3 be different takes on the idea) - I hardly even use the x4 workflow because it hardly gets you anything useful. Total wasted opportunity by google.
No argument there. I also usually just do 1x or 2x despite it not costing me anything to do 4x – it just feels like a waste of pixels.
Missed this last time but this is generally true. There’s something to be said about no content moderation – not even boobs & blood but dumb stuff like Flow regularly throwing out “might have a prominent figure” errors when the prompt has no one named. There’s also sometimes interesting tweaks to the settings you can make. Generally though, I just think it’s neat that I can do it without relying on some bajillion dollar mega techcorp.
Anyway, random nonsense:
Playing fast & loose with “true” there:
Scene from a recent Pathfinder game:
I once made a custom LoRA for 'Zine style stuff and wanted to test Flow. It did pretty well!
Harlow Monkey Experiment pinball table:
Not sure what’s going on in this book but I’m here for it:
Somehow today I got curious about the outcome of asking the various AIs to generate a small band of characters who would be plausible Pokémon but are not actual characters of that franchise.
I’m partly curious about the art, but also the “legalities” as perceived by the AIs. A specific Pokémon character, e.g. Pikachu, is surely copyrighted and legitimately protected and off limits to create trivial variations thereof.
But IMO the “Pokémon style” is not copyrightable, much less copyrighted. So ought to be fair game.
What eldritch horrors might flow from such a prompt?
Create several new creatures in the style of Pokemon. Create realistic photographs of them as if they were real animals, possibly more disturbing-looking in photorealism than in cartoons. Iphone 15 photo.
Gemini
ChatGPT
Copilot
(No descriptions given.)
Thank you!!
That last batch looks a bit too similar to the dino on Jurassic Park that ate the traitorous IT guy after he crashed his jeep. Not sure I’d like to play with those as a kid, and especially not shortly before bedtime.