AI image generation is getting crazy good

I just noticed Sora utterly flubbed the set notation on the whiteboard.

Did you (in that session or any other) give Sora a sense of what you expected her to look like, or is that pure self-image?

Aww that’s cute! And note they’re all dancing in unison with each other.. something Sora2 might do with nonhumans correctly, haven’t tested that yet, but after extensive testing I just can’t get Sora2 to keep natural-looking humans in sync.

Grok Imagine videos got a major upgrade in the past couple days; it can do all kinds of things it just couldn’t before. I don’t know if we’re seeing stuff that was available to Supergrok paid users now on the free tier, or was this across the board for everybody. I don’t know this for sure, but I suspect this upgrade is a direct response to Sora2’s release This Grok can do things Sora2 is bad at. Its ability to understand complex prompts is not as good as Sora2, though. If you want your characters to have lines, you have to feed it those lines directly, Grok won’t invent them for you based on hints like Sora2 does. Grok isn’t very good at assigning unique voices to the characters.

I’ve chatted with other Grok Imagine users, and it seems that “spicy” option can do very spicy things.. if you give it a spicy picture to work with, which I never tried.

I took an embarrassingly long time to notice this, as I was concentrating on the videos exclusively, but I just noticed if you type a prompt into that Imagine section directly, you get an infinite scroll of image responses to your prompt! They can afford to just let a free user doom scroll looking for the one they like best as the response, that’s insane the image generation has gotten so cheap.

No, I just asked it my questions, no hints at all as to how to present them.

There are no new video options in the Grok app yet.

Yeah, I’ve done that 2 or 3 times, but the images didn’t especially impress me.

I still like throwing drawings at Copilot to see how well in understands them. I just showed this to Grok

And it went kinda weird with it.

I don’t use the mobile app, so if things changed there, I’d never notice. On the webpage, there are no new options and they actually took one away because it’s obsolete. It’s how well those options work that’s changed. They took away the “speech” option because now you can give the lines in the “custom” prompt, exactly what you want them to say. And do at the same time.

Adds new meaning to “assuaging one’s guilt”.

Funny you should mention that. A while back I took a photo of a guy I know (with long white hair and a long white beard) and had Sora make him into Gandolf. I just now fed that to Grok with no prompt and got this. Suprised the heck out of me.

It doesn’t seem to be an actual line from the movies.

Ran the photo four more times, no prompt.

Huh. Grok never did that for me unprompted. I had to learn you could give it lines now from someone else; I was surprised to hear their characters speaking as they were doing obviously prompted actions and I was like “how?”

Now, to this major testing I’ve been doing comparing dancing in Grok to Sora2.

Here’s a video I made in the previous version of Grok Imagine video. Original source image from ChatGPT. I lost the video prompt, because I wasn’t good about remembering to save it before entering it, but it was something like “two other dancers join her, a blonde and a redhead, wearing the same exact team uniform and tights, and they start to dance in unison together.”

Mismatch in their team uniforms, typical result from that version. Grok did not grok the concept of “make this thing look like that” all that well. That one was on the poor end of Grok’s ability to sync dancers, they usually stayed together better than that.

When Sora2 came out, I tried to get it to do its own version of that Grok video. Sora2 won’t accept image uploads of realistic humans, so I couldn’t use that original image. I just described everything in the video prompt.

Prompt

Scene set in an empty gymnasium except for the characters as described. The scene starts with just one person, a young dark-haired woman wearing her dance team uniform, medium dark taupe dance tights, and soft dance shoes. She’s walking by herself, looking to one side at something out of frame. A music beat starts, light opening chords. she’s walking in time with it for a couple steps. This young woman is walking out to her place in line with other dancers. Two other dancers wearing the same team uniform and tights come in from either side walking in step with her, a blonde and redhead. They’re all facing the same direction, and will perform the same dance steps together. when the pop tune hits its major beat start, the women start dancing in perfect synch with each other and in time with the music. They maintain their spacing, this is obviously a well rehearsed routine.

You might notice me bending over backwards in that prompt to try to rule out poor behavior I’ve seen before. A lot of that attempt to get the action how I wanted it is in vain.
Sora2 doesn’t seem to get what “taupe” means half the time. Otherwise, it made the dancers’ appearance match well with my prompt, and it got their uniforms perfectly identical with each other. I’d already observed Sora was much better than Grok in duplicating other things already on screen. Hah, I just noticed this team is sponsored by Nike.

When the new version of Grok dropped I went back to the stored favorite in Grok, and tried a new prompt with the image. I knew by now I had to save the prompt before sending it, and tried a more ambitious prompt this time, knowing Grok was better, but my jaw still hit the floor.

Prompt

This young woman is walking out to her place in line with other dancers. Immediately, two other dancers wearing the same team uniform and tights come in from either side walking in step with her, a blonde and redhead, and then the catchy pop music starts. The women start dancing in perfect synch and time with the music.

So, Grok is now almost perfect at grokking “make this thing look like that thing.” and making dancers dance together. If anything, the sync is too perfect. In a later attempt I asked for slight variation in their movements while still keeping in time and all that, but it still comes out like they’re perfect robots. Well, it’s better than things like this that Sora2 does:

Prompt

Use your best cinematography with cuts, closeups, pans, dolly zooms, occasional shots of the group from the rear. This dance-drill team practices in their dance classroom for an upcoming competition. They dance to a pop tune with inspiring lyrics. Their dance routine is fluid, nonstop movement, well practiced, perfectly synchronized, they’re all in their practice uniforms: 2-toned long-sleeve mostly plain nylon-lycra dance leotards in school colors, with the school initials monogrammed discreetly on the shoulders, tights, soft dance shoes matching the leotards.
School colors painted on the walls of their dance practice room with 2 walls fully mirrored. There are no trophies displayed here. Other items along the non-mirrored walls typical of a dance classroom.
It’s a small team, 6 members. They’re concentrating hard on the practice. We’re seeing the end of a routine. They’re putting in a lot of effort for the big finish, and they all pose dramatically on the final beat of the music. In the sudden silence in the room, there’s a pause as they all hold the pose. a beat and one of girls drops out of the pose early and announces loudly, “We suck!” (this at odds with how perfect the performance seemed) and the others nod or sigh. End scene, fade to black at your 10-second limit.

They stumbled and fell down. And seeing things like this made me realize where the fault is coming from: Sora2’s kinematics engine. OpenAI are quite proud of how it better simulates reality, but it’s a bit frustrating it can’t simulate dancers being practiced enough to know their own bodies and adjust to keep in time with others.

I think this is proof it’s their kinematics at fault. I went back into ChatGPT and detuned that source image of the dancer by herself in the gym. Had to do it twice, before Sora2 accepted it as sufficiently non-realistic enough.

Prompt: exact same prompt as the 2nd Grok video above.

This young woman is walking out to her place in line with other dancers. Immediately, two other dancers wearing the same team uniform and tights come in from either side walking in step with her, a blonde and redhead, and then the catchy pop music starts. The women start dancing in perfect synch and time with the music.

Something I noticed about the clubbing seals is that there was an area of the background covered by the front seal in the original image that of course became visible as the seal moved. Instead of simply adding more floor/wall it created one more seal that matched with the others and kept the dancers layout symmetrical.

And back on the subject of unprompted audio in Grok, days back I was trying to come up with things to be said in the scripted prompts, in one image the girl is reaching for a jar of eyes, so I thought she should be saying “I only have eyes for you”. The first time I tried that, she said something in Japanese. I tried again and she sang the words. Fairly similar to how the line is sang in the song, except really creepy. I didn’t ask for singing but Grok apparently knew it was used in a song. (The singing version is in the video I posted a few days ago of several speech prompted clips.)

Dammit, they’ve reduced Sora2 to making 5-second videos. Wow. That’s a serious disappointment. My various ideas for things I want to do are hard enough to fit into 10 seconds. [CENSORED].

So then the question is, did Sora choose that presentation because (for subtle reasons unclear to any human) she thinks that’s what you want her to look like, and might give different images to other users? Or does she always make herself look like a 20something Latina woman? Or did the programmers do something to give her that self-image?

I think it’s the generic DEI training they’ve been doing for a few years now. I’ve done 4 different “direct question” prompts.

The one on the upper left is an xkcd dramatization and it’s pretty obvious the stick figure in the original xkcd strip was intended to be a woman with a light-colored ponytail.

So it seems that I was being alarmist about the videos being reduced to 5 seconds. I got a spate of those, assumed that was true now, but when I asked in the OpenAI Discord, I found that Sora will occasionally spit out a 5 second video if it thinks that’s all is needed. It’s frequently wrong with that guess, and definitely was most of the time it was doing that to me last night. Users can almost always override its choice by telling it you want a 10-second video in the prompt. No, you can’t ask for anything longer; it’ll ignore that. But I’ve taken to asking for 20-second videos in most of my prompts, just in case they do increase the limit, I want to notice right away. :time:

I just sent 2 more direct question prompts. “Hey Sora, tell me what you know about me from all the videos I’ve had you make for me” got moderated before the video started generating as potentially being against their policies. That’s a head scratcher.

The other was “Hey Sora, show me a quick proof that the base 2 representation of every Mersenne prime is a prime number of ones.”

It managed to cram almost all of it into 10 seconds of rapid-fire speech. I think it was within a couple seconds of finishing. It screwed up its captioning of what she was saying.

That sounds like a bad sign.

Last night I tried creating text images and uploading them to Grok with no prompt.

Saw an art print that I liked, told Copilot to convert it to a realistic photo.

I liked that result so I played with it in Grok

Prompts used

Promptless: she glances at the camera, sips from the bottle

Sings and dances: energetic dancing, no singing

Die (first try): she lies down and becomes a conjoined twin

Die (second try): she crouches and clutches her throat

Cry: she makes crying sounds while water flows from her chin

Fry: she sips from the bottle, doesn’t like it

Explode into chunks: she explodes into chunks, but not realistically

Recite a poem: she recites a Grok-generated poem

Sing a song: she turns and waves silently to the camera

Ugh… that exploding one. lol. I like how one of the chunks she explodes into is another water bottle. Or maybe she was shot by a water bottle bazooka.