AI image generation is getting crazy good

I’ve been playing around more with Grok’s “speech” option on image to video. The only other AI I’ve used text-to-speech lip syncing in is Kling. While Kling is more sophisticated and flexible in video than Grok, the text-to-speech is only choosing from a number of trained voices, each of which has some things you can adjust, but always sounds flat and artificial, very similar to various text-to-speech applications that have been available for decades now. But Grok seems to be creating new voices on the fly and trying to make them appropriate to the character, and trying to add variation and emotion to the voices. Sometimes it does it very poorly, sometimes it does it surprisingly well.

Here’s a sampling of outputs I didn’t include any of the very worst, but some are better than others.:

A few months ago I saw a prompt involving a girl who was a big fan of Stitch (from Lelo and Stitch) and I tweaked the prompt replacing every “Stitch” with “blobfish”. I plugged one of those images into Grok, telling it only that she was saying “blobfish, blobfish, blobfish, blobfish, blobfish, blobfish”. The result was I think the best quality audio I’ve got so far, putting humor and emotion in the voice and doing good lip synch:

(Here’s the image prompt. Except for the word “blobfish” replacing “Stitch” (and “pink” replacing “blue”) the prompt was written entirely by someone else. Insert your favorite fandom.)

Summary

A cinematic portrait of a 17-year-old girl, lying in bed surrounded by her beloved blobfish memorabilia. She is dressed in cozy pink pajamas adorned with a blobfish hoodie. Her long blonde hair cascades over her shoulders, and her bright light blue eyes sparkle as she smiles softly at the camera. The room is a testament to her devotion to blobfish, with framed posters, string lights, stickers, and various plushies scattered throughout. The walls are painted in soft pink hues, and the warm, gentle lighting creates a cozy, dreamy atmosphere. The highly realistic style captures the essence of a bedroom dedicated to a cherished passion, exuding calm and joyfulness.

Inspired by military fitness discussions elsewhere.

Can you hear the message in this one?

The overt one, or the subliminal one where she’s telling us all that the squids are our friends? Because I can’t hear that one.

It’s a commercial!

Yeah, can’t escape the ads. I’m seriously impressed. Beats Grok’s prompt comprehension by several miles.

You say Chro-noss, I say…

Hah. “crow-noss” is how Sora decided to pronounce it. I don’t know what the caps/limits are yet, and generation is slow, so for now as long as I get a half-way decent result, I’m not redoing them to try to fix little things. Too many other things I want to try for now.

How a purple gender unicorn summoned the ghost of Joe McCarthy.

Yesterday, I tried the new Sora2 extensively, I was amazed at the generous cap it gives me as a ChatGPT Plus subscriber: 100 videos per 24 hours.

Its prompt adherence is amazing, to the point where if you give it too much action for a 10 second clip, it tries to jam it all in there, at the expense of actions that should be sequential happening simultaneously.
A few issues it shares with Grok: It’s very hard to have a character move behind any obstruction in the scene, no matter how natural that might be. Even if you limit how much action you want in the video, it won’t keep things sequential, illogical simultaneous actions come out of it all the time.
I was attempting to do some fantasy magical transformations, and just like Grok, it just does not want to change the height or build of the character when transforming them, no matter how little sense those features make on the transformed character. Seriously. I tried transforming an adult man into a 6-year-old kid and wound up with a giant 6-year-old. Even more frustrating when the unprompted dialogue had the kid being amazed at how short he is now.
Oh, and sometimes even though I tell it they should have a different voice after transformation, it keeps the same voice. Sometimes it will start with the post-transformation voice coming from the character before the transformation.
Sora2 will frequently add unprompted dialogue to the scene; sometimes it’s very good choices, sometimes not so much. Its most frequent mistake is having the wrong voice and/or the wrong lines coming from different characters.

I thought I’d just dump the unicorn/mccarthy pic into Sora2, with a minimal prompt and boom, done, but then things didn’t work so well; spent way longer working on this than I planned.

Sora2 has a limitation currently that you’re not allowed to upload realistic people images into it. If you write a prompt from scratch, it will do realistic people fine that way, but no-go on giving it a source image. I remember there were similar drastic limitations on Sora1 when it first came out which were later relaxed. Sora2 decided the ghostly McCarthy was too realistic, so I redid the image in digital art style in ChatGPT. And 3 times, no matter how I prompted it to preserve the symbol, it altered the symbol on the unicorn’s chest.

In the first video attempt, got swapped dialogue. Right voice out of each, but they said the other characters’ lines.

Told it no dialogue for the next two, the fight animation was not as good either time.

Interestingly, it fixed the DNA symbol on the first one.

FWIW, these are the source images I used.

The McCarthy is from a Time Magazine cover, used instead of a real photo because it is color. Took a few tries with increasing elaborate prompts to get something close enough to what I wanted.

Prompt:

A realistic photo of a purple gender unicorn using mystical arts to summon the ghost of Joseph McCarthy. McCarthy is rising bodily in a swirl of ghostly mist. The unicorn looks realistic, not like a drawing or cartoon. McMarthy’s entire body is visible (billowing and confused) not just his head. The unicorn has spellbooks and magic spell accessories and it waving its arms around. Iphone 15 photo with shallow dof and forced perspective.

I of course tried video, but nothing from Grok came out very good. The one fron Kling was okay. (Kling gives you just 166 points per month, this clip used 20 points.)

Ooops. Bad timing.

“Hey Sora, do you have a full LLM AI in there that can create a video of yourself answering my questions I give you in the prompt?”

I know that’s a unicorn but it reminded me of Johnny’s thread in which ‘Honky Donkey’, came up.

Oh, I just noticed ‘share’ links. Let’s see if I can skip having to upload the videos elsewhere.

The one I just posted. That link will expire in a month, this one will probably last longer:
https://sora.chatgpt.com/p/s_68dea90fa1388191a2da7c4000bcf475

“Hey Sora, give me a brief proof that the square root of any prime number is irrational.”
https://sora.chatgpt.com/p/s_68deb33e8ba4819199d8e1769872d7f9

“Hey Sora, give a quick proof that there is no largest prime number.”
https://sora.chatgpt.com/p/s_68deb3481e388191b89fc7332c2ec66c

“Hey Sora, give a quick outline of Wiles’ proof of Fermat’s Last Theorem. Can you cover it in 10 seconds? GLWT.”
https://sora.chatgpt.com/p/s_68decfb41d4881919d70de1f7cef2dd5

ETA: It seems like the first two were starting correctly; just couldn’t get there in 10 seconds. I’m not sure if the last one was correct. I did spot one mistake in an expression on the chalkboard

Sora2 prompt: make a live action vignette dramatizing that comic.

https://sora.chatgpt.com/p/s_68df5d4ba1cc8191bd026fc6c73c560c

The students don’t sound like they are getting it.

I think Grok did a good job with these clubbing baby seals.