Another AI images question

And a lot of those images, humans would interpret as the elephant’s head being at the front, but only because that’s where the elephant’s head is: The truck part doesn’t have any particularly recognizable front or back.

And I’m not clear that it entirely does understand the “frontness” of an elephant, given that one of those has tusks coming out of both ends, and several are missing the most quintessential feature of an elephant’s front, the trunk.

Also, I can’t help but remember one of the experiments from the long NightCafe thread, where someone asked it for a painting of the prompt “Facing the Charging Elephant”, and the AI dutifully created a painting of an elephant plugged into a charging station.

That would be a failure of ‘word in context’ determination. Word in Context is an emergent capability that just appeared at a certain point in training.

But it’s not a “mistake” to say that lungs have a purpose in a sense that geologic formations do not.

Who says it’s a failure?

It’s a better “charging elephant” joke than ChatGPT is giving me.

But the majority of the current explanations for why a trait evolved boil down to “because it made the organism more fit” which isn’t much of an explanation at all.

A more direct analogy to explaining how an AI recognizes specific things is if you can explain how a trait evolved by saying that a switch from adenine to thymine in the 46th codon in the gene for the 3rd step in a five step process resulted in an enzyme that was 17 percent more efficient. We have some explanations like that, but they aren’t the norm and they aren’t cheap or easy answers to get.

Explaining how the trained set for an AI model “understands” a specific concept would have a similar level of fine detail, difficulty to tease out, and gibberish-soundingness laymen. The real answer for how an AI recognizes the front of the truck would be something like “because values b724 and b725 in node 17 of layer 8 are set as ‘1’”.

But who cares about that, for either evolution or for AI? Aren’t the general underlying principles by which it is operating the important explanation?

No? Not when the question is “how does Midjourney recognize the front of an elephant”, or “why is a St. Benard so big”. General principles are not answers to specific questions.

Looking at the Midjourney pictures, it also has both the truck and the elephant rotated in the same way in 3d space. It does look like two 3d models merged together.

The examples using other AI services were not as good as the ones from Midjourney. I’m not sure why, but it seems like Midjourney has its own style.

Which is weird, because Midjourney is Stable Diffusion with some custom tweaks.

I believe Midjourney has a much larger model than what you get from Stable Diffusion. At least, locally run Stable Diffusion where the models are maybe 2-3GB. So Midjourney probably has a lot more data to work with and extrapolate what an elephant-shaped truck might look like.

For that matter, I suppose you could custom train a SD model on elephants and garbage trucks if you really wanted.

Okay, apparently the latest Midjourney is not SD-based.

From that link:

“Yeah, all of my Midjourney results seem to be a pastiche high quality 3D render of the prompt, instead of mimicking the style asked for.”

Given that learning systems ultimately involve human feedback, isn’t the answer “because humans keep picking the versions that have the right front bits”?

This is not true.

As to the OP, the best answer you’re likely to get is that the images in the training data tend to show more of the front of both trucks and elephants since those are the interesting parts.

This video is five years old now but still entirely relevant to why we don’t really know how AI works.

OpenAI’s own discussions of their research seems to mention using human feedback quite a bit.

Having the ability to incorporate human feedback is not the same as human feedback being necessary. A large part of recent success of deep learning is that there are techniques (masked language or image modeling are most common) that allow algorithms to learn relevant features without annotations or feedback.

I didn’t say it was necessary. I just said it did involve it. Because human feedback is all over the existing datasets these systems are trained on.

You said they “ultimately involve human feedback” (which suggests necessary to me…) and suggested human feedback is the reason why these systems can reason about relevant part of images. Both are straight up incorrect.