Another AI images question

MrDibble · February 18, 2023, 10:02pm

No, I made a statement about what is, not what ought to be. That you interpreted it that way is, however, understandable.

That you continue to harp on it, after I’ve corrected you, is not. I’m outright telling you I was not talking about “necessary”.

I’m talking about the pre-existing training datasets and the initial pre-training as discussed by OpenAI themselves, not the subsequent algorithms used.

Snarky_Kong · February 18, 2023, 10:15pm

No, I was talking about what is. I don’t really have any opinion on what ought to be in generative modeling.

Leaving that aside, you’re factually incorrect about human feedback being the cause of this phenomenon. CLIP and Dall-E don’t use human feedback. They don’t even use labels.

Chronos · February 19, 2023, 12:11am

…How could they not even use labels? They have to have some way of translating the human-language prompt into images.

Snarky_Kong · February 19, 2023, 12:29am

CLIP stands for “contrastive language-image pretraining.” Basically they scraped a bunch of image/caption pairs from the internet and then trained two models, one each for the image and the caption, to create embeddings that are close to each other when a given caption is from that image and far if it’s from a different image (this is the contrastive part). This creates a shared embedding space for both visual and language information. So when you pass a prompt to Dall-E it’s converting that prompt into this embedding that makes sense as an image and the diffusion model uses it as a guide.

You might have the thought that the caption is just a label. Not really. You can have incredibly large amount of captions for an image. Pandas at a zoo might be “pandas at a zoo” or “people looking at animals in cages” or “a sunny day outside” or whatever. It’s also very noisy. You might scrape things and use text on a webpage instead of a direct caption. Does the text directly comment on the image? At the internet scale it doesn’t really matter since it will tend to correlate with the image.

My main point being is knowing how to create an image of “red ball on top of a green cube” does not require a human to look at the output of a model and give it feedback (this is very common now in NLP although AI feedback instead is becoming a thing: https://www.anthropic.com/constitutional.pdf) or even to label images of red balls and green cubes.

Chronos · February 19, 2023, 3:16am

That still just sounds like labels, to me. The word “label” doesn’t seem to imply that every image has one and only one. Indeed, it’s hard to imagine any meaningful way of associating text with an image that would only connect one text string to each image.

And yeah, with the current generation of AIs, all of the human intervention (in attaching the text strings to the images) happened before the AI was growprammed.

Snarky_Kong · February 19, 2023, 4:02am

Labels would be like a description of what’s in the image, which pixels go with which thing, etc. For caption you’re going to have an image where the only text is something like the date when it was taken.

You can do pure image generation without labels for sure (thispersondoesnotexist.com looks like it’s been taken down, but you may recall face generation), I’m not aware of it research done without paired examples.

Still, no human feedback on the model.

Darren_Garrison · February 19, 2023, 4:05am

Is that a typo or a phrase coining?

Darren_Garrison · February 19, 2023, 4:07am

Try this site.

Dr.Strangelove · February 19, 2023, 5:05am

Borrowed from Schlock Mercenary.

MrDibble · February 19, 2023, 8:11am

So DALL-E is not based off GPT3?

DALL-E was revealed by OpenAI in a blog post in January 2021, and uses a version of GPT-3 modified to generate images.

Do the GPT-3 authors know they weren’t using labels? Because they certainly seem to think they were:

Our training procedure consists of two stages. The first stage is learning a high-capacity language
model on a large corpus of text. This is followed by a fine-tuning stage, where we adapt the model to a discriminative task with labeled data

My emphases.

That’s my understanding - but that doesn’t mean no human intervention at all.

Chronos · February 19, 2023, 1:00pm

Not my phrase coining. Howard Tayler first used it in 2009.

Snarky_Kong · February 19, 2023, 4:15pm

Finetuning is not necessary for functionality, it improves performance.

Ok, so when you said image generation models “ultimately use human feedback” you didn’t mean a human was giving feedback to that model and when you said “humans keep picking the right front bits” you meant that a human taught GPT3 (a different model) how to speak better English?

MrDibble · February 19, 2023, 4:50pm

Again with the “necessary” - that nobody but yourself has introduced into this.

No, which is why I said “ultimately”, not “constantly” or “immediately” or any other word that might be misunderstood that way.

It’s more than “speak better English”.

You clearly have no interest in responding to what I actually wrote, and have repeatedly elaborated on, but only the strawman version of it you initially jumped on.

When you then show no understanding that GPT is not, in fact “a different model”, but underlies DALL-E, I don’t see the point of responding to you any more.

Snarky_Kong · February 19, 2023, 5:03pm

Constantly or immediately wouldn’t be necessary. MW gives a definition of eventually. Which is how I understood you. That feedback happens at some point. Which is false. You’re

The current version of Dall-E isn’t even the autoregressive model anymore. It’s diffusion based.

It’s not really more than speaking better English. Base GPT3 trained without human labeling at all, does next token prediction. The finetuning you’re referring to is to get it to perform better at tasks like question answering, or “speaking better English.”

The entire point of my responses have been that your assertion that compositionality is due to humans labeling images to show front is false. That’s directly responding to what you actually wrote

Snarky_Kong · February 19, 2023, 6:08pm

I notice you cited the GPT3 paper and not the actual Dall-E (or Dall-E 2) paper.

The actual Dall-E paper does not use humans to finetune.

Section of the paper relevant to this thread/discussion:

and

Darren_Garrison · February 19, 2023, 7:19pm

FWIW, here is how Stable Diffusion handles that.

jjakucyk · February 20, 2023, 12:36am

The two at the bottom of the righthand column made my day, those are so hilariously bad.

Chronos · February 20, 2023, 1:30am

Hilarious? I’ll have you know that wild populations are being decimated by Tapir Accordianbutt Disease.

jjakucyk · February 20, 2023, 2:55am

I’m afraid to ask what sort of sounds that produces.

Johnny_Bravo · February 20, 2023, 12:44pm

A pretty strange one, but it tapers off quickly enough.

Topic		Replies	Views
Ask the AI In My Humble Opinion	52	3640	December 12, 2002
Artificial Intelligence - yea or nay? Great Debates	29	1353	July 6, 2000
Constructed Intelligence Great Debates	24	1237	February 28, 2001
Could computers evolve into the replacement for humans? [New title] Great Debates	59	2855	December 12, 2003
What's closer, BCI or AI? Great Debates	31	1887	October 11, 2005

Another AI images question

Related topics