No, I made a statement about what is, not what ought to be. That you interpreted it that way is, however, understandable.
That you continue to harp on it, after I’ve corrected you, is not. I’m outright telling you I was not talking about “necessary”.
I’m talking about the pre-existing training datasets and the initial pre-training as discussed by OpenAI themselves, not the subsequent algorithms used.
No, I was talking about what is. I don’t really have any opinion on what ought to be in generative modeling.
Leaving that aside, you’re factually incorrect about human feedback being the cause of this phenomenon. CLIP and Dall-E don’t use human feedback. They don’t even use labels.
CLIP stands for “contrastive language-image pretraining.” Basically they scraped a bunch of image/caption pairs from the internet and then trained two models, one each for the image and the caption, to create embeddings that are close to each other when a given caption is from that image and far if it’s from a different image (this is the contrastive part). This creates a shared embedding space for both visual and language information. So when you pass a prompt to Dall-E it’s converting that prompt into this embedding that makes sense as an image and the diffusion model uses it as a guide.
You might have the thought that the caption is just a label. Not really. You can have incredibly large amount of captions for an image. Pandas at a zoo might be “pandas at a zoo” or “people looking at animals in cages” or “a sunny day outside” or whatever. It’s also very noisy. You might scrape things and use text on a webpage instead of a direct caption. Does the text directly comment on the image? At the internet scale it doesn’t really matter since it will tend to correlate with the image.
My main point being is knowing how to create an image of “red ball on top of a green cube” does not require a human to look at the output of a model and give it feedback (this is very common now in NLP although AI feedback instead is becoming a thing: https://www.anthropic.com/constitutional.pdf) or even to label images of red balls and green cubes.
That still just sounds like labels, to me. The word “label” doesn’t seem to imply that every image has one and only one. Indeed, it’s hard to imagine any meaningful way of associating text with an image that would only connect one text string to each image.
And yeah, with the current generation of AIs, all of the human intervention (in attaching the text strings to the images) happened before the AI was growprammed.
Labels would be like a description of what’s in the image, which pixels go with which thing, etc. For caption you’re going to have an image where the only text is something like the date when it was taken.
You can do pure image generation without labels for sure (thispersondoesnotexist.com looks like it’s been taken down, but you may recall face generation), I’m not aware of it research done without paired examples.
DALL-E was revealed by OpenAI in a blog post in January 2021, and uses a version of GPT-3 modified to generate images.
Do the GPT-3 authors know they weren’t using labels? Because they certainly seem to think they were:
Our training procedure consists of two stages. The first stage is learning a high-capacity language
model on a large corpus of text. This is followed by a fine-tuning stage, where we adapt the model to a discriminative task with labeled data
My emphases.
That’s my understanding - but that doesn’t mean no human intervention at all.
Finetuning is not necessary for functionality, it improves performance.
Ok, so when you said image generation models “ultimately use human feedback” you didn’t mean a human was giving feedback to that model and when you said “humans keep picking the right front bits” you meant that a human taught GPT3 (a different model) how to speak better English?
Again with the “necessary” - that nobody but yourself has introduced into this.
No, which is why I said “ultimately”, not “constantly” or “immediately” or any other word that might be misunderstood that way.
It’s more than “speak better English”.
You clearly have no interest in responding to what I actually wrote, and have repeatedly elaborated on, but only the strawman version of it you initially jumped on.
When you then show no understanding that GPT is not, in fact “a different model”, but underlies DALL-E, I don’t see the point of responding to you any more.
Constantly or immediately wouldn’t be necessary. MW gives a definition of eventually. Which is how I understood you. That feedback happens at some point. Which is false. You’re
The current version of Dall-E isn’t even the autoregressive model anymore. It’s diffusion based.
It’s not really more than speaking better English. Base GPT3 trained without human labeling at all, does next token prediction. The finetuning you’re referring to is to get it to perform better at tasks like question answering, or “speaking better English.”
The entire point of my responses have been that your assertion that compositionality is due to humans labeling images to show front is false. That’s directly responding to what you actually wrote