Well, that seems rather crude and crappy. It’s a specific function, so it seems to me that creating a “switch” to control it shouldn’t be that hard. On the other hand, you could hide my knowledge of this on the head of a pin, so I could be way, way off for reasons I don’t even understand. LOL
Yes, you have a very, very inaccurate idea of how these AIs work. It isn’t switches and settings, it is more like standing behind an artist’s shoulder and saying “draw me a car being driven by a Tyrannosaurus rex wearing a tophat”, and then have him try a dozen versions and edit the best one of those a dozen times based on your descriptions. Except the artist is an alien who is only familiar with Earth from seeing pictures. And you can only use words that were printed in the captions of the photos. Like, if you want to see a group of people with large eyes, you get that by yelling “Dan Witz! Margaret Keane!” because the alien has seen the works of Dan Witz (who paints crowds of people) and Margaret Keane (people with large eyes) and can somewhat copy their style. And you have to discover on your own what names get what result because there is no guide.
Now that I think about it, the alien is a Tamarian.
Ouch! I was hoping for, “Gee, you’re not so bad, Jasmine! :)” LOL
My God, if it can do all of that, why can’t it do, “Draw me a car being driven by a Tyrannosaurus Rex wearing a top hat with no written text.” ?!
Because people don’t label images based on what they don’t have in them. You will find images labeled “car”, images labled “Tyrannosaurus rex”, images labled “top hat”, images labled “driving”. But you probably won’t find many images being described as “no text”.
Because it doesn’t actually understand language. It’s an AI. When you type a sentence it creates a bunch of tokens about the words, their order, their relationship with one another, etc; and it relates this information to matching tokens it has that define images, relating the two sets of tokens using a complex and very opaque model.
The phrase “no text” simply doesn’t have a powerful enough association with a specific set of tokens to eliminate all text and anything that looks like text.
If @Jasmine would like to aid the state of the art, they can download a few training datasets and label a few million images as having text or no text. Then we can train the next generation of AI on this enhanced dataset.
Isn’t this already pretty much automatable to a reasonable degree of accuracy? My phone has simple text detection on it.
I’ll (not “we”) get right on it. LOL
Sure, something could be added onto the process to cull and retry if it detected text in the output, but what if you want no bananas in the image, or no gerbils, or no sand, or no water, etc.
It would, I suppose, be possible to train one of these algorithms to be sensitive to instructions about what should not be in the output, possibly by using adversarial methods, it’s just that this doesn’t seem to have been a priority for anyone
Actually, Stable Diffusion does successfully do that, at least in some cases. I use Mark Ryden in a lot of prompts, and Mark Ryden uses frames as integral parts of lots of his art. Stable Diffusion a large percent of the time puts frames around images with his name in the prompt. Putting a second prompt with “frames” at -1 works for removing frames.
But that’s probably because the token ‘frame’ was a tag in enough of the training data for the algorithm to learn what a frame is.
Obviously the token ‘text’ or ‘writing’ must also explicitly appear in a lot of the training data when the theme of the image was ‘text’ (i.e. ‘A sign with fancy text’), but there will have been a much larger set of training images that contained writing, but were not tagged - ‘a can of beans’, ‘a stop sign’, ‘a movie poster for the film Citizen Kane’, ‘a tourist information kiosk’ etc. Most images containing text probably don’t have ‘text’ in the metadata, or indeed any consistent indicator like that.
[Moderating]
Nobody suggested that you were a “we”, and taking offense to an imagined slight is not productive for FQ. Knock it off.