DALL-E 2 random text on graphics

I have been experimenting with using DALL-E 2 to design a logo. I gave it a description, and then it gives me an image with random-looking text superimposed on the image. For example, I said, “Design an impressionistic logo for a jazz guitarist” and it gave me an nice image with “Juszt Gister” written at the bottom. Others had “Jigin Jun” and “Pasti Pistor Pation”. Any idea what is going on here? It would be reasonable to use “Lorem Ipsum” style placeholder text if you could just replace it but in some cases it’s overlaid on the image so you can’t just Photoshop it out.

DALL-E “recognizes” that logos have graphical elements humans would recognize as text. But it doesn’t know what text really is, so it does its best to create random blobs that sometimes look like letters.

Our own @Mangetout just put out a video yesterday on this subject:

Endorsed.

DALL-E doesn’t generate text. It generates pixels such that the result looks like it could be part of the data distribution of the training set. Training data has text in it, but the model isn’t in any way generating a logo and then superimposing relevant text on top.

Why is the text gibberish? The model isn’t optimized for that. You often see other artifacts in generative AI images. It’s just that text artifacts immediately stand out. Interestingly, with big enough models, generative AI starts generating good text: https://parti.research.google/.

You can upload the image into the editor, erase the words, and give it a new prompt to have it fill it back in. Just avoid terms like “logo,” or anything else which might imply words or writing.

This applies to basically everything made by these AI-engine “artists.” They’re notorious, for example, for making hands that are recognizably “handlike” but have random numbers of fingers and other bizarre problems. That’s because they don’t know what “a hand” is, in the abstract. They just know, after digesting hundreds of thousands of images, that there is a collection of shapes and colors in this location in the overall image which conforms to certain parameters, so it manufactures a blob conforming to those parameters. Those parameters do not include “a hand has five fingers” because the vagaries of perspective and composition do not allow that to be generalized as a rule. Hence, these nightmares:

Interesting. I would think if I asked for a logo for a band named “Jazz Band” that it would be smart enough to understand that “Jazz Band” is text and that’s what should be in the image. I am not very familiar with the internals of how these things work. A lot has changed since I studied AI in 1978.

It doesn’t ‘know’ anything. It’s just that images in its training set are much more likely to have text-shaped objects in them if their metadata says they’re logos. So when you ask for a logo, it’s going to give you something with text-shaped objects because that’s what logos look like.

There’s no logic, it just turns your prompt into a vector representation and then uses that to guide the diffusion process into an image. Modern AI is extremely data driven, things like expert systems are out of fashion.

One does occasionally get actual legible and sensible text. Though that depends on a fair bit of luck.

John Oliver did a bit on AI in his most recent episode. The video below (queued to about 30 seconds before you see the image) shows the AI blurring “Getty Images” in content it creates. Interestingly, the AI is not really “blurring” anything but, rather, it was trained on scraping images off the internet and so many had the “Getty Image” watermark the AI just figured that is part of art. Getty is suing over it.

The AIs sometimes manage to integrate something similar to words in your prompt, such as “juszt gister” clearly comes from “jazz guitar”. (Here is one of mine that comes close to writing the word “cotton”.)

You should try using Stable Diffusion at this site. If you have an area you don’t like on an otherwise interesting image, you can mark that area and have the AI redo just that area (this is called "inpainting).

I got a popup from your link: “This creation has been flagged by the automoderator and has not yet been reviewed by a human moderator.”

Huh. That set isn’t even made public.

Me too and I just clicked “Show anyway”. I am guessing it got flagged for “shit.”

I recently got an interesting section of image while asking Stable Diffusion for funny cat memes. I trimmed out that portion and tried expanding the sides with more of the same content (excluding the cat with a towel on its head). When I ask for rusty cans surrounded by fruit, Stable Diffusion keeps trying to write the words “rusty” and “cans” on the cans. (And cover them with fruit.)

Sometimes the interpretation of the prompt seems to be weighted toward more plausible scenes.

I asked for ‘A giant number three, shouting at kids’ - I got pictures of a kid, shouting, in front of a big, but rather ordinary numeral 3 - I think because it’s more likely that a kid would be the one to be shouting, in a normal English sentence.

In the same way with ‘rusty cans’ - the interpretation might have been skewed by oil cans, baked bean cans, coke cans, soup cans, etc - where the description of the can is also about its contents (and by extension, its labelling); the algorithm perhaps intepreted ‘rusty can’ as a can of something called ‘rusty’.

Can you just “shut off” the text part of the logo creation process?

No. AI image generation is more “want what you get” than “get what you want”. The best that you can do is try to aim it in the right direction. (You could try putting “text” as a negative prompt, but I doubt that would work.)