Generating AI images have become quite popular recently, and there are some beatiful and creative images out there, made by for example Midjourney.
What I wonder is if there are some implementations that show exactly which images are used as a source for a particular image. I wonder how many images are used, or if there are so many that it is impossible to determine.
Some of the images I have seen look so familiar, that I wonder if it is essentialy plagiarizing another image, just with some minor modifications.
AI art is generated in such a way that it is not clear what steps the program has taken to solve one particular task, but the whole underlying training data set can be known. Perhaps this is what you are looking for.
The AI doesn’t access specific images, it accesses the set of defining characteristics that it decides describe a specific concept. It only sees the originals during training. For instance, Stable Diffusion was trained with around half a billion images, but if you download a local copy of the program, it’s database is only around 5 GB. Around 10 bytes per image. Images aren’t in there.
If I ask it to create an image of Super Mario, I get a pretty accurate depiction. It might be that there are some pictures of other people called Mario in there as well, but that seems to have a minimal impact.
I get that it is hard to reverse engineer because all this training is done in advance, but there must be some pictures that have a lot more impact than others.
And I think it is probably more a matter of the number of images that the AI can study. If there are 10,000 images with Mario the game character, the system is going to have a much better idea of what defines an image of him than it does from the 10 images of Mario the real-estate agent in Tacoma, Washington.
For example, try making some images involving Yoda in Stable Diffusion. They will be pretty well done, so well that you might wonder if they are partially copying real images. Then try making some images involving Jabba the Hutt. You will find them to be…less well done. You will not wonder if any real images were copied. Some generated images are just a fat man with no shirt. My conclusion from that is that Yoda images are much more popular in the pool of training images, so the AI has a much better idea of Yodaness than Jabbaness.
My suspicion there is that the AIs have some sort of bias, either deliberate or emergent, that causes them to favor humans. This mostly improves results, because a lot of people asking for the images want humans in them. Yoda isn’t human, but he’s at least human-ish, so the AIs can handle him, too. But Jabba the Hutt is much less human-ish, and so when the AIs try to make him more human, you end up with things like fat men.
Stable Diffusion was trained off three massive datasets collected by LAION, a nonprofit whose compute time was largely funded by Stable Diffusion’s owner, Stability AI.
All of LAION’s image datasets are built off of Common Crawl, a nonprofit that scrapes billions of webpages monthly and releases them as massive datasets. LAION collected all HTML image tags that had alt-text attributes, classified the resulting 5 billion image-pairs based on their language, and then filtered the results into separate datasets using their resolution, a predicted likelihood of having a watermark, and their predicted “aesthetic” score (i.e. subjective visual quality).
Later on, the article breaks down where the images came from, and basically anywhere with images closely associated with text will get hoovered up. Pinterest, Wikipedia, Flickr, Deviantart, Tumblr, stock photo websites, and online shopping sites. Obvioustly, a huge percentage of the images scraped in this manner are protected by copyright, but it’s still an open question if what is happening here actually violates copyright. Presumably Common Crawl adheres to the DMCA and will remove any of your images from its massive datasets, as long as you can find them and fill out the paperwork.
So if you post a picture online of Super Mario jumping in the air hitting a block, with a description attached to the image, that will get cataloged and fed into the AI model and now it “knows” a little more about what Super Mario and jumping and blocks look like. Repeat that a few billion times and now the model knows a lot about what a lot of things look like, but there’s probably more pictures of Yoda than Jabba the Hutt, so one will get rendered a lot more readily than another.
And also, a lot of these newest models have additional models built in, that make them better at creating realistic faces where faces belong or creating realistic perspective where perspective belongs.
What all these models are doing is trying to learn a probability distribution that resembles the distribution of their training data. Drawing an image with a prompt, input image, or both is drawing a sample from the conditional distribution. The models do not take into a prompt, query some database, and mush the results together. There is no real source for a generated image.
Now, you could embed your query and find the closest data point in the dataset, or even the k images whose mean is closest to your embedding. However, I don’t think that means they’re the source of the generated image. Even a very different image will change the weights of the neural networks and in some sense be a course of all the images.
So the image of Mario that I generate often is a 3D Mario from one specific angle. It is sharp and looks almost exactly like official art. How is the training affected by what must be lots of images in different styles, angles, positions, sizes etc.?
I wonder what will happen when the number of AI generated images used for training surpasses the number of “real” images and how this may affect the quality of the subsequent images. Some interesting loops could emerge. If those images are ever used to train computers to be sentient (something I doubt is possible at all, but I also did not believe 10 years ago computers could translate at an acceptable level and now they do, so I may be wrong) how will they affect the computer’s worldview?
No one really understands how it works. It’s too complicated, and the training has too subtle an influence.
It must though, in some fashion, come up with “vectors” for the source imagery. Consider a simpler example; a human face. We can recognize various parameters at play, like the spacing of the eyes, the color of the eyes/hair/skin, the width of the nose, the roughness of the skin, and so on. In principle, with enough work, we could parameterize every face into a handful of numbers.
Some of these parameters might be difficult to tease apart. As a made-up example, maybe the ratio between forehead width and ear length is important. A computer being trained with sufficient data would learn the relationship, but it might be just too subtle an effect for a person to notice.
We can generalize the idea to more than just faces. At a very high level, we can recognize the difference between photos and oil paintings and pencil drawings. Each of these have low-level details unique to them, like brush strokes or pencil marks. With enough training, though, the style just becomes another parameter.
Mimicking an artist is also likely to be just another vector in its database. Maybe we think of some artists as having a unique style, but in practice probably everyone can be decomposed into “20% of parameter 424, 30% of parameter 6432, etc.”.
These vectors may be very alien to us–that is, there’s no direct analogy to “this artist paints in watercolor” or “this artist likes snowy landscapes”. They’re collections of numbers that when activated combine to produce the kinds of things that artist produces.
And the same goes for specific objects, like the 3D Mario you mentioned. A human would decompose it to “round head, bulbous nose, wide moustache, red cap with M on front, etc.” That’s probably not how the computer is breaking it down, but the net result is the same in that you can describe something with far less data than a raw image would take.
I’d like to repeat Darren_Garrison point that each image only contributes about 10 bytes. Of course, there is no specific 10 bytes that an image corresponds to; it’s really just a subtle influence that’s spread out over the entire neural net. Still, it demonstrates that it can’t possibly be storing the source imagery; there’s just not nearly enough storage for that. It must be decomposing the elements in a way that they can be reconstructed later in a sensible fashion. No one knows precisely how, though.
It brings to mind an old story about a researcher working with evolving circuit designs, where the result worked but it not only didn’t make sense, but didn’t work on other—supposedly identical—chips, and apparently depended on subtle physical variations on that one specific test chip.
Something did occur to me, though. It probably doesn’t heavily weight one copy of one specific image, but (as anyone who has used Google Images knows) some images are found on multiple websites, but often with some alterations, such as changing the resolution, cropping it differently, adding text, tweaking the contrast, etc. Google Images knows that they are similar enough that they are probably “the same”, but I don’t know how the image scrapers and AI training deal with it. There could be some specific press release image that has more influence because it is in the dataset multiple times and each one is treated as a different image.