Improving Machine Elf's Monitor of Very Large but Finite Possibility.. for humans

Quick version: how can we tweak the presentation of random images to maximize the probability that a human viewer will ‘see something’ in the randomness?

Machine Elf asked how many images a monitor of given resolution and color depth could theoretically display - the answer is a cosmically large but finite number. The very vast majority of these images look like noise, but of course the whole set also includes all the still frames from every possible movie of arbitrary length, text of all possible texts in all Earth languages, etc. It’s a version of the Library of Babel, but with exponentially more ‘unreadable’ junk in it.

Humans are pattern-seeking creatures: we see bunnies in clouds, Jesus on toast, tigers in the motions of leaves, that sort of thing. Pareidolia is a particular example. We pick out edges and movement better than subtle differences in color. We seem to be optimized to respond to images that match the visual statistics of scenes of nature. We aren’t great at differentiating one image of visual snow from another image of visual snow.

So, if I wanted to mess with Machine Elf’s monitor in order to maximize the chances that a given human observer would ‘see something’ in a given image, what would be the best way to do that, and what would the resulting ‘somethings’ be?

For example, I suspect that making the images greyscale rather than full 24-bit color would ease the load on the human visual system. Maybe showing two or three or four images in rapid succession would trick the eye into detecting a shape or motion in the aggregate, persistence-of-vision result. Or maybe the images should be strobed at the limit of perception, with the viewer sort of forced to make a judgement from a memory rather than being able to examine the image in detail.

We could tweak the statistics of the images by starting with a lower-resolution random ‘seed’ and then expanding that seed to fill the entire test image using an algorithm that matches the visual behavior of the natural world. The natural world has a lot of gradients in it rather than black pixels right next to white pixels, for example. Images made up of short overlapping line segments of random length and orientation - like toothpicks thrown across a floor - might also be more likely to evoke a perceived meaningful image. This would sort of mess with the notion of random pixel images, though, and might not be acceptable.

As for results, given the way that I think human visual systems are wired, I would expect that people would predominantly see human faces and/or human figures over animal or nature images over images of structures or technology or writing. What do others think?

Define “see something.” What constitutes “something” for this? People see things in ink blots.

In order to see meaningful images, you don’t sample randomly from the space of all bitmaps, which will result in snow. You could, for example, sample from a space defined by a trained neural network that models some elements of human perception, so that lines, textures, faces, etc all show up. That way you produce “random images” in a sense meaningful for the human visual system.

It’s sort of a “theorem” of compression theory that the space of compressed files spans the relevant space more tightly than the raw format does. (I phrased that VERY awkwardly. Please work on helping me with that phrasing, rather than attacking a false claim you infer from the poor phrasing!)

For example, replace the space of 12-megabyte images with a space of valid 1-megabyte Jpeg files of the same image dimensions. The latter allows only a very tiny subset of the entire space, but the images that have “gone missing” are ones that the human might have trouble distinguishing from some image in the smaller (Jpeg) set. You can study the relationship between these two sets to see what image aspects Jpeg “thinks” are important to a human viewer. For example, very-high frequency can be blurred, as can high-frequency chromaticities (but not luminance).

A hypothetical “super” image compression system would know that the shapes of human faces have to be carefully preserved, while the shapes of other animal faces are less important, and the shapes of clouds or hills less important still.

This is sort of what the “deep dream” AI research projects have done: They start with a library of “real pictures” that they train the computer on, give the computer (initially) random images, and then let the computer progressively “enhance” the patterns it inevitably sees in the images. The results end up surreal, in much the same way that real human dreams do, suggesting that this or something like it might be part of the mechanism behind real human dreams.

That’s kind of the point of the exercise. People see meaning in ambiguous visual data all the time. If we wanted to present random visual noise to people in such a way as to maximize the percentage of the time that they said they saw ‘something’ (a face, a family pet, a clown on a unicycle, whatever), how would we do it? And what images would people most often report seeing?

Maybe some kinds of noise are better than others. Maybe some quality of the presentation is better than others. Both would depend on the characteristics of the human eye and brain in finding patterns in noise.

My guess is that, under optimal conditions, with noise ‘tuned’ however we want to optimize “aha!” moments, people would predominantly report seeing creepy faces. But that’s just a hunch.

This seems like a promising avenue. JPEG compression is already tailored to accommodate human vision quirks, eliminating data that is lost on the human eye. But overall does compressing an image of noise with JPEG tend to accentuate things that a human eye would snag on, or smooth them out? Maybe there’s a happy medium.

Instead of rolling randomly on the gigantic roulette wheel, you could generate “random” images according to rules. Say a “drunkard’s walk” algorithm. This will (often but not always) produce patterns that the human mind will likely perceive. Add more “rules” and you’ll get better-defined patterns.

Yes, but the JPEG compression works on blocks of limited size. I thought that by definition it was 8x8 pixel blocks, but it turns out that the blocks can be arbitrarily large.

If you want to catch the attention of a hooman on a 3840x2160 monitor, you want features of attention-catching scale. Lines (of arbitrary shape), blocks of texture with relatively definite edges, that sort of thing. A field composed entirely of uniformly random fine noise is, by definition, seen as noise. Visual snow.

So you might develop a parameter for an image that quantifies the presence of lines and edges, or fields if arbitrarily constant color/brightness. Example, the GIF format uses run-length encoding, i.e. the file says “the next 37 pixels are white, the next 240 pixels are black…” and so on. An that has many long runs of one pixel color might be of more interest to a human visual system than an image that has fine visual snow.

ISTM this would filter out the vast majority of possible images.