Ideal computer graphics format for encoding desktop computing screenshots?

Was/is/shall there soon be a graphics format that’s optimized for sharing general “office work”-type screenshots? By that, I mean your typical web browser screen, MS Excel, or other common desktop apps, which are usually a mix of graphics, line art, and text.

As far as I can tell, none of the existing common graphics formats are particularly suited for this particular use case:

  • JPEG compresses things into a grid of little squares and blurs text; it’s great for photos but mediocre at screenshots
  • PNG can compress losslessly, but results in big files
  • WEBP & AVIF are improvements in efficiency, but still not really optimized for “text on top of simple shapes, with sometimes graphics too” that most desktop app screenshots are

The closest current format I can think of that approximates what I’m imagining is actually a properly composited PDF. By “properly composited”, I don’t mean the simple rasterized output you get when you simply embed a screenshot into PDF, but the native output of something like Acrobat or InDesign, which will composite a PDF out of many layers: simple graphics will be saved as vectors, text will have the font subsetted and then laid out by position (as actual text, not just vector shapes), photos can be individually encoded as JPEG and embedded on top of the other layers, and then the final container overall can be internally zipped.

In that way, each part of the app could be optimally encoded with a scheme that makes the most sense for its given data type: text is zipped, pictures are JPEG-encoded, shapes are vectorized into line art, etc. It’s not just a “dumb” raster grid of pixels but an intelligent composition of different data types overlaid on one another, but still in a printer- and screen-ready format.

Of course, it’s a pain to both create and serve such a PDF, especially compared to a simple screenshot. To be able to create a graphics format that works similarly, the screenshot-taking tool would have to be aware of the app’s UI structure and be able to break it down into its constituent shape/graphics/text parts.

My understanding is that some versions of Microsoft’s RDP (Remote Desktop Protocol) is actually able to stream Windows apps this way, breaking down each app’s UI into “system widgets/frames/API calls” (that the remote system can receive algorithmically and recreate locally), text (sent as text), and graphics/pictures (sent as embedded bitmaps). Or something like that; that explanation probably isn’t 100% accurate, but gets the idea across. When done that way, RDP usually resulted in vastly superior remote desktop experiences (less lag, more responsive graphics) compared to things like VNC or Zoom, which just streamed a video of the pixels without any sort of per-data-type optimization.

Has such a thing ever been attempted with screenshots or screen recordings? I understand it’s quite a technical challenge compared to a simple screenshot, and today’s disk space and bandwidth resources makes it not a very high priority problem to solve… but it’s an interesting problem to me, still. As someone who grew up on 14.4k modems, even though we now live in the time of gigabit fiber, it still breaks my heart to see a megabyte-large screenshot that just shows a simple application window. Surely there’s got to be some better way?

The old DjVu file format was an attempt to do this. I don’t think it ever really caught on. The idea was a scanned document could be converted into OCRed text, and compressed images of non-text. The same thing could work for a screen shot.

I don’t think I ever saw it used in the wild.

For screenshots PNG is probably the best format. If the files produced are too large, then cropping and scaling are probably the right solution. A 4k screen shot is going to be a big file, but the 1024×768 final version is going to be pretty small.

Screenshots are likely to have things like large regions of solid color, which are non-ideal for a jpeg compressor, which is designed to work on photographs. In practice, as long as the image is legible, it probably doesn’t make much difference.

This sounds awfully complicated just to avoid suspected PNG file size.

PNG has always seemed pretty efficient for combined image types to me.

I just did a printscreen capture of my desk top with background image and a chrome window of a tab from my work database. Saved it as a 2560x1440 png at medium compression. 2.42 mb, quality looks good to me.

For 2025 thats a very manageable file size.

Ditto what @FluffyBob said. Just do PNG, it should get you to similar or better numbers than JPG with better visuals.

If you’re not seeing that, I’d suspect that you’re taking screenshots that include a fancy desktop background and/or that you’ve got a bunch of fancy UI settings that make everything a semi-transparent gradient fill.

If you give PNG a solid color that repeats, line after line, it should compress down to nothing. Where it inflates is when you give it a photographic image and expect it to handle that better than JPG. That’s not how it works. It finds simple repetitions (like a solid color or a pattern) and just notes what a single entry looks like, with a repeater count.

Simplify your UI settings and crop your images down to the part that looks like more like Word document and less like a photograph, and your PNGs should go down to negligible sizes.

Yes, I understand the basics of this, but 2.4 MB is huge even by today’s standards. That’s a second or more, per picture, over a slower connection (like under 20 Mbps, which is still common where I live, especially over mobile).

A comparable PDF would be maybe tens or at most a couple hundred kilobytes, a huge difference. JPEG would also be much smaller, but lose text fidelity.

Besides, this is more a question of theoretical interest anyway. I don’t really have any choice except to “just use PNG” (or webp/avif) due to OS and browser support in the real world.

But it’s an interesting question to me, theoretically, and evidently to the DjVu designers too. Too bad that never caught on!

A pdf containing photograhic imagery is not going to compress better than a PNG or JPG.

I meant a composited PDF with embedded text and vector paths representing an application UI, vs rasterizing the same into pixels.

And even if you add a photograph under the text, I’m pretty certain it’s still going to be smaller… because the picture in the PDF would still be JPEG-encoded anyway, but the text on top of it would just be an ASCII overlay instead of rasterized pixels that would fuck with the JPEG encoder were it all flattened into one layer. The PDF container overhead is likely to be quite a bit smaller than the inefficiencies caused by encoding text with the JPEG encoder. (But again, this assumes a properly composited PDF, not just a raster screenshot embedded as-is into a PDF… which would only add pointless overhead.)

I can test this on some demo files later, if it’s really in doubt or controversial.

One thing to consider is: how will the resulting media/document be consumed? Will people look at it on a full size desktop screen? (maybe a PDF type document or a video will work), on a mobile phone or tablet? (maybe it needs to be a web page that can respond to the screen size), will they print it and look at the hard copy? (maybe the screenshots should be reduced down to greyscale, which will incidentally save some space).

I just took a screenshot of my laptop screen (1920 x 1032). Microsoft Paint → PNG

164kb

You didn’t write any indication that you’d read what I actually wrote or took any of the advice in it so you might want to verify that you’re replying because you’re stuck, and not because you’re jumping the gun.

If you do have a GUI and cropping habit that’s amenable to PNG, and are still getting 2.8Mb then I’d suggest finding a better art program.

For the record, I’ll note that a 1920 x 1032 screen would have 1.9 million pixels. If we were storing as a BMP with 256 color palette, you’d already be 2/3rds of the way to your 2.8Mb. A compressed format shouldn’t be anywhere close in size to a BMP and certainly not larger than one (except maybe a black and white image that’s storing pixels at the bit level). It’s almost guaranteed that either you or your art program is doing something wrong.

You look at something like Twitch.tv where people are sharing game streams (dynamic fast-changing images) at 6mbps and the image quality looks great. Sure, it isn’t nearly as sharp as your own screen at 26Gbps, but it’s pretty good for less than 1% of the bandwidth. A desktop that’s barely changing would need a fraction of that bitrate for comparable quality. Yeah, Zoom sucks but purposeful VDI solutions don’t.

The apparent solution doesn’t exist because the problem doesn’t exist.

Visual aid for what I mean by adjusting your GUI settings:

Modern with semi-transparent windows, guassian blur, and glow effects:

Windows 2000:

Modernity looks cool but you’ll generally find that your computer is happier and more responsive when you go full 2000. Your screenshots will also be a lot smaller.

Lossless PNGs will be pretty big, but you probably don’t need lossless. You can get quite tolerable quality with very small file sizes.

I have not had these problems. I just used Snipping Tool (which is free with Windows 10 and, I think, Windows 11) to capture screenshots of the New York Times front page in JPG and PNG formats. The resulting files were 1690x812 pixels, 1.4MB in size for a PNG file and 1747x804 pixels, 256KB in size for a JPG file. Neither image looks blurry or blocky. Windows also includes Snip & Sketch (which I think is what they push for use instead of Snipping Tool), though I didn’t try that. And some at work use Snagit for additional functionality, though that’s not free.

I have. For long documents, though, not for single screenshots.

I took a random 1440x1440 desktop screenshot off the Internet that was in PNG. Original size: 203k. After re-encoding it efficiently, the PNG was 161k. You really want to squeeze more out of it? Fine; losslessly converting to JPEG XL resulted in a 107kB file.

I do not see the problem with using PNG, as many have commented. Nor should the files be huge. It’s true that JPEG XL is “better” (smaller) and also has a “visually lossless” mode to squeeze it down even further. Don’t use (old) JPG for screenshots.

Just to be clear, this is a purely academic question arising out of curiosity, not a real-world need to make smaller screenshots. I am curious about whether there is a theoretical, more space-efficient way to encode this sort of information. And there surely is, as seen by DejaVu, etc.

For context, I take screenshots & recordings all day long for work, and I’ve been using PNGs since they were first invented three decades ago… I understand what you all are saying, and surely that makes sense for everyday usage. Even if a new highly-efficient screenshot-specific format were to be made, there would be exactly 0% chance it’d see widespread adoption. There’s just not enough need for it.

But still, it’s an interesting question to me academically. Just for personal satisfaction.

Let me provide some examples, here, so I can better illustrate the nuances I’m talking about…

Given a screenshot like this:

That’s a pretty standard, already well-trimmed screenshot, right? That’s a 220 KB PNG.

The JPEG is 332 KB:

And you can see the compression artifacts if you zoom in a lot, as in this cropped version resaved as a lossless PNG:

If I remove the background grey color, it’s more obvious:

There are black and dark grey squares all around the word, in the background. I don’t mean the anti-aliasing applied to the font itself, but the nearly-black squares scattered around the image. That’s unavoidable when encoding text as JPEG; it can be improved with quality settings, but not altogether avoidable. Those artifacts will not be there with PNG or other lossless encoders.

We can do a little bit better than PNG with a lossless WEBP at 186 KB:

Or a good compromise with a lossy 98 KB HEIC (that you may or may not be able to see here, sorry):
https://fightingignorance.org/1f1fb412-4b8f-47d2-a9c4-d6a4423641ae-ui.heic

That has higher quality and a smaller size compared to JPEG. But it’s not as widely supported.

So already we have many options.

But something like this janky example (just my quick and dirty example, it’s not an actual screenshot) can dramatically decrease the filesize down to 23 KB:

https://fightingignorance.org/e311d05c-faea-4636-83a0-ee894246efc0-example.svg

If you can’t load the SVG, it looks something like this:

That’s 10% of the original PNG size for a similar (but not exactly the same) output. An actual pixel-perfect copy would require font subsetting, different avatar images (as small JPEGs embedded inside the overall container), etc.

A PDF instead of SVG would be similarly sized, but maybe a little bit bigger. Probably still smaller than the PNG. I don’t know how big it would be as a DejaVu.

But I hope that illustrates the nuances I was trying to get at. The SVG is a combination of vector paths and shapes (lines and rectangles and icons), text (as ASCII with font and position and color metadata), and embedded raster avatars (the JPEG avatar icons). It allows each type of UI component to be encoded most efficiently by their data type, as opposed to trying to force one compressor on all of them.

A composited PDF could use a similar technique, and if I had a copy of Illustrator handy, I’d make one. It might even be smaller because PDF can zip the non-raster parts (text, font faces, vectors, etc.) in a way that SVG cannot.

Anyhow, I understand that that different graphics codecs have different compression algorithms, strengths, weaknesses, etc. This isn’t about that, but about how to better encode the idea of a “application UI” as something other than raw pixels — again, just out of academic curiosity. I hope that makes some sense…

The difficulty, of course, is that such an encoder would be much harder to write, because it would have to be able to tie into either the graphics driver and/or the app UI APIs in order to deconstruct the original app into its original constituent parts, translate each type to the format, and then draw them. That’s like 1000x more work and CPU time than a simple PNG or JPEG encoder.

So it’s purely of academic interest. Such a thing would only be useful where CPU (and programmer) time is nearly infinite but bandwidth is extremely constrained. i.e., it solves no real-world need, as many of you already pointed out.

Yeah, exactly. It’s not a real-world problem. We live in an age when you can stream 4k AAA 3D games in real-time to your browser window. Streaming a desktop isn’t a technical challenge anymore. But still, despite that, there exists desktop-app-specific optimization techniques like I described earlier with Microsoft’s Remote Desktop Protocol: Top 10 RDP Protocol Misconceptions – Part 1 | Microsoft Community Hub

RDP uses presentation virtualization to enable a much better end-user experience, scalability and bandwidth utilization. RDP plugs into the Windows graphics system the same way a real display driver does, except that, instead of being a driver for a physical video card, RDP is a virtual display driver. Instead of sending drawing operations to a physical hardware GPU, RDP makes intelligent decisions about how to encode those commands into the RDP wire format. This can range from encoding bitmaps to, in many cases, encoding much smaller display commands such as “Draw line from point 1 to point 2” or “Render this text at this location.”

To illustrate some of the benefits on CPU load, terminal servers today can scale to many hundreds of users. In some of our scalability tests we see that even with hundreds of users connecting to one server running knowledge worker apps (e.g. Word, Outlook, Excel) the total CPU load consumed by RDP to encode and transmit the graphics is only a few percent of the whole system CPU load!

With this approach RDP avoids the costs of screen scraping and has a lot more flexibility in encoding the display as either bitmaps or a stream of commands to get the best possible performance.

In other words, it encodes UI as UI, not as a rasterized bitmap. When you’re at the office with gigabit fiber, it doesn’t matter — you can send it as uncompressed video and probably still be fine — but those sorts of hyper-optimizations can still be handy for some instances, like if you’re upload-constrained (trying to share a screen from a low-bandwidth home DSL connection with limited upload) or trying to remote into something from a developing country or just somewhere with slow mobile internet, etc.

I just think it’s interesting to think about, is all. Don’t worry, there are no giant screenshots or recordings I’m desperately trying to shrink down :slight_smile: It’s just curiosity.

From a purely theoretic standpoint, your best option would (likely) comprise of a few things layered on top of one another:

  1. Some sort of shared, global library of elements. E.g. all of the written characters, in all fonts, in all styles and sizes; windowing elements, buttons, dialogs, etc.; 3D representations of cars, cans, bottles, and other mass-manufactured, consistent in appearance; and so on. The image file would simply need to reference into this external library, with some extra data about location and rotation.
  2. A system to generate simple 2D and 3D shapes to fill and layer upon one another as semi-transparent elements, to recreate complex images.
  3. A JPEG-like system.
  4. A GIF/PNG-like system.
  5. A system that breaks out subportions of an image for the above systems to handle, based on the ideal compression strategy for that portion of the image.

That is to say, the ideal way to accomplish the task is the impossible way.

But a strong system would probably need to work that direction.