Why is 'decoding a data file format w/o succumbing to an exploit' apparently an intractable problem?

This is something that has long puzzled me. Apparently there are always new exploits that work by malformed data in a data file (such as JPEG, PDF, ZIP) causing the software/operating system on the target system to execute malicious code.

The countermeasure seems to be to fix the software’s vulnerability* to that particular exploit.*

Why is this? From my naive understanding a file format specification specifies a number of mandatory and optional data structures that together make up a file, their way of encoding data and their legal ranges.

Why is it apparently impossible to write e.g. software decoding a JPEG or another format having no provision for scriptable content, that absolutely makes no assumptions of anything in the file being correct and can be mathematically proven to return for all possible inputs either data for display, without any side effects, or no data plus an error status?

Because software developers are lazy and sloppy.
Not checking for buffer overruns is the big one.

Imagine you run a receiving warehouse. You have packages that come in, you hold them in a secure area, and then you ship them out. You train your employees not to open the boxes, so there’s no risk of them opening a package and finding a fake employee training manual that instructs them to do something malicious. I think this is what you’re picturing. “Why is it so hard to train your employees not to open the boxes?” In this case, imagine our employees are robots, so it’s not very hard.

However, your warehouse is also a business that needs its own supplies – employee uniforms, coffee filters, employee training manuals, etc. Those supplies come in their own boxes, probably through a different entrance, and then are stored in a different area, probably one that’s much less secure than your typical holding area (because people need to actually get in those boxes and use them, you need to allow greater access).

So rather than try to trick your robot employees into opening boxes in the secure area that they know they’re not allowed to open, an attacker will trick the robots into thinking that the secure area is full, or confuse them about the location of the secure area, so that a box full of malicious employee training manuals makes it into the general supply room. Once there, any other robot employee can open it up and start following the new rules; doesn’t need to be the original robot that was tricked into thinking that the secure area was full.

Why can’t we program the robots that don’t get confused about where the secure storage area is? It’s just kinda tough. A lot tougher than training them not to open the boxes.

Douglas Hofstadter, in Gödel Escher Bach, has an exchange between the Crab and the Tortoise. The Crab has just bought an audiophile system that is claimed to be “capable of reproducing any and all sounds”. The Tortoise brings him a home recording titled “I Cannot Be Played on Your Record Player” and when the Crab tries to play it, it produces sounds whose vibrations destroy the record player.

This was to illustrate the concept that any sufficiently advanced system designed to follow instructions is subject to being able to follow instructions whose implementation destroys it. Really simple instruction-followers may not be vulnerable but as you increase the abilities of the system it becomes inevitable that such a vulnerability will intrinsically exist. Not that such a vulnerabiliity may be possible but that it flat-out will be a characteristic of any sufficiently complex system.

What is true for a set of instructions that directly attack the instruction-interpreter are certainly true for doing other destructive things that we’d prefer the instruction-interpreter not to ever do, not to be able to do.

To go back to the warehouse analogy. Your employees are told to bring the box into the warehouse and shut the door. A box arrives, but nobody checks the size. either the truck backs through the far wall while bringing it in, opening up the warehouse, or the employees stand around for the next week wondering what to do because one end of the box is stopping the doors from closing. Why? because the manual didn’t say “check size of box first, refuse if too big”.

This is the problem. “Buffer overruns” are common vulnerabilities. The program says “begin reading this data until the stop point” but does not pre-check that the stop point is reasonable, or else does not cancel if the stop point goes beyond the capacity of the memory are set aside to handle it. This sort of instruction occurs in any number of places in a program. If the program does not stop the “move to working space” process, then the copy would keep going past the space, and possibly overwrite an area where programming is stored - so if it overwrites with the necessary code, the “bad data” is now executing as the program, and can go on to do more deliberate stuff like load a bigger program. There are so many such instances where a buffer overrun could happen, and often programmers are careless and do not allow for such malicious possibilities. (“Assumptions”)

Here’s XKCD’s explanation of one particularly bad buffer-overrun attack.

Mathematically proving an algorithm correct is complicated. And even after you prove the algorithm correct, you need to prove the algorithm that turns your algorithm into machine language is itself correct. And then you need to prove the algorithm that turns the algorithm that turns your algorithm into machine code into machine code is itself correct. And then…

That doesn’t even take into account the possibility that malicious code can *also *be algorithmicly correct data.

I think there are three reasons:

  1. Is just the aforementioned sloppy/lazy/not foresighted programmers who write decoding programs and never put in things like “check to make sure the package fits in the warehouse before bringing it in”. It’s hard to really blame them for doing that 20 years ago, before the modern internet security age. And an amazing amount of code is just re-used and re-used. So there are lots of programs still being used that were never designed with good security.

  2. Very simple formats are easy to decode securely, but programmers like to make things interesting and cool, so a lot of formats have structures with lots of detailed instructions for the decoding program, almost makingthe file a mini-program. This kind of thing is much, much harder to make secure and much, much easier to do evil things with.

  3. Windows makes it very easy for a malicious file to pretend to be some other kind (a program pretending to be a picture), and trick the user into running it (because in Windows, the simplest action to run a program is exactly the same as the simplest action to look at an image file). So in some cases, the decoding program isn’t actually being run.

The chain also extends the other direction, you have to prove your proof is correct, you have to prove that the proof of the proof is correct, etc.

(When I reviewed papers, I always found at least one error in every proof of correctness for some algorithm. Most often, because the algorithm had an error the proof didn’t catch.)

Proving any non-trivial program is correct is insanely hard. Hence the unsolvability of the Halting Problem. I.e., you can’t even automate proving a program will stop, let alone produces the right output for a given input.

So you rely on programmers to do their best. Some are quite a bit better at it than most. But most code isn’t written by the best. And even the best aren’t perfect.

And it doesn’t matter what the OS is. E.g., everyone once in a while someone finds an error in a key Linux program that causes you to wonder: What was the idiot who wrote this thinking?

Note that MS uses many “standard” packages from the Open Software world. E.g., the LZW compress/decompress code that is used in many places including encoding/decoding .gif images was duplicated all over and when an exploitable error was found in the original, a lot of fixes had to be pushed out.

And this was some fairly simple code that had a lot of “eyes” looking at it over the years.

I’ve asked myself the same question as OP. It certainly is possible to design decoders that are immune, to a reasonable degree of certainty, to exploits. (Some of the answers in thread are too trite: as OP points out, many decoding problems are very well-defined.) It may take great cleverness and caution to conceive of possible exploits, but if “black hats” can find an exploit, “white hats” should be able to find it, and circumvent it.

So why isn’t more impervious decoding software developed? Some of the obvious reasons are scheduling pressures, and priority given to fancy features rather than integrity. Another reason is dilution of resources due to competition: What if the resources spent developing five functionally equivalent codecs were spent developing a single robust codec?

But one of the biggest reasons, I think, is the huge gulf in skill among programmers. Imagine a restaurant where one cook creates gourmet dishes in minutes, while another cook spends all day trying to fry an egg and ends up burning it. The “black hat” who finds an exploit is probably much smarter than the guy whose sloppy code permitted it.

Another point, probably obvious to many but not all, is that to write exploit-free decoding software it is not necessary to anticipate any particular malicious cleverness. It is sufficient merely to cope (most simply via a simple error exit) with all faulty inputs. (If such a fault occurred by bad luck, rather than intelligent malice, and was undetected it might cause a program crash or other misbehavior.)

Something to understand about the OP’s file format examples = JPEG, PDF, ZIP. None of these are simple file formats that just contain descriptive fields of basic types (integer, string, etc.) Each of these is a complex description that is used to drive a further execution engine that creates new data. PDF is a programming language. It is essentially a way of encapsulating up things written in the Postscript language. You can write arbitrarily complex code in PDF. Some PDF is intended to actively execute whilst you are viewing it - ie to fill in a form. Viewing a PDF document is for all intents running someone else’s program on your machine. Hardening a PDF viewer is a highly non-trivial task. Sandboxing is the usual start. But there are PDF engines everywhere, pwning your printer isn’t impossible.

But both JPEG and ZIP have lesser, but similar aspects. They are compression formats. They describe to the de-compression system how to execute in order to re-create the original document. By definition they create new, larger, data. It can be very hard to know ahead of time exactly how big the final data should be, especially for sub-components being decompressed. It isn’t hard to see how a carefully crafted malicious file could describe an action to be taken by a component of the de-compression that could lead to a datum output that was bigger than expected, and could overwrite some other data structure. From here it is just hard slog on the part of the black-hat to craft an exploit that manages to deliver a payload.

Many years ago I had a corrupt .z file that decompressed to an infinite size. Just a bad bit in the wrong place. But try to decompress it and it filled the file system with a file full of zeros.

Mostly because C is a terrible language for processing un-trusted data. Even for competent developers. All it takes is one tiny mistake (as small as a single missing, incorrect, or extra character) among tens or hundreds of thousands of lines of code to make a huge bug.

Then you have to consider the interactions between all the functions in a library or program. There may be thousands, each calling a dozen others. Good luck reasoning about how they’ll all interact under all possible inputs. It’s a massive problem space.

The lack of bounds checking alone isn’t the only problem with C. If user input, program code, control flow records, etc. were all stored in isolated areas, an overflow of user data would at worst mess up user data. But that’s not how it is with most systems. Rather, everything is commingled. A bug in a program can easily allow malicious user input to replace the program code in memory and take over. Improvements have been made to mitigate this recently (non-writable code segments, non-executable stacks, ASLR, etc) but there’s a lot to fix.

It’s also a social problem. People who write benevolent code tend to do it for a living or for fun and are mainly incentivized by their paycheck or the joy of creating new features for the world. You build something, test it reasonably for bugs, and move on to the next interesting project. Most coders of this sort don’t enjoy pouring over the same few lines of code for a thousand hours in a row, and then doing it again for each and every library and compiler used to build their code, just to hunt for that once-in-a-lifetime security bug. And getting exhaustive third-party audits is both expensive and still no guarantee, in the mathematical proof sense, that the code will be vulnerability-free.

On the other hand, malicious hackers and government entities ARE incentivized to find these rare bugs, because they get a payoff only if they DO find one. So they can dedicate all their time on the one target (or a small selection of targets) looking for even one error that the original programmers made.

It’s the difference between authoring a 1000 page novel and being that OCD guy who found the one punctuation error the editor missed.

Even JPEG is not some trivial thing. It’s not just “hey, display a block of red pixels and then a block of blue”… it has to deal with progressive decoding, interlacing, error correction, color spaces, EXIF extensions, and god knows what else. Each one of those features is probably written by a different team and integrated over decades, and a bug in any one of those components is a possible vector for steganographic malware. Hell, it might even actually display as a valid image with error-corrected noise – may or may not look like much to the human eye, but the JPEG decoder doesn’t know any better. It’s technically a “valid” image, except that one bug in that one library from 15 years ago means the next 80 bytes is read too and accidentally executed via a buffer overflow…

TLDR: Shit is hard. Security is boring (to most). It’s easier to fix problems as they’re found rather than to exhaustively audit every single line of every single component of every single program.

That’s true for any language, and it isn’t the cause of the majority of vulnerabilities.
These are logic and implementation problems, not typos.

Yeah, it’s very easy to make a typo in C, but the usual result of that is a program that won’t run at all, or which runs but which gives absurd results even for ordinary inputs. The sort of mistake that opens up exploits is at the logic or algorithm level, not at the typography level.

Add to this that demand for programming skills is so high that even poor programmers get jobs doing important things. And basically no liability for bad software falls on the original authors, so there is very little incentive to put effective quality control processes in place.

A lovely place to find C code with, well, interesting, bad properties is the Underhanded C Contest.

Here the goal is to craft a C program that performs a task, but typically, crafted with a very hard to find flaw that allows it to be exploited in a manner that undermines its base purpose. The site is well written, the contests all too believable, and the winning entries a geek’s total delight.

I don’t think this is completely true in the real world. There are plenty of exploits that take advantage of deliberately creating conditions that could never occur without malicious action.

There’s no good reason (other than security) to spend the effort to have a ZIP-expanding program be able to deal with a .ZIP file that’s larger than the biggest file the OS’s filesystem can support. If the (non-maliciously-influenced) FAT32 filesystem is really reporting a .ZIP file is multi-terabytes in size, having the decoding program crash is probably the least of your worries.

Again, any program written today should be able to deal with even those kinds of inputs, because security is an issue. But I wouldn’t say that a programmer in 1988 was incompetent because they didn’t provide a graceful way to deal with a pathological file system.

And the point is, a lot of code written in 1988 is still being re-used, and so we’re still finding security issues that a modern competent programmer wouldn’t allow, but a 1988 programmer had little reason to avoid.