Why is 'decoding a data file format w/o succumbing to an exploit' apparently an intractable problem?

And of course it’s usually not up to the programmer how much time they can spend on such a pursuit. Once software basically works and the known issues are considered minor, it’s time to “put it in a box” and give the programmer something else to do (or fire her).

And IME even projects that have very thorough testing do not allocate time to trying to find exploits; they just look for issues during ordinary or extreme use, but not malicious use.

In a way it’s a surprise that more software doesn’t get tripped up by this kind of thing more often. And I think it’s because of pragmatism: if I wanted to decode a PDF, I’d use the latest version of an off-the-shelf library that has been tried and tested on millions of systems. In the unlikely event I “roll my own” decoder I’d either not support scary-sounding, open-ended features that the format specification no doubt allows, or at that point I really would think very carefully about security.

Sorry, hit post too soon. Rewriting.

Many people are focusing on buffer overflows, but that’s just one type of bug or exploit. There are plenty of other more sophisticated ones. And eradicating them is a very hard problem.

One reason that bugs are so difficult to eradicate is that you can only fix a bug you know exists, and finding bugs is harder than you think. Testing software systems is not like testing physical systems because the behavior of physical systems is (generally) continuous with a few specific discontinuities, and the behavior of software is discontinuous.

If you have, say, an analog electrical circuit that expects 1V input, you could test it at 0.9V, 1V, and 1.1V of input, and if it works within spec for all of them, you can be fairly confident that it will work at 1.03V, and at 0.97V, and that there will be a relatively smooth response between those.

If you have a building model, and you test it with 10mph winds and 80mph winds and it behaves as expected, you can be reasonably sure that it’s also going to be safe at 11mph and 43mph winds.

Those aren’t hard and fast rules. You can certainly design very complicated electronic circuits that don’t respond (we call them computers!), and material things can have surprising behavior in some circumstances. But most systems are not chaotic in the chaos theory sense of the word, where a minute change in inputs results in a dramatic difference in output.

That all goes out the window with software.

If you have a software system and you test it with inputs 0x0001 and 0xF000, you have determined only what it does with those inputs. You have zero information about what it might do if you give it 0x0002, or 0xF3A2. Now, for incredibly simple systems, you can simply test all possible inputs. But for any system complicated enough to be useful in the real world, you can’t do that.

There are all sorts of sophisticated methods of testing software to try to get around this. I don’t want to imply that it’s not possible to test software. But it’s never going to be easy as testing something that must comport itself according to the laws of physics, because its range is the entirety of computable mathematics.

Hardly trite at all. Look again at the LZW bug that affected many compression packages including .gif decompressors. This is a remarkably simple (and beautiful) algorithm. I’ve taught it many times, hardly taking up 20 minutes of class time. (Less time than to explain the older variable length Huffman coding.) The code isn’t that complex or long, had been around a several years, had many people looking at it, but it took someone with a particular way of looking at things to realize it has a bug.

In particular: “It certainly is possible to design decoders that are immune, to a reasonable degree of certainty, to exploits.” is so naive I can’t even imagine your mindset towards real computer programs.

Me: “It certainly is possible to design decoders that are immune, to a reasonable degree of certainty, to exploits.”
Someone: " … naive …"

Please reread mine. I don’t claim it is easy or simple, merely that it is possible. I’ll assume we’ve all written codecs. If you spent X hours building a codec you were proud of (but not willing to certify as unexploitable), don’t you think you could feel a high level of certainty if you spent 2X hours total, investing the extra effort in scrupulous checking of pointers and other suspect tokens? (And/or, refactoring to support that purpose.) If not, would 3X still not be enough? :confused:

Walrus: “Most exploits operate by creating a wrong pointer … There are plenty of other more sophisticated ones.”

I hope you start an IMHO thread about the more sophisticated exploits. It’s obvious we have many very top-notch computer programmers here. I’d like to learn.

AFAICT the recent nefarious LZW trojan relies on typical buffer mismanagement, in code that could have been but wasn’t written as
ptr += offset;
if (Debug)
[INDENT]assert(ptr >= LOW_Buff && ptr < HIGH_Buff - Fudge); [/INDENT]

Note that simply turning on Debug would eliminate the trojan threat, i.e. replace it with an “ordinary” Abend. (In practice, you’d invoke a buffer_needed(…) and it would decide whether to abend.)

My point here is that you needn’t treat every line-of-code as Shrödinger’s cat – simple caution when manipulating buffer pointers would fix a very large share of bugs.

I don’t think there is any inconsistency between the idea that it is very difficult to make perfectly big-free software, and that most of the software bugs we encounter are the kind that could have been avoided with a modicum of extra attention.

One issue is that many of these file formats were created well before security was an issue. Who would have thought that PDF files could be exploited when they first came out?

Seems to me the failure there is including anything about debug. It isn’t enough to test this stuff on captive data in the lab and then assume it’s bulletproof against malicious data in the wild.

You really need to write code with more of the taintful attitude. (Taint checking - Wikipedia). When you’re doing something like a codec, 100% of the input is user controlled and 100% of the internal state of your program is subject to taint. You must have every check run every time in production builds running in production mode. Debug builds have no place in this.

IOW, every pointer is checked for over/underflow after every manipulation of the pointer value and before the pointer is dereffed. Yes, that means about 80% of your CPU cycles are pointer bounds checking. Tough. That’s a necessary, but by no means sufficient, condition to avoid being lured by malicious data.

OK, maybe it is possible to eliminate 90% of bugs (counting security holes as a kind of bug) with just a little extra effort. And in fact, many programmers and companies do expend that little bit of extra effort, and do in fact eliminate 90% of bugs. But that still leaves 10% that get through. And maybe with some amount more effort, you can eliminate 99% of bugs. But that still leaves 1%. And to eliminate 100% of bugs, well now, that’s a lot of work.

Right, it’s the same reason we’ve had two space shuttle disasters. When you’ve got a million things that can go wrong, and each one has a one in a million chance of failing, you’ve got a pretty good overall chance of some failure or another. Yeah, once you do have a failure, you look closer at the cause, and fix up the O-rings so they’re much better and now will only fail one time in a hundred million… but meanwhile, you’ve still lost a shuttle, and you haven’t done anything about falling ice damaging the heat tiles.

The space programme is a rich source of example of failures. I used to delight in using a number of them when I taught software engineering. However both Shuttle losses were not engineering failures in the sense of unknow, unexpected failures of a complex system. Both were known, well documented, and continuing failures that were understood and known to potentially lead to a loss of the craft. Both involved failures that the original mission rules required grounding of the fleet until fixed. In both cases management and institutional culture failures allowed flights to continue with the inevitable result.

The best example of a spacecraft software failure is the first Ariane 5 flight. That is a complex interaction of a number of software systems and designs that exemplifies the swiss cheese model of failures. In the end it was arguably a failure in the requirements specification of the system that caused the failure. But 400$M of rocket was destroyed a minute into flight due to a flaw that had evaded review, testing, and came from a code base that had successfully flown Ariane 4 rockets for years. Whilst there were a lot of contributing issues, what killed the launcher was missing a critical rule of embedded real time software - never throw an exception that you can’t catch. But the path to this failure was long and complex.

It is easy to pick on simple parts of security failures in software, such a buffer overruns, but the opportunity for failures are huge. Programming in more secure languages helps - for instance languages where it is impossible to create or manipulate raw pointers make such exploitable code very hard to write in the first place. Systems that segregate pointers, code, and data help greatly. It isn’t as if these are new ideas. But flaws exist in the most arcane places. The work of the black hat is partly finding the flaw, and then working out how to escalate the flaw. The latter is as much an art as the former. Java is intrinsically resistant to pointer smashing exploits as a language. But that doesn’t mean that haven’t been many flaws in the underlying implementations that have let malware in.

To stretch our warehouse package analogy a bit further -

The incoming parcel is so large, it extends all the way to the back of the warehouse, knocks a hole in the back wall into the office, and the back end lands on the boss’s desk knocking the instruction manual to the floor. Employees come in to consult the manual to see what to do next, and reach under the package and pull out the manual, it’s the bogus manual that came in attached to the underside of the package. Now they are following the package sender’s malicious instructions.


There are so many holes because originally PC’s (and even big computers) were designed without a thought to a world of malicious and anonymous attackers. Originally (and still) PC’s read what’s on the boot track of the disk to load; the first interesting virus was a program that insinuated itself on the boot sector, carried by floppy from PC to PC. Email is so spam-friendly because nobody imagined people would have private machines connected to the email network that would lie about who they were or where the email came from.

IBM went a lot farther, because their mainframes were used by businesses, and by hundreds of users per system, and were more likely to be targets of financially motivated shenanigans. But compare the complexities of token ring networking to Ethernet;
*-Make this a ring, pass a token, whoever has the peace pipe -I mean, token - can send.
-What if the ring is broken? We’ll make it a double ring failures cause loopbacks.
-What if the token is lost? Then we’ll… *and so on.
Just yelling on the wire and waiting for a response seems a lot more simplistic; but then there’s MAC address spoofing, ARP tricks, etc.

Regarding jpeg exploits. Here’s one exploitable hole in XP and such from back in 2004.

Our old friend the buffer overflow fairy makes another appearance.

One problem with this one was the affected dll might appear in several places on a computer as it was often included on its own with several software packages. You had to search for and replace/update each one.

Note that seemingly simple things like buffer overflows are sometimes incredibly hard to spot. It’s not like you can scan thru and look for a handful of memory allocation/filling types and see if each one is safe. Some are well disguised because a basic examination might not realize that any memory allocation/filling is being done at all! Stuff can be hidden behind layer after layer after layer of function calls.

Code can be astonishingly complicated. Some serious errors have been discovered in mainstream OSes due to their effects, but years go by without patching because no one can figure out what piece of code is causing it. And even then, fixing the code might be difficult to impossible without causing new errors.

I disagree with this as well. Suppose that I have a program dealing with a chunk of data arriving over an incoming link. For the sake of argument, we’ll assume that the data is the most dangerous possible kind–executable code (that we plan on executing). We’ll also assume that the original source of the data is absolutely trusted.

Like any reasonable person would, I assume that the data might have been corrupted in transit. So I add a high-quality hash to verify that there’s been no corruption. I’ve been very careful in my implementation of the hash and there are no exploits.

However, I did not anticipate actual malice. There is a subtle but crucial difference between a “normal” hash function vs. a cryptographically secure function. There are probably no more than a hundred people on the planet that can reliably tell the difference, and I’m not one of them. Although my hash function is highly resistant to all kinds of random errors, it is not resistant to a targeted attack.

Of course, the industry as a whole gains best practices over time, and one of these practices is to use a known-secure hash function like SHA-2. But this is a moving target; MD5 was once considered secure and now it is completely broken. MD5 is still a very effective hash as long as attack resistance is not relevant. The only justification for more advanced hashes is the anticipation of malice.

Buffers overruns are not characteristic of many languages, and are the cause the majority of vulnerabilites.

But ZIP, like PDF, contained a specific vulnerability by design: it was designed to run an installation script if the zip contained one. That was, at one point, a vector for problems with zip files.

Maybe true, but buffer overruns are not caused by typos!

I’m only familiar with OS X development, but Apple’s sandboxing of apps for the iOS and desktop app store seems to be fairly robust. An app that reads image files would only have entitlements to open files directly chosen by the user through a standard dialog box. Try to read any other files on the system through malicious code in the image and you’ll just get a “permission denied” error, or the sandboxed app would be terminated. It’s not perfect and there’s been a few high profile exploits but Apple is pretty good at staying on top of them.

As of now iOS 9.2 still has no jailbreak available and iOS 9.3 is now available. I’m not aware of any “jailbreaks” being available for OS X desktop apps.

The best feature of Apple systems or Linux is that they have been developing and refining some systems for decades, so each iteration any old holes are corrected. Microsoft has suffered from the flaw that every few versions, someone decides they needed to rewrite the core from scratch; as a result, instead of building on old code and making it better, they introduce whole new areas where failures could happen.

I love this. Windows sucks because it’s full of crufty old code, and it sucks because Microsoft keeps throwing things out and starting over? Which is it? It can’t be both!

oh, and as far as Linux goes, being open source didn’t let anyone catch the Heartbleed bug in OpenSSL. Nor did “many eyes” catch this.

Well, that would be if it was all up the application programmer…
If buffer overun exploit prevention was done by the system programmer, then the cost is far far less than 80%… more like only a few % overhead.

That requires the compiler author to come to the party and keep all variables out of the stack, put them into a suitably safe location… eg if they were simply moved to be AFTER the stack… Then even it was not bounds checked, the buffer is no where near the stack, and the buffer overflow attempt could have to generate gigabytes of extra “data” so as to get that data being written into code or stack… But it would still be possible, by sending zip data that expands into gigabytes of data… it may be possible the exploit avoids segfault ( address no points to an unmapped address… ) by totally filling the address range and then having the data appear at code after overrun…
And then , if Operating system programmer came the party, and make sure that the program data is surrounded by unmapped pages, then the buffer exploit will always hit the honey pot, the unallocated page, seg fault… hey your program is exceeding the data space… that could be a buffer exploit, or a bug in the program, etither way, seg fault … !
What I am saying is that Intel has done enough to prevent these exploits, the OS’s don’t… because the complier doesn’t…
There’s inertia to changing all the OS’s , windows, Linux, etc, and compilers, due to the way that this would crash programs.