Why can't you decompile binaries?

So you write a program and compile it into a binary. The binary can be run on a computer and it will do exactly what you told it to do while writing the program. If you compile the program fourteen thousand times, you’ll have fourteen thousand identical binaries.

Given this, why can’t you take a binary and find out the source code? If you could, the whole open/closed source thingy would be moot, so apparently you can’t, but no-one has been able to explain to me why.

The mapping of source code to binaries is not one to one. Many different sources can result in the same binary file. Also the main benefit of source code is access to the comments and the variable names. Both of these are lost when you compile something.

So what is reverse engineering?

You can. This is called (surprise) decompilation.

The reason it’s not useful is because high-level programming languages do a lot to make programming easy for humans. These conveniences are lost when the program is compiled to machine code. A decompiled binary will produce syntactically valid source code, but all your variable and function names will essentially be numbers, rather than words, there will be no organization of code into separate, reasonably-sized files (indeed, there’s no guarantee that the decompiled code will be in any particularly useful order.)

Consider that all your favorite high-level languages have various looping constructs that all really do the same thing, but work slightly differently to make it easier on your brain. In machine code, these all just get rendered into a series of steps with some conditionals and GOTOs, and that’s what they’ll be when it’s decompiled. The information about what type of human-readable loop that was has been lost.

Then there are compiler optimizations. Constants get folded, and there’s no way for the decompiler to know that 434.577893928 is the Frobnating Hoozit Coefficient as opposed to any old float value.

And of course there won’t be any comments.

So, you can imagine why decompilation doesn’t yield particularly useful results.

That said, decompilers exist, and have for decades. You get “a” program that might have produced the binary, rather than “the” program. You generally won’t get comments, and depending on the binary, might not even get variable names (I’m including things like .NET CLR bytecode in my definition of “binary” here.)

Their usefulness depends on what you’re trying to do. Remember that many people can read assembly language (which it’s trivial to “decompile” a binary into - most development tools have that built-in), and for deconstructing an algorithm or something, that’s enough.

For whole programs, what would be the point? The program’s protected by copyright, and these days maybe patent, trademark, and DMCA laws, so it’s not like you could use the resulting code for much. In an emergency, you could use it to recover code for some application you had only binaries for, but in nearly ever case you’d be better off just to rewrite.

Reverse engineering is the process of examining what a device (or program) outputs in response to various inputs, so that the internal functioning of the device can be determined. That’s a whole separate issue and doesn’t really have anything to do with decompilation.

Reverse engineering is when you study the outward actions of a mechanism and design another mechanism that duplicates them exactly, without necessarily knowing how the original mechanism achieves the actions.

You can decompile a binary, although many license agreements prohibit it. You can find out a source code, but not *the * source code.

A programmer might have written:



void GetName(integer ID)
{
   boolean success;

   // do some stuff here
   success = retrieveString (database, ID);

   if (success)
     doSomeMore();
}


If you compile it and decompile it, you might get:



void p1(integer i1){boolean b1;b1=p2(s1,i1);if(b1)p3();}


It’s like that old joke about translation from English to Russian and back. The original phrase was, “The spirit is willing but the flesh is weak,” which came back as “The wine is good but the meat is spoiled.”

Yes. Reverse engineering (RE) treats the product of interest as a ‘black box’, whose contents are unknown, and seeks to create another product that behaves the same way externally (exhibits identical reactions to external stimuli, for example). It is irrelevant whether it works the same way internally.

I believe that, for legal reasons, RE teams may need to be able to demonstrate that they could not have known anything about how the product of interest works internally.

If the black box is a program, decompilation would seek to take it apart and figure out how it works internally–precisely what these RE teams are forbidden from doing.

The comments above address most of the issue. Easy to read and maintain code is largely about structure, and decompiling loses all of the structure.

I’ve never used a decompiler, but I have had occasion to use a disassembler, which produces an assembly language output instead of a high-level language output.

One stumbling block for such processes is that embedded data, such as a look up table, can be really difficult for the diswhatever to distinguish from actual code. In many cases a human must assist the programs in working this out. Many machines use variable length instructions, with the first word of the instruction being the key to determining how many successive words belong to that instruction. If the decompiler gets “out of phase” with the instructions for any reason (like the embedded data issue mentioned above) then most, if not all, of the rest of the output is garbage.

In my case I was lucky, I had a paper listing from the original developer, but they had lost any and all electronic copies. I thus had access to all the comments, which was helpful even though German is not my mother tongue.

Well, I’ve read a lot of complaints about how closed source programs are distributed only as binary making it impossible to make alterations according to your own needs and tastes, so I imagined that would be the point. Having read this thread I understand why it would be impractical.

Thanks, guys.

You can_ take a binary and from it, produce source code that, when compiled, would produce essentially the same binary.

However, a big part of software engineering is making sure that your source code is clear, readable, organized, documented, and so forth. Programming languages, especially the so-called ‘object-oriented’ ones, contain a lot of high-level organizational syntax (beyond simple comments) that is used only at compile time for error-checking, and then discarded when producing actual binaries. The output you get from a decompiler has none of this.

Think of it this way: let’s say you have a blueprint for a new airliner your company is producing - this would be the source code. This is organized in terms of systems: structural components, wing framework, engine fuel supply, air conditioning, air conditioning backup, hydraulic master pump drive cylinder B, etc. It’s well-organized, cross-referenced, indexed, documented, you name it. There are comments from engineers describing what each piece does, why it was placed where it was, and so forth.

You have a team of robots build a prototype of this plane. They ignore all your documentation: all they care about are the physical connections between components - wires, welds, and so forth. This is essentially the compilation. For rhetorical purposes, we’ll just say there are no labels on anything.

Your competitor steals the prototype, and has a team of his robots disassemble it - decompiling it. However, without access to your original plans and documentation, it will be incredibly difficult for him to figure out anything about the plane, even if his robots produce an exhaustive list of what is connected/welded to what. If he finds a tank and a pump, he won’t know what it’s for without tracing every single connection going into and out of it - is it is a fuel tank? Engine oil tank? Drinking water tank? Hydraulic fluid tank? Backup hydraulic fluid tank? Air tank? Sewage tank

With enough resources, he could eventually reverse-engineer the entire plane, by tracing every single wire and figuring out what it’s supposed to do. Even then, when making repairs or improvements, he won’t have the experience of the engineers who designed the thing (the comments in source code), who might have discovered that you can’t route wire A through joint B because the vibration will fray it.

That said: it is trivial to ‘decompile’ some languages. Obviously, some are not compiled at all: Python, Perl, Javascript. You can just open up the file and look at the source. Others are compiled, but into a pseudo-assembly language that preserves a lot of the structure of the source code. Java is one of these, and there actually exist programs (called obfuscators) that dick around with compiled Java files in order to make it more difficult to get anything meaningful when decompiling them.

One of them is called Zelix Klassmaster.

To put it more simply than the previous explanations, you can’t decompile fully because information has been lost.

An example.

I once worked on a program that needed a small change to it. The original source had been lost, but was approximately 30kb. The decompiled version which I was given to work on had all the loops, macros and high level commands expanded out to their lowest level equivalents, the variables had names like VI0001 (variable integer 1), etc. The file was about 1Mb. The change took me about a minute to actually make, but it took me more than 2 weeks to locate the place to make it. Even then it was only this quick because the program had been previously worked on and sorted out to some extent.

This experience finally pushed the powers that be into approving a full rewrite of the entire 50+ program suite, as they could see that maintenance costs were going to become outrageous.

I worked at a shop where they were using the proto version of the ILBABND0 subroutine. (The shop had moved from the ancient pre-VSE edition of DOS to the ancient pre-MVS version of OS at around the time that IBM was just getting their COBOL to work on OS and there was no equivalent ILB fuzzy wuzzy at that time). In the midst of an audit, we discovered that the 28 year old source code had been lost. I browsed the linkedited code with my yellow card in one hand and re-created to source code. (Of course, the original was in BAL, not COBOL or FORTRAN, which made it easier.) The shop had used the program (they called it ABEND) for so long, that no one there even knew that IBM had ever created the ILBABND0 subroutine. When one of the tech guys from my home company heard what I had done, he told me about ILBABND0, we dragged up its code and discovered that my source only differed from the original by a single SVC, and the one I had used worked equally well with the offical one.

At about the same time, we were installing a new version of IMS and needed to confirm the validity of the MFS library. He created a disassembler to read the modules and prodice source code which we then ran through a compare program to discover whether there were differences between the executed code and the source code. (Using bad source code to re-assemble all the MFSs would have been a BAD THING).

Also illegal, in the case of most such programs.

Cite for “illegal”? I think you may mean “in violation of contract”, instead.

All of what has already been written is true but I’d like to add one more note…

For development purposes, programmers usually produce and work on a “debuggable executable”, which contains the necessary information (called a symbol table) to provide more context to the machine code being executed, including descriptive function and variable names, source code file names and whatnot.

When compiled software components (.EXE files, or .DLLs, or what have you) are released the symbol information is nearly always “stripped out” – not only for security reasons, but also because optimizing compilers, that reorganize machine code to be maximally efficient for the target hardware platform, will automatically strip out the symbolic information, and most commerically released software wants to run as quickly for the user as possible.

However I have personally seen commercial software components in the past that were (probably accidentally) distributed with the symbol table included for some reason – in which case using a standard Windows IDE tool allowed me to see an awful lot of technical details.

OK, that’s more accurate. It’d be a breach (violation of contract law) and clearly a civil legal matter, not criminal.

I’d add that modern optimizing compilers often perform major transformations on the code to make it run faster. The resulting code, while more efficient, can be very difficult to understand or to reverse engineer.