Software engineers: what does reverse engineered code look like?

One of my early assignments in my first job was to disassemble a small binary image program into something halfway readable to the programming staff. The company had a disassembly program that turned the binary into machine language instructions. My role was to try and figure out the logic and write comments. The instruction set (DEC PDP-8) was small but it was tedious work. And I always though of marginal value.

How is this done in 2021? Or is it?

P.S. I was conversant with PDP-8 assembly.

If you disassemble or decompile machine code using a tool like Hex-Rays, it will look like assembly language or higher level (eg C) code, so in 2021 you don’t have to decode binary machine language by hand.

But! For legitimate reverse engineering you may need to employ the https://en.wikipedia.org/wiki/Clean_room_design clean-room technique where the person implementing your reverse-engineered code never touches the original code and therefore does not actually know what it looks like.

Hex-rays page with sample output:

Today’s optimizing compilers take nice readable code and turn it into spaghetti gibberish. Disassembling spaghetti gibberish results in code that looks like spaghetti gibberish.

Back in the PDP days, you could predict what the compiler was going to do. Global variables were in the program’s data segment. Local variables were pushed up onto the stack. C functions resulted in assembly code calls to subroutines. It was all neat and predictable.

The few times that I have gone digging through a modern compiler’s output, I have been surprised to find things like variables that were never written out to memory at all. Instead, they were just floated along in registers as the program went down through the function. I have also seen function calls completely eliminated and replaced with all inline code. If you disassemble code like that, you aren’t going to end up with anything even remotely resembling the original source code.

If you want to reverse engineer how something works, you need to take a step back from the code and look at it more from a data flow point of view instead of focusing on the details of how exactly the data gets manipulated.

If you are really lucky, the code was compiled with debugging symbols included. Knowing things like function and variable names is a huge benefit when you are trying to figure out the purpose of a section of code.

In general, reverse engineering and decompiling of code is just something people expend a fair bit of energy to avoid having to do nowadays - code repositories, version control, use of standard libraries and components, etc means it’s easier to work forward from a known point in the development past than it is to try to work backwards from the current deployed product.

When I hear reverse engineering, I don’t think of decompiling at all. I think of looking at what outputs a program produces given particular inputs, and figuring out what it’s doing based on that, and then re-implementing something that does the same thing.

Thanks for the responses. Let’s try a different question. engineer_comp_geek says his experience indicates decompiled code is radically different from the original source. If true, would anyone these days really want to decompile anything? Or is the better approach suggested by Chronos, i.e., determine what it does and write new code.

Even though I was working on a program originally written in PDP-8 assembly code, I’d go with Chronos.

You might look at the experiences of some dude named Linus Torvalds. Just For Fun (the name of his book) has a light-hearted look at him creating Linux based on the PDP-11 (?) Unix documents.

What @Chronos described pretty much is reverse engineering. If you have the manual, that’s great. If not, or if there are undocumented features, then disassembly, decompilation, data-flow analysis, etc are simply tools that can help you determine what the code does.

A good disassembler will recognize library calls and library code.

I concur that the need for and opportunity for disassembly is less than it once was. Once it used to be common, but now the only people I know of doing disassembly are doing virus or security research.

The results you get from decompiling code varies wildly, but in general you will get the assembly code (easy) and a reconstruction of the code in a higher-level language, often C. This screenshot shows an example from Ghidra, a free and very popular reverse engineering suite:

https://en.wikipedia.org/wiki/Ghidra#/media/File:Ghidra-disassembly,March_2019.png

You can see the series of disassembled instructions and automated annotations, as well as variables. In the right window showing a code reconstruction, note that while you can see equivalent code, it lacks any of the original variable names, structure, or other metadata. This is typical of decompiled code and a big impediment to understanding it. It is a much more significant task to annotate and understand the decompiled code than it is to generate.

If I remember correctly professionals who want to reverse engineer some proprietary software will have a “dirty” team who does the disassembly, packet sniffing, or even look at the ROM chip under a microscope. They produce a white paper outlining the interface or protocol. A separate “clean” team does the development based only on the white paper, and this way there is no copyright infringement. This reverse engineering produces, well, regular code. More like what Chronos was saying.

For binaries that run in JIT environments like the JVM or CLR you just run the code through a decompiler program and, most of the time, you get compilable Java/C# source files. Things get tricky if the developer obfuscates their code (with eg: chinese characters as identifiers, and/or optimizations that are legal in the IL but illegal in the source language), and obviously it won’t be pretty or readable as the original sources.

And then of course, for scripting languages the only thing between you and the original source code is the possibility that the developer used a minifier/obfuscator. You can probably use your browser’s built in devtools to view and beautify (un-minify) the javascript running on this website, for example.

~Max

And sometimes you don’t even have direct access to the code at all. One of the more common forms of software reverse engineering, for instance, is theorycrafting in games. You know that the game uses some formula to, for instance, determine the amount of damage done by some attack, but the official documentation might not give the exact formula (it might say something like “damage dealt depends on your level, the enemy’s level, and any resistances the enemy might have” without saying what the dependence is, or it might not even say that). So the nerdier members of the playerbase will do experiments within the game, with players and enemies of different levels, and whatever numbers or other indicators the game does give, to try to figure out those formulas.