Software engineers: what does reverse engineered code look like?

ASGuy · February 12, 2021, 5:38pm

One of my early assignments in my first job was to disassemble a small binary image program into something halfway readable to the programming staff. The company had a disassembly program that turned the binary into machine language instructions. My role was to try and figure out the logic and write comments. The instruction set (DEC PDP-8) was small but it was tedious work. And I always though of marginal value.

How is this done in 2021? Or is it?

P.S. I was conversant with PDP-8 assembly.

DPRK · February 12, 2021, 6:03pm

If you disassemble or decompile machine code using a tool like Hex-Rays, it will look like assembly language or higher level (eg C) code, so in 2021 you don’t have to decode binary machine language by hand.

But! For legitimate reverse engineering you may need to employ the https://en.wikipedia.org/wiki/Clean_room_design clean-room technique where the person implementing your reverse-engineered code never touches the original code and therefore does not actually know what it looks like.

DPRK · February 12, 2021, 6:08pm

Hex-rays page with sample output:

engineer_comp_geek · February 12, 2021, 6:46pm

Today’s optimizing compilers take nice readable code and turn it into spaghetti gibberish. Disassembling spaghetti gibberish results in code that looks like spaghetti gibberish.

Back in the PDP days, you could predict what the compiler was going to do. Global variables were in the program’s data segment. Local variables were pushed up onto the stack. C functions resulted in assembly code calls to subroutines. It was all neat and predictable.

The few times that I have gone digging through a modern compiler’s output, I have been surprised to find things like variables that were never written out to memory at all. Instead, they were just floated along in registers as the program went down through the function. I have also seen function calls completely eliminated and replaced with all inline code. If you disassemble code like that, you aren’t going to end up with anything even remotely resembling the original source code.

If you want to reverse engineer how something works, you need to take a step back from the code and look at it more from a data flow point of view instead of focusing on the details of how exactly the data gets manipulated.

If you are really lucky, the code was compiled with debugging symbols included. Knowing things like function and variable names is a huge benefit when you are trying to figure out the purpose of a section of code.

Mangetout · February 12, 2021, 8:57pm

In general, reverse engineering and decompiling of code is just something people expend a fair bit of energy to avoid having to do nowadays - code repositories, version control, use of standard libraries and components, etc means it’s easier to work forward from a known point in the development past than it is to try to work backwards from the current deployed product.

Chronos · February 12, 2021, 9:50pm

When I hear reverse engineering, I don’t think of decompiling at all. I think of looking at what outputs a program produces given particular inputs, and figuring out what it’s doing based on that, and then re-implementing something that does the same thing.

ASGuy · February 12, 2021, 10:17pm

Thanks for the responses. Let’s try a different question. engineer_comp_geek says his experience indicates decompiled code is radically different from the original source. If true, would anyone these days really want to decompile anything? Or is the better approach suggested by Chronos, i.e., determine what it does and write new code.

Even though I was working on a program originally written in PDP-8 assembly code, I’d go with Chronos.

JohnGalt · February 12, 2021, 11:14pm

You might look at the experiences of some dude named Linus Torvalds. Just For Fun (the name of his book) has a light-hearted look at him creating Linux based on the PDP-11 (?) Unix documents.

DPRK · February 12, 2021, 11:23pm

What @Chronos described pretty much is reverse engineering. If you have the manual, that’s great. If not, or if there are undocumented features, then disassembly, decompilation, data-flow analysis, etc are simply tools that can help you determine what the code does.

Melbourne · February 13, 2021, 2:59am

A good disassembler will recognize library calls and library code.

I concur that the need for and opportunity for disassembly is less than it once was. Once it used to be common, but now the only people I know of doing disassembly are doing virus or security research.

Cleophus · February 13, 2021, 4:43am

The results you get from decompiling code varies wildly, but in general you will get the assembly code (easy) and a reconstruction of the code in a higher-level language, often C. This screenshot shows an example from Ghidra, a free and very popular reverse engineering suite:

https://en.wikipedia.org/wiki/Ghidra#/media/File:Ghidra-disassembly,March_2019.png

You can see the series of disassembled instructions and automated annotations, as well as variables. In the right window showing a code reconstruction, note that while you can see equivalent code, it lacks any of the original variable names, structure, or other metadata. This is typical of decompiled code and a big impediment to understanding it. It is a much more significant task to annotate and understand the decompiled code than it is to generate.

Max_S · February 13, 2021, 5:48am

If I remember correctly professionals who want to reverse engineer some proprietary software will have a “dirty” team who does the disassembly, packet sniffing, or even look at the ROM chip under a microscope. They produce a white paper outlining the interface or protocol. A separate “clean” team does the development based only on the white paper, and this way there is no copyright infringement. This reverse engineering produces, well, regular code. More like what Chronos was saying.

For binaries that run in JIT environments like the JVM or CLR you just run the code through a decompiler program and, most of the time, you get compilable Java/C# source files. Things get tricky if the developer obfuscates their code (with eg: chinese characters as identifiers, and/or optimizations that are legal in the IL but illegal in the source language), and obviously it won’t be pretty or readable as the original sources.

And then of course, for scripting languages the only thing between you and the original source code is the possibility that the developer used a minifier/obfuscator. You can probably use your browser’s built in devtools to view and beautify (un-minify) the javascript running on this website, for example.

~Max

Chronos · February 13, 2021, 1:46pm

And sometimes you don’t even have direct access to the code at all. One of the more common forms of software reverse engineering, for instance, is theorycrafting in games. You know that the game uses some formula to, for instance, determine the amount of damage done by some attack, but the official documentation might not give the exact formula (it might say something like “damage dealt depends on your level, the enemy’s level, and any resistances the enemy might have” without saying what the dependence is, or it might not even say that). So the nerdier members of the playerbase will do experiments within the game, with players and enemies of different levels, and whatever numbers or other indicators the game does give, to try to figure out those formulas.

Topic		Replies	Views
Why can't you decompile binaries? Factual Questions	19	2369	January 4, 2007
Decompiling Bill Factual Questions	12	782	December 5, 1999
Are incredibly complicated pieces of software built from lines-of-code up? Factual Questions	35	2868	August 27, 2008
Explain "source code" to me Factual Questions	41	10782	September 3, 2010
Dealing with code at a higher level In My Humble Opinion	12	1277	February 11, 2007

Software engineers: what does reverse engineered code look like?

Related topics