could partial compilation with insertion to executable be used if c++ codebase doesn't build?

I suspect that what I am talking about has standard names in the field of reverse engineering, but I am just not familiar with them.

Well, so sometimes an open source codebase in C++ just doesn’t want to build easily. In fact, sometimes projects with such codebases have entire subforums dedicated to the topic of “building the codebase”. So presumably achieving the build can take days of a programmer’s time in some cases.

On the other hand, while developers behind the project don’t necessarily bother to make the build easy for other people, at least they build it themselves and publish a working executable. So I am wondering, is it possible to take advantage of this situation by using an automated tool that would compute the diff between the original codebase and my modified one and modify the executable by deleting/inserting machine code instructions to create a new executable that would be functionally equivalent to the one that would have been created via a successful build?

Is something like that not feasible due to how real life executables are structured? Or is it feasible but the necessary tools don’t exist because nobody thought this approach to be useful? Or can this in fact be accomplished relatively easily using existing tools intended for reverse engineering / hacking?

Isn’t this more or less what a package manager does? I mean, a package manager doesn’t directly manipulate the executable, but it checks your libraries against those of the person that wrote the code and finds/downloads ones that are compatible before compiling. That seems a lot easier then trying to alter the executable piecemeal, which would require architecture dependent knowledge.

For Java development, I know there’s JRebel which can deal with class structure changes (and other things) on the fly. I don’t use it (yet) so I have to stop and restart apps during development anytime I add new instance variables or class methods. Now, I’m guessing this is only possible because of the JVM which abstracts away the architecture dependencies Simplicio mentioned. Also, this all works because things are compiled in debug mode (extra symbols included in the executable). For release, you want the optimized version of the app (for one thing, it’s smaller).

how many “architectures” of this sort are there? Would it be one per compiler, so e.g. gcc and VC6 covering the bulk of them? Or one per processor? (given that all 32bit Windows executables work on all 32bit Windows machines, they all have the same processor type, right?)

Would making a compiler-agnostic source code to machine code alignment tool be a complicated undertaking? (presumably such a tool would be the first step in trying to accomplish this sort of surgery on the executable)

Most compiliers have tags to target most common architectures. But the architecture is the chip type. A few months ago, I would’ve said that there were only two major types in most PC’s (64 or 32 bit x86 chips), but a while ago I was compiling code for a cluster and couldn’t get it to work. Turned out the machine the compiling was done on an Intel machine while the nodes of the cluster were AMD, and the vector instruction sets are different for the two brands.

I suspect more modern processors have varying extensions to their instruction sets to support things like hyper-threading and such, so even within the x86 family there are multiple different architectures.

The answers to these questions are going to be different depending on compiler and language. For some languages, like interpreted languages, the solution is trivial; for others–I would place C and C++ in this latter group–I think this is a pretty difficult approach to the problem. For C/C++, at least, there are alternate approaches which would probably work better in the most common situations.

Why is this so hard? Well, you don’t specify what sort of code changes you’re thinking about, but I imagine you’re thinking of some sort of small local change, maybe just one line of source code or something. But even a pretty small local change can propagate through the entire code. For example, the change in a variable type might require change in every opcode accessing that variable; if you make a function more complicated, you will probably have to relink the executable to reconnect all of the jumps which no longer go to the right places and to find all of the new library functions you called which weren’t used before; and so on. You can probably restrict your changes to be small and trivial enough that these aren’t problems, but this is still IMO the wrong approach.

The traditional way to solve your problem, if the developers cooperate, is by adding some sort of hooks in reasonable places (e.g., event handlers, data pre/post-processing) in the original executable. For example, the developers could distribute the executable with a DLL containing stub functions defining all of the hooks; other programmers could then fill out whatever stubs they wanted and just compile their code to a new version of the DLL. Or, going in the other direction, the developers could turn their code into a library with an API so that you can call their functions from your code. A variation on these methods is to build a parser for these hook functions (for some, possibly simpler, language) into the main executable, which is nice because it can just use a text file rather than requiring an extra compilation step. There are lightweight languages like Lua which are easy to embed this way.

These are all pretty easy to do if the developers are willing. But of course supporting and maintaining any interface is additional work for the developers, so I imagine they are pretty low-priority features for a project still undergoing major development. For that stage in a project the developers are probably mostly interested in working with people who are willing to spend the time getting the thing to compile.

If you mean noting correspondences with source and executable code at compile time, then… this is already done by compilers, and this is the way debuggers are able to find the source lines while stepping through the executable. If you mean developing a tool that produces these correspondences given source code and an executable created with an unknown compiler, then… this seems like an absurdly difficult problem with no advantages over just using your local equivalent of gcc -g. But maybe I don’t understand what you’re asking.

yes, this particular question is sort of like “show me which machine code instructions got generated for method foo or for code line bar of that method”. I understand that this is trivial enough for java/c# and obviously much harder for c++, but is the c++ case really “absurdly difficult problem”?.

Let’s examine just the method-level segmentation problem, so is it really complicated to find out which machine code instructions represent which method in the source code?

ETA: and yes, of course the implication of the above question is that we cannot just fully compile the entire code base; we can only “compile” / parse / otherwise process the segment of code we are trying to locate in the executable. And maybe also do some low key processing on the rest of the code base which does not amount to true full compilation.

Yes, I think in general this is very difficult. Just for a start:[ul]
[li]Optimizing compilers can do all sorts of code shuffling, so you can’t just look for an opcode sequence; loop endpoints can change to exploit vectorization or other fast instructions, expressions may be partially or completely evaluated at compile time, etc.[/li][li]Functions can get inlined, so they may appear in multiple places in the executable, with no hints that you’ve entered a “function”.[/li][li]Multiple functions may compile to the same opcode sequence, so you can’t even be certain you have the right one.[/li][*]Polymorphic method calls are probably done via vftables, so you can’t even trace your way to the function from main().[/ul]If the compiler is not very sophisticated, or if the function has some nice properties (e.g., accessing some static data or library function at a known address) then this may not be impossible. But it’s almost certainly the Wrong Way To Do It anyway.

I think you have that backward.

Java is known to be able to employ more aggressive optimizations than C due to the lack of pointers in the language, which in turn means the resulting machine code has a lower chance of being a straight mapping from high level instruction to low level instruction.