Explain "source code" to me

friedo · September 2, 2010, 3:09pm

marshmallow:

Ruminator:

Manipulating assembly language is difficult and tedious but it’s workable for small changes such as bypassing a serial # check. You can’t work at the assembly language level to make major changes such as adding entire new scenes, characters, and weapons to a game. In other words, you can’t disassemble Madden NFL 2009 and change the assembly code to turn it into Madden NFL 2010.

What about video game console emulators and ROM hacks? Isn’t that pretty similar to this discussion since they have to rip the compiled code off a cartidge and work backwards? And by ROM hack I don’t mean just having the ROM, which is apparently super easy, I mean when they make their own games, like Super Demo World or Super Metroid Redesign or Zelda: Parallel Worlds or tons of others. Judging by some videos on youtube it seems like they’re on the brink of making their own N64 games. It blows my mind.

With hardware emulators, you’re creating software that simulates a piece of hardware and runs the program – you don’t actually have to understand how the program works, as long as you can be reasonably sure that your emulator behaves the exact same way as the real hardware does. And the original Nintendo, for example, was not particularly complex; a well-known M6502 processor, some RAM, and a few commodity chips for this and that.

In the case of custom games like the Mario or Zelda items you mentioned, usually this is a result of reverse-engineering the format in which the original game stored its level maps. (Remember that original Mario and Zelda levels are just 2D grids of standardized pieces, so this isn’t hard. It gets a lot more difficult with more modern incarnations, of course.) Then you can write a program for editing levels. Again, this doesn’t require knowing how the actual game works under-the-hood.

Thudlow_Boink · September 2, 2010, 3:23pm

(Along the lines of Ruminator’s casserole example) This is a little like saying “If I buy a bucket of Kentucky Fried Chicken, isn’t the Colonel’s secret recipe there?”

FuzzyOgre · September 2, 2010, 3:34pm

Source code is a recipe.

Say you went for coffee with a friend, and ordered some apple pie.

Its remarkably difficult to unbake a pie, but there are people who have a knack for recreating the end effect. Microsoft wants you to buy pie only from them. They went through a lot of work to devise a recipe and they dont want you recreating it. The apple pie is a compiled recipe; its closed source.

Open source would be the baker providing copies of the pie recipe(and the coffee too most likely). And the instructions for chairs, the table, even the building. You get the pie ready to eat, but you can also recreate it later. This is Linux.

So you can see that the idea of freely shared recipes vastly predates closed source: its the natural way of things for humans. In fact in computers it was the norm up until about 1980.

Incidentally I am enjoying Apple pie and posting from Ubuntu Linux. Thanks mom, and thanks Linus Torvalds.

Ruminator · September 2, 2010, 3:50pm

Yours is a more famous and much more succinct example than mine. Thank you.

This is true but its a little one-sided.

Microsoft’s pie is closed-source but the pie itself is cited in more downstream recipes as a component food item. For example, a chef on TV may demonstrate an ice cream sundae topped with chunks for Microsoft pie for a certain taste. They could conceivably use Linux pie but they don’t because they realize that 95% of the audience is familiar with the Microsoft pie. Of course, the savvy home cooks will realize they can just substitute Linux pie but it takes some expertise to do this.

The whole ecosystem is tilted to the Microsoft pies even though they are closed. For example, the vast majority of Pie Giftcards happen to be redeemable for Microsoft pies and not Linux ones.

However, there are also negatives for Microsoft’s pie being so prevalent. If a terrorist wanted to scare the country by poisoning pies, he would target the Microsoft pies and not the Linux ones.

Chronos · September 2, 2010, 4:05pm

I suspect, in many cases, they actually have the same original level editors that the Nintendo folks used in the first place. The original programmers (actually, level designers, which might be completely different folks than the programmers themselves) would have wanted to make things easier on themselves, too, so such editors would exist somewhere. And it’s not too hard to imagine a retired employee who figures it’s not such a big deal any more, and shares it.

Baron_Greenback · September 2, 2010, 4:58pm

Microsoft source code is available to some institutions and organisations. I imagine the NDA is pretty stringent.

marshmallow · September 2, 2010, 5:55pm

I guess, although some of these ROM hacks are pretty impressive. They mess with the physics and introduce new enemies, items, and background sprites. It seems like you’d have to know what’s going on to do that sort of stuff but I don’t know anything about programming so nevermind.

You’re right that you make the levels in an editor. But I find it hard to believe a programmer kept a level editor on his own for 25-15 years and then released it. I’m pretty sure it just involves a lot of clever people attacking whatever game they like the most. I’ve seen sites for lots of different editors for different games.

Bytegeist · September 2, 2010, 8:17pm

Bully for you, but I have to ask: any particular reason you didn’t overwrite with the BRA (branch always) instruction? Then it wouldn’t have mattered whether the disk was present or not.

Some of the source code to MS Windows NT4 and 2000 was left on a public server several years ago. I assume it’s still around somewhere, archived, although it might not be very relevant anymore.

Voyager · September 2, 2010, 8:28pm

In addition, compilers optimize code in non-obvious ways. To be more explicit about loops, small loops may be unrolled to save a branch, so you will see two copies of the code, not the loop. But, worse, variables will be allocated to registers depending on their liveness, and it will not be at all obvious which variable is so allocated at any given time, unless you are good at mapping memory locations to register usage.

There are many tools to assist in assembly level debugging of high level languages, like the old sdb and adb. But these only work when you use a compiler switch to add information the debugger can use, they won’t work for production code.

Stealth_Potato · September 2, 2010, 8:29pm

ROM hacker checking in!

It varies quite a lot depending on the game, but it is true that many things can be altered without understanding the game’s code. With experimentation and a whole lot of patience you can usually reverse engineer the storage locations and formats for level maps, enemy descriptions, character sprites, and so on. You can search for known or expected patterns in a hex editor (e.g., known quantities of hitpoints, text strings, etc.), or open up the ROM in a graphics viewer and just scan through the noise looking for anything that looks familiar.

Any stored data whose format you can reverse engineer you can write a convenient editor for – all without even looking at the underlying game code that reads or processes that data. Other things, though, do require understanding the code. Usually this comes up when you want to change some aspect of the game’s behavior, and not just its appearance. For example, changing the way damage is calculated, or modifying the menus or other interfaces.

This is where you need an emulator with debugging capabilities (i.e., the ability to view memory, set breakpoints, and disassemble code – i.e., convert it from raw machine code to a more human-readable series of named opcodes – on the fly). This allows you to run the game while examining its code, so you can isolate the areas relating to the thing you want to change. It’s a tedious task, and it requires a good understanding of the assembly language of the platform. Once you’ve found what you’re looking for – for example, the subroutine that computes damage for a certain attack – you can write a new one that does what you want instead.

Back on the main subject of the thread, it is very often a case of trying to read, understand, and then modify the compiled executable form of a program – naturally we never get to see the source code. OTOH, many games for the older systems (like the NES and the SNES) were not written in compiled languages like C, but instead directly in assembly language. In those cases, the executable machine code is a fairly straightforward translation of the code the original programmers wrote, so our job isn’t too hard. A programmer writing a large program in assembly will take pains to ensure that its structure is amenable to human understanding.

But in cases where the original programmers did write in a compiled language, the results can be nasty.

(They can also be extremely inefficient, owing to the older days when compilers were not so good at optimization. I’ve slogged my way through long subroutines consisting of many dozens of seemingly aimless instructions, when it turned out at the end that all it was doing was copying four bytes from one area of memory to another. It turns out it was just a hideously expanded 32-bit assignment performed on a 16-bit machine, evidently using a lot of C macros to read and shift the individual bytes. Of course, the compiler is not entirely to blame; some of these people wrote shameful C code.)

FuzzyOgre · September 2, 2010, 8:41pm

Ruminator:

This is true but its a little one-sided.

Microsoft’s pie is closed-source but the pie itself is cited in more downstream recipes as a component food item. For example, a chef on TV may demonstrate an ice cream sundae topped with chunks for Microsoft pie for a certain taste. They could conceivably use Linux pie but they don’t because they realize that 95% of the audience is familiar with the Microsoft pie. Of course, the savvy home cooks will realize they can just substitute Linux pie but it takes some expertise to do this.

The whole ecosystem is tilted to the Microsoft pies even though they are closed. For example, the vast majority of Pie Giftcards happen to be redeemable for Microsoft pies and not Linux ones.

However, there are also negatives for Microsoft’s pie being so prevalent. If a terrorist wanted to scare the country by poisoning pies, he would target the Microsoft pies and not the Linux ones.

Thanks for continuing the fun. I liked the way you sliced it.

Superfluous_Parentheses · September 2, 2010, 9:33pm

That’s one advantage, but more importantly, it makes it much easier to extend and fix the operating system itself. The core of Linux (the kernel) is actually not that different from any other closed-source Unix-type kernel from the point of an application writer, and most app writers never look at the source code for the kernel - additional libraries and other programs are usually more interesting to an app programmer (and most Linux distributions come with source code for all of those too).

Other people here have explained a lot of the details already, but in essence, source code in C or a more “advanced” compiled language is used to write these systems provides a much higher level of abstraction, and the programmer then uses a compiler to translate that high-level description down to the most basic instructions.

For instance, if you write something in a Lisp:



(dolist [a '(1 2 3 4)] ; (comment) print every element in the list 1 2 3 4 
  (print a))

this would get translated into something like (but certainly more complex than):



load whatever code we need to print anything  and to reserve memory space on this OS
[that line above is probably a significant amount of code in its own right]
reserve memory space for 5 bytes and store the address of the segment in X
initialize memory segment X with bytes 1 2 3 4 and 0 (or some other marker)
initialize pointer P with the address of segment X
get the value of the address pointed to by P and store it in register Y
marker: LOOP POINT
call the OS function to print with the value of Y
increment P by one
load the value of the address pointed to by P and store it in register Y
if Y is not 0, jump to LOOP POINT
free memory segment X

And that’s just for a simple loop. And actually all the instructions in the lowest level are just numbers, but that part of the translation is more or less bi-directional; assembly language is just an immediate way for humans to write those numbers down in a relatively meaningful way and provide comments and names to things (but the comments you will lose, as you will most of the names when you translate “upwards”).

Stuff like this adds up quickly, and you generally can’t automatically translate back from the low-level stuff your computer actually runs to the high-level description you’ve written the code in.

Chronos · September 2, 2010, 9:56pm

On the other hand, code that is efficient can be even worse.

Superfluous_Parentheses · September 2, 2010, 10:06pm

Or, Slightly less dramatical example. Note to C programmers: you should very probably not use this for copying regions of data to other regions.

Una_Persson · September 2, 2010, 11:22pm

An excellent comparison of the three forms of the same instructions, thank you.

Voyager · September 3, 2010, 12:12am

To quibble, and to confuse the OP more, this isn’t strictly true nowadays.
Say you have assembly code that looks like this:

Load RegA, RebB;
Load Reg C, Mem[XYZ]
Add, RegD, RegA,RegC.

Now in the old days this would be done step by step - which is inefficient, since a memory read takes a relatively long time. A lot of the transistors in a microprocessor are taken up figuring out how to rearrange instructions for better efficiency, so it would almost certainly be rearranged to
Load Reg C, Mem[XYZ]
Load RegA, RebB;
Add, RegD, RegA,RegC.

since there are no data dependencies between these instructions. Other code before this segment might be inserted after the memory read also.

Even more fun, since you fetch instructions from memory (cache if you are lucky) this can be an expensive operation also, and you almost always prefetch the next few instructions so they will be ready to be decoded when the control flow comes to them. But what if there is a branch? (If statement, basically). The CPU tries to guess which way the branch is going to go, and fetches instructions assuming that the branch goes that way. If it is wrong, it throws away the instructions that were fetched incorrectly. It is even more complicated, but that is enough.

If you have a very long instruction word processor, like Intel’s Itanic, the compiler is charged with figuring out microinstructions which can execute at the same time. That makes decompiling even more fun.

jasg · September 3, 2010, 12:34am

Ah… geek disease:

Q. What time is it?

A. Well, to know what time it is, first you build a clock…

TriPolar · September 3, 2010, 12:40am

Voyager:

The_Hamster_King:

A computer program consists of a series of instructions in “machine code” – a pattern of 1’s and 0’s that tells the computer exactly what to do step-by-step.

To quibble, and to confuse the OP more, this isn’t strictly true nowadays.
Say you have assembly code that looks like this:

Load RegA, RebB;
Load Reg C, Mem[XYZ]
Add, RegD, RegA,RegC.

Now in the old days this would be done step by step - which is inefficient, since a memory read takes a relatively long time. A lot of the transistors in a microprocessor are taken up figuring out how to rearrange instructions for better efficiency, so it would almost certainly be rearranged to
Load Reg C, Mem[XYZ]
Load RegA, RebB;
Add, RegD, RegA,RegC.

since there are no data dependencies between these instructions. Other code before this segment might be inserted after the memory read also.

Even more fun, since you fetch instructions from memory (cache if you are lucky) this can be an expensive operation also, and you almost always prefetch the next few instructions so they will be ready to be decoded when the control flow comes to them. But what if there is a branch? (If statement, basically). The CPU tries to guess which way the branch is going to go, and fetches instructions assuming that the branch goes that way. If it is wrong, it throws away the instructions that were fetched incorrectly. It is even more complicated, but that is enough.

If you have a very long instruction word processor, like Intel’s Itanic, the compiler is charged with figuring out microinstructions which can execute at the same time. That makes decompiling even more fun.

If we are going to quibble, computers don’t use assembly instructions, or machine code, or 1’s and 0’s. They are machines which which transfer electrical charge from one place to another, sometimes converting them to light, and sometime converting light back to charges, and provide electomotive forces and magnetic fields to manipulate manipulate groups of molecules. They don’t execute code, they are just doing those transfers in the only way they are able to do based on the arrangement of molecules which compose the computer. If you look close enough, it’s all pretty simple.

TriPolar · September 3, 2010, 12:46am

Back to common terminology.
Lest we forget, some programming languages are interpreters, and execution is based on the source code. Batch files and javascript* work on that basis. Most complex or intensively used languages based on interpretation still compile source code to an object code which is unrelated to the computers native machine language. A common example of that is Java.

Pasta · September 3, 2010, 1:17am

Down the rabbit hole…

Topic		Replies	Views
Windows leak--Are Windows disks uncrackable? No one ever broke into one? Factual Questions	6	1065	February 14, 2004
How does Microsoft keep anything in Windows a secret? Factual Questions	6	1257	June 11, 2006
Decompiling Bill Factual Questions	12	782	December 5, 1999
Why can't you decompile binaries? Factual Questions	19	2369	January 4, 2007
Are incredibly complicated pieces of software built from lines-of-code up? Factual Questions	35	2868	August 27, 2008

Explain "source code" to me

Related topics