Explain "source code" to me

I think I sorta get the concept. I’ve been using Ubuntu and I know that the guts of the OS is freely available and can see how that would make writing apps easier.

But, in what way is Windows source code not available? If I install Windows, isn’t that code there? It may be encrypted or something, but can’t a computer geek get access somehow? Isnt it all there on the disk? or not?

What you have on a Windows install disc is the compiled source code. If you take a look at Internet Explorer the main application is called iexplore.exe which you can’t do anything with except execute. This file (iexplore.exe) is the result of running the source code through a compiler. So you have the end result of the source code without having the code itself.

It is sometimes possible to ‘reverse engineer’ a file to try and get the original source code but this isn’t something I know much about, I imagine it is incredibly difficult.

A computer program consists of a series of instructions in “machine code” – a pattern of 1’s and 0’s that tells the computer exactly what to do step-by-step.

It’s possible to write a program directly in machine code, but it’s really hard. All the 1’s and 0’s start to look the same after a while and it’s very easy to get lost and confused.

So most programs are written in a high-level language that’s easy for humans to read and understand. Then a special program called a compiler translates the high-level “source code” into “machine code” that will actually run on the computer.

When you get a program, you only typically get the final compiled version. You don’t get the original source code that was used to create it. It’s possible to “disassemble” a piece of machine code and figure out what it’s doing, but it’s not easy. If you really want to understand how a program works, you need to have access to the human-friendly source code that was used to create it.

Analogy.

Imagine a 10-minute casserole dish that a deli might create for you. You just add some liquid premix to some pasta and microwave it.

You don’t actually have the full “source code” for the casserole recipe: you don’t know all the individual ingredients of the liquid premix. You don’t know that ingredients or preparation steps to make the pasta from eggs and flour. You don’t any of these things even though both final products are in your physical possession.

With this analogy, the deli has the “source code” and you do not.

Do you want a more computerish analogy than this?

Here’s a simple C example. All this code does is declare and initialize an integer, add another interger to it, and exit with the resulting sum as the return value. It doesn’t even print anything to screen, as that is more complex when you look under the hood.

The source code looks like this:



int main(void) {
  int my_num = 2;
  my_num += 2;
  return my_num;
}


The compiler first translates this fairly readable code into “assembly” language:



        .file   "example.c"
        .text
.globl main
        .type   main, @function
main:
        leal    4(%esp), %ecx
        andl    $-16, %esp
        pushl   -4(%ecx)
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %ecx
        subl    $16, %esp
        movl    $2, -8(%ebp)
        addl    $2, -8(%ebp)
        movl    -8(%ebp), %eax
        addl    $16, %esp
        popl    %ecx
        popl    %ebp
        leal    -4(%ecx), %esp
        ret
        .size   main, .-main
        .ident  "GCC: (GNU) 4.1.2 20080704 (Red Hat 4.1.2-46)"
        .section        .note.GNU-stack,"",@progbits


You can already see that much of the human-language nature is lost forever. For instance, the variable name “my_num” is not present anymore.

The next stage is “object” code, which is binary, so all you get here is a hexdump: :slight_smile:



0000000 457f 464c 0101 0001 0000 0000 0000 0000
0000010 0001 0003 0001 0000 0000 0000 0000 0000
0000020 00d0 0000 0000 0000 0034 0000 0000 0028
0000030 0009 0006 4c8d 0424 e483 fff0 fc71 8955
0000040 51e5 ec83 c710 f845 0002 0000 4583 02f8
0000050 458b 83f8 10c4 5d59 618d c3fc 4700 4343
0000060 203a 4728 554e 2029 2e34 2e31 2032 3032
0000070 3830 3730 3430 2820 6552 2064 6148 2074
0000080 2e34 2e31 2d32 3634 0029 2e00 7973 746d
0000090 6261 2e00 7473 7472 6261 2e00 6873 7473
00000a0 7472 6261 2e00 6574 7478 2e00 6164 6174
00000b0 2e00 7362 0073 632e 6d6f 656d 746e 2e00
00000c0 6f6e 6574 472e 554e 732d 6174 6b63 0000
00000d0 0000 0000 0000 0000 0000 0000 0000 0000
*
00000f0 0000 0000 0000 0000 001b 0000 0001 0000
0000100 0006 0000 0000 0000 0034 0000 0028 0000
0000110 0000 0000 0000 0000 0004 0000 0000 0000
0000120 0021 0000 0001 0000 0003 0000 0000 0000
0000130 005c 0000 0000 0000 0000 0000 0000 0000
0000140 0004 0000 0000 0000 0027 0000 0008 0000
0000150 0003 0000 0000 0000 005c 0000 0000 0000
0000160 0000 0000 0000 0000 0004 0000 0000 0000
0000170 002c 0000 0001 0000 0000 0000 0000 0000
0000180 005c 0000 002e 0000 0000 0000 0000 0000
0000190 0001 0000 0000 0000 0035 0000 0001 0000
00001a0 0000 0000 0000 0000 008a 0000 0000 0000
00001b0 0000 0000 0000 0000 0001 0000 0000 0000
00001c0 0011 0000 0003 0000 0000 0000 0000 0000
00001d0 008a 0000 0045 0000 0000 0000 0000 0000
00001e0 0001 0000 0000 0000 0001 0000 0002 0000
00001f0 0000 0000 0000 0000 0238 0000 0080 0000
0000200 0008 0000 0007 0000 0004 0000 0010 0000
0000210 0009 0000 0003 0000 0000 0000 0000 0000
0000220 02b8 0000 0010 0000 0000 0000 0000 0000
0000230 0001 0000 0000 0000 0000 0000 0000 0000
0000240 0000 0000 0000 0000 0001 0000 0000 0000
0000250 0000 0000 0004 fff1 0000 0000 0000 0000
0000260 0000 0000 0003 0001 0000 0000 0000 0000
0000270 0000 0000 0003 0002 0000 0000 0000 0000
0000280 0000 0000 0003 0003 0000 0000 0000 0000
0000290 0000 0000 0003 0005 0000 0000 0000 0000
00002a0 0000 0000 0003 0004 000b 0000 0000 0000
00002b0 0028 0000 0012 0001 6500 6178 706d 656c
00002c0 632e 6d00 6961 006e
00002c8


This is what you’ve got on your Windows box.

I’ve experimented with de-compiling programs in the past, and it’s pretty much a futile effort, as so much useful abstraction and language is irrevocable lost during compilation.

I could be wrong but I think that when software pirates crack a game they frequently do so by removing the checking code from the original executable file. For example, the original executable could run a check to see if the correct CD is in the drive and also check for a valid serial number. The replacement executable is identical but has these portions of the code removed and is sometimes ‘padded’ in order to maintain correct filesize. I have no idea how they go about doing this.

Say that I wrote down a list of instructions for how to build a machine – I list off the size of the gears, give spatial coordinates for how they are to be assembled, detail where and how much to lubricate the machine, etc.

Now someone actually builds that machine.

My instructions are “source code”. If I go and sell thousands of my machines to other people, they’ll have the machine but not the source code. They could take the machine apart and work back from that to recreate the source code, but that would be a hassle and a half. It’s far easier to just have the source code itself if you want to modify it.

Source code is also broken up into different sections in a logical manner – like chapters. If you’re taking apart a machine you don’t understand, you don’t know where to break it up into subsections that you can look at independently. If the machine is very large and very complex, if you can’t break it up into subsections, understanding it will be almost impossible.

Using Pasta’s posting as illustration, what the hackers do is take the object code hexdump and disassemble (the reverse of assemble) back to the assembly language. (It’s not a perfect reversal process but it’s good enough in most cases.) However, you can’t further decompile the assembly language back to “source code” so the assembly code is the level they make their changes.

Manipulating assembly language is difficult and tedious but it’s workable for small changes such as bypassing a serial # check. You can’t work at the assembly language level to make major changes such as adding entire new scenes, characters, and weapons to a game. In other words, you can’t disassemble Madden NFL 2009 and change the assembly code to turn it into Madden NFL 2010.

Back in the Classic Mac days, I would have great fun using Macsbug to do just this. I would set a breakpoint whenever a dialog box appeared, and when the program put up a dialog saying to insert the “key disk” I would backtrace the program execution until I found the branch, and edit the hex so that the sense of the branch was reversed. Then, the program would only run without the key disk…
I also used to do things like change the number of lives you got in games, which made it much easier to play the game.

My very first hack: Opening up a Sim City city file in FEdit (a MacOS hex editor), finding where it stored the starting budget of $20,000 (hex 4E20) and changing it to FFFF. Free money!

Heh. I remember the first time I decided to install Macsbug and make some use of it. I had the vague notion that I had already acquired a copy and just hadn’t ever installed it, so I looked around and sure enough! So I dropped it into the System Folder and rebooted and hit the programmer’s key to invoke it and…nada. Reread the entry in the Macintosh Bible and tried again. Still nothing. What am I doing wrong? Could my copy of Macsbug be out of date or corrupted? I go to take a closer look at it and discover that what I’ve actually got is MacBugs, a bloody arcade game :smack:

Way, way back I owned a Commodore 64. I taught myself enough Commodore Assembly code to remove copy protection checks from programs. In principle, if I’d wanted to take the trouble I could have figured out to the last bit exactly how my C-64 and any program that would run on it worked. Those days are gone forever. Now even the teams that develop programs have to presume that the modular chunks of code they use probably work and that not too many bugs appear when they’re riveted together.

In principle, decompiling (going from machine code all the way back to a high-level language like C) is possible, but you end up with bloody terrible code: All of the comments and variable names are lost, many complicated structures like loops have probably been broken up into more fundamental piece, and so on. Working with decompiled code isn’t really any easier than working with the assembly code, and the assembly code is mush easier to get back to.

Amateur. I went through every line of code in Football Manager on the BBC Micro in order to find “£20,000”- the amount of money you start the game with- and change it to £1,000,000. :cool:

Made the game hella buggy after a while, though.

I didn’t hack again until Fallout - hex edited all the character attribute values to 0A, which made them 10s, for reasons I didn’t quite understand.

A is 10 in base-16. (A is the first digit after 9.)

An error anyone could make…

To expand on this a bit, a “hex” editor is editing in “hexadecimal”, which is base 16. Hexadecimal uses the digits 0 through 9, but then you’re out of digits and you have to get all the way to 15. So, it just uses letters.

Decimal Hex
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 A
11 B
12 C
13 D
14 E
15 F
16 10
17 11
18 12
19 13
20 14

etc.

The reason people use hex editors and not decimal editors when editing binary files is that it is much easier to convert hexadecimal to binary than it is to convert decimal to binary. In hex, every character is 4 bits, and it is always the same 4 bits. 1 hex = 0001 binary, 2 hex = 0010 binary, etc.
So, if you have a bit number like 0F37, you can very easily convert it to binary.
0 = 0000
F = 1111
3 = 0011
7 = 0111
so 0F37 = 0000 1111 0011 0111.

This makes it very quick to go between hex and binary, and vice versa.
00101111 = 2F hex, which I was able to easily do in my head since 0010 = 2 and 1111 = F.

There is no simple correlation like this between decimal and binary.

A lot of times when you are editing you are setting or clearing individual bits, so it makes more sense to do it in hex.

Note also that Pasta’s C example can be compiled and assembled on many different types of computers–a Windows 7 machine, a new Mac, an old Mac, an iPhone, a mainframe, etc, etc. Any machine for which the compilers and assemblers are available. You start with the same source code, but the output is a different set of binary digits, because each ‘target’ machine is different. Different hardware (processor, screen, etc), different operating system. Each piece of software ends up custom-compiled for the machine it will run on. This is why ‘porting’ software to a new machine is a big deal: often even the source code has to be altered to take into account different capabilities of each target machine.

Of course, the original idea behind high-level languages was that they would make porting not a big deal: If you have a compiler for your computer, then it should do all the work of translating your source code to whatever your computer’s machine language is. Before the advent of compiled languages, though, if you wanted your program to run on a different computer, you pretty much had to rewrite it completely.

What about video game console emulators and ROM hacks? Isn’t that pretty similar to this discussion since they have to rip the compiled code off a cartidge and work backwards? And by ROM hack I don’t mean just having the ROM, which is apparently super easy, I mean when they make their own games, like Super Demo World or Super Metroid Redesign or Zelda: Parallel Worlds or tons of others. Judging by some videos on youtube it seems like they’re on the brink of making their own N64 games. It blows my mind.