I’m generally not a nostalgic guy, but this thread is making me nostalgic!
I remember writing neat little hacks in assembler and BASIC for the 64 and 128. I had two or three “Magic Tricks” published in RUN magazine. I wrote a BBS for the 128 that used a bit of relocatable assembler code for things like PETSCII to ASCII conversion, and an iTunes-like SID player. Those were the days.
Aside from games, there was the “demo scene,” and those guys really, really knew how to push the hardware. I think these guys still exist, and if they do, I would bet that they have reference materials (even if not catalogued and organizes) to lots and lots of undocumented (by CBM) hardware features. The OP might find himself going down a rabbit hole, but it might be a good start.
Brief summary: A floating-point number is stored in something similar to scientific notation, something times two-to-the-something. The exponent is stored in the first (most significant) few bits. So re-casting a floating point number is (approximately) like taking a logarithm.
Bit-shifting that integer logarithm means dividing it by two, and dividing a log by 2 is equivalent to the square root of the original number. Subtracting the logarithm from something else, meanwhile, is equivalent to taking the inverse of the original number. And then we re-cast it as a float, so our sequence is (approximately) equivalent to 1/sqrt(x).
Now, that approximation isn’t quite good enough yet to use it, so we use it to get a better approximation, using Newton’s method (a tried and true technique for numerical calculation). The programmer originally wrote in two steps of Newton, but it turns out that one is good enough, so he edited out the second step.
As for that magic number 0x5F3759DF , that depends on the precise details of floating-point representation (how many bits are used for each part). Naively, the magic number “should” be sqrt(2^127). The magic number actually used is close to that, but not exactly equal. It turns out that, while sqrt(2^127) would be better for the first approximation, the actual number used is slightly better than it after one Newton step.
Ain’t no floating point on a 6502 processor, though. What you did have, if the user was rich enough to spring for a Commodore floppy drive, was a perfectly good second processor…
For something as simple as 8-bit addition with an 8-bit result on a 6502, to give an example for the OP, the code would be something like this. You only need to use the accumulator; the X- and Y-registers aren’t needed.
CLC // Clear the carry flag
LDA b_var // Load the accumulator with the contents of memory address b_var
ADC c_var // Add with carry (to the accumulator) the contents of memory address c_var
STA a_var // Store the accumulator to memory address a_var
RTS // Return from subroutine
With a_var, b_var, and c_var being the addresses where the variables a, b, and c are stored.
So, when disassembled, you might see something like:
CLC
LDA $C102
ADC $C101
STA $C100
RTS
You won’t have variable names upon disassembly. All you will see is that the program clears the carry bit, loads the accumulator with the contents of memory $C102, adds to the accumulator contents of memory $C101, and stores the result in memory location $C100. Those memory location need not be those–they can be stored anywhere in available memory, and they don’t have to be contiguous. I just pretended we were storing A at $C100, B at $C101, and C at $C102.
So that’s part of the tricky part; with the disassembly you see what’s going on, but you don’t have variable names, you don’t have labels, you don’t have comments, etc., so you won’t necessarily know what memory locations $C100, $C101, and $C102 are being used for. You’ll just understand that this routine is adding what’s in the latter two memory locations and storing the result in the first location. So by looking at the rest of the disassembly, you’ll have to infer what these memory locations are being used for. Is it keeping track of what “room” you are in in the game? Is it the number of enemies on a screen? Is it x- or y- coordinates for a sprite? Etc.
(Note that will only allow you to add two 8-bit numbers and return an 8-bit result. So if you are adding 255 + 1, you will have 0 stored into the memory location where the a_var is. If you want to allow for a 16-bit result, then you have to add a couple of more lines and have a a_variable_low_byte and a_variable_high_byte to store the high and low bytes of the result, and you can figure out whether to assign a 0 or 1 to the high byte by storing the status of the carry flag there. This is not meant to be an exhaustive example, but rather a simple example of what something like A = B + C might look like in 6502 assembly when disassembled, so the OP can get an idea of what they would be getting themselves into were they to undertake the project in the OP.)
Definitely. I was deliberately leaving my assembly example vague; my experience was in college with IBM 360 assembly language, not x86 or anything like that.
But yours is better- you take a simple thing: A= B + C and show what that translates into in assembly.
(BTW; assembly and architecture were the two computer science courses that really hammered home how computers actually work. Prior to that, there was a big element of “below this level, magic happens” for me. But taking those two, I learned how they actually work down low.)
I once knew someone who did Mel’s trick on old drum-based calculators. Monroe’s IIRC. Even reading Knuth there are a number of instances of what, today, would be considered ‘bad programming practice.’ So remember kids, what you do today will be judged by the politics of tomorrow. You can’t win, you can’t break even, you can’t get out of the game.
As for self-modifying code, in my earlier days device drivers often were self-modifying ij order to be fast enough to keep up and minimize overhead during interrupt cycles. Debugging typically required a logic analyzer to know what was actually going on on the bus. I really hated this at the time, but now I’m pretty nostalgic.
Now we have namby-pamby processors that separate data and instruction (unlike real VonNeumann machines) for “security,” so we have to fake it by putting an interpreter in the instruction memory and the program in data memory. Or do incremental compilation with a higher level language (like Lisp) that deals with this nonsense for you. (In Lisp, data is executable by calling ‘eval’ on it, programs and data are not distinguished as they are both ‘lists,’ and depending on your implementation, you can redefine even core parts of the language on the fly and have it take effect while the program is running).
Yep. Though, these days not much use to anyone who’s not working in hardware, and even then only probably a small percentage.
Even back when I took architecture back in 1997 (the class centered around the MIPS pipeline), the professor said on the first day of class that only about 3% of the students taking the class would ever need to apply what they were about to learn, since at the time the pipeline processing field was saturated as far as job availability went. And this was Computer Engineering, not Computer Science, so I imagine even fewer CS majors would have ever used it.
Our assembler course which I took back in 1996 was based on the Intel 8086. Our professor was cheap and gave us for our development environment Borland Turbo Assembler 1.0 (the version without the debugger). It was free for him, and awful for us. Running a DOS assembler with no sandbox on Windows 95 leads to all sorts of fun. If you’re lucky, you just get weird ASCII text on your screen somewhere. If you’re unlucky, you BSODed as you accessed something you weren’t supposed to. Most of the time you were unlucky. The application we were building was to have multiple (ASCII character-based) Windows that performed various tasks such as a calculator. (The calculator is the only one I can remember.) I remember at one point I don’t know what I did, but I drew a blue border around my Desktop wallpaper that didn’t go away until the next time my system rebooted. (Which no doubt happened shortly thereafter based on the inherent instability in the whole process of making this application.)
The border itself drew itself by starting at the top center of the screen, and then mirrored itself as it traced the perimeter downward and finished at bottom center of the screen. I was struck with the odd thought that my computer was telling me I was a “square” with its invisible fingers.
Stepping back from self-modifying code for a minute (not completely, though), one of the things the 6502 processor couldn’t do in hardware, was multiply numbers. You had to program it to do that. The right kind of magic could save hundreds of cycles. The type of thing one might encounter disassembling an old, clever program.
There is still a niche for learning this kind of stuff, though not much for today’s practical computer programmer using the aforementioned namby-pamby processors and high-level languages.
I did Mel’s trick in high school on an LGP-21, the later, transistorized, and slower version of the LGP-30 that Mel used. (PDF of the user manual here
With 4K of memory, you almost had to modify your code. I wrote a tic-tac-toe program, and the best way to figure out what the opponent would do in the space available was to change a bunch of add instructions to subtracts. I had an initialization sequence to make sure they were all adds when you started the program. This was in 1969, so don’t expect me to remember all the details.
No assembler for that thing. I wrote one myself. No ASCII. No interrupts. It was a great way to learn how to program.
Yeah, math was pretty nuts on the 6502. You can also multiply/divide by two in hardware because of the bit shifting instructions (ASL, ASR, ROL, ROR), but addition and subtraction were the only dedicated math instructions.
I worked on the 6809 which did have multiply. I was pretty surprised when I talked to friends using the 6502 they told me about some of the things they had to do.
Of course, once any one programmer figured out a clever and efficient multiplication routine, all other programmers could then use it without needing to fully understand it, either through calling it from a library or copying and pasting it. Though you might need to understand the performance of the routine: One can imagine a routine that’s much faster but of limited precision, or one that’s only good for a certain range of values, or the like.
Back in the day, if you needed a general purpose multiply or divide, most people used some variant of Booth’s algorithm, which is basically just breaking down multiplication or division into a series of add/subtract and shift instructions. It’s a bit CPU intensive since an 8 bit x 8 bit multiply or divide requires 8 passes through the add/subtract and shift loop, but a lot of early CPUs that had multiply and divide instructions did basically the same thing internally.
While Booth’s algorithm requires a lot of looping, it has the advantages of being both compact and easy to understand.
The software routines linked to above, which trade space for time by using a pre-computed table of squares, seem to run slightly faster. I guess it depends what your constraints are. Similar issues arise with the Z80.