Basic long-division pseudocode (useful for division and divisibility testing):
# given n >= 0 and d > 0, compute q and r with n = dq+r, q >= 0, 0 <= r < d
i = 1
r = n # or operate destructively on n
while d <= r # shift divisor left until it beats dividend; i is the placeholder
d <<= 1
i <<= 1
end
i >>= 1
while i != 0 # shift divisor left and subtract it off, accumulating quotient as we go
d >>= 1
if d <= r
r -= d
q += i
end
i >>= 1
end
The operations used here are assignment, inequality testing, addition, subtraction, and bitshifting (all relatively close to 6502 primitives, for register-sized numbers). To represent numbers ~10[sup]6[/sup] you need at least three-byte variables, so I’m imagining each of these operations as acting on (probably) arrays stored in memory, and themselves implemented as subroutines (or inserted directly if you want to avoid the JSR costs). There are a number of places where loops and conditions could be unrolled a bit for improved speed, and you could also convert this into two different routines for integer DIV (q only) and MOD (r only).
I’ve implemented similar long-division code in 8088/80386 assembler (and after seeing a little bit of 6502 assembly code, having SO MANY REGISTERS felt like cheating!).
I’ve extracted the text of the aforementioned BXY document below using the 30-day shareware program “CiderPress”. Because it was a minor hassle, and because the contents of the article were interesting, and most of all because the contents of that discussion group are not available plaintext anywhere else, I’ve posted the entirety of the article of interest here. Of particular interest are the run-time benchmarks below, listed in seconds, and optimized by geeks who had to work with only the equipment on-hand. The article is from October of 1981, and represents a pretty good look at what our primitive ancestors did for fun besides hunting mammoths.
I propose that instead of trying to re-invent the abacus using hardware with which we’re not really that familiar, we accept .69 seconds as a good-faith benchmark of the abilities of the platform. From there, I propose that we try to rewrite their (optimized?) algorithm for a modern platform, and see what kind of run time we get, and compare (ha-ha) Apples to Apples.
Ah, and on post-view, the text of the program is in a different article. Here’s the .74-second version of the program.
1000 *---------------------------------
1010 * SIEVE PROGRAM:
1020 * CALCULATES FIRST 1899 PRIMES IN .74 SECONDS!
1030 *
1040 * INSPIRED BY JIM GILBREATH
1050 * (SEE BYTE MAGAZINE, 9/81, PAGES 180-198.)
1060 * AND BY WILLIAM ROBERT SAVOIE
1070 * 4405 DELASHMITT RD. APT 15
1080 * HIXSON, TENN 37343
1090 *---------------------------------
1100 ARRAY .EQ $3500 FLAG BYTE ARRAY
1110 SIZE .EQ 8192 SIZE OF FLAG ARRAY
1120 *---------------------------------
1130 * PAGE-ZERO VARIABLES
1140 *---------------------------------
1150 A.PNTR .EQ $06,07 POINTER TO FLAG ARRAY FOR OUTER LOOP
1160 B.PNTR .EQ $08,09 POINTER TO FLAG ARRAY FOR INNER LOOP
1170 PRIME .EQ $1B,1C LATEST PRIME NUMBER
1180 COUNT .EQ $1D,1E # OF PRIMES SO FAR
1190 TIMES .EQ $1F COUNT LOOP
1200 *---------------------------------
1210 * APPLE ROM ROUTINES USED
1220 *---------------------------------
1230 PRINTN .EQ $F940 PRINT 2 BYTE NUMBER FROM MONITOR
1240 HOME .EQ $FC58 CLEAR VIDEO
1250 CR .EQ $FD8E CARRIAGE RETURN
1260 LINE .EQ $FD9E PRINT "-"
1270 BELL .EQ $FBE2 SOUND BELL WHEN DONE
1280 *---------------------------------
1290 * RUN PROGRAM 100 TIMES FOR ACCURATE TIME MEASUREMENTS!
1300 *---------------------------------
1310 START JSR HOME CLEAR SCREEN
1320 LDA #100 LOOP 100 TIMES
1330 STA TIMES SET COUNTER
1340 .1 JSR GENERATE.PRIMES
1350 LDA $400 TOGGLE SCREEN FOR VISIBLE INDICATOR
1360 EOR #$80 OF ACTION
1370 STA $400
1380 DEC TIMES
1390 BNE .1 LOOP
1400 JSR BELL READ WATCH!
1410 LDY COUNT+1 GET HI BYTE OF COUNT
1420 LDX COUNT
1430 JSR PRINTN PRINT PRIMES FOUND
1440 RTS
1450 *---------------------------------
1460 * GENERATE THE PRIMES
1470 *---------------------------------
1480 GENERATE.PRIMES
1490 LDY #0 CLEAR INDEX
1500 STY COUNT CLEAR COUNT VARIABLE
1510 STY COUNT+1
1520 STY A.PNTR SET UP POINTER FOR OUTER LOOP
1530 LDA /ARRAY
1540 STA A.PNTR+1
1550 LDA #1 LOAD WITH ONE
1560 LDX /SIZE NUMBER OF PAGES TO STORE IN
1570 *---------------------------------
1580 * SET EACH ELEMENT IN ARRAY TO ONE
1590 *---------------------------------
1600 .1 STA (A.PNTR),Y SET FLAG TO 1
1610 INY NEXT LOCATION
1620 BNE .1 GO 256 TIMES
1630 INC A.PNTR+1 POINT AT NEXT PAGE
1640 DEX NEXT PAGE
1650 BNE .1 MORE PAGES
1660 *---------------------------------
1670 * SCAN ENTIRE ARRAY, LOOKING FOR A PRIME
1680 *---------------------------------
1690 LDA /ARRAY SET A.PNTR TO BEGINNING AGAIN
1700 STA A.PNTR+1
1710 .2 LDY #0 CLEAR INDEX
1720 LDA (A.PNTR),Y LOOK AT NEXT FLAG
1730 BEQ .6 NOT PRIME, ADVANCE POINTER
1740 *---------------------------------
1750 * CALCULATE CURRENT INDEX INTO FLAG ARRAY
1760 *---------------------------------
1770 SEC
1780 LDA A.PNTR+1
1790 SBC /ARRAY
1800 TAX SAVE HI-BYTE OF INDEX
1810 LDA A.PNTR LO-BYTE OF INDEX
1820 *---------------------------------
1830 * CALCULATE NEXT PRIME NUMBER WITH P=I+I+3
1840 *---------------------------------
1850 ASL DOUBLE THE INDEX
1860 TAY
1870 TXA HI-BYTE OF INDEX
1880 ROL
1890 TAX
1900 TYA NOW ADD 3
1910 ADC #3
1920 STA PRIME
1930 BCC .3
1940 INX
1950 .3 STX PRIME+1
1960 *---------------------------------
1970 * FOLLOWING 4 LINES CHANGE ALGORITHM SLIGHTLY
1980 * TO SPEED IT UP FROM .93 TO .74 SECONDS
1990 *---------------------------------
2000 TXA TEST HIGH BYTE
2010 BNE .5 PRIME > SQRT(16384)
2020 CPY #127
2030 BCS .5 PRIME > SQRT(16384)
2040 *---------------------------------
2050 * NOW CLEAR EVERY P-TH ENTRY AFTER P
2060 *---------------------------------
2070 LDY #0
2080 LDA A.PNTR USE CURRENT OUTER POINTER FOR INNER POINTER
2090 STA B.PNTR
2100 LDA A.PNTR+1
2110 STA B.PNTR+1
2120 CLC BUMP ARRAY POINTER BY P
2130 .4 LDA B.PNTR BUMP TO NEXT SLOT
2140 ADC PRIME
2150 STA B.PNTR
2160 LDA B.PNTR+1
2170 ADC PRIME+1
2180 STA B.PNTR+1
2190 CMP /ARRAY+SIZE SEE IF BEYOND END OF ARRAY
2200 BCS .5 YES, FINISHED CLEARING
2210 TYA NO, CLEAR ENTRY IN ARRAY
2220 STA (B.PNTR),Y
2230 BEQ .4 ...ALWAYS
2240 *---------------------------------
2250 * NOW COUNT PRIMES FOUND (C=C+1)
2260 *---------------------------------
2270 .5
2280 * JSR PRINTP PRINT PRIME
2290 INC COUNT
2300 BNE .6
2310 INC COUNT+1
2320 *---------------------------------
2330 * ADVANCE OUTER POINTER AND TEST IF FINISHED
2340 *---------------------------------
2350 .6 INC A.PNTR
2360 BNE .7
2370 INC A.PNTR+1
2380 .7 LDA A.PNTR+1
2390 CMP /ARRAY+SIZE
2400 BCC .2
2410 RTS
2420 *---------------------------------
2430 * OPTIONAL PRINT PRIME SUBROUTINE
2440 *---------------------------------
2450 PRINTP LDY PRIME+1 HI BYTE
2460 LDX PRIME
2470 JSR PRINTN PRINT DECIMAL VAL
2480 JSR LINE VIDEO "-" OUT
2490 RTS
It’s interesting to think that they could run it once and have meaningful timing data…
If you scan the program, he actually ran it 100 times in a loop and stood there with a stopwatch. I presume that he then did several ~70 second trials, and divided the mean by 100, but he doesn’t come out and say so.
I think that ultrafilter has the best line on a fast implementation. Eratosthenes’s method removes all divisions, which should speed things up immensely.
How much memory did an Apple ][ have, anyways? 16k? 64k?
The sieve of Eratosthenes makes heavy use of division. Do you have something else in mind?
The very first Apple II model came with 4KB, and could be expanded in 4K or 16K increments up to 12 or 48K, respectively. (Reference.) I don’t know how many people really settled for a mere 4K. Quite limiting, even by 1977 expectations.
The next model, the Apple II+, came with 48K, and could be expanded to 64K with an add-on card. Later third-party cards allowed adding much more.
The enhanced IIe and the IIc generally had 128K standard. (The first, unenhanced IIe had only 64K, but there was an upgrade kit for those sold soon afterward.) Again, cards from Apple and other companies permitted expanding the RAM, by as much as 3 MB if I remember right.
Note that the 6502 only has a 64KB (16-bit) address space. Any amount of memory over 64KB, including the ROM, had to be accessed through bank-switching. Generally speaking that was a big pain in the ass.
Maybe I misunderstood which sieve you were talking about. I was thinking of the sieve where you start with 2, and start adding 2, marking out numbers that are passed, then going to the next number, 3, and start adding 3, marking out the numbers passed.
I wrote an implementation back in college that saved space by doing a breadth-first search of the primes instead of a depth-first search, using a heap to hold the primes up to sqrt(N) and a counter for each prime, with the heap ordered by each node’s counter. For searching up to 10[sup]8[/sup], you only need a heap for primes up to 10[sup]4[/sup], so it would require less than 10[sup]4[/sup]/(ln(10[sup]4[/sup]) - 1.08366) (about 1231) entries. Each entry would need 2 bytes to hold the prime and 4 bytes to hold the counter, so it would need less than 8k to hold the heap.
You start off with the main counter set to 3 and one entry in the heap, for 3. Each time you create an entry in the heap for prime p, you initialize its counter as p[sup]2[/sup] and place it at the end of the heap. You increment the main counter by 2 until you hit or pass the counter for the head of the heap, inserting a new entry for those numbers you pass that are less than the head’s counter, as they are primes. Also, you increment the prime-counter by 1. Finally,you increment the counter of the head of the heap by twice its prime number, and reheapify, looping over this final step until the head element’s counter is greater than the main counter. Then you loop back to incrementing the main counter by 2 until you’ve inserted entries for all primes up to sqrt(N).
Once you have entries for all primes up to sqrt(N), you do pretty much the same thing but without inserting new entries into the heap. After you test N, you have a count of all the prime numbers up to N.
For those who don’t know what a heap is, it’s a balanced tree where every node is not greater than any of its descendants. Since it’s always perfectly balanced, it can be implemented as an array, and nodes do not need any pointers to children or parents, as those are already implied by a node’s index in the array. Calling the root of the heap index 1, the children of any node at index i are the nodes at index 2i and 2i + 1. So, for instance: [ 1 5 2 6 8 4 3 7 ] is a heap. 1 is less than 5 and 2, 5 is less than 6 and 8, 2 is less than 4 and 3, and 6 is less than 7.
I’ve never seen that implementation before, but it makes a lot of sense. Usually you see it as something like “Divide every number by the smallest number not crossed off and cross off the ones that leave no remainder and go back to the beginning of this sentence”.
Um,
ultrafilter,
You can implement the sieve with no divisions at all. It just takes a humongous blob of RAM.
Here’s the (super-crude) outline from long-ago memory:
Allocate 1 million words of RAM.
Initialize with integers 1, 2, 3, 4 etc to 1 million
Compute sqaure root of last array element value, save as StopPoint
Initialize variable PrimeIncrement to 2
OuterLoopTop:
InnerLoop: Starting with PrimeIncrement-th entry in array, set every PrimeIncrement-th value to zero (ie set element 2,4, 6, … to zero on first pass) top at end of array.
Starting with Array(PrimeIncrement), scan upwards in array for next non-zero entry. Set PrimeIncrement = found value.
If PrimeIncrement > StopPoint then exit, else go back to OuterLoopTop:
after exit: Scan through array, outputting non-zero entries as found primes.
That’s the crudest code outline I’ve done in years, but you can probably follow it. Dinner’s on & I gotta go.
It’ll work using nothing but integer addition & addressing math. You can even do the single square root calc by simply taking the number with 1/2 the bits of your max. i.e. if you allocate 2^24 array entries, then 2^12 is a good square root.
Obvious improvements are to simply initialize with 1’s rather than 1, 2, 3, 4 etc, to use bits rather than bytes or words for each entry to save 8x-32x on storage, and to be smarter about the special case of 2, so you can cut out the first pass and all the even values, which saves another 2x on storage.
Other than the table itself, this can be implemented with just a couple of scalars and only a few dozen machine instructions. I did it once on a 370 with all the speedups I mentioned and then some and all work vars and addressing calcs were in CPU registers. The only core (RAM hadn’t neen invented yet) reference was for the table itself. Even the code fit in the tiny CPU cache of that monster.
Sorry, should have previewed.
Well, there you go. The reference I have (an introductory discrete math book) doesn’t explicitly mention division, but it seems like that’s what the author had in mind. And you can implement it that way, although it’s probably better not to. I am corrected.
So how many simultaneously-executing Apple II emulators can a modern PC run before their average execution speed is about that of the original Apple II?
(I know, I know, the context switching will eat you alive, not to mention the RAM requirements)
sorry, I should have previewed too.
You don’t really need that much memory, because you don’t really need the array to hold the values. If a(i) = i for all i, then you don’t need to store the value - the subscript tells you what the value should be.
All you need is a logical array to tell you whether a given number has been crossed off or not. If the memory is byte-addressable, then the simplest way is with a byte array. If you want it even more compact, you’d use a packed bit array.
To find where bit n is stored, the 3 low order bits of n tell you the bit position, and n rightshifted 3 times tells you which byte it’s in.
That gets the storage requirement in your example down from 4 MB to 125 KB.
Agreed. As I said in the last paragraph. I was trying to make the description as simple as possible, not the execution.
If you look at the end of LSLguy’s post, you’ll see he mentioned just that. In fact, he does one better, noting that you only need to store the bits for odd numbers, reducing the memory needed to a bit more than 61k.