How Much Faster Is My Computer Than An Apple ][

Here you go.

Basic long-division pseudocode (useful for division and divisibility testing):


# given n >= 0 and d > 0, compute q and r with n = dq+r, q >= 0, 0 <= r < d

i = 1
r = n     # or operate destructively on n
while d <= r     # shift divisor left until it beats dividend; i is the placeholder
  d <<= 1
  i <<= 1
end
i >>= 1
while i != 0     # shift divisor left and subtract it off, accumulating quotient as we go
  d >>= 1
  if d <= r
    r -= d
    q += i
  end
  i >>= 1
end

The operations used here are assignment, inequality testing, addition, subtraction, and bitshifting (all relatively close to 6502 primitives, for register-sized numbers). To represent numbers ~10[sup]6[/sup] you need at least three-byte variables, so I’m imagining each of these operations as acting on (probably) arrays stored in memory, and themselves implemented as subroutines (or inserted directly if you want to avoid the JSR costs). There are a number of places where loops and conditions could be unrolled a bit for improved speed, and you could also convert this into two different routines for integer DIV (q only) and MOD (r only).

I’ve implemented similar long-division code in 8088/80386 assembler (and after seeing a little bit of 6502 assembly code, having SO MANY REGISTERS felt like cheating!).

I’ve extracted the text of the aforementioned BXY document below using the 30-day shareware program “CiderPress”. Because it was a minor hassle, and because the contents of the article were interesting, and most of all because the contents of that discussion group are not available plaintext anywhere else, I’ve posted the entirety of the article of interest here. Of particular interest are the run-time benchmarks below, listed in seconds, and optimized by geeks who had to work with only the equipment on-hand. The article is from October of 1981, and represents a pretty good look at what our primitive ancestors did for fun besides hunting mammoths. :wink:

I propose that instead of trying to re-invent the abacus using hardware with which we’re not really that familiar, we accept .69 seconds as a good-faith benchmark of the abilities of the platform. From there, I propose that we try to rewrite their (optimized?) algorithm for a modern platform, and see what kind of run time we get, and compare (ha-ha) Apples to Apples.

Ah, and on post-view, the text of the program is in a different article. Here’s the .74-second version of the program.



1000  *---------------------------------
 1010  * SIEVE PROGRAM:
 1020  * CALCULATES FIRST 1899 PRIMES IN .74 SECONDS!
 1030  *
 1040  * INSPIRED BY JIM GILBREATH
 1050  *   (SEE BYTE MAGAZINE, 9/81, PAGES 180-198.)
 1060  * AND BY WILLIAM ROBERT SAVOIE
 1070  *   4405 DELASHMITT RD. APT 15
 1080  *   HIXSON, TENN  37343
 1090  *---------------------------------
 1100  ARRAY  .EQ $3500    FLAG BYTE ARRAY
 1110  SIZE   .EQ 8192     SIZE OF FLAG ARRAY
 1120  *---------------------------------
 1130  * PAGE-ZERO VARIABLES
 1140  *---------------------------------
 1150  A.PNTR .EQ $06,07   POINTER TO FLAG ARRAY FOR OUTER LOOP
 1160  B.PNTR .EQ $08,09   POINTER TO FLAG ARRAY FOR INNER LOOP
 1170  PRIME  .EQ $1B,1C   LATEST PRIME NUMBER
 1180  COUNT  .EQ $1D,1E   # OF PRIMES SO FAR
 1190  TIMES  .EQ $1F      COUNT LOOP
 1200  *---------------------------------
 1210  * APPLE ROM ROUTINES USED
 1220  *---------------------------------
 1230  PRINTN .EQ $F940    PRINT 2 BYTE NUMBER FROM MONITOR
 1240  HOME   .EQ $FC58    CLEAR VIDEO
 1250  CR     .EQ $FD8E    CARRIAGE RETURN
 1260  LINE   .EQ $FD9E    PRINT "-"
 1270  BELL   .EQ $FBE2    SOUND BELL WHEN DONE
 1280  *---------------------------------
 1290  * RUN PROGRAM 100 TIMES FOR ACCURATE TIME MEASUREMENTS!
 1300  *---------------------------------
 1310  START  JSR HOME     CLEAR SCREEN
 1320         LDA #100     LOOP 100 TIMES
 1330         STA TIMES    SET COUNTER
 1340  .1     JSR GENERATE.PRIMES
 1350         LDA $400     TOGGLE SCREEN FOR VISIBLE INDICATOR
 1360         EOR #$80     OF ACTION
 1370         STA $400
 1380         DEC TIMES
 1390         BNE .1       LOOP
 1400         JSR BELL     READ WATCH!
 1410         LDY COUNT+1  GET HI BYTE OF COUNT
 1420         LDX COUNT
 1430         JSR PRINTN   PRINT PRIMES FOUND
 1440         RTS
 1450  *---------------------------------
 1460  *      GENERATE THE PRIMES
 1470  *---------------------------------
 1480  GENERATE.PRIMES
 1490         LDY #0       CLEAR INDEX
 1500         STY COUNT    CLEAR COUNT VARIABLE
 1510         STY COUNT+1
 1520         STY A.PNTR   SET UP POINTER FOR OUTER LOOP
 1530         LDA /ARRAY
 1540         STA A.PNTR+1
 1550         LDA #1       LOAD WITH ONE
 1560         LDX /SIZE      NUMBER OF PAGES TO STORE IN
 1570  *---------------------------------
 1580  * SET EACH ELEMENT IN ARRAY TO ONE
 1590  *---------------------------------
 1600  .1     STA (A.PNTR),Y  SET FLAG TO 1
 1610         INY          NEXT LOCATION
 1620         BNE .1       GO 256 TIMES
 1630         INC A.PNTR+1 POINT AT NEXT PAGE
 1640         DEX          NEXT PAGE
 1650         BNE .1       MORE PAGES
 1660  *---------------------------------
 1670  * SCAN ENTIRE ARRAY, LOOKING FOR A PRIME
 1680  *---------------------------------
 1690         LDA /ARRAY   SET A.PNTR TO BEGINNING AGAIN
 1700         STA A.PNTR+1
 1710  .2     LDY #0       CLEAR INDEX
 1720         LDA (A.PNTR),Y  LOOK AT NEXT FLAG
 1730         BEQ .6       NOT PRIME, ADVANCE POINTER
 1740  *---------------------------------
 1750  * CALCULATE CURRENT INDEX INTO FLAG ARRAY
 1760  *---------------------------------
 1770         SEC
 1780         LDA A.PNTR+1
 1790         SBC /ARRAY
 1800         TAX          SAVE HI-BYTE OF INDEX
 1810         LDA A.PNTR   LO-BYTE OF INDEX
 1820  *---------------------------------
 1830  * CALCULATE NEXT PRIME NUMBER WITH P=I+I+3
 1840  *---------------------------------
 1850         ASL          DOUBLE THE INDEX
 1860         TAY
 1870         TXA          HI-BYTE OF INDEX
 1880         ROL
 1890         TAX
 1900         TYA          NOW ADD 3
 1910         ADC #3
 1920         STA PRIME
 1930         BCC .3
 1940         INX
 1950  .3     STX PRIME+1
 1960  *---------------------------------
 1970  * FOLLOWING 4 LINES CHANGE ALGORITHM SLIGHTLY
 1980  * TO SPEED IT UP FROM .93 TO .74 SECONDS
 1990  *---------------------------------
 2000         TXA          TEST HIGH BYTE
 2010         BNE .5       PRIME > SQRT(16384)
 2020         CPY #127
 2030         BCS .5       PRIME > SQRT(16384)
 2040  *---------------------------------
 2050  * NOW CLEAR EVERY P-TH ENTRY AFTER P
 2060  *---------------------------------
 2070         LDY #0
 2080         LDA A.PNTR   USE CURRENT OUTER POINTER FOR INNER POINTER
 2090         STA B.PNTR
 2100         LDA A.PNTR+1
 2110         STA B.PNTR+1
 2120         CLC          BUMP ARRAY POINTER BY P
 2130  .4     LDA B.PNTR   BUMP TO NEXT SLOT
 2140         ADC PRIME
 2150         STA B.PNTR
 2160         LDA B.PNTR+1
 2170         ADC PRIME+1
 2180         STA B.PNTR+1
 2190         CMP /ARRAY+SIZE     SEE IF BEYOND END OF ARRAY
 2200         BCS .5       YES, FINISHED CLEARING
 2210         TYA          NO, CLEAR ENTRY IN ARRAY
 2220         STA (B.PNTR),Y
 2230         BEQ .4       ...ALWAYS
 2240  *---------------------------------
 2250  * NOW COUNT PRIMES FOUND  (C=C+1)
 2260  *---------------------------------
 2270  .5
 2280  *      JSR PRINTP   PRINT PRIME
 2290         INC COUNT
 2300         BNE .6 
 2310         INC COUNT+1
 2320  *---------------------------------
 2330  * ADVANCE OUTER POINTER AND TEST IF FINISHED
 2340  *---------------------------------
 2350  .6     INC A.PNTR
 2360         BNE .7
 2370         INC A.PNTR+1
 2380  .7     LDA A.PNTR+1
 2390         CMP /ARRAY+SIZE
 2400         BCC .2
 2410         RTS
 2420  *---------------------------------
 2430  * OPTIONAL PRINT PRIME SUBROUTINE
 2440  *---------------------------------
 2450  PRINTP LDY PRIME+1  HI BYTE 
 2460         LDX PRIME
 2470         JSR PRINTN   PRINT DECIMAL VAL
 2480         JSR LINE     VIDEO "-" OUT
 2490         RTS



It’s interesting to think that they could run it once and have meaningful timing data…

If you scan the program, he actually ran it 100 times in a loop and stood there with a stopwatch. I presume that he then did several ~70 second trials, and divided the mean by 100, but he doesn’t come out and say so.

I think that ultrafilter has the best line on a fast implementation. Eratosthenes’s method removes all divisions, which should speed things up immensely.

How much memory did an Apple ][ have, anyways? 16k? 64k?

The sieve of Eratosthenes makes heavy use of division. Do you have something else in mind?

The very first Apple II model came with 4KB, and could be expanded in 4K or 16K increments up to 12 or 48K, respectively. (Reference.) I don’t know how many people really settled for a mere 4K. Quite limiting, even by 1977 expectations.

The next model, the Apple II+, came with 48K, and could be expanded to 64K with an add-on card. Later third-party cards allowed adding much more.

The enhanced IIe and the IIc generally had 128K standard. (The first, unenhanced IIe had only 64K, but there was an upgrade kit for those sold soon afterward.) Again, cards from Apple and other companies permitted expanding the RAM, by as much as 3 MB if I remember right.

Note that the 6502 only has a 64KB (16-bit) address space. Any amount of memory over 64KB, including the ROM, had to be accessed through bank-switching. Generally speaking that was a big pain in the ass.

Maybe I misunderstood which sieve you were talking about. I was thinking of the sieve where you start with 2, and start adding 2, marking out numbers that are passed, then going to the next number, 3, and start adding 3, marking out the numbers passed.

I wrote an implementation back in college that saved space by doing a breadth-first search of the primes instead of a depth-first search, using a heap to hold the primes up to sqrt(N) and a counter for each prime, with the heap ordered by each node’s counter. For searching up to 10[sup]8[/sup], you only need a heap for primes up to 10[sup]4[/sup], so it would require less than 10[sup]4[/sup]/(ln(10[sup]4[/sup]) - 1.08366) (about 1231) entries. Each entry would need 2 bytes to hold the prime and 4 bytes to hold the counter, so it would need less than 8k to hold the heap.

You start off with the main counter set to 3 and one entry in the heap, for 3. Each time you create an entry in the heap for prime p, you initialize its counter as p[sup]2[/sup] and place it at the end of the heap. You increment the main counter by 2 until you hit or pass the counter for the head of the heap, inserting a new entry for those numbers you pass that are less than the head’s counter, as they are primes. Also, you increment the prime-counter by 1. Finally,you increment the counter of the head of the heap by twice its prime number, and reheapify, looping over this final step until the head element’s counter is greater than the main counter. Then you loop back to incrementing the main counter by 2 until you’ve inserted entries for all primes up to sqrt(N).

Once you have entries for all primes up to sqrt(N), you do pretty much the same thing but without inserting new entries into the heap. After you test N, you have a count of all the prime numbers up to N.

For those who don’t know what a heap is, it’s a balanced tree where every node is not greater than any of its descendants. Since it’s always perfectly balanced, it can be implemented as an array, and nodes do not need any pointers to children or parents, as those are already implied by a node’s index in the array. Calling the root of the heap index 1, the children of any node at index i are the nodes at index 2i and 2i + 1. So, for instance: [ 1 5 2 6 8 4 3 7 ] is a heap. 1 is less than 5 and 2, 5 is less than 6 and 8, 2 is less than 4 and 3, and 6 is less than 7.

I’ve never seen that implementation before, but it makes a lot of sense. Usually you see it as something like “Divide every number by the smallest number not crossed off and cross off the ones that leave no remainder and go back to the beginning of this sentence”.

Um,

ultrafilter,

You can implement the sieve with no divisions at all. It just takes a humongous blob of RAM.

Here’s the (super-crude) outline from long-ago memory:
Allocate 1 million words of RAM.
Initialize with integers 1, 2, 3, 4 etc to 1 million
Compute sqaure root of last array element value, save as StopPoint
Initialize variable PrimeIncrement to 2
OuterLoopTop:
InnerLoop: Starting with PrimeIncrement-th entry in array, set every PrimeIncrement-th value to zero (ie set element 2,4, 6, … to zero on first pass) top at end of array.
Starting with Array(PrimeIncrement), scan upwards in array for next non-zero entry. Set PrimeIncrement = found value.
If PrimeIncrement > StopPoint then exit, else go back to OuterLoopTop:

after exit: Scan through array, outputting non-zero entries as found primes.

That’s the crudest code outline I’ve done in years, but you can probably follow it. Dinner’s on & I gotta go.

It’ll work using nothing but integer addition & addressing math. You can even do the single square root calc by simply taking the number with 1/2 the bits of your max. i.e. if you allocate 2^24 array entries, then 2^12 is a good square root.

Obvious improvements are to simply initialize with 1’s rather than 1, 2, 3, 4 etc, to use bits rather than bytes or words for each entry to save 8x-32x on storage, and to be smarter about the special case of 2, so you can cut out the first pass and all the even values, which saves another 2x on storage.

Other than the table itself, this can be implemented with just a couple of scalars and only a few dozen machine instructions. I did it once on a 370 with all the speedups I mentioned and then some and all work vars and addressing calcs were in CPU registers. The only core (RAM hadn’t neen invented yet) reference was for the table itself. Even the code fit in the tiny CPU cache of that monster.

Sorry, should have previewed.

Well, there you go. The reference I have (an introductory discrete math book) doesn’t explicitly mention division, but it seems like that’s what the author had in mind. And you can implement it that way, although it’s probably better not to. I am corrected.

So how many simultaneously-executing Apple II emulators can a modern PC run before their average execution speed is about that of the original Apple II?

:smiley:

(I know, I know, the context switching will eat you alive, not to mention the RAM requirements)

sorry, I should have previewed too.

You don’t really need that much memory, because you don’t really need the array to hold the values. If a(i) = i for all i, then you don’t need to store the value - the subscript tells you what the value should be.

All you need is a logical array to tell you whether a given number has been crossed off or not. If the memory is byte-addressable, then the simplest way is with a byte array. If you want it even more compact, you’d use a packed bit array.

To find where bit n is stored, the 3 low order bits of n tell you the bit position, and n rightshifted 3 times tells you which byte it’s in.

That gets the storage requirement in your example down from 4 MB to 125 KB.

Agreed. As I said in the last paragraph. I was trying to make the description as simple as possible, not the execution.

If you look at the end of LSLguy’s post, you’ll see he mentioned just that. In fact, he does one better, noting that you only need to store the bits for odd numbers, reducing the memory needed to a bit more than 61k.