How Much Faster Is My Computer Than An Apple ][

Q.E.D · November 22, 2004, 6:12pm

Omphaloskeptic · November 22, 2004, 6:50pm

Basic long-division pseudocode (useful for division and divisibility testing):


# given n >= 0 and d > 0, compute q and r with n = dq+r, q >= 0, 0 <= r < d

i = 1
r = n     # or operate destructively on n
while d <= r     # shift divisor left until it beats dividend; i is the placeholder
  d <<= 1
  i <<= 1
end
i >>= 1
while i != 0     # shift divisor left and subtract it off, accumulating quotient as we go
  d >>= 1
  if d <= r
    r -= d
    q += i
  end
  i >>= 1
end

The operations used here are assignment, inequality testing, addition, subtraction, and bitshifting (all relatively close to 6502 primitives, for register-sized numbers). To represent numbers ~10[sup]6[/sup] you need at least three-byte variables, so I’m imagining each of these operations as acting on (probably) arrays stored in memory, and themselves implemented as subroutines (or inserted directly if you want to avoid the JSR costs). There are a number of places where loops and conditions could be unrolled a bit for improved speed, and you could also convert this into two different routines for integer DIV (q only) and MOD (r only).

I’ve implemented similar long-division code in 8088/80386 assembler (and after seeing a little bit of 6502 assembly code, having SO MANY REGISTERS felt like cheating!).

Jurph · November 22, 2004, 8:30pm

I’ve extracted the text of the aforementioned BXY document below using the 30-day shareware program “CiderPress”. Because it was a minor hassle, and because the contents of the article were interesting, and most of all because the contents of that discussion group are not available plaintext anywhere else, I’ve posted the entirety of the article of interest here. Of particular interest are the run-time benchmarks below, listed in seconds, and optimized by geeks who had to work with only the equipment on-hand. The article is from October of 1981, and represents a pretty good look at what our primitive ancestors did for fun besides hunting mammoths.

I propose that instead of trying to re-invent the abacus using hardware with which we’re not really that familiar, we accept .69 seconds as a good-faith benchmark of the abilities of the platform. From there, I propose that we try to rewrite their (optimized?) algorithm for a modern platform, and see what kind of run time we get, and compare (ha-ha) Apples to Apples.

Sifting Primes Faster and Faster

Benchmark programs are sometimes useful for selecting between various processors. Quite a few articles have been published which compare and rank the various Z-80, 8080, 6800, and 6502 systems based on the speed with which they execute a given BASIC program. Some of us cannot resist the impulse to show them up by recoding the benchmark in our favorite language on our favorite processor, using our favorite secret tricks for trimming microseconds.

“A High-Level Language Benchmark” (by Jim Gilbreath, BYTE, September, 1981, pages 180-198) is just such an article. Jim compared execution time in Assembly, Forth, Basic, Fortran, COBOL, PL/I, C, and other languages; he used all sorts of computers, including the above four, the Motorola 68000, the DEC PDP 11/70, and more. He used a short program which finds the 1899 primes between 3 and 16384 by means of a sifting algorithm (Sieve of Eratosthenes).

His article includes table after table of comparisons. Some of the key items of interest to me were:

!lm15
Language and Machine Seconds

Assembly Language 68000 (8 MHz) 1.12
Assembly Language Z80 6.80
Digital Research PL/I (Z80) 14.0
Microsoft BASIC Compiler (Z80) 18.6
FORTH 6502 265.
Apple UCSD Pascal 516.
Apple Integer BASIC 2320.
Applesoft BASIC 2806.
Microsoft COBOL Version 2.2 (Z80) 5115.

!lm10
There is a HUGE error in the data above; I don’t know if it is the only one or not. The time I measured for the Apple Integer BASIC version was only 188 seconds, not 2320 seconds! How could he be so far off? His data is obviously wrong, because Integer BASIC in his data is too close to the same speed as Applesoft.

I also don’t know why they neglected to show what the 6502 could do with an assembly language version. Or maybe I do…were they ashamed?

William Robert Savoie, an Apple owner from Tennessee, sent me a copy of the article along with his program. He “hand-compiled” the BASIC version of the benchmark program, with no special tricks at all. His program runs in only 1.39 seconds! That is almost as fast as the 8 MHz Motorola 68000 system! The letter that accompanied his program challenged anyone to try to speed up his program.

How could I pass up a challenge like that? I wrote my own version of the program, and cut the time to .93 seconds! Then I made one small change to the algorithm, and produced exactly the same results in only .74 seconds!

Looking back at Jim Gilbreath’s article, he concludes that efficient, powerful high-level languages are THE way to go. He eschews the use of assembly language for any except the most drastic requirements, because he could not see a clear speed advantage. He points out the moral that a better algorithm is superior to a faster CPU. (Note that his algorithm is by no means the fastest one, by the way.)

Here is Gilbreath’s algorithm, in Integer BASIC:

<program#1>

The REM tagged onto the end of line 70, if changed to a real PRINT statement, will print the list of prime numbers as they are generated. Of course printing them was not included in any of the time measurements. According to my timing, printing adds 12 seconds to the program.

I modified the algorithm to take advantage of some more prior knowledge about sifting: There is no need to go through the loop in lines 50 and 60 if P is greater than 127 (the largest prime no bigger than the square root of 16384). This means changing line 40 to read:
 40 P=I+I+3 : IF P&gt;130 THEN 70 : K=I+P
This change cut the time for the program from 188 seconds to 156 seconds. My assembly language version of the original algorithm ran in .93 seconds, or 202 times faster; the better algorithm ran in .74 seconds, or almost 211 times faster.

William Savoie has done a magnificent job in hand-compiling the first program. He ran the program 100 times in a loop, so that he could get an accurate time using his Timex watch. Here is the listing of his program.

<Bill Savoie’s program>

Here is a listing of my fastest version. If you delete lines … through …, you get my code for the original algorithm.

<my program>

Michael R. Laumer, of Carrollton, Texas, has been working for about a year on a full-scale compiler for the Integer BASIC language. He has it nearly finished now, so just for fun he used it to compile the algorithm from Gilbreath’s article. Mike used a slightly different form of the Integer BASIC program than I did, which took 238 seconds to execute. But the compiled version ran in only 20 seconds! If you are interested in compiling Integer BASIC programs, you can write to Mike at Laumer Research, 1832 School Road, Carrollton, TX 75006.

If you want to, you can easily cut the time of my program from .74 to about .69 seconds. Lines 1600-1650 in my program set each byte in ARRAY to $01. If I don’t mind the extra program length, I can rewrite this loop to run in about 42 milliseconds instead of the over 90 it now takes. Here is how I would do it:

!lm15
.1 STA ARRAY,Y
STA ARRAY+$100,Y
STA ARRAY+$200,Y
STA ARRAY+$300,Y TOTAL OF 32
. LINES LIKE THESE
.
.
STA ARRAY+$1E00,Y
STA ARRAY+$1F00,Y
INY
BNE .1
!lm10

If you can find a way to implement the same program in less than .69 seconds, you are hereby challenged to do so!

Jurph · November 22, 2004, 8:39pm

Ah, and on post-view, the text of the program is in a different article. Here’s the .74-second version of the program.



1000  *---------------------------------
 1010  * SIEVE PROGRAM:
 1020  * CALCULATES FIRST 1899 PRIMES IN .74 SECONDS!
 1030  *
 1040  * INSPIRED BY JIM GILBREATH
 1050  *   (SEE BYTE MAGAZINE, 9/81, PAGES 180-198.)
 1060  * AND BY WILLIAM ROBERT SAVOIE
 1070  *   4405 DELASHMITT RD. APT 15
 1080  *   HIXSON, TENN  37343
 1090  *---------------------------------
 1100  ARRAY  .EQ $3500    FLAG BYTE ARRAY
 1110  SIZE   .EQ 8192     SIZE OF FLAG ARRAY
 1120  *---------------------------------
 1130  * PAGE-ZERO VARIABLES
 1140  *---------------------------------
 1150  A.PNTR .EQ $06,07   POINTER TO FLAG ARRAY FOR OUTER LOOP
 1160  B.PNTR .EQ $08,09   POINTER TO FLAG ARRAY FOR INNER LOOP
 1170  PRIME  .EQ $1B,1C   LATEST PRIME NUMBER
 1180  COUNT  .EQ $1D,1E   # OF PRIMES SO FAR
 1190  TIMES  .EQ $1F      COUNT LOOP
 1200  *---------------------------------
 1210  * APPLE ROM ROUTINES USED
 1220  *---------------------------------
 1230  PRINTN .EQ $F940    PRINT 2 BYTE NUMBER FROM MONITOR
 1240  HOME   .EQ $FC58    CLEAR VIDEO
 1250  CR     .EQ $FD8E    CARRIAGE RETURN
 1260  LINE   .EQ $FD9E    PRINT "-"
 1270  BELL   .EQ $FBE2    SOUND BELL WHEN DONE
 1280  *---------------------------------
 1290  * RUN PROGRAM 100 TIMES FOR ACCURATE TIME MEASUREMENTS!
 1300  *---------------------------------
 1310  START  JSR HOME     CLEAR SCREEN
 1320         LDA #100     LOOP 100 TIMES
 1330         STA TIMES    SET COUNTER
 1340  .1     JSR GENERATE.PRIMES
 1350         LDA $400     TOGGLE SCREEN FOR VISIBLE INDICATOR
 1360         EOR #$80     OF ACTION
 1370         STA $400
 1380         DEC TIMES
 1390         BNE .1       LOOP
 1400         JSR BELL     READ WATCH!
 1410         LDY COUNT+1  GET HI BYTE OF COUNT
 1420         LDX COUNT
 1430         JSR PRINTN   PRINT PRIMES FOUND
 1440         RTS
 1450  *---------------------------------
 1460  *      GENERATE THE PRIMES
 1470  *---------------------------------
 1480  GENERATE.PRIMES
 1490         LDY #0       CLEAR INDEX
 1500         STY COUNT    CLEAR COUNT VARIABLE
 1510         STY COUNT+1
 1520         STY A.PNTR   SET UP POINTER FOR OUTER LOOP
 1530         LDA /ARRAY
 1540         STA A.PNTR+1
 1550         LDA #1       LOAD WITH ONE
 1560         LDX /SIZE      NUMBER OF PAGES TO STORE IN
 1570  *---------------------------------
 1580  * SET EACH ELEMENT IN ARRAY TO ONE
 1590  *---------------------------------
 1600  .1     STA (A.PNTR),Y  SET FLAG TO 1
 1610         INY          NEXT LOCATION
 1620         BNE .1       GO 256 TIMES
 1630         INC A.PNTR+1 POINT AT NEXT PAGE
 1640         DEX          NEXT PAGE
 1650         BNE .1       MORE PAGES
 1660  *---------------------------------
 1670  * SCAN ENTIRE ARRAY, LOOKING FOR A PRIME
 1680  *---------------------------------
 1690         LDA /ARRAY   SET A.PNTR TO BEGINNING AGAIN
 1700         STA A.PNTR+1
 1710  .2     LDY #0       CLEAR INDEX
 1720         LDA (A.PNTR),Y  LOOK AT NEXT FLAG
 1730         BEQ .6       NOT PRIME, ADVANCE POINTER
 1740  *---------------------------------
 1750  * CALCULATE CURRENT INDEX INTO FLAG ARRAY
 1760  *---------------------------------
 1770         SEC
 1780         LDA A.PNTR+1
 1790         SBC /ARRAY
 1800         TAX          SAVE HI-BYTE OF INDEX
 1810         LDA A.PNTR   LO-BYTE OF INDEX
 1820  *---------------------------------
 1830  * CALCULATE NEXT PRIME NUMBER WITH P=I+I+3
 1840  *---------------------------------
 1850         ASL          DOUBLE THE INDEX
 1860         TAY
 1870         TXA          HI-BYTE OF INDEX
 1880         ROL
 1890         TAX
 1900         TYA          NOW ADD 3
 1910         ADC #3
 1920         STA PRIME
 1930         BCC .3
 1940         INX
 1950  .3     STX PRIME+1
 1960  *---------------------------------
 1970  * FOLLOWING 4 LINES CHANGE ALGORITHM SLIGHTLY
 1980  * TO SPEED IT UP FROM .93 TO .74 SECONDS
 1990  *---------------------------------
 2000         TXA          TEST HIGH BYTE
 2010         BNE .5       PRIME > SQRT(16384)
 2020         CPY #127
 2030         BCS .5       PRIME > SQRT(16384)
 2040  *---------------------------------
 2050  * NOW CLEAR EVERY P-TH ENTRY AFTER P
 2060  *---------------------------------
 2070         LDY #0
 2080         LDA A.PNTR   USE CURRENT OUTER POINTER FOR INNER POINTER
 2090         STA B.PNTR
 2100         LDA A.PNTR+1
 2110         STA B.PNTR+1
 2120         CLC          BUMP ARRAY POINTER BY P
 2130  .4     LDA B.PNTR   BUMP TO NEXT SLOT
 2140         ADC PRIME
 2150         STA B.PNTR
 2160         LDA B.PNTR+1
 2170         ADC PRIME+1
 2180         STA B.PNTR+1
 2190         CMP /ARRAY+SIZE     SEE IF BEYOND END OF ARRAY
 2200         BCS .5       YES, FINISHED CLEARING
 2210         TYA          NO, CLEAR ENTRY IN ARRAY
 2220         STA (B.PNTR),Y
 2230         BEQ .4       ...ALWAYS
 2240  *---------------------------------
 2250  * NOW COUNT PRIMES FOUND  (C=C+1)
 2260  *---------------------------------
 2270  .5
 2280  *      JSR PRINTP   PRINT PRIME
 2290         INC COUNT
 2300         BNE .6 
 2310         INC COUNT+1
 2320  *---------------------------------
 2330  * ADVANCE OUTER POINTER AND TEST IF FINISHED
 2340  *---------------------------------
 2350  .6     INC A.PNTR
 2360         BNE .7
 2370         INC A.PNTR+1
 2380  .7     LDA A.PNTR+1
 2390         CMP /ARRAY+SIZE
 2400         BCC .2
 2410         RTS
 2420  *---------------------------------
 2430  * OPTIONAL PRINT PRIME SUBROUTINE
 2440  *---------------------------------
 2450  PRINTP LDY PRIME+1  HI BYTE 
 2460         LDX PRIME
 2470         JSR PRINTN   PRINT DECIMAL VAL
 2480         JSR LINE     VIDEO "-" OUT
 2490         RTS

ultrafilter · November 22, 2004, 8:46pm

It’s interesting to think that they could run it once and have meaningful timing data…

Jurph · November 22, 2004, 8:56pm

If you scan the program, he actually ran it 100 times in a loop and stood there with a stopwatch. I presume that he then did several ~70 second trials, and divided the mean by 100, but he doesn’t come out and say so.

Punoqllads · November 22, 2004, 10:53pm

I think that ultrafilter has the best line on a fast implementation. Eratosthenes’s method removes all divisions, which should speed things up immensely.

How much memory did an Apple ][ have, anyways? 16k? 64k?

ultrafilter · November 22, 2004, 11:00pm

The sieve of Eratosthenes makes heavy use of division. Do you have something else in mind?

Bytegeist · November 22, 2004, 11:47pm

The very first Apple II model came with 4KB, and could be expanded in 4K or 16K increments up to 12 or 48K, respectively. (Reference.) I don’t know how many people really settled for a mere 4K. Quite limiting, even by 1977 expectations.

The next model, the Apple II+, came with 48K, and could be expanded to 64K with an add-on card. Later third-party cards allowed adding much more.

The enhanced IIe and the IIc generally had 128K standard. (The first, unenhanced IIe had only 64K, but there was an upgrade kit for those sold soon afterward.) Again, cards from Apple and other companies permitted expanding the RAM, by as much as 3 MB if I remember right.

Note that the 6502 only has a 64KB (16-bit) address space. Any amount of memory over 64KB, including the ROM, had to be accessed through bank-switching. Generally speaking that was a big pain in the ass.

Punoqllads · November 23, 2004, 12:44am

Maybe I misunderstood which sieve you were talking about. I was thinking of the sieve where you start with 2, and start adding 2, marking out numbers that are passed, then going to the next number, 3, and start adding 3, marking out the numbers passed.

I wrote an implementation back in college that saved space by doing a breadth-first search of the primes instead of a depth-first search, using a heap to hold the primes up to sqrt(N) and a counter for each prime, with the heap ordered by each node’s counter. For searching up to 10[sup]8[/sup], you only need a heap for primes up to 10[sup]4[/sup], so it would require less than 10[sup]4[/sup]/(ln(10[sup]4[/sup]) - 1.08366) (about 1231) entries. Each entry would need 2 bytes to hold the prime and 4 bytes to hold the counter, so it would need less than 8k to hold the heap.

You start off with the main counter set to 3 and one entry in the heap, for 3. Each time you create an entry in the heap for prime p, you initialize its counter as p[sup]2[/sup] and place it at the end of the heap. You increment the main counter by 2 until you hit or pass the counter for the head of the heap, inserting a new entry for those numbers you pass that are less than the head’s counter, as they are primes. Also, you increment the prime-counter by 1. Finally,you increment the counter of the head of the heap by twice its prime number, and reheapify, looping over this final step until the head element’s counter is greater than the main counter. Then you loop back to incrementing the main counter by 2 until you’ve inserted entries for all primes up to sqrt(N).

Once you have entries for all primes up to sqrt(N), you do pretty much the same thing but without inserting new entries into the heap. After you test N, you have a count of all the prime numbers up to N.

For those who don’t know what a heap is, it’s a balanced tree where every node is not greater than any of its descendants. Since it’s always perfectly balanced, it can be implemented as an array, and nodes do not need any pointers to children or parents, as those are already implied by a node’s index in the array. Calling the root of the heap index 1, the children of any node at index i are the nodes at index 2i and 2i + 1. So, for instance: [ 1 5 2 6 8 4 3 7 ] is a heap. 1 is less than 5 and 2, 5 is less than 6 and 8, 2 is less than 4 and 3, and 6 is less than 7.

ultrafilter · November 23, 2004, 12:47am

I’ve never seen that implementation before, but it makes a lot of sense. Usually you see it as something like “Divide every number by the smallest number not crossed off and cross off the ones that leave no remainder and go back to the beginning of this sentence”.

Desmostylus · November 23, 2004, 1:02am

Um,

Sieve of Eratosthenes

Eratosthenes invented an algorithm for finding all the primes up to a given bound n. The usual approach would involve trying to factor each number from 1 to n, and this would require many divisions. The sieve of Eratosthenes works without any division. We just list the numbers from 1 to n. Then we cross out 1 since it is a unit. Then we circle 2, and note that this is our first prime. Then we cross out every multiple of 2 that is bigger than 2 up to our bound n. Then we look along our list starting at 2 , looking for the first number that was not crossed out. This is 3. We circle 3 and this is our next prime. Then we cross out every multiple of 3 that is biiger than 3, up to our bound. We can also start at 3^2=9, since by our theorem , any composite number less than 3^2, would have to have a prime factor that is less than sqrt(3^2)=3 , and would have already been crossed out. In the case of 3, this just saves crossing out 6, since we already crossed out 6 when we crossed out multiples of 2. Then we start at 3 and scan along our list looking for the first number that was not crossed out. This is not 4 , since 4 was crossed out when we crossed out multiples of 2. We see that 5 has not been crossed out. We circle 5 as our next prime. Then we start at 5^2=25, and cross out every multiple of 5 that is greater than or equal to 25. Once again we can skip the smaller multiples of 5 , which are 10,15,and 20, since they were already crossed out by previous passes of the algorithm. We continue this procedure. When do we stop? We don’t need to go all the way to n, but only until we reach sqrt(n) , beacuse our sqrt(n) theorem. We stop when we have circled the prime q, and we compute q^2>n. Then we can stop. Then all remaining uncrossed out numbers are also primes.

LSLGuy · November 23, 2004, 1:03am

ultrafilter,

You can implement the sieve with no divisions at all. It just takes a humongous blob of RAM.

Here’s the (super-crude) outline from long-ago memory:
Allocate 1 million words of RAM.
Initialize with integers 1, 2, 3, 4 etc to 1 million
Compute sqaure root of last array element value, save as StopPoint
Initialize variable PrimeIncrement to 2
OuterLoopTop:
InnerLoop: Starting with PrimeIncrement-th entry in array, set every PrimeIncrement-th value to zero (ie set element 2,4, 6, … to zero on first pass) top at end of array.
Starting with Array(PrimeIncrement), scan upwards in array for next non-zero entry. Set PrimeIncrement = found value.
If PrimeIncrement > StopPoint then exit, else go back to OuterLoopTop:

after exit: Scan through array, outputting non-zero entries as found primes.

That’s the crudest code outline I’ve done in years, but you can probably follow it. Dinner’s on & I gotta go.

It’ll work using nothing but integer addition & addressing math. You can even do the single square root calc by simply taking the number with 1/2 the bits of your max. i.e. if you allocate 2^24 array entries, then 2^12 is a good square root.

Obvious improvements are to simply initialize with 1’s rather than 1, 2, 3, 4 etc, to use bits rather than bytes or words for each entry to save 8x-32x on storage, and to be smarter about the special case of 2, so you can cut out the first pass and all the even values, which saves another 2x on storage.

Other than the table itself, this can be implemented with just a couple of scalars and only a few dozen machine instructions. I did it once on a 370 with all the speedups I mentioned and then some and all work vars and addressing calcs were in CPU registers. The only core (RAM hadn’t neen invented yet) reference was for the table itself. Even the code fit in the tiny CPU cache of that monster.

Desmostylus · November 23, 2004, 1:04am

Sorry, should have previewed.

ultrafilter · November 23, 2004, 1:06am

Well, there you go. The reference I have (an introductory discrete math book) doesn’t explicitly mention division, but it seems like that’s what the author had in mind. And you can implement it that way, although it’s probably better not to. I am corrected.

galt · November 23, 2004, 1:08am

So how many simultaneously-executing Apple II emulators can a modern PC run before their average execution speed is about that of the original Apple II?

(I know, I know, the context switching will eat you alive, not to mention the RAM requirements)

LSLGuy · November 23, 2004, 1:39am

sorry, I should have previewed too.

Desmostylus · November 23, 2004, 2:05am

You don’t really need that much memory, because you don’t really need the array to hold the values. If a(i) = i for all i, then you don’t need to store the value - the subscript tells you what the value should be.

All you need is a logical array to tell you whether a given number has been crossed off or not. If the memory is byte-addressable, then the simplest way is with a byte array. If you want it even more compact, you’d use a packed bit array.

To find where bit n is stored, the 3 low order bits of n tell you the bit position, and n rightshifted 3 times tells you which byte it’s in.

That gets the storage requirement in your example down from 4 MB to 125 KB.

LSLGuy · November 23, 2004, 1:13pm

Agreed. As I said in the last paragraph. I was trying to make the description as simple as possible, not the execution.

Punoqllads · November 23, 2004, 5:52pm

If you look at the end of LSLguy’s post, you’ll see he mentioned just that. In fact, he does one better, noting that you only need to store the bits for odd numbers, reducing the memory needed to a bit more than 61k.

Topic		Replies	Views
Apple Powerbook processor speed question Factual Questions	12	1022	August 7, 2004
My new computer is 1000 tmes better then my first computer! Miscellaneous and Personal Stuff I Must Share	11	1104	January 21, 2008
Simple BASIC program to count from 1 to x Factual Questions	19	1975	January 14, 2007
How long for the fastest computer to multiply two numbers whose product is a 30-digit number.? Factual Questions	15	1188	March 27, 2018
An interesting thought about the speed of computers... Miscellaneous and Personal Stuff I Must Share	7	1159	January 13, 2007

How Much Faster Is My Computer Than An Apple ][

Related topics