So, what exactly is the problem with MS Excel for complex mathematics and statistics?

This was a sub-topic inthis thread, but I would like to see it discussed more specifically as apparently MS Excel, which many professionals rely on every day to be an accurate calculation application is not considered an adequate tool for complex mathematics.

What is the issue with Excel when used for complex calculations? Is it coded wrong or what?

Complex calculations often require what is called symbolic computation; that is, working with variables, equations, and functions and so on as you would on pencil and paper, by solving equations, doing derivatives and integrals, etc. You can make computers do these kinds of things much faster than your caffeinated research assistant can. Symbolic computation packages include stuff like Mathematica, Matlab, and various code libraries for programming languages.

Excel doesn’t have any symbolic computation capabilities, though. (And it shouldn’t; that’s not what it’s for.) It is set up only for doing numeric computation (and some other stuff.) So it is a wholly inappropriate tool for doing all kinds of mathematical work.

Of course, there is plenty of fancy mathematics to be done in strictly numeric areas, as well. But that type of stuff tends to be done by writing your own program rather than relying on a spreadsheet model. You use it when it’s the right tool for the job, and use something else when it’s not.

I’d say that the biggest problem is that it’s proprietary and closed. Maybe it really does work correctly, but you have to take Microsoft’s word for that. Take the example of the 77.1*850 bug from the other thread: OK, so you’re not going to actually multiply those two particular numbers together by accident very often. But there was something funny in the code that made it do that, and you have no way of knowing what that something funny was, or if it might manifest in some other, completely different way. Most tools used by real researchers, though, if something like that cropped up, you (or anyone) could just take a look at the source code and see what was happening.

Another problem is that Excel, like most modern software, is a resource hog. That’s fine if you’re using something that used to be called a supercomputer, just to balance your checkbook. But when you’ve got a project that will take weeks to run, you want it to run as leanly and cleanly as possible.

To expand upon what friedo stated, the use of Excel (or any other spreadsheet type application) as an environment for creating and operating algorithms such as complex transforms, iterative solvers, and other complicated operations is tasking it to do something that it does very poorly. It is analogous to using Paintbrush to touch up photographs or AutoCad to perform cartography functions. It has neither the documentation features nor the debugging capability to be used as a programming tool; all of this is aside from some of the most notable bugs and memory limitations. Personally, the data I deal with is often millions of samples on multiple channels, which would completely overwhelm Excel if I tried to import an entire unfilitered dataset.

Excel, like other spreadsheet applicataions, is intended for bookkeeping type tasks and a limited set of visualization functions (of which most of the graph types are unnecessary, unclear, and provide poor formatting in their default configuration). Although the learning curve to start using it is shallower than, say, Matlab or R, and therefore preferred by professors and lab instructors to get students processing small sets of data, its functions are far more limited.

And yes, I’ve had many occasions in which statistical and DSP calculations performed in Excel did not match the (correct) results from R, Matlab, or Mathematica. Whether because this was some flaw in how the data was entered and formatted in Excel, or due to some inherent bug in a standard function I do not know, though in one particular case, I discovered that the Excel CHIINV function does not always provide correct or even consistent results.

The bottom line is if you are going to be performing complex numerical or statistical analysis, you should be using a dedicated tool or programming system that is both verified to function correctly in its internal operations and allows for documentation and debugging of your code. Using Excel as a jacked up algorithm solver is like trying to change your tire with a pastry spreader and a bendy straw.

Stranger

From my perspective, as a biologist, Excel’s built-in stats functions are very limited, and the graphing is plain lousy. Sure, you can do a T-test, or an ANOVA if you install some of the extra modules. And you can throw together a vaguely acceptable bar graph or scatter plot. Sometimes that’s enough, if you have simple experiments and small data sets. And I’ll do that if I just want to throw some preliminary numbers into a graph.

But it didn’t take me very long to out-grow Excel’s built in functions. Computationally speaking, my needs are trivial. But I do have to use some specialized (though not particularly complicated) statistical methods. If you understand the math, you can kludge your way through, but Excel makes it ugly and difficult.

Say you wanted to dice an onion, and you only had a hacksaw. Sure, you could do it, but you’ll make a huge mess, and everything will end in tears. It’s the wrong tool for the job.

According to the linked blog post in that other thread, Excel converts numbers into binary and a binary number needs to recur infinitely to properly represent the number 0.1. The consequences are supposedly infinitesimal except with that one particular type of number which you are very unlikely to come across.

I do find it ironic that binary can’t do the number 0.1 - I’d like to have seen Alanis Morrisette fit that one in to her song.

That’s not Excel, though. That’s just how arithmetic has to be done with the hardware we have.

In base b, the only fractions that have a terminating representation are those whose denominator is not divisible by any prime which doesn’t divide b. 0.1 = 1/(2 * 5), and 5 does not divide 2, so 0.1 doesn’t have a terminating representation. It’s exactly the same as representing 1/3 in decimal.

The visualization features are just shitty. There is no way to have two overlaid X-Y line plots, for instance.

Also–and this is a problem with any program that saves files as binary blobs–you cannot have proper source control. Any moderately complex project must save as human-readable text, because that is the only way for diff and other tools to work.

Here’s an essay on issues with spreadsheets in general and Excel in particular for statistics. Some of the info is a little old, but there’s more recent stuff as well.

If you want a more general discussion of this (like, if you actually want to learn what you need to know for doing extensive numerical computations), then this needs to be discussed on several levels.

The fundamental problem, discussed in this thread and several previous threads, and in that Joel Spolsky essay, always takes novice programmers by surprise when they first learn it (myself included – I was mortified). That is, the bedrock fractional units of our decimal numbers system, to-wit: 0.1, 0.01, 0.001, 0.0001, etc., CANNOT be represented precisely as binary numbers, which computers use. These, and nearly all multiples and combinations thereof, turn out to be infinitely repeating numbers in binary, just as 1/3 or 3/7 is when written as a decimal fraction.

So it doesn’t matter if your computer stores such numbers with 16 bits or 32 bits or 64 or 128 bits or 65536 bits or more. The number is always a finite number of bits long, and infinitely repeating numbers have to be chopped off there, and thus are always approximations.

There are just too many programmers who don’t know or understand anything about this.

There are several implications:

  1. Your calculations will typically always be just approximations. Even if your starting data looks very precise (written in decimal), like 25.36743, its representation in the computer is an approximation.

  2. Now, if you’re trying to compute how many gallons of paint you need to cover a tank with a circular floor, it might suffice to give all your data (including pi) with just 2 or 3 decimal places. If you’re trying to compute a rocket trajectory to the moon, it might take a few more places. If you’re trying to compute a rocket trajectory to Neptune, maybe even more places to get enough accuracy to actually hit the planet.

  3. SO! You can always deal with this problem by just doing your calculations to more precision, as needed! (Not necessarily an option in a lot of languages or applications, like Excel. But possible in theory.) Right?

  4. NO! Not necessarily. There are certain kinds of problems with this, where any amount of additional precision is impractical, or sometimes simply doesn’t solve the problem.

  5. First, there is the case – common in heavy-duty scientific, engineering, and mathematical work, where you have really heavy-duty intensive, massive computations to be done. Things like computing functions and integrals using infinite series, or simulations of massive events (like nuclear explosions) at the atomic level where you are trying to track every atom. Stuff like that. The problem is (to over-simplify [del]somewhat[/del] a whole lot) is that, upon doing a WHOLE LOT of arithmetic, the round-off errors accumulate. Possibly to the point that your results have very little accuracy, or even none at all.

This can be improved, somewhat in some cases, by using Double Precision instead of Single Precision, or even more precision. Assuming your language or application offers the feature. Or you can write your own modules to do it, or use existing program libraries with such features.

But the MAJOR solution is to study and learn Numerical Analysis – typically, a FULL SEMESTER college class, offered either in Math or Computer Science department, in which you study mathematical techniques for doing extensive computations accurately. This is not a class for mathematical n00bs. It gets fairly heavy. You have to learn how to design computational algorithms for stuff like this.

  1. Second, there is the case, which I have seen often, of trying to compare your computed values with some desired target. For example, if 0.1 is not really exactly 0.1, then what is:
    0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1
    If you compare that with 1.0, as in:

if ( 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 == 1.0 ) {
    print ("There is sanity in the world
") ;
}
else {
    print ("The gods must be crazy!
") ;
}

what outcome do you expect? Lots of programmers are surprised.

Here are some earlier threads where these problems were discussed:
Date puzzle - Factual Questions - Straight Dope Message Board (Starting at post #4, then #6, 13, 14, 16)
and intermittently scattered through this thread: Extremeely basic programming question. - Factual Questions - Straight Dope Message Board

Especially in software that deals with money, where you expect dollars-and-cents computations to give you exact dollars-and-cents results, this bites often, and causes wrong results. You need to know techniques to do this “right” – First, you need to think about it and define exactly what the “right” results ought to be.

Note that no amount of precision solves this problem. You can’t fix it just by using Double Precision or 100-Fold Precision. Your floating point numbers will always have some small inaccuracy, and tests like the above will always be troublesome. It is possible that an applications like Excel will attempt to handle this by applying some sort of “fudge factor” in doing comparisons like that. Maybe. But somebody somewhere will argue that THAT isn’t even the right thing to do!

  1. Above and beyond all the above, Excel apparently had a plain outright bug. A certain few numbers (which could result from inaccuracies like the above), when converted to a human-readable character string and displayed or printed, simply got converted wildly inaccurately. Thus the numbers that should have computed as (approximately) 65536 (correct to at least a dozen decimal fraction places), and displayed as 65536.0000 (or similar), got displayed at 100000 or something like that – That was just plain outright WAY wrong.

Senegoid gives a good overview of the issues. To expand upon a couple of the basic numerical analysis issues that we used to teach (typically in the first week of a one semester second year course.)

Machine epsilon - the number which estimates the rounding error in floating point arithmetic on a particular machine. It can vary depending upon your processor. All algorithm design for computational mathematics needs to understand this.

Ill conditioned systems. Some systems are intrinsically very very sensitive to even the smallest error in calculation. The favourite example is a 20 term polynomial that has roots of the integers -10 to +9 (or some such). An error in the least significant bit of any of the coefficients of the polynomial will result in a polynomial where half the roots become imaginary.

Expressions where there is a term that is the difference of two components that are often very very close to the same value. This leads to serious loss of precision, and often to simply wildly wrong results. As a paradigm spreadsheets make these sorts of error easy to commit because they isolate sub components of the calculation in difficult to track down ways.

Managing these issues is critical.

If you are a professional engineer of scientist doing computational mathematics I would say that near half of the work is still numerical solving of PDEs. (Signals processing and statistics probably make up a large slab of the remainder, although numerical sims are gaining ground.) This is real mathematics. Excel dosen’t do any of them. Well not for real. It is a tool for doing arithmetic and quick and dirty back of the envelope calculations (I will use it to calculate things that are a pain to do on a calculator, but no more complex.)

As has been pointed out above, one of the critical issues with Excel is that there is no way of expressing your spreadsheet in a source code form. So it is almost impossible to use Excel in any sort of real world software engineering process. No source code control, no easy way of reviewing code, and worst of all, essentially no way of testing the system in a systematic manner that would allow the system to be relied upon. Indeed Excel encourages some of the worst elements of poor engineering and science practice.

You are correct about floating point numbers, but in principle, you could store all numbers as algebraic expressions. This could require an arbitrarily high amount of memory, but it could be done.

I suspect that comparisons are in the general case a Very Hard Problem since they could require sophisticated transformations, though stuff like adding rationals together would be trivial.

Question – I know there was some experimental work in doing base 4 computing for a while (aside from DNA computing I mean), due to greater data density (more you can store in two 4-bits than two 2-bits). Though it didn’t really pan out (I recall the words “too much work for too little practical benefit”). Has any supercomputing company ever tried having a hardware subcomponent built for base-10 specifically for the finite representation problem with very common numbers like .1, or is it just not worth it compared to simpler workarounds even in incredibly mathematically sensitive scenarios?

This is a tangential point, but still worth some nitpicks. :smiley:

(1) Once upon a time, most IBM business software used decimal arithmetic with a moveable decimal point. Decimal computations were exact.

(2) One has the option – though it is directly supported in few languages – of representing rational numbers exactly and doing rational number arithmetic. (This is what I do when I want to dazzle my users with exact answers. :smiley: )

(3) Unless one is foolish enough to expect code like
  x = 0.1 * 10;
  if (x != 1) abort(“1 isn’t 1”);
to work as “expected”, the inability to represent exact fractions is only a minor part of more general imprecision problems. More significant are numeric issues like the summing problem addressed by Kahan’s summation algorithm.

Numerical computation is it’s own specialized sub-field of algorithms. Floating point can’t represent all numbers accurately. Normally, this is fine since the errors are small but certain arithmetic techniques can cause errors to compound in unpleasant ways.

For example, (1+1/x)^x trends towards e as x gets larger. The problem is, as x gets larger, 1/x gets tiny, making it hard to approximate using floating point and tiny errors in the representation get compounded through the exponation.

Using Wolfram Alpha which is a tool specially designed for numeric computation, I get:

1.001^1000 = 2.71692
1.000001^1000000 = 2.71828
1.000000001^1000000000 = 2.71828
1.000000000001^1000000000000 = 2.71828
1.000000000000001^1000000000000000 = 2.71828
1.000000000000000001^1000000000000000000 = 2.71828

Using Excel, I get:

1.001^1000 = 2.71692
1.000001^1000000 = 2.71828
1.000000001^1000000000 = 2.71828
1.000000000001^1000000000000 = 2.71852
1.000000000000001^1000000000000000 = 3.035
1.000000000000000001^1000000000000000000 = 1

Properly designed numeric computation algorithms like Wolfram Alpha carefully perform calculations in a way such that errors don’t compound exponentially like the example above. Certain common operations like operations involving matrix inversion are especially prone to such errors so great care needs to be taken in how they are coded. Usually, this is packaged into a library to use such as numpy so you don’t have to worry about the details.

This is why excel is unsuited to this sort of calculation.

Decimal point arithmetic is alive and still with us. The obvious language is COBOL. But the specific instruction sets in the core machines - in addition to the IBM business series, that converged into the single general purpose architecture of the 360 and 370, the DEC VAX had binary coded decimal instructions, and guess what? The x86 instruction set still has support for BCD. They can pack 18 decimal digits into an 80 bit FPU word. (You can either use four or eight bits per decimal digit, there are instructions for either.) These are integers only - no fractional part or fixed point arithmetic, although there is no reason it could not be trivially augmented to do this. One does however note that the majority of the instructions are not valid in the x64 instruction set. So they are finally dying. They are probably really slow.

The advent of RISC machines killed a lot of this nonsense off in new architectures. Even though x86 has won on the desktop.

I must be missing something here - but what exactly is this 77.1850 bug supposed to be? I typed =85077.1 into one cell, and =77.1*850 into another, and got 65535.0000000000 in each.

I then used my trusty old HP 32S RPN calculator to do the same thing, and got the same answer.

Then, just to be sure, I took a pencil to paper and did it by hand! Same answer.

What’s the bug?

Searching for “85077.1" gets me a link to a radio station. Searching on "77.1850” finds nothing.

I’m surprised you call it “nonsense”.

There are obviously a massive number of financial/business calcs being performed constantly - why wouldn’t the architecture be aligned with the types of calcs being performed.

IBM added one of these units to Power6 and moved this processing from firmware in the Z9 to hardware in the Z10.

When you google, include “excel”

When that calc was performed before the bug was fixed, excel would display “100,000”

And the most entertaining thing about this is from a blog by some guy about this problem, here is his Q+A:
"Q: Isn’t this really, really bad?

A: IMHO, no, the chance that you would see this in real life calculations is microscopic."
I couldn’t believe that answer.

Any calculation with the same characteristics as that one provided (850*77.1) that ended up with a result near 65535 is a problem - and he says the chance of seeing it is “microscopic” so you shouldn’t worry about it?

I would say the chances aren’t “microscopic” because that number is well within the range of numbers you could see in business every day.

Salaries, inventory levels, units ordered, extended amount on a purchase, etc. etc. etc.

Thanks for the info, RaftPeople.

The bug must have been fixed for quite a while - I was using Excel 2003 for the calculation.

I’ll Google using ‘Excel’.