importance of RAM vs Processsor for data analysis

Hi.

I’m still dreaming of a new laptop, and was happy the last time I got a used/refurbished one from a mainstream seller. I noticed that I can probably save a lot by gettting a less powerful processor and wondering if that is smart for the work I do. I essentially use my laptop for analyzing large datasets often with fairly complex models (I would do it on a desktop which is harder to lose or get stolen but I need to travel). It occurred to me that the computer relies mostly on the RAM to “turn over” the dataset in it’s head when doing the analysis. Is this right? Is there a trade off between spending $ on RAM vs on Processor? In other words is 4-6GB of RAM on an older cheaper Athlon better than working with 3GB of RAM on the latest Intel (assuming windows vista 64)?

I do data analysis too, not on laptops but on thin clients. I moved to a server pool with a faster processor, and it made a huge difference. I suspect the memory was about equal for the old server and the new. I’d guess you are moving through your dataset in a fairly regular way, so the caching the processor does (bringing in not just the item you request but also the items near it) would work very well for you. Given this, and unless your data sets are truly huge, I’d go with the faster machine. BTW, all the 64 bit Vista machines I saw when I was looking had 4 GB RAM standard. That’s for Vista, not data.

The datasets tend to be about 100,000 to 200,000 records with about 200 variables each. I think it’s really the complex modelling that eats up the time.

Get enough RAM to fit your dataset, and then the faster processor you can find. Getting extra RAM beyond what you need won’t improve performance, but getting too little RAM will really, really hurt.

Definitely get the fastest processor. Those are not really all that large - 300 MB on the high end or something. 3 GB of RAM should be plenty, assuming you are not doing other memory-intensive stuff (like visualizing this data).

This is absolutely the correct advice.
There’s an outside chance that RAM speed and CPU cache size might matter here too.

Use Windows Task Manager … look at the “memory” and “virtual memory” columns to get a sense of the memory footprint it uses when you open up large datasets. That will give you some idea.

Correct. RAM is vastly faster than a HDD, even if you use a SSD. It depends upon the application, but you should aim for 2x the memory footprint plus OS overhead in case the application tries to copy the dataset.

If he’s doing complex modeling, that should be plenty to eliminate the impact of memory latency. Most D-caches are plenty big enough to handle something like that, unless he is walking through the data in really strange worst case ways.

Keep in mind that unless you get an OS that can actually address the memory, buying 4 - 6GB of RAM is likely a waste of money. Windows XP can only address a little over 3GB, and 32 bit versions of Vista are teh same way. 64-bit Vista can handle um, a lot of memory. Probably more than you could afford.

Looks to me like it is possible here for RAM speed to be a limiting factor. The processor will likely fly through its instructions and will be waiting on RAM to deliver more data.

In this case faster RAM (low latency) combined with a fast front side bus (FSB) speed would be useful. Not sure how much you can work with that in a laptop but for a desktop it is easy to finesse those numbers.

I like Ruminator’s suggestion of watching what the PC says it is using.

As far as what to expect, a 200 by 200,000 by 8 byte dataset (assuming your values are doubles at 63 bits) is only 0.3 GB if I did that right. You probably need multiple copies in memory, but I am not sure whether that would be 2 or 10 copies.

I would guess that today’s laptop choices probably influence your speed more by their processor speed than by their ram size, providing it doesn’t grind to a nearly complete halt by thrashing.

It’d be a good idea to ask experts in whatever software you are using. When I switched from PCs to Sparcstations about 15 years ago, I found that a statistical analysis package (SAS) ran several times faster (I think this was for iterative nonlinear model fitting) but AutoCAD ran at about the same speed (I think this was for 3D renderings). They clearly rely more on different parts of the computer’s design.

Another thought on this - it may matter more what file access time is like, and how your software uses files.

In SAS, a dataset (a table of values for variables X observations with a little header info) is a file, whether it is named with a path and intentionally saved permanently by the user or not. Some operations, like having one of the canned “procs” or procedures operate on data, require datasets as input and output, so these things require saving and accessing a file, even if it’s only 1 kB.

Now, writing and then reading a file does not necessarily mean that the hard disk drive has to spin up and seek and all. OS’s nominally read things out of files and write them in by manipulating the permanent hard disk file, but actually read files into buffers and modify what is in the buffer. It may be an internal process that is transparent to the user (which is the case for all the Windows programs I use, though in programming for Windows I get to decide when to “flush the buffers”). Or it may be externally accessible to the user in some way (in the last Unixes I used, SVR4 and BSD, there was the “sync” command that flushes them all).
I think it is common programming practice to use modular components that may need to be loaded at run time, during processing, so (I am guessing at examples) if your data analysis includes a step where you do signal processing such as fourier analysis, and a step where you use common image transform methods like blurring, and a step where you access things using communications packages for various hardware and software ports, each of these steps may load executable library functions when they are first called by an instance of your software.
This doesn’t take much time if you are using a single instance of your software. Where it gets expensive is if you are iteratively calling one application from another. I did one analysis where a statistics and general purpose package iteratively opens sessions of a mesher and then a differential equation solver and recloses them, with all of them reading and writing text files to pass information in and out. Each iteration needs a few file accesses.

Long story short, there are storage and interprocess communication elements that could be the big limits in what you are doing, too - so see if you can test that idea.

Thanks for the responses, although some of this went right over my head. I’m using SAS and Stata. I noticed that some comples regression models in SAS ran much quicker than in Stata but I assume it’s because of the likelihood modelling that is going on in the background. I think calling them or email them is a good idea.

OP, you could spend a lot of time looking at this statistic or that counter in Windows, but it really boils down to a couple of things.

  1. Memory is cheap
  2. Memory can be upgraded in a laptop
  3. The CPU can’t be upgraded in a laptop*

If it were me, I’d get the fastest/best processor I could afford. You’re stuck with whatever processor you decide on and if it doesn’t have the kind of horsepower you need for your work, you’re out of luck. So getting the fastest one you can get is a hedge against that. And if you later find out that your workload needs more memory, that’s easy to get, not too expensive and easy to upgrade.

  • not sure if this is universally true, but I can’t think of a laptop where the processor can be upgraded