Affymetrix Calculations, 64 Bit Processing, and R Linux Compiling

Any Affymetrix genechip biomedical experts on the SDMBs?

The company I work for is embarking on a new project in the exciting field of oligonucleotide array/genechip analysis. Paticularly we are looking at gene expression during specific disease states in an attempt to develop diagnostic tests*.

Already, before we’ve even started, we are running into problems. One of the 1st steps in the process is to analyze the chip’s probe intensities into gene expression measures. Using the chip we have in mind each raw probe intensity file is a little over 30 megs. We would like to be able analyze 200 to 400 files at a time. For these calculations, we are attempting to use the open source R Project in combination with code developed by the fine folks at Bioconductor. On a two gig of RAM, Win32 terminal server, I’ve managed to process 38 files before I reached serious memory limitations and the machine locked up. So here are my questions:

1.) We would like to get a 64 bit linux workstation to avoid the 4 gig RAM limit of a 32-bit environment. R, though, is written and compiled for a 32 bit target. Will I be able to compile it from source for a 64-bit platform? If so, what will be it’s limitations? It’s not optimized for 64 bit, so I won’t get the speed but I’ll get the increased memory?

2.) R requires both a C and Fortran compiler. Getting a 64 bit version of GCC is easy but what about a Fortran compiler? I believe I’ll need F2C or F77.

3.) With 20+ gigs of RAM, anybody know how many files we’ll be able to analyze? The memory usage/# of file relationship is not quite linear.

I’ve asked other bizarre questions here, so what the hell, I’m giving is a shot.
*My apologies for that crappy summary explanation, I’m a programmer, not a microbiologist. I further apologize for the long, long question.

MY geekiness is very small time, and a lot of the code and compling stuff you’re talking about is way out of my range of expertise, but it seems to me that you’re saying essentially, that a fairly high powered Intel machine (under win 32) is only giving you about 10% of the file crunching horsepower you need for your file processing.

Even if you go to Linux and get a larger RAM space, I think you’re still going to be significantly constrained by the inherent CPU speed and throughput bottlenecks of the Intel hardware platform. It really sounds as if you probably need to look at some symmetric multi-processing cluster type system if the software can run on a multi-processor cluster, or a real mini-computer to get the horsepower, and more specifically the throughput, you need.

I am using Affy chips right now, but I am not doing the analysis. I am just a geneticist with a passing knowledge of the bioinformatics aspect.

I know we do all of our primary processing using the dChip suite on a regular old P4 desktop PC running Windows 2000. There is a consensus around here that this is far better than the Affy software suite. From there, we move our data to R, which is running on a 64 bit Sun server. It runs, I don’t know how much faster than on a relatively equivalent 32 bit machine. I don’t know the exact hardware configuration of the server, nor do I think it will be easy to analyze how many chips can be run at once since it is a shared server. I do know that for my project, we have had no issue doing around 5-6 at a time. Then again, we are just comparing three genotypes using the bare number of replicates to get to a relatively good statistical resolving power.

Getting the Fortran compiler on Linux shouldn’t be an issue. I have used g77 without problems, but I don’t know if it is recommended for compiling R. This information should be available pretty openly, though.

Thank you both for your replies.

This is the exact problem I am having because solid benchmarking is difficult to come by. Even a post to the Biocondutor mailing list couldn’t get me solid numbers. The biggest difference is the chip people are using. For example, the HG-U95 takes a lot less processing power then the HG-U133V2 (the chip we want to use). I’m certainly going to check out the dChip suite.

Cool, I know R uses F2C since I had to install it the last time I compiled R from source on a 32 bit machine.

The system would have multiple processors. I’m hoping though that the processor power won’t be an issue. This stuff could run for a week as long the machine is capable of spitting out some final results.

Thanks guys. I think my real reason for posting (even after researching this extensively) is that I’m worried that if we buy this machine (spending some large $) I won’t be able to get it to work. I might just have to take the plunge!

Yeah there is no comparison between the chip that I use (DrosGenome 2.0 with 18,000 probe sets) and the Human Genome chips (usually 60,000+). So my experience in processor load won’t apply at all to you.

The only thing I can suggest is to find someone at your institution with a similar box and see if you can give it a trial run before you commit to buying it. I’m pretty sure that you can probably find something close to equivalent, especially if there are other microarray or bioinformatics groups around. And if you are buying from Dell or another big commercial computer company, maybe they have a try-before-you-buy program or a return policy if it doesn’t suit your needs.

I’ve spoken to another group here (I saw my buddy this morning) and he could lend me no further insight (he is running R on a P3 linux box).

It depends on how well R was written. It may just compile and run — 64 bit systems have been around for a very long time. Programmers should know better than to use pointers and integers interchangibly (the most common portability problem), but in some quarters the “All the worlds a VAX” mentality still prevails.

I’m not sure what you mean by not being optimized for 64 bit. The extra address space is the optimization. A given 64 bit machine may also have other attributes, such as the additional registers the AMD64 architecture added to IA32, more efficent argument passing conventions, etc. which are taken advantage without any code changes. There may be other attributes such as vector arithmetic units that you’d have to specially optimize for, but those aren’t specific to a 64 bit architecture.

The important thing is virtual address space, not physical memory. If 20GB virtual memory isn’t enough, use 40GB; if that’s not enough, use 60GB; etc. Depending on the resident set size of the application, more physical memory may or may not improve performance

OK.

After a quick look at the code I’ll say this:

  1. It looks like the programmers took care to make their code portable. They use standard variable types (long, double, etc) and refer to the limits.h file to know how big the values are that can be represented.

  2. GCC has no trouble compiling 64 bit. You knew this already. G77 (the GNU Fortran compiler) IS GCC, so if R will compile under G77 on a 32bit system then you are home free for 64 bit.

  3. The Install guide says you need F2C. F2C makes C code out of Fortran code, so it will compile with the 64 bit GCC. No sweat.

  4. They don’t (on a quick glance) appear to be using pointers in the Fortran code, so there shouldn’t be any trouble with pointer conversions (usually just a 32 bit int under fortran) to the C 64 bit pointers.
    If the code is as clean as it looks then compiling it under 64 bit shouldn’t be hard.

Given that 64 Bit doesn’t even rate a comment on the homepage, I’d hazard to guess that 64 bit is a non-issue for R - else there’d be lots of discussion about getting it onto the 64 bit machines that are getting ever more popular.

Stumbling around the mailing lists show people having the usual, everyday problems of compiling from someone else’s source - file permissions, missing libraries, etc. This from people compiling on 64 bit AIX systems and similar. I don’t find anybody wailing about a lack of support for 64 bit .

Since you are processing individual files to individual results, why aren’t you considering a farm? You could probably get and install 10 Opteron workstations with 4 GB RAM for around $3000 each. $30000 doesn’t sound bad, and would probably get you your 400 simultaneously processed files.

I would try it on a single, beefy 64 bit Workstation and if it works then just add more workstations. You’ll probably want to have two HDs in the workstation, though. If you over load linux and it runs out of memory it might start killing processes on you - since processes get killed indiscriminantly, the OS can hose itself. Use a small (20 or 40 GB) drive for the swap partition. You’ll want the fastest drives you can get for that job. You shouldn’t really need 40GB, but that’s about the smallest you can readily get these days. The rule of thumb for the swap partition is RAM*3. 12 GB would be kind of hard to find I expect.

Never mind the biochemistry. I’d love a chance to just help set up the computer system. That’s fascinating enough for me.

Wow. I love the SDMBs.

Not quite. All the files are processed together. They are read into a single object, then background corrected and normalized together. The final calculations produce a single set of expression values based on all the files.

That might be a problem since my company runs an entirely Microsoft operation. One thing is for sure, I’m going to pass this thread along to my boss and see what he thinks.

I got you, I don’t know why I wasn’t thinking along those lines. So with 64 bit = 2^64 = 1 Terabyte memory ceiling…

Thank you everyone. Mort Furd you went way above and beyond to answer by question. I really appreciate it.

Is there any way that the processing can be split up? Say, you preprocess the files to get your background info then pass that from all the individual machines back to one that figures the corrections (assuming the correction factors are somethingthat you apply equally to all files) and the normalization target. That one the passes the corrections and the normalization target back to all the other machines who then correct all of their files. Each machine then calculates its own aggregate result for all of its files and then passes this on to the main machine for the final aggregation.

If I read you correctly, you are doing a statistical analysis of the measurements made by the chips. Given that, that mathematics for combining the aggregate results should be manageable.

If I am imagining this correctly, what you are doing is aggregating all of the files in one go on one machine and then doing your statistical analysis on the final aggregate. So really, all you need is for the client machines to generate their aggregates and stop before they do the analysis. Instead, they pass the aggregate on to the main machine which uses them as files from which to build its own aggregate - and it then goes on to do the analysis.

Maybe I’m way the hell off. Probably I am. But it seems like there ought to be some way to do this in a distributed manner.

It sounds like what you need is to hire a render farm on a semi-regular basis. You probably don’t want the expense of buying, running and maintaining lots of kit. Alternatively, you might want to speak to Microsoft about their 64-bit Windows beta / customer preview: Windows Terminal Server is a major beneficiary.

No, you are not far off at all and this is something I proposed to my boss. There are actually a couple of methods of doing these calculations, one of them is straight C code which is called from R, the other(s) are completely written in R. As you can imagine we have to use the C code since it’s much faster (by a factor of 10) and slightly more memory efficient. I am not the most proficient C programmer but since the code is open source, I told my boss I’d take a whack at it. He’s a PhD statistician, so I figure, if he can explain the math to me, I can code it. For some reason, he didn’t like this idea. Maybe, it’s the 8 other pressing things I’m supposed to be working on
:smack: It would be fun.

Ah, yes. These are the times when you wish you could crawl up on the phot-copier and xerox yourself.