A bit of background: I’m a technician in a small biomedical research lab. Currently, I’m collecting and processing data to produce survival curves sort of like this. I’ve also got to do some medium-weight statistics on these data sets. Now, I’m in the midst of a battle with a half-broken Excel spreadsheet, grafted onto kludgy old VB script, that’s been handed down for years, with various modifications made along the way. I need to modify it yet again to make it work for my current round of experiments, but I’m sick and tired of Excel and want to find a proper tool for the job.
Now here’s my question. I know there are more than a few science and engineering types around here… What sort of tools do you use, and where are they most useful? How easy are they to use? I’ve had a bit of experience with Matlab, but I doubt I can convince my boss to spring for the several thousand dollar license. I’ve also heard that a lot of people head straight for Python.
Partly, I also want to learn something that I can carry with me to grad school, where I vaguely intend to get to the more quantitative side of biology. My programming experience at this point is rather meager.
I’ve heard good things about the R Project, but it apparently also has a steep learning curve. OTOH, it seems to be free. (I have no experience with it myself.)
Matlab was my first thought - it’s reasonably efficient (at least until I start doing my own programming), and a pretty decent data handler. I’m running it on a kind of old machine, so some of the processing scripts take a while - I’ve started doing pre-processing from scratch in Fortran and then importing the output files into Matlab. There are two main reasons that my research group (I’m a grad student) uses it, though - it’s got good, fairly intuitive visualization capabilities, and a few of our collaborators at other universities are using it as well. If you’re looking for something to take to grad school, Matlab seems to be a popular choice in the earth sciences, despite the price tag.
I’m learning Statistica right now for some data analysis, but the less said about that the better. That’s probably more a reflection on me than the software, though.
R and Python are the standard programming languages of academic statisticians. Python is very easy to pick up, a pleasant language to work in, and very useful for doing data transformations that might be a bit more painful in other languages. R has some nice features, and as long as what you’re trying to do isn’t incredibly obscure, there’s a good chance that somebody has already written a package to do it.
If you do get into R and make some progress with it, you might be able to convince your boss to buy an S-Plus license for you. R and S-Plus are both based on implementations of the S programming language, but R doesn’t really have any fancy graphical tools built in, whereas S-Plus can be made to look a lot more like Excel if you don’t want to screw around with the command line all the time. On the other hand, there’s a really nice Excel plug-in called RExcel that allows you to invoke R code from Excel. There’s also a Python module for interacting with R, but I can’t remember its name right now.
SAS is pretty well-known, but I’ve never run into anyone (myself included) who would choose to program in it if there were any alternatives available. Since there are, we don’t.
How much does a license of Matlab cost anyways? They don’t give any price information until I spill my entire life story… If it’s only in the $hundreds range, I might be able to beg…
Of course, every freaking biologist seems content with Excel. I’ve seen papers in my field where all of the figures are lifted straight out of Excel, horrific default settings intact.
If you’re going to explore the Python route, make sure to check out SciPy (with its companion NumPy) and matplotlib for visualization. Also, IPython is a big help in interacting with python in general and matplotlib in particular.
You can get the student version for just a few hundred ($250?). I seem to recall the professional version costing something like $5k. It’s kind of a pain in the butt the way they package the software, because depending on what you want to use it for, you’ve got to get several of the toolboxes for it to even be functional. Signal processing, controls toolbox, and symbolic math toolbox, in my case. The student prices for those are around $60 each, IIRC.
Its difficulty can be over-estimated. On the other hand, it is free, it comes with very clear manuals (as pdfs along with the down-load) and it is extremely versatile. Give it a try - it will most likely handle what you need it to do.
SAS is also a good choice, but there is some learning there as well, and the price will make you wince.
SPSS is in my opinion the easiest programme to use, Stata is also ok (with its windows style drop downs), R needs alot more direct coding i’ve been told.
Keep in mind that different departments often have preferences in terms of what program they use. You could buy and learn Stata only to find out they only use SPSS. The good news is that if you become proficient in one program, you’ll have little problem transferring your skills.
SAS is probably the most complete program, but I agree that R is probably the way to go for your immediate needs.
SPSS is what I used in grad school and now at work. I have also used SAS, and while it is probably a better program than SPSS, it’s also considerably more expensive, or was, the last time I looked.