Yeah, something like Pandas is good here. Excel is just really, really not equipped to handle large swaths of data. Trust me, I have a giant beast of a gaming computer that also does things like batch rendering and massively parallel AI and planning work. Excel still chokes and cries if you accidentally copy something too big into it. Computer specs honestly don’t matter too much at that level, the software just plain isn’t equipped to handle it.
My brother has this problem. He does reservoir modelling for oil reservoirs. One was 100 GB. What he has done is to split his workbooks into multiple sheets to take advantage of Excel’s multi-threading. He’s seen significant performance benefits going upwards of 24 cores. And you absolutely need the 64 bit version of Excel.
Something to note is that VBA is single-threaded, so you should avoid it if possible.
He too has been advised to learn MatLab but cannot make the time.
Except that multi-threading appears to be a $499 add-on.
Amazon has servers for hire. I don’t think it’s that expensive. So if you really want to narrow down your computation requirements, I’d configure servers of increasing memory size until my spreadsheet ran properly or I ran out of available affordable configurations.
Once you figure out your computation requirements, you can either keep using that virtual machine or build your own.
What might kill you is if those nested If’s are contained in nested loops. If your computation is O(N), then even 36 million computations should probably be fine, even on machines of moderate power (by 2015 standards).
100 GB spreadsheets for geologic data modeling? Don’t they teach FORTRAN anymore for God’s sake?
What are you talking about? Threading is built into the core of the Python language hand has been there since prior to release 2.2. There is the built-in low level thread API as well as the higher level threading and the Queue module that is an implementation of FIFO communications between threads, although the o.p. really shouldn’t have to worry about explicit threading as many of the NumPy and SciPy modules use threading optimize the performance of their operations automagically without any direction from the user. And while threading may be of some benefit in dealing with a dataset of this computational scale, the real problem here is Excel which in no way optimizes code or runs efficiently. Indeed, although you may have cut and pasted the same cell formula down a length of column, it treats each and every formula, and indeed, the entire cell contents, format, formula, links, et cetera as individual non-inhereted objects, which is what is sucking down most of the memory of the o.p. Just running the problem in a language designed to handle numerical data such as Python, PDL, or C will be inherently faster by orders of magnitude just by the virtue of not having to carry around a bunch of useless metadata.
The dataset the o.p. is dealing with should be able to be used on a reasonably powerful desktop or a small server without any problems. The only reason to go to an external server such as Amazon EC2 is if you have a lot of simultaneous users who need to share access to the data across the internet.
I certainly hope not; Fortran hasn’t been in use for new software development for at least twenty years (since Fortran 95) although the attempt was made to modernize it by glomming on a bunch of object oriented functionality and interoperability with other languages in Fortran 2003. There is still a lot of legacy code in use that was written in Fortran so it is useful to know, but there is no good argument for doing new development in the language. New applications should be written in a language with modern features and libraries such as C, Scala, Python, Perl Data Language (PDL), et cetera, depending on the type of application and skill/experience of the programmer. (I generally recommend avoiding programming in C++ unless the programmer already has a strong background, as OOP is often just a complication for the mostly linear operations in scientific programming.)
For the problem the o.p. is trying to tackle a “scripting” language with C-based libraries optimized for data munging and manipulation such as Python or PDL is pretty much ideal, and despite fears of the learning curve, anyone who can write in the mess of VBA or can follow a tangle of inter-referential Excel formulas will find sitting down with any of a vast number of online tutorials on how to use these languages a relief.
Stranger
If you’re doing the same calcs but on numerous records, just create an input sheet, a calc sheet that evaluates a single record, and a macro that pastes-values the output to an output sheet and queues up the next record to evaluate.
I’ve done this several times for 100 MB excel files evaluating a couple hundred thousand records. Macro run times can vary from 15 minutes-several hours.
It’s doable if you’re married to excel. If you need all the formulas dynamic at once, excel is not going to be your go-to tool. I’d look into MatLab or something else and you’re going to need an experienced user to help you build it out.
This:
Again, threading is built into the core of the Python language and is used by functions in NumPy and SciPy to parallelize and optimize operations in a manner transparent to the user. The “Anaconda Accelerate” add-on builds some additional functionality using the Intel Math Kernel Library (MKL) and their NumbaPro package that supports parallelization using the graphics processor with the CUDA framework. Most (in fact, virtually all) of the MKL functionality can be found in OpenBLAS, which is not included with Anaconda but can be installed using pip or any number of other package managers. Since Continuum Analytics (the company that packages and distributes the Anaconda distribution and many of the additional modules) doesn’t charge for the basic distribution their way of making a profit by providing advanced functionality and support services to sophisticated users who want a turnkey distribution.
However, everything the o.p. needs to do can be done with basic Python and a base distribution of NumPy/SciPy. I use NumPy/SciPy for filtering, transformation, and visualization of large volumes of raw signal data and spectral distributions without using any of the for-pay packages. I only recommended the free Anaconda distribution because it is a simple, self-maintaining installation package with comprehensive documentation and a GUI installed by default.
Stranger
I was a super-big fan of Excel. It’s got a lot of great terms for doing calculations from simple math to statistical, financial, and engineering functions and handling text in special ways. It was easy enough to merge certain parts of the excel tables into Word documents to make nice reports and output.
Then I had to learn Access for my job in publishing and I became a big fan of Access and rewrote a bunch of my old Excel hobby projects into Access format.
Fortunately, Access has a nice wizard for importing Excel tables right into it and the formulas that work in Excel will generally work in Access, with little to no modification. In a nutshell, Access can be considered a form of SQL with developer-coddling wizards for making user-friendly graphical interfaces and reports – all that stuff I was merging to Word documents could be viewed on the screen or in reports that I made within Access itself.
Then I decided to learn SQL because my new employer used SQL for a lot of critical data. I grabbed Learn SQL in 21 Days (I don’t recall the publisher or authors names) and learned what I need for my job in a week. Again, it’s quite fortunate that most of the formulas that work in Excel cells can be built into your SQL statements.
Believe me, once you’ve got the hang of SQL (if you’re working a real job at the same time, give yourself 3 months to learn it instead of 3 weeks), you’ll wonder why you didn’t make the transition earlier. Plus, way back then (it was several years ago) I stumbled across a website that had a mock SQL server that would let you learn and test SQL code on it, all for free. I don’t know if it’s still out there, but there must be some imitators around by now.
Good luck! Come back and thank us next Spring!
–G!
I use this method quite often with SQL acting as the source.
Using a VBA ADODB.connection from the spreadsheet -
I query a group of SQL records (example - one location out of 2000 available locations)
Dump the query results to an excel table (input sheet)
Run calculations on that table (calc sheet)
Copy the results to the next line a table (output sheet).
Rinse and repeat
MySQL is the database. It has records that refers to columns and rows just like in a spreadsheet. You would use MySQL query language to select and extract data. Computing that couldn’t be easily performed using query language could be done in a programming language such as PHP. MySQL and PHP go together well. You would also use MySQL for report generation and then the PHP language to further format it for your purposes.
This. Really, this. Or some breed of SQL if FORTRAN scares you. MySQL Workbench sounds like a nice interface. But you haven’t lived until you have timed a language designed sixty years ago to do what you want to do, operating on modern hardware. There’s a reason it’s still the GO TO language for some things. rimshot