Really *BIG* databases

I work in IT and have for almost 15 years (!). I have worked with unix and Windows, usually on software for slow connection remote computers. These machines used text files and Access tables for storage. That means that, while I use and understand SQL (Sybase, Oracle, Access), I haven’t worked with really huge databases. At my current job, several of my co-workers have had jobs at very large companies that track lots of stuff (e.g., AOL, Home Depot, UPS).

In my experience, a “long” query might run for 30 minutes to an hour. I was compiling some historical statistics once with some summation functions and such and it ran for quite a while. Well, these guys said that at these big companies, it was not unusual for a query to run for days. And I say WTF? I can’t imagine that anything useful can be done with a database where queries run for that length of time.

What is your experience with databases?

I work at a technology-oriented service company that I won’t identify but whose name everyone will recognize and whose jingle everybody can sing.

Our product and finance reporting involves queries whose performance is measured in dozens of hours, running against datasets totalling hundreds of terabytes. We recently celebrated a minor triumph by optimizing one of these reports (or to be more specifically accurate, the tables touched by the report) so that it would run to completion in just under a full day.

So yeah, that sort of thing isn’t really surprising.

Database programming always bored me, so I have done very little of it. I usually specialize in man-to-machine interfaces or graphics development where the demand for database work is rather small. Any database development that I have done has not had long query times - the times I am used to would be measured in minutes, not hours or days.

I work with fairly large databases but they general aren’t measured in multiple-terbytes. I am pretty strong in SQL though. The hard part ot my current gig is that the necessary SQL is extremely long. I once wrote a plain SQL statement that was over 38 pages long and there was other SQL in the job as well. We have almost no debugging tools here so watch those commas and parentheses because if even one is wrong, the whole thing is and you have to find it yourself.

I was given a query that ran for an hour to look at yesterday. That is considered way too long for our purposes. It was a series of design flaws that caused it and I brought down the run time to 15 seconds.

Luxury! :wink:

This is why they have data warehouses nowadays; a good warehouse will reformat transactional information into a more query friendly format and turn queries that take days into queries that take minutes or seconds.

I’ve done a little bit of this, and have a friend who does it for a living. It’s amazing how he can fine tune huge datasets into lean mean warehouses. Very cool.

Hmm… what counts as a really big database??

The main database I’m working with lately is about 5-7 gigs for the MS-SQL database file (and around another 5 gigs for transaction log :eek:) and has several tables up into the 2-5 million records range.

And if a query won’t run in 15 minutes or less… then I get yelled at. :smiley:

Much of my work depends on extracts from one of our larger systems that take about 30 hours to run. There’s a lot of manipulation of individual financial records of around a million people, aggregated so that people without access to individual data can still have some useful information to use in examining trends – groupings by client company, age group, etc. It’s really not uncommon with large systems if you’re working with either the production database or just a simple mirror of it. As Athena points out, a good data warehouse can help a lot of that, but not every system warrants the expense of being linked into a warehouse.

Another large database designer/analyst/tuner chiming in.

Queries that run for hours and days are generally poor queries or running against poorely designed databases.

Properly designed and indexed databases should return results in seconds or perhaps minutes. Certainly not hours and days.

Sometimes, re-designing a database structure is not in the cards and long running queries are a necessary evil. If such a request is an exception then perhaps that’s acceptable. If, however, that kind of request is part of daily or weekly routine, then someone needs to analyze the business requirement for this sort of request and then figure out how to more efficiently structure either the query or the data.

chrisk, I have no idea what would qualify as a big database. I think one that would have queries running for days would count. Or, maybe, a very poorly designed database.

Athena, I have been curious about data warehousing, that kind of work appeals to me. Although I have never done it. Reworking data has always seemed like interesting work. I like the idea of taking big files and doing something with it.

Shagnasty, please elaborate! 38 pages! At my old job we were complaining about the queries that run to 5 pages. Is that Oracle? Lots of nested queries?

I was looking for work last year and, before I accepted my current job, was expecting a job offer from a company that ran many reports compiling survey results using SAS. Which means, I think, they don’t really have a “database”, SAS has its own repository. They had thousands of reports that had to be run quarterly on each new survey. I was interested in helping them move to a better technology. But it didn’t happen.

I find this stuff interesting and amusing (well, 38 page queries are also scary!).

I asked my DB expert buddy this same question a few months ago. His answer:

<500K rows, 1Gig = tiny. Usually an Access DB.

~50 million rows/20Gig: Medium

“Big” to him starts at 1 billion rows and 500 Gig.

I think it’s a lot of fun, too. Like I said, I’ve done a bit of it, and it’s fascinating to play with all that data. Not sure if I’d want to do it full time, but it’s fun on occasion.

Maybe not every system, but I’ve seen dramatic speed improvements in even small databases if it’s the right problem.

I personally added a mini-data-mart type thing to a very small project last year that dramatically improved performance. There was a situation where the users wanted a screen to display past-due dates from maybe 10 or 12 tables, all in different formats. The past due dates had to be calculated based on the last time something was paid/inspected/whatever and the interval between the times that things had to be paid/inspected/whatever. There could be thousands of these dates at any given time in the DB.

Sure, I could do it using brute force and an unholy stored procedure. Much easier was to maintain a lil’ data mart table that had the next due date of everything, updated by triggers on all the various tables. Voila, my ungodly huge stored procedure was now a one-line query.

Now, that’s not a true data warehouse by any means, but it demonstrates using data warehousing/data mart concepts on a very small database and getting a decent return on it.

That’s kind of interesting - I’d be surprised if the day-long query couldn’t be made much better though.

Our databases are (so far) mostly under a gig, but you can get some very poor performance in much smaller databases. In a poorly tuned DB (ie, doesn’t contain proper, or any, indices) of even 10 Megs, your query involving table joins could take over a minute, when it should be near-instantaneous.

Stuff like this is what a DBA is for, but sometimes they’re overworked, or not very good, or don’t care about you, or (in our case) don’t exist. :slight_smile:

Good grief, to do what? I’ve written queries with a dozen table joins, or fetching 100+ columns but I’ve never run over about two pages.

I don’t get to play with humongous databases, I guess some of our customers are over 100 Gig. Even then a fetch query that takes over a second suggests that the indexing is screwed.

I can’t imagine what sort of query would run for over a day unless it was like QuickSilver said, a naff query against honking un-indexed tables.

What I hate, as the system admin, is how frequently hardware & OS is blamed for the poor performance. If you want 5-10% improvement, tune the OS. If you want 1000% performance improvements, tune the query.

Amazing how many DBA’s & developers don’t understand this. You see these 8-way, cartesion joins on non-indexed tables spaces and they wonder why the performance sucks.

Actually, to a certain extent, I’ve had the opposite experience (though I admit this might be a special case.) We were fighting performance lockups and retuning our queries to get 5-10% improvements.

When we finally got some new hardware approved for purchase, that gave 200-300% improvements. :slight_smile: :slight_smile:

Snerk. We have lookup tables that big. (ok, that’s a slight exaggeration…but we do have a dimension table with 100 million rows).

My company (you’d recognize the name in a minute, etc) has hundreds of terabytes in its warehouse, in dozens of tables, and several of them are billions of rows. We’d like to have all queries run in minutes (as opposed to hours) but occasionally an ad-hoc request will come in that will take several hours to complete. This indicates that the request and the database design are not well-suited for each other; if this becomes a common occurrence then we need to redesign the database, to include a new level of aggregation or something like that.

I don’t personally deal with big databases, but my dad has one that’s about four terabytes. The sort of queries that anyone will want to execute on it are well known, so the database is optimized for them, and most of them run in seconds.

I’m essentially a database editor for the US Patent & Trademark Office. The database my division is responsible for goes back to the 1960s but one I routinely query goes back to October 25, 1870 (trademark #0000001).

Access and SQL Server here.

I am the DBA for a firm that has the most incredibly poor design on their database that you can imagine. Over the last 3 years, I’ve managed to sneak in some redesign and a lot of optimization, so that queries that used to run 30 seconds or more now run in 1-2 seconds. They think I am god.

It’s good to be god… :smiley: