I should probably post this to one of the sites dedicated to SQL Server, but frankly I’ve been a member of the SDMB for so long (#157, baby!) , it’s just easier to ask in here.
So I want to delete massive amounts of data from a table in Sql Server 2000.
The table currently has about 450 million rows in it, 333 GB of data in this one table. I want to delete by a value that currently does not have an index. FTR, an index on another field is almost 14GB by itself. I need to be able to delete it in small chunks so as to not bring the server down.
It seems like one of my options is to delete the current index that’s not being used and add a new one based on the field that I want to key off of to delete. (i’d have to delete the first index because there’s not enough space to just add another one).
Another option seems to be to just delete small pieces at a time until such a time I can delete bigger chunks.
While I’m asking for opinions, I’d like to keep this in General Questions, as I’m asking for INFORMED opininons, and factual statements.
So what other information can I give you to help me solve this delimma?
I am an Oracle person but have have done some work with SQL Server and lots with Access.
The first part I am unsure of is if you are deleting whole records or running some kind of update query. Second, how many records will be affected?
I am also unclear on why you keep insisting on indexes. Sure, it would nice but you can run deletes (and updates) just fine without them even if it is slower. I am sure other people run queries on non-indexed fields on that database right?
The way I imagined you would do it is to run a select query that mimics your final query and see how long that takes. Your final query will probably take longer than that but it should give you an idea if you are talking in the ballpark of hours or days.
If it takes a really long time, you can just build your query to stop after a certain number of records. There are several ways to do that but I am not sure if that is what you mean. You could run it many times until it comes back with nothing. You could even schedule it to run at a certain interval until it is done.
I’m not insisting on indexes, but the table is so massive (and I estimate I’ll be deleted 80 - 90 % of it) that I currently can’t even run a select query on it that will run in less than an hour or two. To answer your other question… I’m not talking about updating rows, I’m talking about row deletes. I did mimic a select query to select a tiny bit of the data and it took forever to run, that’s why I thougt it might be good to utilize an index… to find the data to delete faster. Am I wrong on this?
Regarding other people runnign other queries against it, no that doesn’t really happen. This database is mostly for dba use in support of another database. It has key records in it that need to be kept for historical reasons to the actual production db. We recently changed a business rule that doesn’t require us to keep all of the records.
Another Oracle bod here but I have a SQL Server chap just across the desc. What I would do is index the column that’s going to be selected on. Select the records you want too keep into a new table and then TRUNCATE the existing table. I know this is pretty much what RaftPeople just said, I just thought I’d make it more specific.
TRUNCATE is much faster than DELETE because it isn’t treated as a transaction – you can’t rollback a truncate.
I would have thought dropping the existing index first would probably speed up the delete process as otherwise you will have the overhead of continually updating that index as rows are removed. Having an index on the column you are using in the delete will not really help because it too would incur an overhead to maintain as rows are removed and, as you are removing most rows anyway, scanning the whole table will not add much.
I hadn’t thought of that. I like RaftPeople’s idea too, except I also have to check to see if there are any permissions that I need to watch out for. Small Clanger’s idea would take care of that.
As another possible solution, I realize a select top n with an order by clause has to return the entire result set before it can actually select the top n. However, if one doesn’t use the order by clause, does it behave the same way? I’ll have to check into that.
If you’re worried about losing permissions, script out the old table using enterprise manager, and check the box labelled “Script object-level permissions”
Then do a global search and replace in the script to make the new table name.
Conversely, you could insert the records you want to keep into a temp table, truncate the original table, and then re-insert the saved rows from the temp table. It uses a little more overhead, but your main table is never compromised by being dropped or anything.
[hijack]
Also, I j ust want to point out that with the proper tools, even a truncate can be reversed. TRUNCATE is called a non-logged transaction because it doesn’t log the individual row removals. In actuality what it does is log the extent deallocations, and this can be repaired IF and only IF there is no other activity on the DB after your truncate, and you have the proper third party software tools.
[/hijack]
Also, damn that’s a huge table. I suspect that you could do with some normalization
Another vote for the RaftPeople Truncate Method. However, it that won’t work:
I gather from this that disk space is tight, which might make the copy-then-truncate option unavailable. If you have a primary key on the table - ideally, an integer - you could follow your “delete small pieces at a time” pretty easily by limiting the DELETE statement to primary key value ranges. Absent that, you could use other ranges, such as by state or ZIP code or whatever you find convenient.
Note that the size of the database file may not decrease with each delete, depending on certain SQLS settings, so you’ll have to shrink the db.