SQLU VLDB Week – Archiving and Purging Data

Welcome back for the final day of VLDB week at SQL University. Today, we will talk about one of the big challenges that all administrators must eventually tackle when working with a VLDB: archiving and purging data. This particular problem has a very simple solution; however, if you can’t take the easy way out, there are other things you can do.

Before we go any further, here is a recap of VLDB week up to this point:

There is an Easy Way

The challenges with archiving and purging data is that it can take a long time, cause excessive blocking due to lock escalation, and the deleted space is difficult to recover. As I said above, there is a simple solution that addresses all of these concerns … partitioning. Partitioning allows us to archive or purge data by performing a fast switch of a whole partition out of the table. The intent is to switch out the partition you want to archive or purge to an empty table and then you are free to do with it as you please.

A partition switch is a metadata-only operation. This means there is no actual movement of data. The allocation mappings for the pages are simply updated to show the other table as the owning object. The operation is very quick and doesn’t take excessive locks. Additionally, it’s possible to recover the space by dropping the partition scheme and recreating it and rebuilding the indexes thus allowing you to drop the filegroup the partition was in.

There is a Hard Way

It would be nice if everything was nicely partitioned for us, but back here in the real world, that’s not always the case. We all have to deal with purging or archiving data from a massive stand-alone table at some point. There are some strategies you can use to minimize the impact on the live system.

Delete in small batches to avoid lock escalation or long term locks.
Manage the log file by running frequent log backups, if in full or bulk-logged recovery model, or frequent checkpoints, if in simple recovery model, in between batches to ensure that you are able to reuse the existing log. Log file growth will block all activity on the database during an expansion, and should be avoided.
Use targeted deletes to minimize the number of locks that will need to be taken. A delete command with criteria will take shared locks on the data that it is searching to find matches and then escalate the shared locks to exclusive locks on the rows it intends to delete and releases the shared locks on the other rows. In an attempt to ensure that it can take the exclusive locks, it takes intent-exclusive locks. I’ve seen a number of instances where delete statements that needed to delete 0 records were causing deadlocks because the delete command was taking locks on the entire table to perform a table scan to find the rows.
When I set up a batch delete process, I will have it perform a select using the nolock hint to identify the rows to delete and insert the primary key value(s) into a temporary table or table variable and then targe the delete at the exact rows using the PK column(s) only in the criteria. This minimizes the locks taken initially to only the rows that will be deleted.

One more side note: simply adding TOP to the delete statement will NOT minimize the amount of locks taken initially. Please don’t do this!!

I set up a little code demo to demonstrate what the difference in targeted deletes. I’ll be using the AdventureWorks2008R2 database for this example. The Production.TransactionHistory table has 113,443 rows, but it will do just fine for this purpose.

The first run of this will be a typical delete statement doing an If Exists … Delete. I’m going to put it into a transaction so that I can check the locks before they are released and then roll the transaction back. I’ll be querying sys.dm_tran_locks to get the locks information.
```
Use AdventureWorks2008R2;

Begin Tran

If Exists(Select *
    From Production.TransactionHistory with(nolock)
    Where Quantity > 4)
Delete Top(100)
From Production.TransactionHistory
Where Quantity > 4

SELECT resource_type,
    request_mode,
    LockCount = COUNT(*)
FROM sys.dm_tran_locks
WHERE request_session_id = @@SPID
Group By resource_type, request_mode

Rollback
```
resource_type request_mode LockCount

OBJECT IX 1

PAGE IX 84

DATABASE S 1

KEY X 300

So we see that it takes a shared lock on the database (everything does), an intent-exclusive lock on the table in case it needs to escalate, 84 intent-exclusive locks on index pages, and 300 key locks on rows in the clustered index. What this doesn’t show is the locks that are taken initially for identifying rows to delete. We can preserve those locks by adding a holdlock hint to the query so it won’t release any locks. I’m going to add the with(holdlock) hint to the Production.TransactionHistory table in the delete command. Here are the new results:

resource_type request_mode LockCount

OBJECT IX 1

PAGE IX 84

KEY RangeS-U 237

KEY RangeX-X 100

DATABASE S 1

KEY X 200

As you can see there are a number of locks taken by the command which get either escalated into other locks for the delete or get released at escalation time on rows it’s not going to delete. Now let’s take a look at the locks taken using the method I describe. I’m going to select the PK values that I want to delete and insert them into a table variable. I will then join that table variable to the table to perform a targeted delete using just the PK values.
```
Use AdventureWorks2008R2;

Declare @ToDelete Table (TransactionID int not null primary key)

Begin Tran

Insert Into @ToDelete
Select Top(100) TransactionID
From Production.TransactionHistory with(nolock)
Where Quantity > 4

Delete TH
From Production.TransactionHistory TH
Inner Join @ToDelete TD On TD.TransactionID = TH.TransactionID

SELECT resource_type,
    request_mode,
    LockCount = COUNT(*)
FROM sys.dm_tran_locks
WHERE request_session_id = @@SPID
Group By resource_type, request_mode

Rollback
```
resource_type request_mode LockCount

OBJECT IX 1

PAGE IX 33

DATABASE S 1

KEY X 300

The first thing I notice is that there is very little difference in the number of locks. The number IX locks on pages is less (33 vs 84), but otherwise it is almost the same. Makes sense since we are deleting 100 records in each case. Now, let’s add the with(holdlock) hint to the Production.TransactionHistory table in teh delete statement again so we can see the initial locks. Here is the locks output from the second run of the query:

resource_type request_mode LockCount

OBJECT IX 1

PAGE IX 33

DATABASE S 1

KEY X 300

No, I didn’t make a copy/paste error here. Are you wondering why no extra locks are showing up this time? The answer is simple. By doing a targeted delete, all locks taken initially for identifying the rows to delete were converted into exclusive locks for the delete. No extra locks were generated.

One final word about recovering disk space after the deletion of data has completed. First of all, my recommendation is to not do it if you don’t absolutely have to do it. There is no good way to recover a large amount of space after a large deletion. If you must do it, use these tips to minimize the impact.

Use DBCC SHRINKFILE with the TRUNCATEONLY option first. TRUNCATEONLY will drop any free space at the end of the file and release it to the file system. In some cases, this might free up enough space so that you don’t have to do a harmful shrink operation.
Never use DBCC SHRINKDATABASE as it will try to shrink all files in the database. If you must shrink, use DBCC SHRINKFILE to target only the file with the extra space.
Include rebuilding indexes as part of your plan. Shrinking a file will fragment your database horribly, and if you are going to shrink it, you shoudl rebuild all of the indexes afterwards.

Comments (8)

April 15, 2011 at 10:37 PM

Stephen Horne


SQLSoldier,

Thanks for this useful series. Some of your readers may want code examples that show/expand on what you suggest. Following is a code snippet I often use in my current job. We mainly consolidate data from lots of different data sources into a central data/reporting store (not a true Data Warehouse) so our recovery mode for all our databases is simple (no need for point-in-time recovery). Most of our larger tables are partitioned but we still sometimes need to perform DELETEs which as you indicate can cause many issues with VLDBs.

The following script causes frequent small commits leading to frequent checkpoints, thus as you say minimizing log file growth.

Obvious note: WHERE clauses in the command must match.

WHILE EXISTS (SELECT * FROM MyTable WITH (NOLOCK) WHERE MyColumn = 12345)
DELETE TOP (1000) MyTable WHERE MyColumn = 12345 ;

If the WHERE column is your clustered index or is the first column of your clustered index and if you can determine some way to evenly “chunk” or group your data into even non-overlapping sets then you can parallelize this DELETE operation possibly leading to better performance. This might look something like the following where you would run the commands simultaneously and possibly see a performance boost.

Command #1:
WHILE EXISTS (SELECT * FROM MyTable WITH (NOLOCK) WHERE MyColumn >= 12345 AND MyColumn < 23456)
DELETE TOP (1000) MyTable WHERE MyColumn = 12345 AND MyColumn = 23456 AND MyColumn < 34567)
DELETE TOP (1000) MyTable WHERE MyColumn = 23456 AND MyColumn = 23456 AND MyColumn < 34567 ;

If I see problems I can easily cancel these WHILE EXISTS DELETE TOP (N) commands. Because these looping structures are issuing multiple small, frequent commits any rollback operation on the currently running DELETE TOP (N) command is also small. If you were midstream of issuing a single huge DELETE command with no TOP (N) and had to cancel, the necessary rollback can take a very long time.

I would love to see scripts that you might have or scripts that your other readers have related to this discussion.

Again thanks for sharing these very useful ideas related to VLDBs.

Loading...
April 15, 2011 at 10:47 PM

Stephen Horne


SQLSoldier. Sorry for the follow up. It looks like WordPress mangled my post somewhat. Not sure what happened. Following is the chunk that is mangled. If it mangles it again and you are interested I can send via email? Thanks, Stephen

Command #1:

WHILE EXISTS (SELECT * FROM MyTable WITH (NOLOCK) WHERE MyColumn >= 12345 AND MyColumn < 23456)
DELETE TOP (1000) MyTable WHERE MyColumn = 12345 AND MyColumn = 23456 AND MyColumn < 34567)
DELETE TOP (1000) MyTable WHERE MyColumn = 23456 AND MyColumn = 23456 AND MyColumn < 34567 ;

If I see problems… etc

Loading...
1. April 16, 2011 at 10:42 AM
  
  SQLSoldier
  
  
  Thanks for the feedback, Stephen!! I’m seeing your posts just fine. Nothing is mangled. What does it look like to you?
  
  I’m going to add in a little more info on this. Thanks for the suggestion!!
  
  Loading...
  1. April 16, 2011 at 11:29 AM
    
    Stephen Horne
    
    
    Hi Robert,
    
    The mangling is in the Command #1 clause above. I think it is the less thans and greater thans that are messing things up a bit. WordPress is possibly seeing an html element somehow. Should read like the following (will use GT and LT instead of the characters).
    
    —-
    
    Command #1:
    WHILE EXISTS (SELECT * FROM MyTable WITH (NOLOCK) WHERE MyColumn GT= 12345 AND MyColumn LT 23456)
    DELETE TOP (1000) MyTable WHERE MyColumn GT= 12345 AND MyColumn LT 23456 ;
    
    Command #2:
    
    WHILE EXISTS (SELECT * FROM MyTable WITH (NOLOCK) WHERE MyColumn GT= 23456 AND MyColumn LT 34567)
    DELETE TOP (1000) MyTable WHERE MyColumn GT= 23456 AND MyColumn LT 34567 ;
    
    When I am doing something like this, I frequently use Adam Machanic’s sp_whoisactive to monitor that help ensure that I am not causing problems. I also use sp_spaceused on MyTable to watch progress. And I also use a query like the following to monitor progress.
    
    SELECT COUNT(*)
    FROM MyTable WITH (NOLOCK)
    WHERE MyColumn GT= 23456 AND MyColumn LT 34567 ;
    
    If I see problems… etc
    
    Loading...
April 18, 2011 at 10:26 AM

Shaun


I’ve been waiting all week for this article! I’m in a situation where I need to delete data from two huge tables – 60 million+ rows each and growing by 80K rows per day. Trying to find the best way. Been using your version of the “hard method”, but it’s still pretty slow. I’d love to see an example of using partitions to delete data.

Loading...
1. April 19, 2011 at 3:00 PM
  
  SQLSoldier
  
  
  Is the table partitioned? If so, you switch out the whole partition that you want to delete into an empty table and then just drop the table or truncate it.
  
  Loading...
  1. April 20, 2011 at 8:15 AM
    
    Shaun
    
    
    It’s not currently partitioned 🙁 But that’s what I was thinking I would do. Does adding partitions to an unpartitioned table cause a big disk IO hit if the partitions are in the same filegroup? (running SQL 2005, btw).
    
    Loading...
    1. April 20, 2011 at 10:42 AM
      
      SQLSoldier
      
      
      Yes. Unfortunately, it is a huge IO hit. It has to move the data to different files. Obviously, it’s best to plan for partitioning in the design phase, but tha tis so often not the cards we are dealt. Unfortunately, people often let it go because it will be such a big pain to implement. Even if they set the existing data as a single partition and then starting added partitions at se time intervals, the problem will at least not keep getting worse.
      
      Loading...