If you missed any of the earlier posts in my DR series, you can check them out here:
- 31 Days of disaster Recovery
- Does DBCC Automatically Use Existing Snapshot?
- Protection From Restoring a Backup of a Contained Database
- Determining Files to Restore Database
- Back That Thang Up
- Dealing With Corruption in a Nonclustered Index
- Dealing With Corruption in Allocation Pages
- Writing SLAs for Disaster Recover
- Resolutions for All DBAs
- Use All the Checksums
- Monitoring for Corruption Errors
- Converting LSN Formats
- Extreme Disaster Recovery Training
- Standard Backup Scripts
- Fixing a Corrupt Tempdb
- Running DBCC CheckTable in Parallel Jobs
- Disaster Recovery Gems From Around The Net
- When are Checksums Written to a Page
- How to CHECKDB Like a Boss
- How Much Log Can a Backup Log
- The Case of the Backups That Wouldn’t Restore
- Who Deleted That Data?
- Which DBCC CHECK Commands Update Last Known Good DBCC
- Restoring Differential Backups With New Files
- Handling Corruption in a Clustered Index
- Improving Performance of Backups and Restores
Backup Performance
We had 2 databases on the server in question, the small one was 500 GB and the large one was 1.75 TB. The smaller database was basically used for authentication and only had a few hundred updates per day. As such, we rarely focused on this database very much. The main transactional database, the big one was extremely busy 24 hours a day, 7 days a week. There was no maintenance period as it handled transactions from users everywhere. Our busiest times were …. weekends. Followed next by business hours in the United States and business hours in Japan. The system was used by 30,000+ support agents around the globe. Okay, you get the picture. There was no time when it wasn’t busy.
We had highly tuned our backups. We would back up the smaller database at midnight (took less than half an hour) and the large database at 1 AM (US Pacific Time). The large database took 2 hours. We published performance metrics reports daily, and you could see that there was a small performance drop in the main database while the backup was running. From 1 AM to 3 AM, it wasn’t an issue as we had performance to spare. Over time, the backup times kept taking longer and longer. This became an issue when the backup time started taking longer than 4 hours. This put the backup completion time after 7 AM Eastern US Time which meant we were getting close to when business starting picking up again. We were still fine performance-wise in the application, but we were approaching the time that it would become a problem. Furthermore, the smaller database was now taking more than an hour to run and so it was still running when the bigger one started.
In order to maintain the size of this database, we aggressively purged data from it 4 times a day deleting support cases that were closed and had no action on them for at least 90 days. We were deleting millions of rows daily. We tracked and plotted the amount of data that we purged as well as the size of the database in our daily performance reports. There was no significant changes in either of those metrics,
We investigated the amount of activity, also in our performance reports, during the backup time frame, and no big changes there either. All performance metrics throughout the day looked completely normal. No slowness during the day, only during the backup window. We escalated it to the SAN team, and they confirmed that none of our settings on the SAN had changed and that everything looked healthy on the SAN. All SAN metrics looked good. We were on a shared SAN with many other applications, and he said that none of the others were complaining, only us.
Digging deeper, we discovered that while the backups were running, our throughput to the SAN went way down and then sometime in the 3 AM hour, throughput would return to normal. The bulk of the backup was being performed after this time. We had a theory and we needed to confirm it. We asked the SAN admin to validate the same findings on his side of the SAN.
Sure enough, the SAN was being flooded between midnight and 3 AM and bottlenecking on throughput because everyone else on the SAN was running their backups at midnight as well. We changed out backup schedule to work around this. We would back up the smaller database at 11 PM and then start the large database at 3 AM. Backup times returned to normal, and we were good again.
Summary
When you run into performance problems with your backups, it is important to look at the usual suspects first such as disk performance, activity on the server, etc. Our investigation was made much easier by having baselines of the activity that we could compare to the current levels to determine if anything had truly changed. Ultimately though, we had to trust our findings and look outside the box. We had to look outside our system at external factors that were affecting us.
Day 26 of 31 Days of Disaster Recovery: The Mysterious Case of the … « Quick Disaster Recovery.com
[…] Read the original here: Day 26 of 31 Days of Disaster Recovery: The Mysterious Case of the … […]
Day 29 of 31 Days of Disaster Recovery: Using Database Snapshots to Restore Replicated Databases in Test | SQLSoldier
[…] The Mysterious Case of the Long Backup […]
alzdba
Nice case, we acutally had the same issue a couple of years ago. Spreading loads solved it too. 🙂
Another big issue we had ( a couple of times ) was when a Windows admin changed the backup folder to be compressed. Very, very bad idea. Turning it off solved the backup issues for those cases.
Thanks again for this great series on Disaster Recovery.
SQLSoldier
Thanks for sharing your issues as well.