T-SQL Tuesday #88: My Top WTF Moments

T-SQL Tuesday #88 – The daily (database-related) WTF — T-SQL Tuesday #88

Welcome to T-SQL Tuesday #88 being hosted this month by Kennie Nybo Pontoppidan (blog|@KennieNP). This month’s topic is the “daily (database-related) WTF”, and I want to share some of my top WTF moments.

If you would like to participate in this month’s blog party, go to Kennie’s invitational blog post: Announcing T-SQL Tuesday #88 – The daily (database-related) WTF.

You did what to your database mirroring?

I was working on an operations team at Microsoft, when I got a call at 5 AM from one of the service engineers. I was in the shower when he called, but I checked the message as soon as I got out. The message stated, “We want to fail over our mirrored database to do some planned maintenance. I already dropped mirroring and recovered the mirror database. What is the next step?”

Now, bear in mind that when they set up mirroring initially, I gave them written documentation on how to do failovers and other routine processes. Needless to say, they had to completely re-setup mirroring from scratch before they could fail over and it was a fairly large database so it took them about 3 hours. I talked one of the contractors on his team through the process. His only participation in it was causing the original outage.

Here’s the real WTF moment in all this. A few month’s later, he got an award for his “quick and decisive action” in fixing the outage. Of course, he proudly accepted the award. That was when I knew it was time to leave that team. When the biggest screw ups on the team are happily accepting credit for others fixing the problems they create, it’s no longer an effective team.

In what way is that not a big deal?

At a former employer, the company had years earlier bought another large company, and the 2 sides of the business were still operating separately. Each line of business (LOB) had it’s own operations team and DBA team and the 2 sides never mixed. A little less than a year after I started working there, financial troubles caused the company to try some drastic measures (in their mind) to try to save money. One of the measures was to merge the 2 DBA teams together into 1 team.

The 2 DBA teams managed approximately the same amount databases, but one DBA ran very smoothly with very few issues and only 3 DBAs. This was the team of which I was a member. The other team had 8 databases and was plagued by frequent code rollbacks. If you guessed that the small, well-run team was merged into the large, hectic team, you are correct.

From the day we started the merge until the day I left, there was a WTF moment almost every day. One of the biggest issues I had with my new manager (the whole list is too long for a single blog post) was that he treated his team like unskilled laborers. He did not support them or back them up. They were expected to do anything the dev teams asked them to do any time of day or night. Holidays, weekends, whenever. If they asked to have some code deployed, the DBAs were expected to do it within half an hour during the day or within an hour any other time. They would often get 2 or 3 deployment requests overnight. 1 AM, 2 AM, 4 AM, whatever. And the person on-call not only did it within an hour every time, but they were also expected to be in the office for their normal 40 hours that week.

I spoke out quite loudly to anyone that would listen about this. In a team meeting that the manager did not show up for, I told the guys that they should not be doing that. Here’s the real WTF moment. One of the guys said, “When you’re on call, you just set your alarm to go off every hour to check for emails. It’s no big deal.”

Frankly, I was dumbstruck that not only had the team members been conditioned to put up with these outrageous expectations, but they had come to believe that waking up every hour to check for emails was no big deal. I flat out refused to do it. I told the manager, I told the manager’s manager, and I told the VP that I was telling them from the very beginning that there is no way I am doing that. I practically dared them to fire me, and I told everyone else they should do the same.

It’s not the SAN!

Here’s another story from my Microsoft days. We were experiencing major performance problems with our storage. We had migrated from an old dedicated SAN to a new shared SAN several months earlier. We moved because we were promised better performance on the new SAN. When we initially migrated, performance was great. We were the first to migrate to the new SAN. After a few months, performance had started slipping, and eventually reached a point where performance was no longer meeting our minimum requirements.

We complained to the on-site team who said that the storage team insisted that our performance metrics looked great on the SAN. This went back and forth repeatedly with us providing performance metrics from our side showing severe latency and the storage engineer always coming back that it’s not a SAN problem because everything looks great on the back-end. Meanwhile, performance just continued to degrade further.

It got to the point that we were insisting that they let us bring in someone from EMC to look at the SAN because there was no way that things could look good on the SAN side of things. They resisted for obvious reasons, and we had to get a VP to sign off on allowing the EMC person come in and check the SAN. They were going to be forced to allow it.

Clearly, this raised some hackles on the storage team, and one of the other storage admins decided to look at it himself. And sure enough, he was able to spot the discrepancy. The SAN admin was looking at the old SAN that we were no longer using, not the one we were using. How he wasn’t able to see that he was looking at a completely unused SAN (remember I said it was dedicated to us) is beyond me, but of course the metrics looked great. It was unused. Once they looked at the correct SAN, they were able to see that the SAN was now overloaded. They moved some stuff around and made some configuration changes such as increasing our queue depth and number of paths to the SAN.

He said what?

This last WTF story is the biggest one of all. This one resulted in someone losing their job, and rightly so. At the company where I grew into my first DBA role, our systems admin left the company, and the guy they hired to replace him seemed to really know his stuff. Hint: if a story starts out with he or she “really knowing their stuff”, it’s not going to end up that way.

The systems admin said from the beginning that he knew nothing about SQL Server replication, yet he quickly decide that he knew more about that anyone. He came to conclusion that it was the root of all of our problems, and that if we did away with replication, everything would run perfectly.

This was in 2005, and SQL Server 2005 had just been released. I had signed up for a 3 day training class on the new features of SQL Server 2005. I was going to be out of the office Monday through Wednesday. The training class was right down the road from the office, so at the end of the day on Tuesday, I dropped by the office. I don’t remember why. The systems admin seemed happy to see me, and he wanted to run an idea he had by me.

He showed me his great plan do stop using SQL Server replication and replace it with out own home-build data replication. He wanted me to sign off on his plan and to give him an estimate of how many days it would take to completely write our own replication. He was thinking 2 or 3 days of spare time. I told him that replication is not the root of our problems, that it would be a massive project to try to write our own merge replication, and it would take a lot of time, no idea how much exactly, but lots and lots of dedicated time to even try to do it.

On Thursday when I was back in the office, he came to see me a few minutes before 2 PM. He said that he was about to start a meeting with the ownership committee (from the company that owned our company) and thought I might want to be there in case the database came up. So I joined the meeting with no preparation or any idea of what the agenda was. He got up in front of the ownership committee and told them that our problems are all caused by SQL Server replication because “it locks the database while it is running”. He wanted to get their buy-off on his plan to replace replication. He never got that chance.

I interrupted him and explained that what he was saying was 100% false because replication runs nonstop 24 hours a day and if it was locking the database, we wouldn’t be able to get any data into at all. He tried to say that it queues up and gets in there in between locks, but I told him and everyone there that that’s not how replication works. Not even close. In actuality, it was much less cordial than I’m making it sound.

As I was coming back from the meeting, the CTO called me in to his office as I went walking by. He asked me if the systems admin had told me about his plan to replace replication. I told him I had and that I had told him it wasn’t feasible nor would it address any of our problems. While I was away at the training class, he had told the CTO and CEO that I had signed off on his plan and told him that it would take at most 40 hours to rewrite replication. I told him exactly what I had actually told the systems admin, and then told him about the meeting I was had just left. The CTO and CEO were not aware that this meeting was occurring.

That was the system admin’s last day there. When I left the CTO’s office, they fired the system admin for flat out lying to them about the plan and for going directly to the ownership committee after they had told him not to do anything until they could talk to me directly about what he wanted to do.