The error they were receiving was:
Msg 0, Level 11, State 0, Line 8
A severe error occurred on the current command. The results, if any, should be discarded.
This is a generic error that simply means the connection was terminated or lost and if any results were received back, they may be incomplete and you should not use them. It doesn’t tell you what the underlying error was.
I suspected that there was some difference between the queries that failed and the ones that were successful in SSMS. It ran the query they gave me, and I got the same error. I got disconnected and no further error info was returned. I also verified that the same query was successful on the otehr two tables mentioned. No errors on the other tables.
I wanted to know what error was causing the connection to be terminated, so I checked the SQL log and discovered that every time it failed, it was generating a stack dump. Before I was done investigating, it had generated 21 stack dumps. The key user-usable error info in the log was:
* Exception Address = 00007FF93BCF7E08 Module(sqlmin+00000000001E7E08)
* Exception Code = c0000005 EXCEPTION_ACCESS_VIOLATION
* Access Violation occurred reading address 0000000000000000
I also learned that there were 2 different queries against this table throwing stack dumps. I tried some other variations of the queries and they were successful. Some things worked and some failed.
This would fail:
Select Distinct Column1, Column2, Column3 From dbo.Table;
But this would not:
Select Column1, Column2, Column3 From dbo.Table;
And this would fail:
Select * From dbo.Table Where Column4 = 'a string';
But this would not:
Select * From dbo.Table;
I suspected that the queries that failed was using a different index than the queries that were successful and figuring out which one would help me determine the issue. For a sanity check, I ran DBCC CHECKTABLE on the table, and it returned no corruption. If an index was corrupt, it should have reported that. Since there was no detectable corruption to go by, I decided to compare the execution plans to identify the differences. The attempt to generate estimated execution plans also failed with the same errors on the same queries.
Unlike my hairline, the plot thickened. There is no need to read the data to generate the plan, so that meant that the problem wasn’t with the data, it was something to do with the metadata. My suspicion immediately went to statistics. Statistics would be used to generate the plan without needing to touch the indexes. Also, in SQL Server 2014, a read-only replica can automatically generate column statistics. If it was the automatically-generated, temporary column statistics, that would also explain why the queries were all successful on the primary node. Temporary column statistics created on the primary are stored in the database and are transferred to secondary databases. Temporary column statistics on a readable secondary replica are stored in the tempdb database and do not get transferred across to the primary. You can identify these statistics in sys.stats where auto_generated = 1 and is_temporary = 1 and name ends with _readonly_database_statistics.
As I said earlier, DBCC CHECKTABLE returned clean as did DBCC CHECKDB. Since these statistics are stored in tempdb, I ran DBCC CHECKDB on tempdb, and it returned no errors either. Still I suspected that one of more of the temporary stats were corrupted. I dropped all temporary stats for the table having errors, and all queries and query plan compilations on that table started working. It was definitely something wrong with the temp stats, but was it corruption? Maybe DBCC CHECKxxx doesn’t check the temporary stats.
I sent an email to the MVP distribution group, and Paul Randal (blog|@PaulRandal) verified that DBCC CHECKxxx does not check any statistics blobs at all. So not only temporary statistics, no statistics are checked at all. Paul indicated that it was possible to do so, but that the team never implemented that particular check.
I shared one of the dump files with an escalation engineer at Microsoft, at he confirmed that it appeared to be hitting corrupt temporary stats. A bug has been filed, and the SQL Team will be working on finding a root cause and a fix for the issue. In the meantime, if you encounter this bug, you can fix it temporarily by dropping the statistics using the DROP STATISTICS command.
The Microsoft engineer also said that I could confirm if a stats blob is corrupt by running DBCC SHOW_STATISTICS. Unfortunately, I had already dropped the corrupted temporary stats, so I wasn’t able to try that out. If it happens again, I can try that out to identify exactly which stats blobs are corrupted and only drop those specific stats rather than dropping all of them on the table (there were 38 of them that I had dropped). I did use this command to check all other stats in the database to validate that no other corrupted stats existed.