Troubleshooting Access Violation caused by Database Backup in SQL Server 2008/R2   2 comments

I recently worked on an interesting case where reportedly the log backup for a database was ‘hung’. The troubleshooting and resolution was quite interesting making me to write this blog.

It was SQL Server 2008 (SP2) – 10.0.4000 where a user had initiated transaction log backup for a database [db1]. The log backup never completed. User found the spid in a ‘hung’ state hence killed the spid. The spid was now stuck in a “killed/rollback” state since last one week.

I’d typically start troubleshooting such issues by looking at the ERRORLOG. However, the issue originally started a week back and relevant ERRORLOG was been overwritten. Fortunate enough, I found a symptom dump (SQLDump0001.txt) and a minidump file (SQLDump0001.mdmp) in \LOG folder generated that day.

Looking at the symptom dump file, I could clearly see it was an AV (Access Violation)

SqlDumpExceptionHandler: Process 3162 generated fatal exception c0000005 EXCEPTION_ACCESS_VIOLATION. SQL Server is terminating this process.
* *******************************************************************************
*   05/17/12 11:43:06 spid 3162
*   Exception Address = 0000000001060136 Module(sqlservr+0000000000060136)
*   Exception Code    = c0000005 EXCEPTION_ACCESS_VIOLATION
*   Access Violation occurred writing address 0000000000000008
* Input Buffer 498 bytes -
*             BACKUP LOG [db1] TO  DISK = N'X:\Backups\db1\db1_backup.trn' WITH NOFORMA
*  T, NOINIT,  NAME = db1_backup', SKIP , REWIND, NOUNLOAD,  STATS = 10

As soon as the tlog backup started for the said database, an AV occurred. Additionally, I could see lots of errors like below. These message indicate the backup spid is waiting to acquire a latch on ‘BACKUP_OPERATION’.

2012-05-24 01:20:56.095 spid365      Timeout occurred while waiting for latch: class 'BACKUP_OPERATION', id 0000000099665208, type 4, Task 0x0000000005413048 : 0, waittime 66300, flags 0x1a, owning task 0x000000000634C988. Continuing to wait.

Notice, I have got the backup spid (spid365) above. Let’s look at sys.sysprocesses to see what this spid is waiting for:

spid   kpid   cmd              status                         waittime             waittype waitresource
------ ------ ---------------- ------------------------------ -------------------- -------- ------------------------------------
365    9496   KILLED/ROLLBACK  suspended                      20682                0x0024   BACKUP_OPERATION (0000000099665208)
365    5716   KILLED/ROLLBACK  spinloop                       0                    0x0000

(2 row(s) affected)

I find two workers (9496 and 5716) for this backup spid 365. worker 9496 is the one waiting for BACKUP_OPERATION. Another worker for the same spid changes its state from spinloop to runnable if I repeatedly query the status. That seems to be an after-effect of AV.

Okay, so now is the time to analyse the minidump file to know what exactly is going on here as I cannot conclude much from the above messages.

If you’re quite new to the “dumps” world, I suggest reading these blogs which will get you started.

I loaded the dump using windbg, changed the context to the thread which raised the exception (using .ecxr)

0:049> .ecxr
rax=0000000000000000 rbx=000000002760b650 rcx=0000000000000000
rdx=00000000069f81a0 rsi=0000000000000001 rdi=0000000027609f40
rip=0000000001060136 rsp=0000000027609f10 rbp=0000000027609f48
r8=0000000000000002  r9=0000000000000004 r10=0000000000000018
r11=0000000000000000 r12=0000000000000002 r13=0000000000000000
r14=0000000001000000 r15=00000000045c0028
iopl=0         nv up ei pl nz na po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010206
00000000`01060136 48894808        mov     qword ptr [rax+8],rcx ds:00000000`00000008=????????????????

Now, run kc to see the stack of the thread which raised this AV:

0:049> kc
*** Stack trace for last set context - .thread/.cxr resets it
Call Site
sqlservr!EventInternal<Spinlock >::Wait

Looking at the highlighted frames, it appears that it needed access to DEK to decrypt the encrypted pages for the backup operation to proceed. DEK i.e. Database Encryption Key? Means Encryption involved here? I quickly looked at is_encrypted column value in master.sys.databases for the database in question and I find it has TDE enabled on it.

is_encrypted name
------------ -----------------------------
1            db1

So, now I have enough keywords to do a quick search for this issue. I searched internet for “SQL Server 2008 Transparent Database Encryption Backup Access Violation” and the first search result is the solution.

FIX: An access violation may occur when a database is enabled for Transparent Database Encryption in SQL Server 2008 or in SQL Server 2008 R2

The KB article clearly explains that the issue occurs because of an incorrect synchronization between the SQL Server threads when the Database Encryption key (DEK) that is used for Transparent Database Encryption is accessed. It lists the fixes for SQL 2008/R2. Also, the workaround is to manually issue a CHECKPOINT before performing a Backup operation on TDE enabled databases. Another thing to note – The issue may occur with any sort of backup, be it Full, Log or Differential.

There is another KB article describing a somewhat similar issue but that occurs due to non-default IO affinity settings, unrelated to the issue I explained above.

I have attempted to detail the approach here as how I narrowed down the issue and found the related KB article with the fix. The KB article is not very searchable otherwise. I believe Microsoft could have added some more information (e.g. error from the Errorlog, waittype BACKUP_OPERATION etc) so that the KB article becomes easily searchable and not everyone needs to take the lengthy route of looking into dumps etc.

I have tried to include enough information in this blog post so that someone facing the similar issue may easily spot the root-cause and the fix. Hope that helps someone. Few more debugging blogs on the way, stay tuned!

2 responses to “Troubleshooting Access Violation caused by Database Backup in SQL Server 2008/R2

Subscribe to comments with RSS.

  1. Thanks for this post !!!

  2. Pingback: How To Backup Sql Server 2008 R2 Database | Information

Leave a Comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: