I recently worked on an interesting case where reportedly the log backup for a database was ‘hung’. The troubleshooting and resolution was quite interesting making me to write this blog.
It was SQL Server 2008 (SP2) – 10.0.4000 where a user had initiated transaction log backup for a database [db1]. The log backup never completed. User found the spid in a ‘hung’ state hence killed the spid. The spid was now stuck in a “killed/rollback” state since last one week.
I’d typically start troubleshooting such issues by looking at the ERRORLOG. However, the issue originally started a week back and relevant ERRORLOG was been overwritten. Fortunate enough, I found a symptom dump (SQLDump0001.txt) and a minidump file (SQLDump0001.mdmp) in \LOG folder generated that day.
Looking at the symptom dump file, I could clearly see it was an AV (Access Violation)
SqlDumpExceptionHandler: Process 3162 generated fatal exception c0000005 EXCEPTION_ACCESS_VIOLATION. SQL Server is terminating this process. * ******************************************************************************* * BEGIN STACK DUMP: * 05/17/12 11:43:06 spid 3162 * Exception Address = 0000000001060136 Module(sqlservr+0000000000060136) * Exception Code = c0000005 EXCEPTION_ACCESS_VIOLATION * Access Violation occurred writing address 0000000000000008 * Input Buffer 498 bytes - * BACKUP LOG [db1] TO DISK = N'X:\Backups\db1\db1_backup.trn' WITH NOFORMA * T, NOINIT, NAME = db1_backup', SKIP , REWIND, NOUNLOAD, STATS = 10
As soon as the tlog backup started for the said database, an AV occurred. Additionally, I could see lots of errors like below. These message indicate the backup spid is waiting to acquire a latch on ‘BACKUP_OPERATION’.
2012-05-24 01:20:56.095 spid365 Timeout occurred while waiting for latch: class 'BACKUP_OPERATION', id 0000000099665208, type 4, Task 0x0000000005413048 : 0, waittime 66300, flags 0x1a, owning task 0x000000000634C988. Continuing to wait.
Notice, I have got the backup spid (spid365) above. Let’s look at sys.sysprocesses to see what this spid is waiting for:
spid kpid cmd status waittime waittype waitresource ------ ------ ---------------- ------------------------------ -------------------- -------- ------------------------------------ 365 9496 KILLED/ROLLBACK suspended 20682 0x0024 BACKUP_OPERATION (0000000099665208) 365 5716 KILLED/ROLLBACK spinloop 0 0x0000 (2 row(s) affected)
I find two workers (9496 and 5716) for this backup spid 365. worker 9496 is the one waiting for BACKUP_OPERATION. Another worker for the same spid changes its state from spinloop to runnable if I repeatedly query the status. That seems to be an after-effect of AV.
Okay, so now is the time to analyse the minidump file to know what exactly is going on here as I cannot conclude much from the above messages.
If you’re quite new to the “dumps” world, I suggest reading these blogs which will get you started.
I loaded the dump using windbg, changed the context to the thread which raised the exception (using .ecxr)
0:049> .ecxr rax=0000000000000000 rbx=000000002760b650 rcx=0000000000000000 rdx=00000000069f81a0 rsi=0000000000000001 rdi=0000000027609f40 rip=0000000001060136 rsp=0000000027609f10 rbp=0000000027609f48 r8=0000000000000002 r9=0000000000000004 r10=0000000000000018 r11=0000000000000000 r12=0000000000000002 r13=0000000000000000 r14=0000000001000000 r15=00000000045c0028 iopl=0 nv up ei pl nz na po nc cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010206 sqlservr!LatchBase::UnpendEligibleWaiters+0x196: 00000000`01060136 48894808 mov qword ptr [rax+8],rcx ds:00000000`00000008=????????????????
Now, run kc to see the stack of the thread which raised this AV:
0:049> kc *** Stack trace for last set context - .thread/.cxr resets it Call Site sqlservr!LatchBase::UnpendEligibleWaiters sqlservr!LatchBase::ReleaseInternal sqlservr!BPool::ReadPageCompletion sqlservr!FCB::IoCompletion sqlservr!IOQueue::CheckForIOCompletion sqlservr!SOS_Scheduler::SwitchContext sqlservr!SOS_Scheduler::SuspendNonPreemptive sqlservr!EventInternal<Spinlock >::Wait sqlservr!LatchBase::Suspend sqlservr!BUF::AcquireLatch sqlservr!BPool::Get sqlservr!BTreeMgr::Seek sqlservr!IndexPageManager::GetPageWithKey sqlservr!GetRowForKeyValue sqlservr!IndexDataSetSession::GetRowByKeyValue sqlservr!IndexDataSetSession::FetchRowByKeyValueInternal sqlservr!RowsetNewSS::FetchRowByKeyValue sqlservr!CMEDScan::StartSearch sqlservr!CMEDCatObfusKey::FLocateCertRow sqlservr!CMEDCatObfusKey::GetCertBytes sqlservr!CMEDProxyCertificate::GetCertBytes sqlservr!CSECCertificate::InitFromMD sqlservr!CSECDEK::DecryptCurrDEK sqlservr!CSECDEK::SetupAfterLoadFromDisk sqlservr!BackupFileDesc::InitializeDEK sqlservr!SQL_FILE_INFO::Write sqlservr!BackupMedium::WriteDescription sqlservr!BackupMedium::WriteLeadingMetadata sqlservr!BackupStream::ThreadMainRoutine sqlservr!BackupThread::ThreadBase sqlservr!SubprocEntrypoint sqlservr!SOS_Task::Param::Execute sqlservr!SOS_Scheduler::RunTask sqlservr!SOS_Scheduler::ProcessTasks sqlservr!SchedulerManager::WorkerEntryPoint sqlservr!SystemThread::RunWorker sqlservr!SystemThreadDispatcher::ProcessWorker sqlservr!SchedulerManager::ThreadEntryPoint msvcr80!endthreadex msvcr80!endthreadex kernel32!BaseThreadStart
Looking at the highlighted frames, it appears that it needed access to DEK to decrypt the encrypted pages for the backup operation to proceed. DEK i.e. Database Encryption Key? Means Encryption involved here? I quickly looked at is_encrypted column value in master.sys.databases for the database in question and I find it has TDE enabled on it.
is_encrypted name ------------ ----------------------------- 1 db1
So, now I have enough keywords to do a quick search for this issue. I searched internet for “SQL Server 2008 Transparent Database Encryption Backup Access Violation” and the first search result is the solution.
FIX: An access violation may occur when a database is enabled for Transparent Database Encryption in SQL Server 2008 or in SQL Server 2008 R2
The KB article clearly explains that the issue occurs because of an incorrect synchronization between the SQL Server threads when the Database Encryption key (DEK) that is used for Transparent Database Encryption is accessed. It lists the fixes for SQL 2008/R2. Also, the workaround is to manually issue a CHECKPOINT before performing a Backup operation on TDE enabled databases. Another thing to note – The issue may occur with any sort of backup, be it Full, Log or Differential.
There is another KB article describing a somewhat similar issue but that occurs due to non-default IO affinity settings, unrelated to the issue I explained above.
I have attempted to detail the approach here as how I narrowed down the issue and found the related KB article with the fix. The KB article is not very searchable otherwise. I believe Microsoft could have added some more information (e.g. error from the Errorlog, waittype BACKUP_OPERATION etc) so that the KB article becomes easily searchable and not everyone needs to take the lengthy route of looking into dumps etc.
I have tried to include enough information in this blog post so that someone facing the similar issue may easily spot the root-cause and the fix. Hope that helps someone. Few more debugging blogs on the way, stay tuned!