SQL Server 2008 backup hung on a strange wait PREEMPTIVE_CLUSAPI_CLUSTERRESOURCECONTROL   Leave a comment


This is another interesting case where SQL Server 2008 backup appears to be hung. The client reported that he has a maintenance plan to take full backup of all the user databases. Apparently backup for some of the databases succeed while for some other the backup just ‘hangs’. At times, SQL Server reports non-yielding scheduler and generates a stack dump as well.

Now that makes it interesting as why the backup hangs only for a hands full of databases while succeeds for the rest. I started narrowing this down by comparing the two jobs and found that the option “Verify backup integrity” was checked for the backup job which appears to hang. To confirm that “Verify backup integrity” is really what is causing the issue, I created a test job for one database backup with “Verify backup integrity” turned on. When I run this job, backup really hung. And so the further troubleshooting begins..

Looking at sys.sysprocesses I see the command being executed was RESTORE HEADERON  and waittype shows as PREEMPTIVE_CLUSAPI_CLUSTERRESOUR

Select * from master..sys.sysprocesses Where spid = 91
spid   kpid   blocked waittype waittime                lastwaittype                            program_name                                    cmd
------ ------ ------- -------- --------------------    ------------------------------------     ---------------------------------------             -------------------
91    6508        0     0x0181 141829660               PREEMPTIVE_CLUSAPI_CLUSTERRESOUR           Microsoft SQL Server Management Studio          RESTORE HEADERON

 
I then queried inputbuffer to see the complete query being run:


dbcc inputbuffer(91)

-- Here is the output

declare @backupSetId as int

select @backupSetId = position from msdb..backupset

where database_name=N'Hung_Backup_Db' and backup_set_id=(select max(backup_set_id)

from msdb..backupset where database_name=N' Hung_Backup_Db' ) if @backupSetId is null

begin raiserror(N'Verify failed. Backup information for database '' Hung_Backup_Db'' not found.', 16, 1)

end

RESTORE VERIFYONLY FROM DISK = N'D:\Backup\Hung_Backup_Db_backup.bak'

WITH FILE = @backupSetId, NOUNLOAD, NOREWIND

So, it is running Restore Verifyonly since “Verify Backup Integrity” is checked. I can reproduce this issue by just running the above mentioned Restore Verifyonly command. The spid again seems to stuck on PREEMPTIVE_CLUSAPI_CLUSTERRESOUR.

A couple of things to note here:

  1.  The session performing the backup (or running restore headeronly) is neither blocked nor blocking any other  session.
  2. The full wait type is PREEMPTIVE_CLUSAPI_CLUSTERRESOURCECONTROL which is one of the pre-emptive waittypes visible in SQL Server 2008 onwards. By name, it suggests some sort of wait on cluster APIs and/or resources.

With this in mind, I took a manual dump of this spid while it was hung using dbcc stackdump (-1,91)
Here is the call stack from the dump:

Child-SP RetAddr Call Site
00000000`1c34a5a8 000007ff`7fd6e146 ntdll!ZwRequestWaitReplyPort+0xa
00000000`1c34a5b0 000007ff`7fd6e37f rpcrt4!LRPC_CCALL::SendReceive+0x447
00000000`1c34a690 000007ff`7fe9d4c8 rpcrt4!NdrSendReceive+0xf6
00000000`1c34a6c0 000007ff`7fe9d5bd rpcrt4!NdrpClientCall2+0x7e2
00000000`1c34acb0 000007ff`7b3b6349 rpcrt4!NdrClientCall2+0x1d
00000000`1c34ace0 000007ff`7b3a9e1b clusapi!ApiResourceControl+0x59
00000000`1c34ad40 000007ff`7b3a8233 clusapi!ClusterpResourceControl+0x360
00000000`1c34ae00 00000000`0292f80e clusapi!ClusterResourceControl+0x33
00000000`1c34ae50 00000000`0293a0d6 sqlservr!FClusMgr::ClusterResourceIsDiskType+0x7e
00000000`1c34bf20 00000000`02947583 sqlservr!FClusMgr::VerifyDriveInClusterGroup+0x186
00000000`1c34cfa0 00000000`02f84a8f sqlservr!FileMgr::ValidateFileName+0x3a3
00000000`1c34d880 00000000`0386f8df sqlservr!BackupFileList::GenerateVolumeUsageList+0x3f
00000000`1c34d8b0 00000000`03870b70 sqlservr!BackupOperation::VerifyBackupSet+0x1af
00000000`1c34edc0 00000000`038744bd sqlservr!BackupEntry::VerifyBackupSet+0xf0
00000000`1c34ef90 00000000`01079616 sqlservr!CStmtLoadVol::XretExecute+0xed
00000000`1c34f0d0 00000000`0107abbb sqlservr!CMsqlExecContext::ExecuteStmts<1,1>+0x55a
00000000`1c34f380 00000000`0107a4d9 sqlservr!CMsqlExecContext::FExecute+0x58b
00000000`1c34f500 00000000`0107d10b sqlservr!CSQLSource::Execute+0x319
00000000`1c34f630 00000000`01077c9b sqlservr!process_request+0x370
00000000`1c34f8f0 00000000`0101163b sqlservr!process_commands+0x2b2
00000000`1c34faf0 00000000`010112fa sqlservr!SOS_Task::Param::Execute+0x11b
00000000`1c34fc10 00000000`01010e35 sqlservr!SOS_Scheduler::RunTask+0xca
00000000`1c34fca0 00000000`01590e50 sqlservr!SOS_Scheduler::ProcessTasks+0x95
00000000`1c34fd10 00000000`015909a0 sqlservr!SchedulerManager::WorkerEntryPoint+0x110
00000000`1c34fdd0 00000000`015a9ab0 sqlservr!SystemThread::RunWorker+0x60
00000000`1c34fe00 00000000`0159013f sqlservr!SystemThreadDispatcher::ProcessWorker+0x12c
00000000`1c34fe90 00000000`781337d7 sqlservr!SchedulerManager::ThreadEntryPoint+0x12f
00000000`1c34ff20 00000000`78133894 msvcr80!_callthreadstartex+0x17
00000000`1c34ff50 00000000`77d6b71a msvcr80!_threadstartex+0x84
00000000`1c34ff80 00000000`00000000 kernel32!BaseThreadStart+0x3a

From the above stack, we see that spid is stuck at ZwRequestWaitReplyPort while trying to validate the disk resource types in clustersqlservr!FClusMgr::ClusterResourceIsDiskType.

Before we discuss this further, let me shed some light on how RESTORE VERIFYONLY actually works so that we know why we landed here.

According to the MSDN article on RESTORE VERIFYONLY checks performed by RESTORE VERIFYONLY include:

  •  That the backup set is complete and all volumes are readable.
  •  Some header fields of database pages, such as the page ID (as if it were about to write the data).
  •  Checksum (if present on the media).
  • Checking for sufficient space on destination devices.

As per above article one of the tasks performed during the integrity checks is to retrieve the files contained in the backup set as well as validate few properties of the disks that contain these files. In our case, it should be one of the SAN drives on which SQL Server resource depends on. Since SAN drive is a clustered disk and managed by the cluster resource, we use the cluster API’s to retrieve the information and then wait for a response back from the cluster Service. And that is why we see the wait type noticed in the sysprocesses for these sessions i.e. PREEMPTIVE_CLUSAPI_CLUSTERRESOURCECONTROL

Now, this leads us to believe that there is some issue with the cluster. When we tried to open Cluster Administrator (cluadmin.msc) which is another application talking to cluster.exe we observed that it was hung as well. The Failover Cluster Management was not responding. Now in some situation it can generate a non-yielding dump as well causing failover to other node.

So to resolve this problem we restarted the cluster service… and yayyyyyyyyyyy!! We have our backup jobs back on track!!

So, the ultimate resolution was to restart the cluster services. However, this still does not explain the root-cause as why cluster services were not responding properly. And this is something your Systems admin would help to get you an answer for.

Nonetheless, from SQL side, we know one or more of these steps are going to resolve the said issue:

  • Restart cluster service and verify if these sessions progress/end
  • Restart SQL Server
  • Reboot the node

Hope this will help you identify how to troubleshoot a hung backup provided the symptoms match as that of discussed here.

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: