Non-yielding scheduler due to issue with Tcp::Close   4 comments


Another day when SQL Server became unresponsive and I was brought in to the RCA. Fortunate enough, I happened to find a KB article which matched with the problem here. However, our SQL server had all the fixes what that KB article offered to us. Now that is really, so here the story unfolds..

This was a SQL Server 2008 R2 SP1 (10.50.2796) running on Windows Server 2008 R2 SP1. From Errorlog, I saw those famous non-yielding scheduler messages which goes like this:

DateTime Server      Using 'dbghelp.dll' version '4.0.5'
DateTime Server      ***Unable to get thread context for spid 0
DateTime Server      * *******************************************************************************
DateTime Server      *
DateTime Server      * BEGIN STACK DUMP:
DateTime Server      *   DateTime spid 1234
DateTime Server      *
DateTime Server      * Non-yielding Scheduler
DateTime Server      *
DateTime Server      * *******************************************************************************
DateTime Server      Stack Signature for the dump is 0x0000000000000166
DateTime Server      External dump process return code 0x20000001.
External dump process returned no errors.

DateTime Server      Process 0:0:0 (0x1a5c) Worker 0x00000000097661A0 appears to be non-yielding on Scheduler 2. Thread creation time: 12994623549587. Approx Thread CPU Used: kernel 0 ms, user 0 ms. Process Utilization 0%. System Idle 99%. Interval: 70215 ms.
DateTime Server      Process 0:0:0 (0x1a5c) Worker 0x00000000097661A0 appears to be non-yielding on Scheduler 2. Thread creation time: 12994623549587. Approx Thread CPU Used: kernel 0 ms, user 0 ms. Process Utilization 0%. System Idle 99%. Interval: 131274 ms.

 
There were no other preceding messages of interest, nothing much in Event logs either. Focus goes straight looking into the dump file SQL Server had generated for above non-yielding scheduler. I opened windbg, set the symbol path and loaded the dump file, reload the symbols. Since I know the non-yielding thread ID 0x1a5c I changed the context to this thread ID and used my favourite k command to see the call stack of the hanging thread.

--Set the symbol path
0:049> .symfix+ D:\publicsymbols
DBGHELP: Symbol Search Path: SRV*D:\publicsymbols*http://msdl.microsoft.com/download/symbols
DBGHELP: Symbol Search Path: SRV*D:\publicsymbols*http://msdl.microsoft.com/download/symbols

--Load the symbols
0:049> .reload /f

--Change the context to the hanging thread
0:049> ~~[0x1a5c]s
ntdll!NtClose+0xa:
00000000`778d140a c3              ret

--Display the call stack of non-yielding thread
0:049> k
Child-SP          RetAddr           Call Site
00000000`2136d5c8 000007fe`fd00246a ntdll!NtClose+0xa
00000000`2136d5d0 000007fe`fd0022c0 mswsock!SockCloseSocket+0x13a
00000000`2136d690 000007fe`ffb419ad mswsock!WSPCloseSocket+0x80
00000000`2136d6c0 00000000`0082147d ws2_32!closesocket+0xcd
00000000`2136d710 00000000`00821360 sqlservr!Tcp::FCloseRefHandle+0xf5
00000000`2136d770 00000000`00ea3e3c sqlservr!Tcp::Close+0x5c
00000000`2136d840 00000000`00ea3c55 sqlservr!CConnectionEndpoint::Close+0x198
00000000`2136db50 00000000`00f63b08 sqlservr!CConnectionEndpoint::FHandleTimerEvent+0x161
00000000`2136ebb0 00000000`00891c6a sqlservr!CSbTransportMgr::PeriodicWorkNotification+0x10c
00000000`2136ec00 00000000`002c91c0 sqlservr!CServiceBrokerMgr::PeriodicWorkNotification+0xa3
00000000`2136ec30 00000000`002c8c33 sqlservr!CBrokerPeriodicWorkNotification::ProcessTskPkt+0x18
00000000`2136ec60 00000000`002c8a9c sqlservr!TaskReqPktTimer::ExecuteTask+0xc3
00000000`2136ed90 00000000`002bab95 sqlservr!OnDemandTaskContext::ProcessTskPkt+0x237
00000000`2136eed0 00000000`002c88dd sqlservr!SystemTaskEntryPoint+0x3f5
00000000`2136f7c0 00000000`002af070 sqlservr!OnDemandTaskContext::FuncEntryPoint+0x21
00000000`2136f7f0 00000000`002aee46 sqlservr!SOS_Task::Param::Execute+0x12a
00000000`2136f900 00000000`002aec8b sqlservr!SOS_Scheduler::RunTask+0x96
00000000`2136f960 00000000`0087923a sqlservr!SOS_Scheduler::ProcessTasks+0x128
00000000`2136f9d0 00000000`00879389 sqlservr!SchedulerManager::WorkerEntryPoint+0x2d2
00000000`2136fab0 00000000`00877201 sqlservr!SystemThread::RunWorker+0xcc
00000000`2136faf0 00000000`00878402 sqlservr!SystemThreadDispatcher::ProcessWorker+0x2db
00000000`2136fba0 00000000`754d37d7 sqlservr!SchedulerManager::ThreadEntryPoint+0x173
00000000`2136fc40 00000000`0422f1e0 msvcr80!_callthreadstartex+0x17 [f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c @ 348]
00000000`2136fc48 000007ff`fff4d4a8 0x422f1e0
00000000`2136fc50 00000000`0422f1e0 0x7ff`fff4d4a8
00000000`2136fc58 00000000`00000000 0x422f1e0

 
I know this is the correct call stack for the hanging thread. In your case, you might want to verify as sometimes the copied call stack is different than the actual non-yielding call stack.

To determine the correct call stack of a non-yielding thread, see this nice blog written by my dear friend Karthick P.K [Blog]

How to analyze Non-Yielding scheduler or Non-yielding IOCP Listener dumps

Okay, so the lines above highlighted in bold are some of those which probably points towards the reason of non-yielding scheduler. Looks like there is some problem with call to Tcp::Close made by SQL Server which semes to be stuck at OS functions. Time for a quick search on “sql 2008 tcp close non-yielding” which takes me to the following KB article:

“Non-yielding Scheduler” error and SQL Server 2008 or SQL Server 2008 R2 stops responding intermittently in Windows Server 2008 or in Windows Server 2008 R2

As the KB article mentions, this issue appears to be somewhat complex and involves three fixes – SQLServer Database Engine, SQL Native client and Windows fix. Apparently I had all these fixes already applied. My SQL server version is 10.50.2796 i.e. CU4 for SQL 2008 R2 SP1 (while the KB article say it is fixed in CU6 of SQL 2008 R2 RTM). SQL Server native client (C:\Windows\system32\sqlncli.dll) was also on higher build and the OS was Windows 2008 R2 SP1. So the question was – why still the issue if I have all these fixes. Ise the issue not completely fixed or re-surfaced in later builds? Or need some other fix? Well, after digging a bit more, I found another KB article which explains all these questions:

FIX: Error is logged when you start SQL Server 2008 or when a client sends a request to the program

As per the this KB article, I applied Cumulative update package 7 for SQL Server 2008 R2 Service Pack 1 and issue seems to have fixed now 🙂

To summarize this whole experience, if you are running into a non-yielding scheduler issue where the call stack matches with that of mentioned in this post, make sure you have all the three fixes for sqlservr.exe, sqlncli.dll and windows patch as per the KB article http://support.microsoft.com/kb/2491214

If the issue still occurs or you already have fixes applied as per http://support.microsoft.com/kb/2491214 then you should consider applying SQL server CUs as per http://support.microsoft.com/kb/2711549

Hope this post helps you if you are facing the same issue.

4 responses to “Non-yielding scheduler due to issue with Tcp::Close

Subscribe to comments with RSS.

  1. Was the issue reproducible ? . And after applying all these fixes, how can we know the issue is resolved..

  2. No it was not reproducible but did occur a couple of times. And resolved after all the fixes were applied.

  3. Good information & well Analyzed Prashant 🙂

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: