SQL Server resource hung on cluster due to MSDTC wrong IP   Leave a comment


Hi All,

Hope you are enjoying our blog posts. I am back with one more interesting issue.

I was at home enjoying weekend suddenly I see few sms and calls on my mobile. Everyone started complaining we are not able to connect to SQL Server. Ok, what’s a big deal I will fix the issue for you…

I started looking into the issue and got confused with the behavior. I see all the resources are up and running fine on cluster, everything looks to be clean No ERRORS in SQL Server error log and Event logs as well, then why no one is a able to connect to SQL Server. I also tried to connect to SQL Server and I was also not able to connect, got time out expire error message from Management Studio. It was failing with command prompt as well!!

Since I don’t see any error message anywhere I thought of restarting the SQL Server from Failover cluster. I brought the SQL Server resource offline and then tried to bring it online but strangely it got hung in Online Pending state for a long time, there was no error in Event log and ERRORLOG. I waited for long time but not much progress and in the end SQL Server Failed.

As a thumb rule I decided to start SQL Server from Command prompt, strange SQL Server came online fine!! So what’s wrong with cluster??

I again tried to start SQL Server from failover cluster manager and again same issue.

I looked into SQL ERRORLOG

2013-01-10 04:02:51.33 Server SQL Server is starting at normal priority base (=7). This is an informational message only. No user action 
is required.
2013-01-10 04:02:51.33 Server Detected 12 CPUs. This is an informational message; no user action is required.
2013-01-10 04:02:51.33 Server Large Page Extensions enabled.
2013-01-10 04:02:51.33 Server Large Page Granularity: 2097152
2013-01-10 04:03:01.92 Server Large Page Allocated: 32MB
2013-01-10 04:03:04.19 Server Large Page Allocated: 32MB
2013-01-10 04:03:07.05 Server Using locked pages for buffer pool.
2013-01-10 04:03:11.28 Server Using dynamic lock allocation. Initial allocation of 2500 Lock blocks and 5000 Lock Owner blocks per node.
 This is an informational message only. No user action is
required.
2013-01-10 04:03:11.30 Server Multinode configuration: node 0: CPU mask: 0x0000000000000fc0 Active CPU mask: 0x0000000000000fc0. This 
message provides a description of the NUMA configuration for this
computer. This is an informational message only. No user action is required.
2013-01-10 04:03:11.30 Server Multinode configuration: node 1: CPU mask: 0x000000000000003f Active CPU mask: 0x000000000000003f. This 
message provides a description of the NUMA configuration for this
computer. This is an informational message only. No user action is required.
2013-01-10 04:03:11.30 Server Attempting to initialize Microsoft Distributed Transaction Coordinator (MS DTC). This is an informational 
message only. No user action is required.

After long time when SQL Server resource used to fail I see below messages in SQL Server ERRORLOG


2013-01-10 04:04:13.06 Server Service control: stop before startup

Event log has below entry

Event Type: Error
Event Source: ClusSvc
Event Category: Resource Monitor
Event ID: 1145
Date: 10/10/2013
Time: 4:05:51 AM
User: N/A
Computer: SQLTEST01
Description:
Cluster resource SQL Server (SQLTEST) timed out. If the pending timeout is too short for this resource, consider increasing the pending timeout value.
For more information, see Help and Support Center at
http://go.microsoft.com/fwlink/events.asp.
Data:0000: b4 05 00 00 ´...

Now since we had no clue I thought of checking Cluster event log to check if I find any help there. Below is the entry in Cluster.log

00000848.00003200::2013/01/09-17:54:21.752 ERR SQL Server <SQL Server (SQLTEST)>: [sqsrvres] OnlineThread: ResUtilsStartResourceService 
failed (status 5b4)
00000848.00003200::2013/01/09-17:54:21.752 ERR SQL Server <SQL Server (SQLTEST)>: [sqsrvres] OnlineThread: Error 5b4 bringing resource 
online.
00000848.00003200::2013/01/09-17:54:21.861 WARN [RM] RmpSetResourceStatus: Resource <SQL Server (SQLTEST)> not setting status since Terminate is called, lock owner=0x2050, resource=SQL Server (SQLTEST),
state=14...
00000eb8.00000f18::2013/01/09-17:54:32.096 INFO [CP] CppResourceNotify for resource SQL Server (SQLTEST)
00000eb8.00000f18::2013/01/09-17:54:32.096 INFO [FM] RmTerminateResource: 4e61fb84-23a2-401a-a9b3-dcfb01a501eb is now offline
00000eb8.00000f18::2013/01/09-17:54:32.096 INFO [GUM] GumSendUpdate: queuing update type 0 context 11
00000eb8.00000f18::2013/01/09-17:54:32.096 INFO [GUM] GumSendUpdate: Dispatching seq 5916 type 0 context 11 to node 2
00000eb8.00000f18::2013/01/09-17:54:32.096 INFO [GUM] GumSendUpdate: completed update seq 5916 type 0 context 11
00000eb8.00000f18::2013/01/09-17:54:32.096 INFO [GUM] GumSendUpdate: queuing update type 0 context 11
00000eb8.000027e8::2013/01/09-17:54:32.096 INFO [FM] FmpEnumerateGroupResources: Entry for group <4a81aba9-7aa8-43c7-b1e7-3e7658797b0d>....
00000eb8.00000f18::2013/01/09-17:54:32.096 INFO [GUM] GumSendUpdate: Dispatching seq 5917 type 0 context 11 to node 2

*Note:- cluster log time is in GMT

I tried to convert this status (status 5b4) to decimal

Hex 5b4 ==> 1460 Decimal

C:\Documents and Settings\manish>net helpmsg 1460
"This operation returned because the timeout period expired."

You can confirm by runnig a tool called ERR

C:\Manish\Desktop\Download\Err>Err.exe 5b4
# for hex 0x5b4 / decimal 1460 :
SQL_1460_severity_16 sql_err
# The partner for database "%.*ls" is not defined or it is
# different than the one attempting to establish log
# shipping.
ERROR_TIMEOUT winerror.h
# This operation returned because the timeout period expired.
# 2 matches found for "5b4"

I was not getting any clue at all!! So I decided to take a close look at event log one more time before I attempt to run any tools. I scrolled the Event log further down and i see one error message.

Event Type: Error
Event Source: DCOM
Event Category: None
Event ID: 10009
Date: 10/01/2013
Time: 2:07:39 AM
User: N/A
Computer: SQLTEST01
Description:
DCOM was unable to communicate with the computer SQLDTCTST01.Domain.com using any of the configured protocols.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

SQLDTCTST01.Domain.com is my MSDTC Network name resource and in my ERRORLOG also the last message says ” Attempting to initialize Microsoft Distributed Transaction Coordinator (MS DTC). This is an informational message only. No user action is required.”

I tried to co-relate both and look like some thing wrong with my MSDTC Resource.

I tried to ping MSDTC Network name and what I see was more confusing than what I was expecting, the IP address was not the IP address for which MSDTC is configured, it was pointing to some other IP address!!

Now one thing is confirmed that SQL Server resource is getting timed out because of MSDTC resource.

As an attempt I tried to bring the MSDTC resource offline and then tried to bring SQL Server resource online this time SQL Server came online fine…. One more reason for confusion.

I tried to bring the MSDTC resource online it also came fine…but after few minutes issue started happening again…..

Question still remains the same Why MSDTC is pointing to other IP and I asked the same question to my team!! One of the mates highlighted that today there was a DR exercise, is there any relation with that….

Resolution

Now situation was clear. I logged into DR server and I found Network name and IP address resources were online there and IP address was same which i was getting in my ping request…

We brought this resource offline at DR site flushed the DNS on the cluster, then tried to bring the resource and every thing was up and running perfect!!!

Other possible reasons

1- Check MSDTC configuration

2- Check if some one has manually changed the IP address Or Network name

All right issue resolved its back to normal now and time to enjoy my weekend …hurray….

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: