Hope you are enjoying our blog.
Today i am going to share one more examples of real troubleshooting where customer comes with an issue but when we actually found the root cause it turns out to be completely different all together from what we expected. To know more, look at how we worked on this issue and most amazing thing was how we resolved this one.
According to the customer when he was starting SQL Server Agent service it started successfully with status “service started successfully” but when we are running any jobs it was throwing below error.
SQLServerAgent is not currently running so it cannot be notified of this action. (Microsoft SQL Server, Error: 22022)
Surprising part was if SQL Server Agent is running successfully then why this error message.
First thing for troubleshooting is to start with SQLAGENTLOG.OUT to see if any error message reported there. I opened SQLAGENTLOG.OUT file and it was completely blank!!! Phew!!
It was difficult to believe this, so I decided to start SQLAgent service from command prompt. I started it from Command prompt and here is the result.
C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Binn>SQLAGENT90.EXE -I Instancename -c
Microsoft (R) SQLServerAgent 10.50.2500.0
Copyright (C) Microsoft Corporation.
Nothing is getting reported in the command prompt as well. This is interesting, no clue at all!! We got stuck here and most of the time we give up in such situations, but in this case we decided to try something else.
I had 2 ideas as a next action plan.
1- Take a Procmon while running the service.
2- Take a Dump of SQLAGENT90.EXE.
We tried first action plan of collecting Procmon while running the service. It did not help much then we moved to next step and took dump of SQLAGENT90.EXE.
Since this is not an error message we can not directly take a crash dump so challenge here is, how should we take dump on the service? Luckily this was Windows Server 2008. Windows Server 2008 we have a facility where we can directly take dumps from the task manager for a process.
Here is the screen shot on how to take dump. Check the below screen shot.
Once dump is generated it will give you the location where it got generated.
Go to the dump location and get the DMP file.
Here we go with our typical Dump analysis
First step is to set the symbol path to Microsoft symbols server. On the Windbg command window type below command.
.sympath srv*c:\ publicsymbols*http://msdl.microsoft.com/download/symbols;
Now let’s load the symbols from Microsoft symbols server:
Again type .reload /f and hit enter.
Let’s confirm whether we have symbols loaded or not, remember this time we took dump on SQLAGENT.EXE.
0:002> lmvm SQLAGENT start end module name 00000000`00b40000 00000000`00bbd000 SQLAGENT (pdb symbols) c:\publicsymbols\SQLAGENT.pdb\D6244891FD4647DBBA866D05ED4825891\SQLAGENT.pdb Loaded symbol image file: SQLAGENT.EXE Image path: C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Binn\SQLAGENT.EXE Image name: SQLAGENT.EXE Timestamp: Fri Jun 17 01:28:00 2011 (4DFB1010) CheckSum: 0007337E ImageSize: 0007D000 File version: 2009.100.2500.0 Product version: 10.50.2500.0 File flags: 8 (Mask 3F) Private File OS: 40000 NT Base File type: 1.0 App File date: 00000000.00000000 Translations: 0409.04b0 CompanyName: Microsoft Corporation ProductName: Microsoft SQL Server InternalName: SQLAGENT OriginalFilename: SQLAGENT.DLL ProductVersion: 10.50.2500.0 FileVersion: 2009.0100.2500.00 ((KJ_PCU_Main).110617-0038 ) FileDescription: SQLAGENT - SQL Server Agent LegalCopyright: Microsoft Corp. All rights reserved. LegalTrademarks: Microsoft SQL Server is a registered trademark of Microsoft Corporation. Comments: SQL
Since this is full dump there can we multiple threads and we have no idea which thread is of our interest. To know that run below command on windbg.
0:000> ~ # 0 Id: 155c.137c Suspend: 0 Teb: 000007ff`fffde000 Unfrozen 1 Id: 155c.15f4 Suspend: 0 Teb: 000007ff`fffda000 Unfrozen . 2 Id: 155c.1a68 Suspend: 0 Teb: 000007ff`fffd4000 Unfrozen 3 Id: 155c.1200 Suspend: 0 Teb: 000007ff`fffae000 Unfrozen 4 Id: 155c.158c Suspend: 0 Teb: 000007ff`fffd8000 Unfrozen 5 Id: 155c.128c Suspend: 0 Teb: 000007ff`fffd6000 Unfrozen
Now if you look at all the threads carefully you will find a “.” dot just before the thread 2. That is how we came to know this is the thread of our interest. Currently we are on thread 0 so we need to switch to thread 2. Below is the command to switch to a thread.
0:000> ~s ntdll!NtWaitForSingleObject+0xa: 00000000`7721135a c3 ret Run below command to dump the stack. 0:002> kc Call Site ntdll!NtWaitForSingleObject KERNELBASE!WaitForSingleObjectEx KERNELBASE!GetOverlappedResult sqlncli10!Np::ReadSync sqlncli10!BATCHCTX::SNIRead sqlncli10!BATCHCTX::ReadPacket sqlncli10!CConnection::PreLogin sqlncli10!CTdsParser::DoConnect sqlncli10!CTdsParser::OpenServerConnection sqlncli10!ConnectIt sqlncli10!DoDlgConnection sqlncli10!SQLDriverConnectW odbc32!SQLInternalDriverConnectW odbc32!SQLDriverConnectW sqlsvc!QSQLLogonExImpl sqlsvc!QSQLLogonExWithErrorHandling SQLAGENT!ConnVerifyConnectionOnStart SQLAGENT!DumpAndCheckServerVersion SQLAGENT!ServiceMain sechost!ScSvcctrlThreadA kernel32!BaseThreadInitThunk ntdll!RtlUserThreadStart
Looking at the highlighted part in the stack, looks like we got stuck while login to SQL Server.
There could be 2 possibilities.
1- We have some permission issues while login to SQL Server
2- We have some login issue with over all SQL Server itself.
We checked and confirmed that Agent account had sysadmin rights on the SQL Server. This signifies there is no permission issue with Agent.
So we decided to go ahead with other option.
We were not confident here that login to SQL Server could be an actual problem because customer is connected to SQL Server from his laptop and showing us the job failure status .This means Login is working fine from the client location but, what about login from the server itself?? Because SQL Agent is connecting to SQL Server using client tools of Server itself.
I asked customer to connect to SQL Server from Management Studio of server. What I see here was quite interesting Management Studio login got hung. This is interesting, isn’t it??
I tried to connect to SQL Server from Command prompt of server that also got hung. Now I was confident we are going in right direction.
It leads us to the conclusion that the actual problem is with client tools of Server. To make it 100% sure I created UDL (to know how to create udl, check this blog).
UDL worked in the first attempt because UDL by default goes with OLEDB Provider for SQL Server, but we are facing issue with native client that’s why I changed the provider of SQL Server to Native Client to check it is getting hung or not, check the blow screen on how to change it.
I changed it to Native Client 10.0 and then tried to connect and this also got hung.
Come on Manish! We are going in right direction and yes yes….We can crack this now.
Now I went back to the dump stack which we took earlier and checked it carefully, looked at below 2 frames.
sqlncli10!BATCHCTX::ReadPacket sqlncli10!CConnection::PreLogin sqlncli10!CTdsParser::DoConnect
As per the above call looks like we got stuck at PreLogin that means we are not even going inside the SQL Server, problem is before that.
I consulted few of my friends who work on the connectivity side to check what checks we do during PreLogin.
As per them for the security reasons during the PreLogin we use SSL Encryption using Schannel.dll and this is registered at below registry location
So one of the reasons for this problem could be the encryption cipher used during the SSL Encrypted Pre-Login to SQL Server is corrupt/disabled. Since the cipher is corrupt/disabled the pre-login cannot complete as the data cannot be encrypted/decrypted correctly.
To resolve this issue follow these steps:
1. Open regedit.
2. Rename the SCHANNEL key to old from below location:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\SecurityProviders\SCHANNEL === > make it to SCHANNEL_Old
3. When you do this you might receive an error. (This can be ignored).
4. Refresh the branch in the registry and the SCHANNEL key should now be recreated with default settings.
NOTE: No Reboot is required on Windows 2003/2008 but a reboot is required on Windows 2000.
After making these changes if we try to login to SQL Server this key will be re-populated.
Now any attempt of login to SQL Server from local server will be successful. We started the SQL Server Agent now and then tried to start a job. Now all the jobs are working fine!!!
NOTE: As you now know this was the issue with Client tools of that local machine, so this can happen with any client machine not necessarily with local server were SQL Server is installed, but resolution steps will remain same for that machine where you face this issue.
As you can see from this blog we started troubleshooting SQL Agent job failing with error and ending up troubleshooting Login issue.
Hope this will help you in troubleshooting not only this issue, infact you can follow the similar approach to troubleshoot other login issues as well.