Latch Timeout: To worry or not to?   1 comment


Hey, I seldom see these messages in ERROLROG, is this something I should be concerned about?

[WARNING] 2013-02-14 05:25:15.97 spid61      Timeout occurred while waiting for latch: class ‘DATABASE_CHECKPOINT’, id 0000000017560830, type 4, Task 0x000000000043EBC8 : 0, waittime 300, flags 0x1a, owning task 0x00000000009F0508. Continuing to wait.

 
The short answer is – No. This ‘particular’ message is not to be concerned about for the period it occurred on this server.

Why do I say that? Want to know more. Then go ahead and read the longer version.

Before reading the further lines, we must have a clear understanding of the difference between a “lock” and a “latch”. Latch is a very short time physical lock on a buffer while an IO operation is performed. For example: During a IO read operation SQL acquires a shared latch on a buffer before it can fetch the data off the disk into the in-memory buffer. If the time taken to complete the IO takes time there can be a Latch Timeout reported. Thus latch timeouts can occur due to sub optimal Disk IO performance.

The message in the log says:

[WARNING] 2013-02-14 05:25:15.97 spid61      Timeout occurred while waiting for latch: class ‘DATABASE_CHECKPOINT’, id 0000000017560830, type 4, Task 0x000000000043EBC8 : 0, waittime 300, flags 0x1a, owning task 0x00000000009F0508. Continuing to wait.

 
Please note that the portion of the message “Timeout occurred while waiting for latch: class” is very generic for non-buf latch time-outs. If a new DBA just searched over internet with this string, there could be numerous KB articles, blogs etc. The rest of the message (class <class_name>… onwards) does contain very specific information. Broadly speaking, latch time-out error messages are of two categories:

1. Timeout occurred while waiting for latch: class ‘latch_class’ <further details>…
2. A time-out occurred while waiting for buffer latch — type ‘type #’ <further details>…

 
The first message is mostly in the cases of non-bpool latches while the second one in bpool latch. It is important to know that sys.dm_os_latch_stats doesn’t differentiate between the page latch waits caused due to I/O operations and read/write operations than sys.dm_os_wait_stats

Sticking to the original message, there are ‘few’ latch timeouts which are of concern including the following (and excluding DATABASE_CHECKPOINT. This latch is used to serialize checkpoints within a database hence we just print a message after 5 mins i.e. waittime 300 secs as you notice above, but we’re not really concerned here for obvious reasons)

ACCESS_METHODS_DATASET_PARENT
APPEND_ONLY_STORAGE_INSERT_POINT
ACCESS_METHODS_HOBT_COUNT
ACCESS_METHODS_SCAN_RANGE_GENERATOR
ACCESS_METHODS_KEY_RANGE_GENERATOR
QUERY_OPTIMIZER_ID_MANAGER

 
If you see any of the above latch waits ‘class’ in the errorlog (these are to name a few), more than likely a minidump would be generated. Also, such latch timeouts are prone to cause SQL unresponsiveness. They must be of a greater concern and must be investigated. Example:

A time-out occurred while waiting for buffer latch — type 4, bp 00000001361F49C0, page 4:64704, stat 0xc0000b, database id: 2, allocation unit Id: 6488064, task 0x00000006CC04D288 : 0, waittime 300, flags 0x11000001038, owning task 0x0000000000000000. Not continuing to wait

 
Possible causes of “Time out occurred while waiting for buffer latch”:

1) An IO bottleneck due to the load on the system in proportion to the available IO resources (slow disks, slow controllers, too many IO reads due to insufficient indexes or poorly designed queries, etc., or just a load larger than the hardware can handle).

2) A hardware issue, such as disks or controllers not responding properly or having corruption at some level. If one disk in the RAID array fails, the IO load on the other disks is heavier while the parity is being rebuilt. Bad sectors on a disk could cause sporadic errors.

Next Steps:

A) If the error is infrequent and/or not causing specific problems it can be ignored.

B) If it needs to be investigated further, the following data should be collected:

i.              A Perfmon log with relevant counters covering the period when the errors occur.

ii.             If we can determine a specific query or situation that causes the error to occur, we can run DBCC StackDump and “select * from master..sysprocesses” immediately before and after causing the problem. We may need to set up a debugger and use it to capture additional information.

iii.            A SQLDiag taken after the problem occurs.

Also, you might want to check ERRORLOG for any symptom of IO bottleneck or any activity which consumes IO e.g. database backup etc,. Here are some of the messages I found during the time latch timeout was reported which possibly explains the latch timeout during DATABASE_CHECKPOINT.

2013-02-14 04:52:14.14 Backup      Database backed up. Database: testdb, creation date(time): 2012/12/20(04:56:29), pages dum
2013-02-14 05:01:10.92 spid5s      SQL Server has encountered 1183 occurrence(s) of I/O requests taking longer than 15 seconds to com

 
Idea of this post was to give an insight of “Latch Timeouts” and how to get started. Hope you find it useful.

One response to “Latch Timeout: To worry or not to?

Subscribe to comments with RSS.

  1. Pingback: We didn’t know that… | SQLactions.com

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: