I recently worked on an issue where SQL Server was experiencing memory pressure. During the troubleshooting, one of my colleagues asked an interesting question as how Stolen Potential is calculated and why is stolen potential negative? I thought to share what I researched and learned through a quick blog post.
In SQL Server 2008, when SQL Server experiences memory pressure (usually due to Buffer pool pressure on x64 instance) the following error is logged in ERRORLOG:
2012-10-30 15:50:42.21 spid67 Error: 701, Severity: 17, State: 123. 2012-10-30 15:50:42.21 spid67 There is insufficient system memory in resource pool 'default' to run this query.
The way you can tell it is a Buffer pool pressure is by looking at a message similar to this:
2012-12-18 16:37:01.22 spid20s Failed allocate pages: FAIL_PAGE_ALLOCATION 1
As you see, it failed to allocate 1 page here. Single page allocation comes from Bpool. So that’s a Bpool pressure. Additionally, following messages also indicate a Bpool pressure:
BPool::Map: no remappable address found. BufferPool out of memory condition LazyWriter: warning, no free buffers found.
Had it a MTL (technically speaking, outside Bpool) pressure, SQL server would have printed messages like below:
Failed Virtual Allocate Bytes: FAIL_VIRTUAL_RESERVE 65536 WARNING: Failed to reserve contiguous memory of Size= 65536 Failed Virtual Allocate Bytes: FAIL_VIRTUAL_COMMIT 65536
Note that in the above message, size is in bytes. 65536 bytes = 64 K which is greater than 8K. That means SQL Server tried to reserve/commit 64K page which obviously has to come from outside Buffer pool (up until SQL 2012).
Coming back to the subject of this blog, in our case we had an OOM (Out of Memory) condition due to Buffer pool pressure. SQL Server printed DBCC MEMORYSTATUS output in the Errorlog a few times. From DBCC MEMORYSTATUS output, we get the distribution of various Buffer Pool counters. Here is one such snippet:
Buffer Pool Value ---------------------------------------- ---------- Committed 1048576 Target 1048576 Database 917683 Dirty 7041 In IO 26 Latched 23 Free 96816 Stolen 34077 Reserved 1302240 Visible 1048576 Stolen Potential -340169 Limiting Factor 17 Last OOM Factor 17 Last OS Error 0 Page Life Expectancy 4869
As we see, Stolen Potential is negative here. We also see another counter for Stolen pages here. According to KB907877 stolen memory describes 8-KB buffers that the server uses for miscellaneous purposes. These buffers serve as generic memory store allocations. Similarly, Stolen Potential shows the maximum pages that can be stolen from the buffer pool. Having said that, it is evident if the value of stolen potential goes negative, there won’t be any further pages that Engine can steal from the Buffer Pool. Hence in such cases where stolen potential is negative, SQL Server will throw an OOM.
Now, the outstanding question is how is this value of stolen potential calculated. So, here is the formulae as per my understanding:
Stolen potential = (95% of Visible) – Stolen – Reserved
In our case, Stolen Potential = (1048576 * 0.95) – 34077 – 1302240 = –340169.8 = the value we see above.
Note: Up until SQL 2008 R2, stolen potential memory counter was under “Buffer Pool” object. In SQL 2012, stolen potential is called Page Alloc Potential (KB) and resides under “Memory Manager” object. Please refer to Memory Manager surface area changes in SQL Server 2012 for more details.
Okay, now was the part to troubleshoot the issue. We know it was a Bpool pressure, so, look for the clerk having high “single page allocator”. Apparently, in our case MEMORYCLERK_SQLQERESERVATIONS appears to be the distinguished one.
MEMORYCLERK_SQLQERESERVATIONS (node 0) KB ---------------------------------------- ---------- VM Reserved 0 VM Committed 0 Locked Pages Allocated 0 SM Reserved 0 SM Committed 0 SinglePage Allocator 10436736 MultiPage Allocator 0
MEMORYCLERK_SQLQERESERVATIONS is a workspace for query executions. Usually there might be some memory intensive queries (due to inappropriate or missing indexes or bad cardinality estimates). Leveraging some of the useful queries from this blog we found there was one poor query which was alone taking 9 Gigs of memory for query execution. Tuned the query in DTA (basically a missing index was suggested and we also updated the statistics with full scan) and issue resolved.
Hope this post may help some.