Sometime back while auditing error logs in one of the SQL Server instances I came across following message;
SQL Server has detected an unreported OS/hardware level read or write problem on page (1:74098) of database 10
LSN returned (59991:12571:175), LSN expected (59991:12571:600)
Contact the hardware vendor and consider disabling caching mechanisms to correct the problem
As can be seen on error message itself there is an indication that the LSN value stored in memory isn’t matching with the LSN returned by OS i.e. either data read is not as per what is expected or somehow data written to disk is lost or not written at all. On this note, let’s have a quick look at what Stale Read and Lost Writes are?
A Stale read occurs when SQL Server writes a modified page to storage but storage system returns a different value may be a different version of the page from hardware cache. A Lost write is something when SQL Server modifies a page and writes to disk but this value never stored in storage system so you get to see previous version while read operation. At times you may observe that issue is resolved when system is rebooted as this clears the cache. For us it worked and we involved HW vendor for appropriate checks but should not be considered to be generic solution across all environments.
This may not be detected by CHECKSUM as page is valid based on checksum value but hardware is retuning a different version. To enable additional diagnostics for this type of problems, SQL Server has added TRACE FLAG 818. You can specify TRACE FLAG 818 either as a startup parameter -T818 for the computer that is running SQL Server or by running the following statement:
DBCC TRACEON (818, -1)
TRACE FLAG 818 enables an in-memory ring buffer that is used for tracking the last 2,048 successful write operations that are performed by the computer on which you are running SQL Server not including sort and workfile I/Os. When errors such as Error 605, 823 or 3448 occur the incoming buffer’s LSN value is compared to the recent write list. If the LSN that is retrieved during the read operation is older than the one specified during the write operation a new error message is logged in the SQL Server error log. You can refer http://support.microsoft.com/kb/826433 to know more details on this.
From SQL Server 2000 SP4, SQL Server 2005 this logic enhanced the behavior of this trace flag to perform LSN check on every read of a page and to store the LSN in hash table design. This is not default and you need to enable TRACE FLAG 818 explicitly when SQL starts up.
Leave a comment if you faced similar issue in your environments.