Yesterday, afternoon we had a support executive from a remote development approach our DBA team requesting help on understanding why is their SQL error log being flooded with the error message “Time-out occurred while waiting for Buffer Latch — Type 3, bp 039FE530, page 1:44377, stat 0xc00009, database id: 9, allocation unit Id: 72057594058833920, task 0x00FE86B8 : 0, waittime 300, flags 0x3a, owning task 0x00FE8F28. Not continuing to wait.
Since it was a cross department request, we wanted to know whether this would be a paid support . Well the support executive made it clear, he was looking forward for a buddy help, just needed advice on what is causing is and thats it, no paid support.
One of our DBA members, monitoring the call queue [i.e. the support ticket queue] volunteered to log in and check. On RDP to the box, the first thing he noticed was a previous entry in the error log, around 12 hours back, that tempdb files directed to the system drive [a real bad install decision] had gone full.
But as of now tempdb files had lots of free space with the disk drive also having free space. On cross checking with the support executive, he got confirmation that yes, they had intervened to free up space for tempdb. How they did it was not explained and neither did we probe further.
Going back to the SQL error log, our DBA spotted a pattern on the timing of the error messages being logged. The error pops up every 5 minutes. This pointed to a possibility that a task was running either at a regular frequency and it was a process in the task that was generating this error.
A good possibility, because if true, this meant that the process when trying to read a page from disk to cache, it was not able to obtain EX (exclusive) latch on the BUF associated with the page. This meant either the page in question was corrupted or the page did not exist.
The next point of investigation was to pinpoint the process that was trying to access the page. A quick search on the current requests running on the SQL instance showed 3 SQL Agent job process running continuously for an extensively long period of time. Also noticed that a process from one of the Agent jobs was blocking other process and in turn causing a chain of blocking(s).
Got a pretty good idea as to which Agent job might be the culprit. All the 3 SQL Agent jobs were manually stopped and viola the error message popping up in the error log every 5 minutes also stopped.
But we still needed to identify the page belonged to which object and why was it not being accessible. To verify, ran the suspect Agent job again and yes, error message started popping up again.
This made things much easier, because process now identified, user database already identified from the error message, i.e. dbid = 9. What remained now was just identifying the page belonged to which object.
Advertised downtime, on gaining concurrence, put the database into single user mode and executed DBCC CHECKDB. The results were plain clear and as per expectations, DBCC CHECKDB returns an error identifying the object and the error associated ” B-tree chain linkage mismatch. Page (1:159435) is missing a reference from previous page (1:2560). Possible chain linkage problem.”
A quick search on whether the object was being accessed by the code in the SQL Agent job returned a success. The concerned department was notified on the investigation results and an offer made to help them recover the object.
How would we recover the object? I will cover it another blog.