What happened?

During the ongoing SQL migration on our TESTING environment cluster, things were not progressing as per plan. Initially the EVA SAN LUN(s) movement from source to target cluster node(s) gave some major headaches. Solving our way through it took up a major amount of time.

Actual SQL DB migration was already running late, when we were hit by a replication failure. Now during the closing stages after resolving all issues, in the late night comes the bad news. One of our DBA team mates escalates with a dejected face saying “a db xSxxxxxD has gone into a suspect mode. Please help me as I am tired and am not able to think”.

Analysis:

  1. First port of call was the SQL error logs. Error log entries indicate recovery of db “xSxxxxxD” was not completed successfully and dump(s) were generated. An indicator of the transaction log file for the DB “possibly” going kaput.
  2. Good time for the log to go kaput. Next step was to put the DB into emergency mode.
  3. Why emergency? It is a sort of bypass recovery mode, where the transaction log is bypassed to make the DB available to members of the sysadmin group.
  4. Emergency mode helps sysadmin members [member(s)? yes, in emergency mode multiple user connections are allowed] to gain connection to the DB.
  5. To understand apart from the log going corrupt was there a corruption issue in any objects inside the DB, we needed to run a full CHECKDB. We need to know the reason why the DB has gone suspect.
  6. Executed CHECKDB after altering the DB to single user mode with rollback immediate.
  7. After approx 120 minutes CHECKDB completes with error message(s) showing the index id 4 on table “xxxtxxxxxxxx” has gone bad with page allocation errors. Ah! a simple one over here.
  8. It is the index that has caused the issue. The index was being modified when a cluster disk failover was carried out, during the failover the transaction log got corrupted and when the SQL instance started the normal process of DB recovery was initiated but could not complete because the transaction log file could not be read due to the corruption.
  9. Since the log was corrupted, the log entries related to the index changes could not be read to rollback the index data modifications.
  10. Recovery on DB could not be completed. DB put into suspect mode by the SQL instance.
  11. On the surface a very simple issue and a quick solution available. Just drop the non-clustered index and recreate OR rebuild the index.
  12. But is it as simple as it looks? No, why?

Why is this issue not easily recoverable?

  1. We cannot simply execute the rebuild or “drop and create” command. It will fail. Why?
  2. The DB was in emergency mode, meaning bypass recovery.
  3. For all operations related to index, be it either rebuild, or “drop and create” you need to have recovery log in place.
  4. Now if we take the DB out of the emergency mode, SQL instance attempts to take DB online. To go online a recovery is run but transaction log file looks corrupted and the DB goes back to suspect.
  5. Stalemate. In suspect mode you can’t access the DB and without accessing the DB, you can’t rebuild or “drop and create” the offending non-clustered index.
  6. In emergency mode though you can access the DB, you still can’t rebuild or “drop and create” the offending non-clustered index because it is in bypass recovery mode and index rebuild / creation / drop operations recovery log in place.
  7. What next?

Fix:

  1. Best option out is to carry out a restore from the latest relevant backup. We finally went ahead with the restore and it took its own sweet time.

OR

  1. Put DB in emergency mode.
  2. Execute DBCC CHECKDB (xSxxxxxD, REPAIR_ALLOW_DATA_LOSS). This would take this own sweet time.
  3. The above will rebuild the transaction log and the a subsequent attempt to bring the DB online will complete as the recovery step for the DB will be completed.
  4. Though the non-clustered index “might still” be in a bad state, but atleast with the DB online and the log in place, you can rebuild or “drop and create” the index again.

OR

  1. Put DB in emergency mode.
  2. Directly REBUILD_LOG.
  3. RESTORE DATABASE <db> WITH RECOVERY.
  4. DB comes online.
  5. The non-clustered index will still be corrupted, but atleast with the DB online and the log in place, you can rebuild or “drop and create” the index again.

Moral of the story: Before you start a database migration, it pays well to ensure that the overnight “FULL BACKUP(s)” have been successfully taken.

 
Regards

Vasudev Menon

Like us on FaceBook Follow us on Twitter

Join the fastest growing SQL Server group on FaceBook