SQL Server: Troubleshooting a SQL 2005 database that went into a suspect mode.

What happened?

During the ongoing SQL migration on our TESTING environment cluster, things were not progressing as per plan. Initially the EVA SAN LUN(s) movement from source to target cluster node(s) gave some major headaches. Solving our way through it took up a major amount of time.

Actual SQL DB migration was already running late, when we were hit by a replication failure. Now during the closing stages after resolving all issues, in the late night comes the bad news. One of our DBA team mates escalates with a dejected face saying “a db xSxxxxxD has gone into a suspect mode. Please help me as I am tired and am not able to think”.

Analysis:

  1. First port of call was the SQL error logs. Error log entries indicate recovery of db “xSxxxxxD” was not completed successfully and dump(s) were generated. An indicator of the transaction log file for the DB “possibly” going kaput.
  2. Good time for the log to go kaput. Next step was to put the DB into emergency mode.
  3. Why emergency? It is a sort of bypass recovery mode, where the transaction log is bypassed to make the DB available to members of the sysadmin group.
  4. Emergency mode helps sysadmin members [member(s)? yes, in emergency mode multiple user connections are allowed] to gain connection to the DB.
  5. To understand apart from the log going corrupt was there a corruption issue in any objects inside the DB, we needed to run a full CHECKDB. We need to know the reason why the DB has gone suspect.
  6. Executed CHECKDB after altering the DB to single user mode with rollback immediate.
  7. After approx 120 minutes CHECKDB completes with error message(s) showing the index id 4 on table “xxxtxxxxxxxx” has gone bad with page allocation errors. Ah! a simple one over here.
  8. It is the index that has caused the issue. The index was being modified when a cluster disk failover was carried out, during the failover the transaction log got corrupted and when the SQL instance started the normal process of DB recovery was initiated but could not complete because the transaction log file could not be read due to the corruption.
  9. Since the log was corrupted, the log entries related to the index changes could not be read to rollback the index data modifications.
  10. Recovery on DB could not be completed. DB put into suspect mode by the SQL instance.
  11. On the surface a very simple issue and a quick solution available. Just drop the non-clustered index and recreate OR rebuild the index.
  12. But is it as simple as it looks? No, why?

Why is this issue not easily recoverable?

   
  1. We cannot simply execute the rebuild or “drop and create” command. It will fail. Why?
  2. The DB was in emergency mode, meaning bypass recovery.
  3. For all operations related to index, be it either rebuild, or “drop and create” you need to have recovery log in place.
  4. Now if we take the DB out of the emergency mode, SQL instance attempts to take DB online. To go online a recovery is run but transaction log file looks corrupted and the DB goes back to suspect.
  5. Stalemate. In suspect mode you can’t access the DB and without accessing the DB, you can’t rebuild or “drop and create” the offending non-clustered index.
  6. In emergency mode though you can access the DB, you still can’t rebuild or “drop and create” the offending non-clustered index because it is in bypass recovery mode and index rebuild / creation / drop operations recovery log in place.
  7. What next?

Fix:

  1. Best option out is to carry out a restore from the latest relevant backup. We finally went ahead with the restore and it took its own sweet time.

OR

  1. Put DB in emergency mode.
  2. Execute DBCC CHECKDB (xSxxxxxD, REPAIR_ALLOW_DATA_LOSS). This would take this own sweet time.
  3. The above will rebuild the transaction log and the a subsequent attempt to bring the DB online will complete as the recovery step for the DB will be completed.
  4. Though the non-clustered index “might still” be in a bad state, but atleast with the DB online and the log in place, you can rebuild or “drop and create” the index again.

OR

  1. Put DB in emergency mode.
  2. Directly REBUILD_LOG.
  3. RESTORE DATABASE <db> WITH RECOVERY.
  4. DB comes online.
  5. The non-clustered index will still be corrupted, but atleast with the DB online and the log in place, you can rebuild or “drop and create” the index again.

Moral of the story: Before you start a database migration, it pays well to ensure that the overnight “FULL BACKUP(s)” have been successfully taken.

 
Regards

Vasudev Menon

Like us on FaceBook Follow us on Twitter

Join the fastest growing SQL Server group on FaceBook

   

8 Comments on “SQL Server: Troubleshooting a SQL 2005 database that went into a suspect mode.”

  1. Execute DBCC CHECKDB (xSxxxxxD, REPAIR_ALLOW_DATA_LOSS). This would take this own sweet time.

    Are you sure that the above is advisable because all the rows in a corrupt page will simply be deleted which can lead to data loss.

  2. Sachin,

    I have said 2 important things in the blog

    1. “best option out is a restore”

    2. The moral of the story “Always ensure a full backup is in place before starting migration”.

    When a DBA faces an issue and troubleshoots, there are many options available and specially when he writes a blog to the wider community he needs to put across all options available, so that when others face a similar issue they are aware of the best and the worst options.

    Now lets say:

    a. What if the poor chap had no backup? He just sits back and pray?

    b. Should he not attempt to bring the DB online with the next available options on hand? mark my words, I have said next available option.

    Now to your question of data loss:

    Purely in the context of “this blog” and “the error I was troubleshooting”, I had clearly mentioned it is the non-clustered index that had gone kaput.

    You seriously think carring out DBCC checktable or checkdb with DATA_LOSS for a non-clustered index, will result in a data loss… from the table object itself? Smile

    Thanks,

    VM

  3. Oh sorry I had missed that you had mentioned that error message indicated an issue with the index ,thats why all the confusion.

  4. No problem Sachin, please do feel free to always question. That is the only way, by questioning, can a healthy discussion on the pros and cons of a topic be initiated and knowledge shared among the wider community.

  5. Mr. @@AmitBansal

    Sir, Is work with and use of DBCC CHECKDB (xSxxxxxD, REPAIR_ALLOW_DATA_LOSS) command data will loss or not…and if yes the how much chances to LOSS data and is there any other Step or process for Repaitr Suspect Database in SQL SERVER 2005\2008

    Thanks,

    Manoj Kumar

  6. This post really helped me, thanks Vasu.

    Hey Sachin, it seems that you are very much prompt on every blogs and all, well carry on the nice work and at the same time do read the blogs Carefully.. 😉

Leave a Reply

Your email address will not be published.