SQL Server Always ON and Change Data Capture: Friends or Foes?

I talked about Always ON and Change Data Capture in one of the Open Talks at recently concluded SQLServerGeeks Annual summit – 2016. The topic of discussion was Always ON and Change Data Capture: Friends or Foes? I thought its worth a write as well because this combination is very less talked about.

Let me tell you a short story about how and why I chose the title for my talk.

We have a two node windows cluster in our environment. SQL Server 2012 Enterprise Edition is installed on both the nodes as standalone and both these nodes are a part of Always ON configuration. Always ON is configured in synchronous-commit mode which means that the transactions wait to send the transaction confirmation to the client until the secondary replica has hardened the log to disk. There is a database in this Always On configuration which has CHANGE DATA CAPTURE (abbreviated as CDC) feature enabled on it. Simple and straight till now.

One day we got a requirement to introduce another replica in Always ON configuration (Don’t ask me the reason, that doesn’t matter for this post). We followed basic steps. Another windows server was built and made a part of existing cluster. Then on a Friday afternoon, we started SQL Server related stuff with the weekend already on our minds. SQL Server 2012 was installed as standalone on new node and it was enabled for Always ON (By right clicking on SQL Server service and enabling Always ON, We all know this. Right?). This new node was added as a third replica in current Always ON configuration. While doing this I was continuously keeping a close eye on the clock at the bottom right of my laptop and I was delighted to see that this is nearly the end of my shift and the weekend is about to begin. I left the rest of the things like restoring the database, syncing it with other replicas, moving logins, jobs etc. pending for Monday.

Before the mysterious Monday begins, let me tell you how we use CDC. In our environment the production data gets inserted in to SQL Server’s CDC enabled database. We have a database in another RDBMS (other than SQL Server) which we initially sync’d with SQL server’s CDC enabled database using a manual process. The clients see the data from other RDBMS. This was implemented to separate out Read and write activity. Obviously there could have been other methods to achieve this but that’s a different discussion. In order to make sure that the data in the other RDBMS remains in sync with SQL server, a process is defined which queries SQL Server’s CDC tables using function cdc.fn_cdc_get_net_changes to see what changes those tables have and then it applies those changes to other RDBMS real time. So both the RDBMS’s always remain in sync with each other. A job has been scheduled to send emails if these RDBMS’s become out of sync.

So finally the Monday begins and I had no realization that the storm was brewing. I was still under weekend’s hangover when I logged on to my system. Hangover was torn apart in to pieces by a plethora of emails I saw in my Inbox. What did the emails say? SQL Server is out of sync with other RDBMS and the lag was huge. Clients were screaming and my desk soon looked like a honeycomb surrounded by bees.

What could have happened? I had no idea so I started to investigate. I let my memory rewind to think about Friday and I recalled that the only thing changed was the addition of third replica to Always ON configuration. Did this set the house on fire? Yes, indeed. Below is why:

One thing I did not do after adding the third replica was I did not bring the databases on third replica in sync with other replicas. I did not restore any databases on Friday. So what? Well, the thing to note about Change Data Capture is that it uses Transactional replication’s log reader agent. By default the log reader will not process log records that have not already been hardened at all availability group secondary replicas. So because the database was not restored on the new secondary instance, the logs were not able to harden, log reader agent was not able to process the records and hence Change Data Capture stopped working. This phenomenon is explained in detail here.

What this link also talks about is the failover situation in Synchronous and Asynchronous commit mode. If there is a catastrophic event, primary replica node gets compromised and the databases failover to secondary replica. As the old primary is not available, the logs will not be able to harden on old primary (new secondary) so Change Data Capture will not work. If you have Always ON configured in Asynchronous-Commit mode, Then you can enable trace flag 1448 and log reader will keep processing the log records. This trace flag has no effect if the Always ON is configured in synchronous-commit mode. In order to make sure log reader keeps on processing the log records in synchronous-commit mode, you have to take the database out of availability group till you fix the old primary node or remove the faulty replica from Always ON configuration.

Everything said above to handle failover situation is manual but there needs to be a way to automate this stuff so that it can be handled without waiting for human intervention. I wrote a stored procedure to automate all this. This stored procedure is scheduled to run every 5 mins via a SQL Server Agent job on all the Always ON replicas. This will ensure that in the event of any issue with Always ON, Change Data Capture continues to work. The procedure accepts two parameters which are Availability Group name and Database name (which is CDC enabled database for our purpose). It is assumed that the CDC capture and clean up jobs are already created on all the replicas. This is a very basic stored procedure and this can be improved with more additions based upon the requirement. The definition of this Stored Procedure is as given below:

   
USE [master]
GO

/****** Object:  StoredProcedure [dbo].[SP_TrackCDCStatus]    Script Date: 14/09/2016 14:20:33 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO


Create PROCEDURE [dbo].[SP_TrackCDCStatus]
@AG_Name varchar(50),
@DatabaseName sysname
as

/******************************
** Name: SP_TrackCDCStatus.sql
** Auth: Yogeshwar Phull
** Date: 23/08/2016
**************************
*******************************/

Begin
Declare @InstanceName varchar(20)=@@SERVERNAME
Declare @Role varchar(20)
Declare @AG_GroupId varchar(40)
Declare @Health varchar(40)
Declare @AO_Mode tinyint
DECLARE @Status_P1 tinyint
DECLARE @Status_P2 tinyint
DECLARE @Status_S1 tinyint
DECLARE @Status_S2 tinyint
Declare @count_P1 tinyint
Declare @count_S1 tinyint
Declare @count_S2 tinyint
Declare @CMD nvarchar(200)
Declare @Job_1 nvarchar(50)='cdc.'+@DatabaseName+'_capture'
Declare @Job_2 nvarchar(50)='cdc.'+@DatabaseName+'_cleanup'
Declare @body1 varchar(200)
Declare @body2 varchar(200)
Declare @body3 varchar(200)
Declare @subject1 varchar(100)
Set @body1='Synchrnonization state of some databases is not healthy on availability group '+@AG_Name+'. Please take necessary action'
Set @body2='Synchrnonization state of CDC enabled database '+ @DatabaseName+' is not healthy. Trace flag 1448 has been enabled to ensure continuous working of Change Data Capture. Please investigate'
Set @body3='Synchrnonization state of CDC enabled database '+ @DatabaseName+' is not healthy. It has been removed from availability group '+@AG_Name+'. Please investigate'
Set @subject1='Always ON notification - '+@InstanceName


SELECT distinct @Role=C.role_desc, @AG_GroupId=C.group_id, @AO_Mode=D.availability_mode
FROM
sys.availability_groups_cluster AS A
INNER JOIN sys.dm_hadr_availability_replica_cluster_states AS B
ON
B.group_id = A.group_id
INNER JOIN sys.dm_hadr_availability_replica_states AS C
ON
C.replica_id = B.replica_id
INNER JOIN sys.availability_replicas AS D
ON
A.group_id = D.group_id
WHERE
B.replica_server_name=@InstanceName and A.name=@AG_Name

If @Role='PRIMARY'
	Begin
		Declare @Iter int=1 
			WHILE (@Iter <= 5)
				BEGIN
					Select @Health=synchronization_health from sys.dm_hadr_availability_replica_states where role=2 and group_id=@AG_GroupId
						If @Health=2
							break
						Else
							WAITFOR DELAY '00:03:00'
							set @Iter=@Iter+1
				END

			IF @Health<>2
				BEGIN
					Declare @Count int
					Create table #temp_sync
					(database_id int, synchronization_state int)
						
						SET NOCOUNT ON
						Insert #temp_sync 
						select distinct database_id, synchronization_state from sys.dm_hadr_database_replica_states
						where synchronization_state not in (1,2)


						Select @count=count (1) from #temp_sync 
							
							IF @count>0
							Begin
								EXEC msdb.dbo.sp_send_dbmail  
									 @profile_name = 'DBA',  
									 @recipients = 'DBA@email.com',  
									 @body = @body1,  
									 @subject = @subject1 
							End
							IF exists (select * from #temp_sync where database_id=db_id(@DatabaseName))
							
								BEGIN
									IF @AO_Mode=0
										Begin
										DBCC TRACEON (1448, -1)
												EXEC msdb.dbo.sp_send_dbmail  
												  @profile_name = 'DBA',  
												  @recipients = 'DBA@email.com',  
												  @body = @body2,  
												  @subject = @subject1
										End	

									IF @AO_Mode=1
										Begin
										set @CMD = 'Alter availability group [' +@AG_Name+'] remove database '+'['+@DatabaseName+']'
										exec sp_executesql @CMD
												EXEC msdb.dbo.sp_send_dbmail  
												  @profile_name = 'DBA',  
												  @recipients = 'DBA@email.com',  
												  @body = @body3,
												  @subject = @subject1
										End
								END
					Drop table #temp_sync
				END
	
  
			ELSE
				BEGIN
					select @Status_P1 =enabled from msdb..sysjobs where name like @Job_1
					select @Status_P2 =enabled from msdb..sysjobs where name like @Job_2
						If @Status_P1=1 
							Begin
								SELECT @count_P1=count(1)
								FROM 
								msdb.dbo.sysjobactivity AS a
								INNER JOIN 
								msdb.dbo.sysjobs AS b 
								ON 
								a.job_id = b.job_id
								WHERE a.start_execution_date IS NOT NULL
								AND a.stop_execution_date IS NULL
								AND b.name = @Job_1
		
										If @count_P1<>1
											EXEC msdb.dbo.sp_start_job @Job_1 
							END

						Else
							Begin
								EXEC msdb.dbo.sp_update_job @job_name=@Job_1,@enabled = 1
								EXEC msdb.dbo.sp_start_job @Job_1
							END

						If @Status_P2<>1
								 EXEC msdb.dbo.sp_update_job @job_name=@Job_2,@enabled = 1
				END
  END

  ELSE
  
		Begin
				select @Status_S1 =enabled from msdb..sysjobs where name like @Job_1
				select @Status_S2 =enabled from msdb..sysjobs where name like @Job_2

					If @Status_S1=1 
						Begin
							SELECT @count_S1=count(1)
							FROM msdb.dbo.sysjobactivity AS a
							INNER JOIN msdb.dbo.sysjobs AS b 
							ON 
							a.job_id = b.job_id
							WHERE a.start_execution_date IS NOT NULL
							AND a.stop_execution_date IS NULL
							AND b.name = @Job_1
						End
				
				    If @count_S1=1
							EXEC msdb.dbo.sp_stop_job @Job_1
							EXEC msdb.dbo.sp_update_job @job_name=@Job_1,@enabled = 0
		

   

				If @Status_S2=1 
						Begin
							SELECT @count_S2=count(1)
							FROM msdb.dbo.sysjobactivity AS a
							INNER JOIN msdb.dbo.sysjobs AS b 
							ON
							a.job_id = b.job_id
							WHERE a.start_execution_date IS NOT NULL
							AND a.stop_execution_date IS NULL
							AND b.name = @Job_2 
						END
				If @count_S2=1
						EXEC msdb.dbo.sp_stop_job @Job_2
						EXEC msdb.dbo.sp_update_job @job_name=@Job_2,@enabled = 0
  
		END
END


GO

Below are the details of what this Stored procedure does:
1. The values of current replica role, Availability Group Id and Availability Mode are assigned to defined variables.
2. If the replica role is Primary, It checks the state of secondary replica as seen from primary.
a. If secondary replica is not healthy (it checks the state of secondary replica 5 times in 3 minutes interval)
a1. It checks how many databases are not in sync. If there are any databases out of sync, it sends an email saying there are databases out of sync.
a2. Then it looks for the CDC database that was passed as parameter.
->If CDC enabled database is out of sync and availability mode is asynchronous, it enables Trace Flag 1448 and sends related email.
->If CDC enabled database is out of sync and availability mode is synchronous, it removes the database from availability group and sends related email
b. If secondary replica is healthy, it checks the status of capture and cleanup job on primary.
b1. If the jobs are disabled, it enables those and starts the capture job
b2. If the jobs are enabled, it checks whether capture is job is running, if not running , it starts the capture job.

3. If the replica role is secondary, It checks the status of capture and clean up job. If the jobs are enabled, it disables the jobs and stops the capture job if it is running.

So, the bottom line is that Always ON and Change Data Capture are not Friends inherently but you can be a mediator to help them shake hands and hug each other.

And, I accept that we missed a very basic step in the excitement of weekend which caused all this blunder but lessons learnt from such outages sit in memory for long (longer than book reading). Waiting for the next one to knock the door 🙂

Cheers!!

Regards,
Yogeshwar Phull

Like us on FaceBook | Join the fastest growing SQL Server group on FaceBook
https://www.thejerusalemfund.org/ativan had a mild hypnotic effect on my son. During the day I did not notice any drowsiness, he just became calmer and more balanced…
Follow me on Twitter | Follow me on FaceBook
————————————

   

About Yogeshwar Phull

I have 9 years of experience as a Microsoft SQL Server Database Administrator and presently working with Kurtosys Systems India Private Limited. I like speaking and writing about Microsoft SQL Server.

View all posts by Yogeshwar Phull →

3 Comments on “SQL Server Always ON and Change Data Capture: Friends or Foes?”

  1. HI Yogesh

    thanks for the SP , very handy one.

    may I know here would I be passing the CDC DB name and the server name exactly in the above script pls

    cheers
    AJ

Leave a Reply

Your email address will not be published.