SQL Server Replication Log Reader Why and How latency occurs

We, as a DBA, have many times been in a situation where your log reader goes into latency and no matter how much you troubleshoot it come back when it should and there is nothing you can do about it. Yes, as per the design on SQL Server Replication log reader there very little you can offer when you go into the latency of log reader agent. So in this blog I would like to explain how the log reader actually works to understand why it goes into latency and how you can prevent (yes the best solution) it from that situation.

The log reader agent is created each for one published database for any number of publications on that database. This runs on distributor server and connects to the publisher. It read the transaction log and inserts the commands into the distribution database for transactions happened on the published articles. So the whole architecture of this agent is designed around the efficiency to read transaction log which is sequential.

Before we go ahead with the explanation lets setup the databases, tables and replication.

--Creating the Publisher Database
USE [master]
GO

IF EXISTS(SELECT 1 FROM sys.databases WHERE name = 'Repl_Pub')
DROP DATABASE Repl_Pub
GO

CREATE DATABASE [Repl_Pub]

--Creating the Subscriber Database
USE [master]
GO

IF EXISTS(SELECT 1 FROM sys.databases WHERE name = 'Repl_Sub')
DROP DATABASE Repl_Sub
GO

CREATE DATABASE [Repl_Sub]

--Create two tables for replication
USE [Repl_Pub]
GO

CREATE TABLE pubLogger_tbl (eId INT PRIMARY KEY CLUSTERED,
			    eName NVARCHAR(10))
GO

CREATE TABLE pubTrans_tbl (eId INT PRIMARY KEY CLUSTERED,
			   eName NVARCHAR(10))
GO

Now setup replication with the two tables in Repl_Pub database using SSMS by following the steps in below links under SQL Server Management Studio options.

Distributor and Publisher

Subscriber

Now let us start to understand how the log reader works. When you add an article in a publication it doesn’t get marked for replication unless there is a subscriber which subscriber to this article. Once the article is subscribed by a subscriber, transactions on that article is marked for replication. It is nothing but a flag on the log record specifying if this should be read by the leader agent. Once the whole transaction is read by the log reader it clears this flag so that the log records then be truncated and the log can be reused. These tasks are achieved by two procedures. The first one to scan the log for transaction is done using sp_replcmds. The second task to clear the replication flag is achieved using sp_repldone.

This looks pretty straight forward when you have nice small transactions which complete on time and there is no latency. What happens when there is a long open transaction T1? Will the other transaction T2, which is committed be delivered which started after T1 (still not committed)? What about transactions for other publication on same database? As there is single log reader agent for each database will they also suffer with latency? Think for getting few answers before you read any further.

The following steps will get you to the answer. We will run an open transaction on one of the tables and run few other transactions on the other table while the first transaction is still not committed. Remember that we are having both these articles in a single publication.

Open Huge Transaction

use Repl_Pub
GO

begin tran

DECLARE @i INT = 0
while (@i < 2000000)
BEGIN

	INSERT INTO pubLogger_tbl
		SELECT @i, REPLICATE(CHAR(65 + (@i%26)),10)
	SET @i = @i + 1

END

--commit
--rollback

Batches of small transactions (To be run multiple times as needed)

USE Repl_Pub
GO

DECLARE @i INT
SELECT @i = CASE WHEN MAX(eId) IS NULL THEN 0
			ELSE MAX(eId) + 1 END
	FROM pubTrans_tbl

INSERT INTO pubTrans_tbl
		SELECT @i, REPLICATE(CHAR(65 + (@i%26)),10)

Now run the below script to check the commands read into the distribution database.

use [distribution]
GO

SELECT count(*) #count,art.article  ,publication
FROM MSrepl_commands cmds WITH (NOLOCK)
INNER JOIN MSarticles art WITH (NOLOCK) 
      ON art.article_id = cmds.article_id
INNER JOIN MSpublications pub WITH (NOLOCK)
      ON art.publication_id=pub.publication_id
GROUP BY art.article,pub.publication

You will observe the results have commands for the article which had small transactions. So even though the first transaction started before the second one the commands are read while the first transaction is still running. So no latency while you have the open transaction.

So when do you have the latency? Now commit the huge transaction and run few small transaction immediately. Run the monitor command on distribution database and you will observe the commands from the second small transaction are not read and it will be reading commands of the first huge transaction. And the latency will be still very less while this transaction is being read.

After a while when you see the first huge transaction is all read then you will see the latency increases for the log reader. So to draw our conclusion, the latency is calculated based on past transactions and not the currently read transactions. If you had a huge transaction which lead to log reader latency that means you will see the latency after the commands from that transaction are read and the log reader estimates that future transactions may take this much time based on history. Below are few formulas used for calculating the latency.

delivery_rate = (delivered_commands/delivery_time) * 1000

average_commands = delivered_commands/delivered_transactions

delivery_latency = (current_latency + last_delivery_latency)/2

So according to these formulas, one huge off beat transaction can screw up all your latency figures. So what is the best way to track the latency? We cannot arrive at exact figures but you can always check the pending log to be read from the publication database. This doesn’t give you the time to read but will tell you how much log is holding up and does that lead to space issues on my publication database log file.

So, these are the takeaways from this blog.

When you are stuck with log reader latency there is no fast way out
You can take measures to make the inserts into MSrepl_commands faster by keeping a check on fragmentation and size
Identify the long running transaction by getting the article which had highest number of commands when the latency started
From MSlogreader_history identify the time when the latency increased. The long running transaction would have committed during this time
Scan log from backups or t-log to check the transaction id and with combination of session data if collected will help to get the root.
Tune the long running transaction: Batching, Add Right Filters to Update commands.

Happy Learning,

Manu

Like us on FaceBook | Join the fastest growing SQL Server group on FaceBook |
Follow me on Twitter | Follow me on FaceBook

3 Comments on “SQL Server Replication Log Reader Why and How latency occurs”

Lakshman says:

April 15, 2015 at 5:20 am

Great article Manu!

SANTHOSH says:

April 15, 2015 at 1:31 pm

Well explained. Thanks Manu

Chetan says:

April 16, 2015 at 12:06 pm

It is good to know how replication latency is calculated.

A lot of times Log file fragmentation is the cause for slow log reader performance. It is very important to keep VLF count in check to avoid issues with log reader latency.

In some situations, it also becomes necessary to adjust the readbatchsize of log reader agent.

SQL Server Replication Log Reader Why and How latency occurs

About Manohar Punna

3 Comments on “SQL Server Replication Log Reader Why and How latency occurs”

Leave a Reply Cancel reply

Related Posts

SQL Server Query Cost, Memory Grant & SQLRESERVATIONS Clerk

SQL Server Memory Troubleshooting

Performance Monitor Counters

About Manohar Punna

3 Comments on “SQL Server Replication Log Reader Why and How latency occurs”

Leave a Reply Cancel reply