We, as a DBA, have many times been in a situation where your log reader goes into latency and no matter how much you troubleshoot it come back when it should and there is nothing you can do about it. Yes, as per the design on SQL Server Replication log reader there very little you can offer when you go into the latency of log reader agent. So in this blog I would like to explain how the log reader actually works to understand why it goes into latency and how you can prevent (yes the best solution) it from that situation.
The log reader agent is created each for one published database for any number of publications on that database. This runs on distributor server and connects to the publisher. It read the transaction log and inserts the commands into the distribution database for transactions happened on the published articles. So the whole architecture of this agent is designed around the efficiency to read transaction log which is sequential.
Before we go ahead with the explanation lets setup the databases, tables and replication.
--Creating the Publisher Database USE [master] GO IF EXISTS(SELECT 1 FROM sys.databases WHERE name = 'Repl_Pub') DROP DATABASE Repl_Pub GO CREATE DATABASE [Repl_Pub] --Creating the Subscriber Database USE [master] GO IF EXISTS(SELECT 1 FROM sys.databases WHERE name = 'Repl_Sub') DROP DATABASE Repl_Sub GO CREATE DATABASE [Repl_Sub] --Create two tables for replication USE [Repl_Pub] GO CREATE TABLE pubLogger_tbl (eId INT PRIMARY KEY CLUSTERED, eName NVARCHAR(10)) GO CREATE TABLE pubTrans_tbl (eId INT PRIMARY KEY CLUSTERED, eName NVARCHAR(10)) GO
Now setup replication with the two tables in Repl_Pub database using SSMS by following the steps in below links under SQL Server Management Studio options.
Now let us start to understand how the log reader works. When you add an article in a publication it doesn’t get marked for replication unless there is a subscriber which subscriber to this article. Once the article is subscribed by a subscriber, transactions on that article is marked for replication. It is nothing but a flag on the log record specifying if this should be read by the leader agent. Once the whole transaction is read by the log reader it clears this flag so that the log records then be truncated and the log can be reused. These tasks are achieved by two procedures. The first one to scan the log for transaction is done using sp_replcmds. The second task to clear the replication flag is achieved using sp_repldone.
This looks pretty straight forward when you have nice small transactions which complete on time and there is no latency. What happens when there is a long open transaction T1? Will the other transaction T2, which is committed be delivered which started after T1 (still not committed)? What about transactions for other publication on same database? As there is single log reader agent for each database will they also suffer with latency? Think for getting few answers before you read any further.
The following steps will get you to the answer. We will run an open transaction on one of the tables and run few other transactions on the other table while the first transaction is still not committed. Remember that we are having both these articles in a single publication.
Open Huge Transaction
use Repl_Pub GO begin tran DECLARE @i INT = 0 while (@i < 2000000) BEGIN INSERT INTO pubLogger_tbl SELECT @i, REPLICATE(CHAR(65 + (@i%26)),10) SET @i = @i + 1 END --commit --rollback
Batches of small transactions (To be run multiple times as needed)
USE Repl_Pub GO DECLARE @i INT SELECT @i = CASE WHEN MAX(eId) IS NULL THEN 0 ELSE MAX(eId) + 1 END FROM pubTrans_tbl INSERT INTO pubTrans_tbl SELECT @i, REPLICATE(CHAR(65 + (@i%26)),10)
Now run the below script to check the commands read into the distribution database.
use [distribution] GO SELECT count(*) #count,art.article ,publication FROM MSrepl_commands cmds WITH (NOLOCK) INNER JOIN MSarticles art WITH (NOLOCK) ON art.article_id = cmds.article_id INNER JOIN MSpublications pub WITH (NOLOCK) ON art.publication_id=pub.publication_id GROUP BY art.article,pub.publication
You will observe the results have commands for the article which had small transactions. So even though the first transaction started before the second one the commands are read while the first transaction is still running. So no latency while you have the open transaction.
So when do you have the latency? Now commit the huge transaction and run few small transaction immediately. Run the monitor command on distribution database and you will observe the commands from the second small transaction are not read and it will be reading commands of the first huge transaction. And the latency will be still very less while this transaction is being read.
After a while when you see the first huge transaction is all read then you will see the latency increases for the log reader. So to draw our conclusion, the latency is calculated based on past transactions and not the currently read transactions. If you had a huge transaction which lead to log reader latency that means you will see the latency after the commands from that transaction are read and the log reader estimates that future transactions may take this much time based on history. Below are few formulas used for calculating the latency.
delivery_rate = (delivered_commands/delivery_time) * 1000
average_commands = delivered_commands/delivered_transactions
delivery_latency = (current_latency + last_delivery_latency)/2
So according to these formulas, one huge off beat transaction can screw up all your latency figures. So what is the best way to track the latency? We cannot arrive at exact figures but you can always check the pending log to be read from the publication database. This doesn’t give you the time to read but will tell you how much log is holding up and does that lead to space issues on my publication database log file.
So, these are the takeaways from this blog.
- When you are stuck with log reader latency there is no fast way out
- You can take measures to make the inserts into MSrepl_commands faster by keeping a check on fragmentation and size
- Identify the long running transaction by getting the article which had highest number of commands when the latency started
- From MSlogreader_history identify the time when the latency increased. The long running transaction would have committed during this time
- Scan log from backups or t-log to check the transaction id and with combination of session data if collected will help to get the root.
- Tune the long running transaction: Batching, Add Right Filters to Update commands.