Bullet-proofing Your RTO and RPO by Warner Chaves

This article first appeared in the SQLServerGeeks Magazine.
Author: Warner Chaves.
Subscribe to get your copy.

If you are a DBA then you are surely familiar with the terms Recovery Time Objective (RTO) and Recovery Point Objective (RPO). These terms have been used for decades to define service level agreements for IT systems all around the world.
Relational database systems such as SQL Server provide mechanisms to configure, manipulate and control the behaviour of your database system so that you can meet your RTO and RPO requirements.
This is not a new topic and so you must be thinking to yourself that you already know everything about it and I’m beating on a dead horse. However, I recurringly run into incidents where everyone thought the RTO and RPO would be easily met and it ended up being missed. Sometimes by a very large margin.
To this effect, I decide to compile some of the cases and scenarios to watch out for when working with SQL Server and Azure SQL Db to make sure that when you commit to those RTO and RPO requirements, you will be able to meet them guaranteed!

Let’s start with RTO. This one is all about how fast you can get back in business and that is why it’s defined as a time objective. For production systems this could be hours down to seconds and for non-production systems it could be hours, sometimes even days. To meet the RTO you will implement not only a backup strategy but also a High Availability (HA) and Disaster Recovery (DR) strategy if necessary.

Restore Testing
If you depend on your backups completely for your RTO, then at the very least you need to have a restore testing strategy. There are many ways to go around this:
1. Restore job that refreshes a development copy.
2. Restore job into some other idle infrastructure like a DR server.
3. An automatic process that boots up or creates a VM, runs the restore test then powers down.
Ideally the restore should also include a consistency check to cover all the bases and make sure the data is 100% recoverable.
An important recommendation I often make is to test different restore file sequences to see which one leads to the fastest recovery. For example, Full + Diff + Logs versus Full + Logs.

Even though usually the least files to restore means less time, it might not necessarily be so and different file types with the same amount of files to apply will also give you different times. Restore time depends on the sizes of the backups and the number of operations done during the timespan covered by those files so there is always some level of uncertainty. For this reason, I still recommend you test the different restore file sequences to figure out which one will be the fastest and do it on a regular schedule.
Now you might be thinking, I have an HA configuration and I don’t depend on my backups for my RTO, I have a cluster and multiple nodes, etc. In this case you still need to be mindful of your backup-based RTO because what happens if someone either maliciously or accidentally deletes data or drops a table? Your Availability Group will immediately replicate this destructive change to the rest of the cluster! And back to the backups you will have to go.

Crash Recovery Testing
The other common event that will test your RTO is when something unexpected happens to your SQL Server instance or Availability Group. Then SQL Server runs crash recovery and you might be waiting for a long time to have your database 100% available because recovery is taking longer than expected. This can happen when:
1. The SQL Server instance is restarted.
2. A SQL Server Failover Cluster Instance fails over.
3. An Availability Group fails over.
4. A long transaction is cancelled.
I have seen this happen countless times. For example, a DBA will have a two-node Availability Group and set them to synchronous replication and so they think that any failover will be near instantaneous and have extremely low RTO. Unfortunately, this is not the case at all, as the synchronous process only guarantees remote log hardening, not actually replaying the log. If there are a lot of operations to be replayed, you will still be waiting.

Now in SQL Server 2019 and Azure SQL Db, Microsoft introduced Accelerated Database Recovery. This is a new feature that changes the recovery process of SQL Server databases to dramatically accelerate the time for the database to be 100% available. However, even though it is a dramatic improvement over previous versions, you should still test and monitor your fail-overs to make sure you meet your RTO, even with Accelerated Database Recovery enabled.

Azure SQL Db RTO
Azure SQL Db offers both backup recovery as well as HA/DR options that can assist in meeting RTO. For backup options you can do local backup as well as geo-restore backups in a paired Azure region. Local backups are a size-of-data operation so the time to restore will depend on your database size. If you are using the newer Hyperscale model then it’s not a size of data operation but a few minutes regardless of the size. For either case, I still suggest to do tests regularly to get a good idea of timings.
If you are using the Auto-failover-group capability then you need to be aware that Microsoft will not automatically failover your databases at the first sign of an issue but it can take up to 1 hour of continuous outage before they trigger the failover. If this is not acceptable for your requirements then you need to roll your own monitoring and do your own manual failover if needed.
One final gotcha is that Microsoft offers a 30 second database failover when invoked by the user but this is not backed by a formal SLA except when running the Business Critical tier of Azure SQL Db.
The following table summarizes these options:

Recovery Point Objective is the amount of data in terms of time that your SLA allows to be lost in the event of an outage. Unlike RTO, the RPO is usually measured in smaller units could be hours down to seconds. Most DBAs will implement their RPO with a combination of log backups as well as HA/DR technologies that continuously replicate database operations.
The biggest gotcha in understanding RPO is that it is not only about being able to recover to the last X minutes or seconds but also about recovering to a specific point in time in the backup history that is considered “active” by the business. Going back to the example of someone maliciously or accidentally doing data changes, your last log backups happening at a high frequency will not protect you if the change happened a few days ago and you already deleted those log backups to save on backup storage.

Monitor Backup Time
Controlling the backup RPO is also not about simply setting the log backup schedule. It is also about making sure that the log backups are always able to meet their frequency. For this I recommend setting up monitoring jobs that track and baseline how long the backup jobs are taking, as well as alert when a backup job runs longer than expected.

In the image above, even though the schedule is every 15 minutes, the 3pm run of the backup does not happen because the 2:30 job is still running and it will not get triggered again until 3:30. RPO would not be met in case the server had an irrecoverable problem in the middle of this backup run.

Availability Groups
For some requirements, the RPO of log backups is not sufficient and a technology like Availability Groups is implemented to provide an even smaller RPO. The usual configuration is to have log backups running as well as one or more synchronous replicas of the primary database.

In the configuration above, the system is taking log backups every 15 minutes and keeping one sync copy with the primary. The biggest danger to your RPO here is that any type of issue stops the synchronous replication and leaves you open for data loss. This can happen for multiple reasons but the most common one is that temporary network issues disconnect the nodes and then the application continues to perform changes on the primary while the backlog of changes to apply to the secondary keeps growing. As an added issue, if the log starts to grow due to records accumulating, your log backups will start taking longer and possibly slide over your RPO time window.
A common misconception is that in a synchronous two node cluster, if the secondary replica is disconnected, the primary SQL Server will simply stop accepting changes (sacrificing availability for RPO) but that is not the case. I recommend having robust Availability Group monitoring in place either via 3rd party tools or home-grown scripts to detect these issues.

Azure SQL Db RPO
In terms of RPO, Azure SQL Db takes log backups every 5 to 10 minutes, depending on the compute size and the amount of database activity. Based on the geo-restore RPO we can see that backup files must be shipped between paired regions at least once every hour. And similar to the RTO, the RPO SLA only applies to Business-Critical tier. In order to maintain this SLA Microsoft has throughput limits on the log to guarantee that the replica never falls to far behind
Here is the summary for Azure SQL Db RPO:

When you are in charge of data you must take that responsibility with the level of seriousness that it truly requires. Guaranteeing specific levels of RTO and RPO are part of this responsibility and you need to do everything in your power to make sure you can meet them. These cases and scenarios shared here can help you cover some gaps in your general strategy so that you are never caught off-guard. And remember, even if you are in the managed Azure SQL Db cloud offering, you should still verify your configuration and test your procedures to make sure your RPO and RTOs are being met.

Thanks for reading!

This article first appeared in the SQLServerGeeks Magazine.
Author: Warner Chaves.
Subscribe to get your copy.

Leave a Reply

Your email address will not be published. Required fields are marked *