As I work with Availability Groups (AG), I’m amazed at what all SQL Server does behind the scenes to make them work… and to make them work fast and seamlessly. However, you still need to monitor them, and sometimes take corrective action. Wouldn’t it be great if you could have SQL Server take care of these Availability Group issues? This post will show you a couple of problems that I have seen, and solutions to automatically take care of those issues.
Availability Group issues – Suspended Data Movement
The first of the Availability Group issues to discuss is that, for whatever reason, data is no longer moving between the primary replica and a secondary replica. This puts the Data Movement in a Suspended state.
If the data movement remains suspended for too long, you might have to take some undesired actions to get things back in sync. Things like removing the database from the AG, restoring log files, then reattaching it to the AG. When the data movement becomes suspended, we want to get it flowing again as soon as possible. Let’s have SQL Server try to get the data flowing again.
Resuming Data Movement
It’s normally very simple to resume the data movement – just issue the T-SQL statement:
Resuming data movement manually
Transact-SQL
1
ALTERDATABASE[DB]SETHADRRESUME;
What you need to know is that data movement has been suspended, and for which database. Remember, the time it takes to respond will determine how drastic a measure you need to go through to get the database(s) back in sync. When data movement becomes suspended, SQL Server raises an error (35264), and in sys.messages it reads:
AlwaysOn Availability Groups data movement for database ‘%.*ls’ has been suspended for the following reason: “%S_MSG” (Source ID %d; Source string: ‘%.*ls’). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online.
Automating the Resumption of Data Movement with SQL Server Agent Alerts
To automate resuming data movement, create a SQL Agent Alert for this error that runs a job. To determine which databases have suspended data movement, we hit the system Dynamic Management Views (DMV). The DMV sys.dm_hadr_database_replica_states has a column (is_suspended) that will tell you if that database is in a suspended state. Let’s create the job and alert:
Job script to resume data movement
This job gets all the databases in an AG on this instance and that is in the suspended state. It dynamically creates and runs the T-SQL statement to resume the data movement.
With the job created, you need to set up an alert to trap the error and run the job:
Trapping suspended data movement with an alertConfigure alert to run the job
When suspended data movement on any database, SQL Server will trap the error and run the job. The job should start the data movement flowing again.
AlwaysOn Availability Groups data movement for database '%.*ls' has been suspended for the following reason: "%S_MSG" (Source ID %d; Source string: '%.*ls'). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online.
The second of the Availability Group issues is that when an availability group fails over, other actions may need to be performed. A third-party application might need some data updated in the database to reflect what the primary replica is. Perhaps the cluster witness configuration needs adjusting. Maybe services need turning off on the old primary replica.
SQL Server raises an error (1480) when the AG fails over. This occurs in both the new and old primary replica instances. The text of this error from sys.messages reads:
The %S_MSG database “%.*ls” is changing roles from “%ls” to “%ls” because the mirroring session or availability group failed over due to %S_MSG. This is an informational message only. No user action is required.
Note that this error is raised for each database in the AG.
On each replica, the failover process goes through three roles. The old primary replica will go from “PRIMARY” to “RESOLVING” to “SECONDARY”, while the old secondary will go from “SECONDARY” to “RESOLVING” to “PRIMARY”. These roles will show up in the above error message and used to tune the alert.
There are three conditions that we could potentially alert for. Firstly is that a failover did occur. Secondly is that the failover occurred, and this instance is now the primary replica. Thirdly is that a failover occurred, and this instance is now a secondary replica. The following script will create alerts for the last two conditions:
AG Failover detected
Transact-SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/*
Use this event to run a job when the replica becomes primary
*/
USE[msdb]
GO
EXECmsdb.dbo.sp_add_alert@name=N'AG Failover Detected - Now Primary',
@message_id=1480,
@severity=0,
@enabled=1,
@delay_between_responses=0,
@include_event_description_in=0,
@event_description_keyword=N'"RESOLVING" to "PRIMARY"',
@job_id=N'00000000-0000-0000-0000-000000000000'
GO
/*
Use this event to run a job when the replica becomes secondary
*/
USE[msdb]
GO
EXECmsdb.dbo.sp_add_alert@name=N'AG Failover Detected - Now Secondary',
@message_id=1480,
@severity=0,
@enabled=1,
@delay_between_responses=0,
@include_event_description_in=0,
@event_description_keyword=N'"RESOLVING" to "SECONDARY"',
@job_id=N'00000000-0000-0000-0000-000000000000'
GO
With the alert in place, you can modify it to run a job to do the actions that you need to perform. If you just need to know that a failover did occur, you can remove the message text to alert on.
Configuring alert to detect AG failover
Summary
In this post, we have learned how to use the SQL Server Agent Alerting system to capture and fix Availability Group issues and to respond to them by running jobs.
#1 | Written by Stuart Davis about 5 years ago.Reply
Why do you need all this cursor gubbins? you can just concatenate all the suspended databases into a single string!
select @sql += ‘alter database [‘ + Db_Name(database_id) + ‘] set hadr resume; ‘
from sys.dm_hadr_database_replica_states drs
join sys.availability_replicas ar on ar.replica_id=drs.replica_id
where ar.replica_server_name=@@ServerName and drs.is_suspended=1
To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
Functional
Always active
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
Preferences
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
Statistics
The technical storage or access that is used exclusively for statistical purposes.The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
Marketing
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.