As I work with Availability Groups (AG), I’m amazed at what all SQL Server does behind the scenes to make them work… and to make them work fast and seamlessly. However, you still need to monitor them, and sometimes take corrective action. Wouldn’t it be great if you could have SQL Server take care of these Availability Group issues? This post will show you a couple of problems that I have seen, and solutions to automatically take care of those issues.
Availability Group issues – Suspended Data Movement
The first of the Availability Group issues to discuss is that, for whatever reason, data is no longer moving between the primary replica and a secondary replica. This puts the Data Movement in a Suspended state.
If the data movement remains suspended for too long, you might have to take some undesired actions to get things back in sync. Things like removing the database from the AG, restoring log files, then reattaching it to the AG. When the data movement becomes suspended, we want to get it flowing again as soon as possible. Let’s have SQL Server try to get the data flowing again.
Resuming Data Movement
It’s normally very simple to resume the data movement – just issue the T-SQL statement:
1 |
ALTER DATABASE [DB] SET HADR RESUME; |
What you need to know is that data movement has been suspended, and for which database. Remember, the time it takes to respond will determine how drastic a measure you need to go through to get the database(s) back in sync. When data movement becomes suspended, SQL Server raises an error (35264), and in sys.messages it reads:
AlwaysOn Availability Groups data movement for database ‘%.*ls’ has been suspended for the following reason: “%S_MSG” (Source ID %d; Source string: ‘%.*ls’). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online.
Automating the Resumption of Data Movement with SQL Server Agent Alerts
To automate resuming data movement, create a SQL Agent Alert for this error that runs a job. To determine which databases have suspended data movement, we hit the system Dynamic Management Views (DMV). The DMV sys.dm_hadr_database_replica_states has a column (is_suspended) that will tell you if that database is in a suspended state. Let’s create the job and alert:
This job gets all the databases in an AG on this instance and that is in the suspended state. It dynamically creates and runs the T-SQL statement to resume the data movement.
With the job created, you need to set up an alert to trap the error and run the job:
When suspended data movement on any database, SQL Server will trap the error and run the job. The job should start the data movement flowing again.
Wrapping it all up in a script
The T-SQL script to create the job and alert is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
/****************************************************************************** Create job to resume AG database data movement for suspended databases. Create alert to catch when data movement has been suspended, and run the job. ******************************************************************************* MODIFICATION LOG ******************************************************************************* 2018-11-20 WGS Initial creation. ******************************************************************************/ USE [msdb] GO BEGIN TRANSACTION; DECLARE @ReturnCode INT; SELECT @ReturnCode = 0;DECLARE @jobname sysname = N'AG - Resume Data Movement'; DECLARE @categoryname sysname = 'HADR-Availability Group'; IF EXISTS (SELECT name FROM dbo.sysjobs where name = @jobname) BEGIN EXECUTE msdb.dbo.sp_delete_job @job_name = @jobname; END; IF NOT EXISTS (SELECT name FROM msdb.dbo.syscategories WHERE name=@categoryname AND category_class=1) BEGIN EXECUTE @ReturnCode = msdb.dbo.sp_add_category @class=N'JOB', @type=N'LOCAL', @name=@categoryname; IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback; END; EXECUTE @ReturnCode = msdb.dbo.sp_add_job @job_name=@jobname, @enabled=1, @notify_level_eventlog=0, @notify_level_email=2, @notify_level_netsend=2, @notify_level_page=2, @delete_level=0, @description=N'Resume data movement on suspended Availability Group databases. This job can be run manually, or from an alert', @category_name=N'HADR-Availability Group', @owner_login_name=N'sa'; IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback; EXECUTE @ReturnCode = msdb.dbo.sp_add_jobserver @job_name=@jobname, @server_name = N'(local)' IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback; EXECUTE @ReturnCode = msdb.dbo.sp_add_jobstep @job_name=@jobname, @step_name=N'Resume data movement in AG databases', @step_id=1, @cmdexec_success_code=0, @on_success_action=1, @on_fail_action=2, @retry_attempts=0, @retry_interval=0, @os_run_priority=0, @subsystem=N'TSQL', @command=N'DECLARE @SQLCMD VARCHAR(1000); DECLARE cDBSuspended CURSOR FOR SELECT ''ALTER DATABASE ['' + DB_NAME(database_id) + ''] SET HADR RESUME;'' FROM sys.dm_hadr_database_replica_states drs JOIN sys.availability_replicas ar ON ar.replica_id = drs.replica_id WHERE ar.replica_server_name = @@SERVERNAME AND drs.is_suspended = 1; OPEN cDBSuspended; FETCH NEXT FROM cDBSuspended INTO @SQLCMD; WHILE @@FETCH_STATUS = 0 BEGIN EXECUTE (@SQLCMD); FETCH NEXT FROM cDBSuspended INTO @SQLCMD; END; CLOSE cDBSuspended; DEALLOCATE cDBSuspended; ', @database_name=N'master', @flags=0; IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback; EXECUTE @ReturnCode = msdb.dbo.sp_update_job @job_name=@jobname, @enabled=1, @start_step_id=1, @notify_level_eventlog=0, @notify_level_email=2, @notify_level_netsend=2, @notify_level_page=2, @delete_level=0, @description=N'Resume data movement on suspended Availability Group databases.', @category_name=N'HADR-Availability Group', @owner_login_name=N'sa', @notify_email_operator_name=N'', @notify_netsend_operator_name=N'', @notify_page_operator_name=N''; IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback; DECLARE @alertname sysname = N'AG Data Movement suspended'; IF EXISTS (SELECT * FROM msdb.dbo.sysalerts WHERE name = @alertname) BEGIN EXECUTE @ReturnCode = msdb.dbo.sp_delete_alert @alertname; IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback; END; /* Alert text: AlwaysOn Availability Groups data movement for database '%.*ls' has been suspended for the following reason: "%S_MSG" (Source ID %d; Source string: '%.*ls'). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online. */ EXECUTE @ReturnCode = msdb.dbo.sp_add_alert @name=@alertname, @message_id=35264, @severity=0, @enabled=1, @delay_between_responses=0, @include_event_description_in=0, @job_name = @jobname; IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback; COMMIT TRANSACTION; GOTO EndSave QuitWithRollback: IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION; EndSave: GO |
Availability Group issues – Failover
The second of the Availability Group issues is that when an availability group fails over, other actions may need to be performed. A third-party application might need some data updated in the database to reflect what the primary replica is. Perhaps the cluster witness configuration needs adjusting. Maybe services need turning off on the old primary replica.
SQL Server raises an error (1480) when the AG fails over. This occurs in both the new and old primary replica instances. The text of this error from sys.messages reads:
The %S_MSG database “%.*ls” is changing roles from “%ls” to “%ls” because the mirroring session or availability group failed over due to %S_MSG. This is an informational message only. No user action is required.
Note that this error is raised for each database in the AG.
On each replica, the failover process goes through three roles. The old primary replica will go from “PRIMARY” to “RESOLVING” to “SECONDARY”, while the old secondary will go from “SECONDARY” to “RESOLVING” to “PRIMARY”. These roles will show up in the above error message and used to tune the alert.
There are three conditions that we could potentially alert for. Firstly is that a failover did occur. Secondly is that the failover occurred, and this instance is now the primary replica. Thirdly is that a failover occurred, and this instance is now a secondary replica. The following script will create alerts for the last two conditions:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
/* Use this event to run a job when the replica becomes primary */ USE [msdb] GO EXEC msdb.dbo.sp_add_alert @name=N'AG Failover Detected - Now Primary', @message_id=1480, @severity=0, @enabled=1, @delay_between_responses=0, @include_event_description_in=0, @event_description_keyword=N'"RESOLVING" to "PRIMARY"', @job_id=N'00000000-0000-0000-0000-000000000000' GO /* Use this event to run a job when the replica becomes secondary */ USE [msdb] GO EXEC msdb.dbo.sp_add_alert @name=N'AG Failover Detected - Now Secondary', @message_id=1480, @severity=0, @enabled=1, @delay_between_responses=0, @include_event_description_in=0, @event_description_keyword=N'"RESOLVING" to "SECONDARY"', @job_id=N'00000000-0000-0000-0000-000000000000' GO |
With the alert in place, you can modify it to run a job to do the actions that you need to perform. If you just need to know that a failover did occur, you can remove the message text to alert on.
Summary
In this post, we have learned how to use the SQL Server Agent Alerting system to capture and fix Availability Group issues and to respond to them by running jobs.