Availability Group issues fixed with Alerts

2020-08-06 Wayne Sheffield 3 comments

As I work with Availability Groups (AG), I’m amazed at what all SQL Server does behind the scenes to make them work… and to make them work fast and seamlessly. However, you still need to monitor them, and sometimes take corrective action. Wouldn’t it be great if you could have SQL Server take care of these Availability Group issues? This post will show you a couple of problems that I have seen, and solutions to automatically take care of those issues.

Availability Group issues – Suspended Data Movement

The first of the Availability Group issues to discuss is that, for whatever reason, data is no longer moving between the primary replica and a secondary replica. This puts the Data Movement in a Suspended state.

If the data movement remains suspended for too long, you might have to take some undesired actions to get things back in sync. Things like removing the database from the AG, restoring log files, then reattaching it to the AG. When the data movement becomes suspended, we want to get it flowing again as soon as possible. Let’s have SQL Server try to get the data flowing again.

Resuming Data Movement

It’s normally very simple to resume the data movement – just issue the T-SQL statement:

ALTER DATABASE [DB] SET HADR RESUME;

1	ALTER DATABASE [DB] SET HADR RESUME;

What you need to know is that data movement has been suspended, and for which database. Remember, the time it takes to respond will determine how drastic a measure you need to go through to get the database(s) back in sync. When data movement becomes suspended, SQL Server raises an error (35264), and in sys.messages it reads:

AlwaysOn Availability Groups data movement for database ‘%.*ls’ has been suspended for the following reason: “%S_MSG” (Source ID %d; Source string: ‘%.*ls’). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online.

Automating the Resumption of Data Movement with SQL Server Agent Alerts

To automate resuming data movement, create a SQL Agent Alert for this error that runs a job. To determine which databases have suspended data movement, we hit the system Dynamic Management Views (DMV). The DMV sys.dm_hadr_database_replica_states has a column (is_suspended) that will tell you if that database is in a suspended state. Let’s create the job and alert:

Job to resume data movement — Job script to resume data movement

This job gets all the databases in an AG on this instance and that is in the suspended state. It dynamically creates and runs the T-SQL statement to resume the data movement.

With the job created, you need to set up an alert to trap the error and run the job:

Availability Group issues - Configure Alert to detect suspended data movement — Trapping suspended data movement with an alert

Configure Alert to run a job — Configure alert to run the job

When suspended data movement on any database, SQL Server will trap the error and run the job. The job should start the data movement flowing again.

Wrapping it all up in a script

The T-SQL script to create the job and alert is:

/******************************************************************************
Create job to resume AG database data movement for suspended databases.
Create alert to catch when data movement has been suspended, and run the job.
*******************************************************************************
MODIFICATION LOG
*******************************************************************************
2018-11-20 WGS Initial creation.
******************************************************************************/
USE [msdb]
GO

BEGIN TRANSACTION;
DECLARE @ReturnCode INT;
SELECT @ReturnCode = 0;DECLARE @jobname sysname = N'AG - Resume Data Movement';
DECLARE @categoryname sysname = 'HADR-Availability Group';

IF EXISTS (SELECT name FROM dbo.sysjobs where name = @jobname)
BEGIN
EXECUTE msdb.dbo.sp_delete_job @job_name = @jobname;
END;

IF NOT EXISTS (SELECT name FROM msdb.dbo.syscategories WHERE name=@categoryname AND category_class=1)
BEGIN
EXECUTE @ReturnCode = msdb.dbo.sp_add_category @class=N'JOB', @type=N'LOCAL', @name=@categoryname;
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback;
END;


EXECUTE @ReturnCode = msdb.dbo.sp_add_job
@job_name=@jobname,
@enabled=1,
@notify_level_eventlog=0,
@notify_level_email=2,
@notify_level_netsend=2,
@notify_level_page=2,
@delete_level=0,
@description=N'Resume data movement on suspended Availability Group databases.

This job can be run manually, or from an alert',
@category_name=N'HADR-Availability Group',
@owner_login_name=N'sa';
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback;

EXECUTE @ReturnCode = msdb.dbo.sp_add_jobserver @job_name=@jobname, @server_name = N'(local)'
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback;

EXECUTE @ReturnCode = msdb.dbo.sp_add_jobstep
@job_name=@jobname,
@step_name=N'Resume data movement in AG databases',
@step_id=1,
@cmdexec_success_code=0,
@on_success_action=1,
@on_fail_action=2,
@retry_attempts=0,
@retry_interval=0,
@os_run_priority=0,
@subsystem=N'TSQL',
@command=N'DECLARE @SQLCMD VARCHAR(1000);
DECLARE cDBSuspended CURSOR FOR
SELECT ''ALTER DATABASE ['' + DB_NAME(database_id) + ''] SET HADR RESUME;''
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas ar ON ar.replica_id = drs.replica_id
WHERE ar.replica_server_name = @@SERVERNAME
AND drs.is_suspended = 1;

OPEN cDBSuspended;
FETCH NEXT FROM cDBSuspended INTO @SQLCMD;
WHILE @@FETCH_STATUS = 0
BEGIN
EXECUTE (@SQLCMD);
FETCH NEXT FROM cDBSuspended INTO @SQLCMD;
END;
CLOSE cDBSuspended;
DEALLOCATE cDBSuspended;
',
@database_name=N'master',
@flags=0;
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback;


EXECUTE @ReturnCode = msdb.dbo.sp_update_job
@job_name=@jobname,
@enabled=1,
@start_step_id=1,
@notify_level_eventlog=0,
@notify_level_email=2,
@notify_level_netsend=2,
@notify_level_page=2,
@delete_level=0,
@description=N'Resume data movement on suspended Availability Group databases.',
@category_name=N'HADR-Availability Group',
@owner_login_name=N'sa',
@notify_email_operator_name=N'',
@notify_netsend_operator_name=N'',
@notify_page_operator_name=N'';
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback;

DECLARE @alertname sysname = N'AG Data Movement suspended';
IF EXISTS (SELECT * FROM msdb.dbo.sysalerts WHERE name = @alertname)
BEGIN
EXECUTE @ReturnCode = msdb.dbo.sp_delete_alert @alertname;
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback;
END;

/*
Alert text:
AlwaysOn Availability Groups data movement for database '%.*ls' has been suspended for the following reason: "%S_MSG" (Source ID %d; Source string: '%.*ls'). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online.
*/
EXECUTE @ReturnCode = msdb.dbo.sp_add_alert
@name=@alertname,
@message_id=35264,
@severity=0,
@enabled=1,
@delay_between_responses=0,
@include_event_description_in=0,
@job_name = @jobname;
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback;

COMMIT TRANSACTION;
GOTO EndSave
QuitWithRollback:
IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION;
EndSave:
GO

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

/******************************************************************************

Create job to resume AG database data movement for suspended databases.

Create alert to catch when data movement has been suspended, and run the job.

*******************************************************************************

MODIFICATION LOG

*******************************************************************************

2018-11-20 WGS Initial creation.

******************************************************************************/

USE [msdb]

BEGIN TRANSACTION;

DECLARE @ReturnCode INT;

SELECT @ReturnCode = 0;DECLARE @jobname sysname = N'AG - Resume Data Movement';

DECLARE @categoryname sysname = 'HADR-Availability Group';

IF EXISTS (SELECT name FROM dbo.sysjobs where name = @jobname)

BEGIN

EXECUTE msdb.dbo.sp_delete_job @job_name = @jobname;

END;

IF NOT EXISTS (SELECT name FROM msdb.dbo.syscategories WHERE name=@categoryname AND category_class=1)

BEGIN

EXECUTE @ReturnCode = msdb.dbo.sp_add_category @class=N'JOB', @type=N'LOCAL', @name=@categoryname;

IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback;

END;

EXECUTE @ReturnCode = msdb.dbo.sp_add_job

@job_name=@jobname,

@enabled=1,

@notify_level_eventlog=0,

@notify_level_email=2,

@notify_level_netsend=2,

@notify_level_page=2,

@delete_level=0,

@description=N'Resume data movement on suspended Availability Group databases.

This job can be run manually, or from an alert',

@category_name=N'HADR-Availability Group',

@owner_login_name=N'sa';

IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback;

EXECUTE @ReturnCode = msdb.dbo.sp_add_jobserver @job_name=@jobname, @server_name = N'(local)'

IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback;

EXECUTE @ReturnCode = msdb.dbo.sp_add_jobstep

@job_name=@jobname,

@step_name=N'Resume data movement in AG databases',

@step_id=1,

@cmdexec_success_code=0,

@on_success_action=1,

@on_fail_action=2,

@retry_attempts=0,

@retry_interval=0,

@os_run_priority=0,

@subsystem=N'TSQL',

@command=N'DECLARE @SQLCMD VARCHAR(1000);

DECLARE cDBSuspended CURSOR FOR

SELECT ''ALTER DATABASE ['' + DB_NAME(database_id) + ''] SET HADR RESUME;''

FROM sys.dm_hadr_database_replica_states drs

JOIN sys.availability_replicas ar ON ar.replica_id = drs.replica_id

WHERE ar.replica_server_name = @@SERVERNAME

AND drs.is_suspended = 1;

OPEN cDBSuspended;

FETCH NEXT FROM cDBSuspended INTO @SQLCMD;

WHILE @@FETCH_STATUS = 0

BEGIN

EXECUTE (@SQLCMD);

FETCH NEXT FROM cDBSuspended INTO @SQLCMD;

END;

CLOSE cDBSuspended;

DEALLOCATE cDBSuspended;

@database_name=N'master',

@flags=0;

IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback;

EXECUTE @ReturnCode = msdb.dbo.sp_update_job

@job_name=@jobname,

@enabled=1,

@start_step_id=1,

@notify_level_eventlog=0,

@notify_level_email=2,

@notify_level_netsend=2,

@notify_level_page=2,

@delete_level=0,

@description=N'Resume data movement on suspended Availability Group databases.',

@category_name=N'HADR-Availability Group',

@owner_login_name=N'sa',

@notify_email_operator_name=N'',

@notify_netsend_operator_name=N'',

@notify_page_operator_name=N'';

IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback;

DECLARE @alertname sysname = N'AG Data Movement suspended';

IF EXISTS (SELECT * FROM msdb.dbo.sysalerts WHERE name = @alertname)

BEGIN

EXECUTE @ReturnCode = msdb.dbo.sp_delete_alert @alertname;

IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback;

END;

Alert text:

AlwaysOn Availability Groups data movement for database '%.*ls' has been suspended for the following reason: "%S_MSG" (Source ID %d; Source string: '%.*ls'). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online.

EXECUTE @ReturnCode = msdb.dbo.sp_add_alert

@name=@alertname,

@message_id=35264,

@severity=0,

@enabled=1,

@delay_between_responses=0,

@include_event_description_in=0,

@job_name = @jobname;

IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback;

COMMIT TRANSACTION;

GOTO EndSave

QuitWithRollback:

IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION;

EndSave:

Availability Group issues – Failover

The second of the Availability Group issues is that when an availability group fails over, other actions may need to be performed. A third-party application might need some data updated in the database to reflect what the primary replica is. Perhaps the cluster witness configuration needs adjusting. Maybe services need turning off on the old primary replica.

SQL Server raises an error (1480) when the AG fails over. This occurs in both the new and old primary replica instances. The text of this error from sys.messages reads:

The %S_MSG database “%.*ls” is changing roles from “%ls” to “%ls” because the mirroring session or availability group failed over due to %S_MSG. This is an informational message only. No user action is required.

Note that this error is raised for each database in the AG.

On each replica, the failover process goes through three roles. The old primary replica will go from “PRIMARY” to “RESOLVING” to “SECONDARY”, while the old secondary will go from “SECONDARY” to “RESOLVING” to “PRIMARY”. These roles will show up in the above error message and used to tune the alert.

There are three conditions that we could potentially alert for. Firstly is that a failover did occur. Secondly is that the failover occurred, and this instance is now the primary replica. Thirdly is that a failover occurred, and this instance is now a secondary replica. The following script will create alerts for the last two conditions:

/*
Use this event to run a job when the replica becomes primary
*/
USE [msdb]
GO
EXEC msdb.dbo.sp_add_alert @name=N'AG Failover Detected - Now Primary',
@message_id=1480,
@severity=0,
@enabled=1,
@delay_between_responses=0,
@include_event_description_in=0,
@event_description_keyword=N'"RESOLVING" to "PRIMARY"',
@job_id=N'00000000-0000-0000-0000-000000000000'
GO
/*
Use this event to run a job when the replica becomes secondary
*/
USE [msdb]
GO
EXEC msdb.dbo.sp_add_alert @name=N'AG Failover Detected - Now Secondary',
@message_id=1480,
@severity=0,
@enabled=1,
@delay_between_responses=0,
@include_event_description_in=0,
@event_description_keyword=N'"RESOLVING" to "SECONDARY"',
@job_id=N'00000000-0000-0000-0000-000000000000'
GO

Use this event to run a job when the replica becomes primary

USE [msdb]

EXEC msdb.dbo.sp_add_alert @name=N'AG Failover Detected - Now Primary',

@message_id=1480,

@severity=0,

@enabled=1,

@delay_between_responses=0,

@include_event_description_in=0,

@event_description_keyword=N'"RESOLVING" to "PRIMARY"',

@job_id=N'00000000-0000-0000-0000-000000000000'

Use this event to run a job when the replica becomes secondary

USE [msdb]

EXEC msdb.dbo.sp_add_alert @name=N'AG Failover Detected - Now Secondary',

@message_id=1480,

@severity=0,

@enabled=1,

@delay_between_responses=0,

@include_event_description_in=0,

@event_description_keyword=N'"RESOLVING" to "SECONDARY"',

@job_id=N'00000000-0000-0000-0000-000000000000'

With the alert in place, you can modify it to run a job to do the actions that you need to perform. If you just need to know that a failover did occur, you can remove the message text to alert on.

Availability Group issues - Alert to detect failover — Configuring alert to detect AG failover

Summary

In this post, we have learned how to use the SQL Server Agent Alerting system to capture and fix Availability Group issues and to respond to them by running jobs.

Availability Group, Database Masters, SQL, SQL Server Central automation, Availability Group, Data Movement, Failover

#1 | Written by Stuart Davis about 6 years ago. Reply

Why do you need all this cursor gubbins? you can just concatenate all the suspended databases into a single string!

select @sql += ‘alter database [‘ + Db_Name(database_id) + ‘] set hadr resume; ‘
from sys.dm_hadr_database_replica_states drs
join sys.availability_replicas ar on ar.replica_id=drs.replica_id
where ar.replica_server_name=@@ServerName and drs.is_suspended=1

Loading...
#2 | Written by Trev about 5 years ago. Reply

In response to the above comment, I get that you can build the query dynamically, but then how would you automate the execution?

Loading...
- #3 | Written by Wayne Sheffield about 5 years ago. Reply
  
  You run the dynamically created query with the EXECUTE (@SQLCMD) statement within the job step.
  
  Loading...

#1 | Pinged by Fixing Availability Group Issues with Alerts – Curated SQL about 6 years ago.

[…] Wayne Sheffield automates a few problems away: […]

Loading...

Wayne Sheffield

Recent Posts

Top Posts

Archives

Categories

Blogroll

Availability Group issues fixed with Alerts

Availability Group issues – Suspended Data Movement

Resuming Data Movement

Automating the Resumption of Data Movement with SQL Server Agent Alerts

Wrapping it all up in a script

Availability Group issues – Failover

Summary

Like this:

Related

Leave a Reply to let me know how you liked this postCancel reply

Wayne Sheffield

Recent Posts

Top Posts

Archives

Categories

Tags

Blogroll

Availability Group issues fixed with Alerts

Availability Group issues – Suspended Data Movement

Resuming Data Movement

Automating the Resumption of Data Movement with SQL Server Agent Alerts

Wrapping it all up in a script

Availability Group issues – Failover

Summary

Share this:

Like this:

Related

Leave a Reply to let me know how you liked this postCancel reply