DR Fail Over of Azure App Services Using Automation

Servers Hero
In this post, I will explain how you can implement disaster recovery failover for an application that has been built on Azure’s App Services and Azure SQL.

Business Continuity

One of the great things about Azure is how easy it can be to solve some of the old business & technology challenges, especially if you have gone through a digital transformation and moved beyond the limits of virtual machines and infrastructure. Microsoft Azure allows us to deploy in locations around the world, at fairly modest costs, and easily switch users from one deployment to another.

The core feature that I stress for people to consider when thinking about installation flexibility and disaster recovery, even outside of Azure, is Traffic Manager. This micro-cost service abstracts DNS records and public IP addresses (together they are referred to as an endpoint by Azure) and enables simple direction, load balancing, geo-redirection, performance enhancement, and prioritization (automating failover) of endpoints.
A simple application can be deployed in one region, along with its database. A duplicate can be created in another region, and with a combination of Azure solutions, replication and failover can be implemented. More complex applications can have single databases feeding into a central data warehouse, or maybe even use a geo-resilient database such as Cosmos DB.

Simple Scenario

In this post, I’m going to stick with a very common and simple scenario. Imagine a deployment that has two load-balanced web servers and a backend machine running SQL Server – that’s not so exotic! Now, replace those web servers with Azure’s App Services, and replace the SQL Server with Azure SQL; this will reduce management costs, possibly reduce runtime costs, and allow you to focus on the service instead of the distractions of infrastructure configuration. It took only a few minutes to deploy the below “production + test” environment into a resource group called Petri in the Azure North Europe region:

  • A production and test web app running on a scalable app service plan
  • Production and test Azure SQL databases on an Azure SQL Server
A production web app running in North Europe [Image Credit: Aidan Finn]
A production web app running in North Europe [Image Credit: Aidan Finn]

Business Continuity

Let’s assume that the above web app generates revenue for the business and has become mission critical. The production elements are:

  • App Service Plan: appsvc-petri
  • App Service (web app): petriapp1
  • SQL Server: sqlsvr-petri
  • SQL Database: sql-petri1

We need to “replicate” these items to another Azure region just in case North Europe either has extended downtime or is destroyed. The remaining items are test & dev related and do not need to be replicated.
Ideally, any failover will be:

  • Manually started: In my experience, automatic failover of stateful systems is bad. Accidental failovers are more common and more destructive than feared (and rarely occurring) real disasters.
  • Orchestrated: The day you require a failover is a day when things go wrong and humans make mistakes. Automate as much of the process as possible – a human will start the process and Azure will do the rest.

The Disaster Recovery Site

In reality, the app services and SQL Server will not be replicating. Instead, the content will be replicated to the disaster recovery site:

  • App Service: Whatever release system is being used to distribute the app service code to the production site will also be used to release code to an identical app service plan and app service deployment in the secondary site.
  • Azure SQL: An identical Azure SQL Server and database will also be deployed in the secondary site. The production database will replicate to the secondary database. If there were more than one production database, their failover could be aggregated into an atomic failover group.

A secondary web app running in West Europe [Image Credit: Aidan Finn]
A secondary web app running in West Europe [Image Credit: Aidan Finn]
Next, we have to figure out how to redirect clients from the production version of the website to the secondary; this is easily accomplished using a Traffic Manager profile (in priority mode). The DNS name of the site will point to the Traffic Manager profile’s Microsoft-managed fully qualified domain name (FQDN) using a CNAME record. The Traffic Manager profile will have two endpoints that can redirect clients to either the production or the secondary site:

  • PrimaryEndpoint (Enabled): This redirects to the production app service (web app)
  • SecondaryEndpoint (Disabled): And this resolves to the secondary app service (web app)

The Traffic Manager endpoints [Image Credit: Aidan Finn]
The Traffic Manager endpoints [Image Credit: Aidan Finn]
In theory, one could leave both endpoints enabled and configure PrimaryEndpoint with a higher priority than SecondaryEndpoint. However, this could lead to a situation where the production site could be faulty but failover does not occur, or even a false failover – I want a manual decision to trigger failover!
PrimaryEndpoint is enabled, and all clients will be redirected to the app service running in North Europe unless I change that. SecondaryEndpoint is disabled. To achieve a failover, I will disable PrimaryEndpoint and enable SecondaryEndpoint, thus redirecting clients to the secondary system.
Note that Traffic Manager is a global service that is hosted in all regions. Recent global issues in the cloud have made me very careful, so I have placed the Traffic Manager profile into a resource group that is in a third “witness region”: UK South.

Azure Automation

To achieve an orchestrated failover, I will use Azure Automation. Two PowerShell runbooks will be created:

  • PetriFailover: This will failover the database (in an Azure SQL failover group) from North Europe to West Europe and then change the enabled/disabled states of the Traffic Manager endpoints to redirect clients to the App Service in West Europe.
  • PetriFailback: This runbook will reverse the changes of AppFailover and redirect clients back to the production system in North Europe.

The Azure Automation account will also be deployed into the “witness region” (UK South), isolating it from anything bad that might happen in the production or secondary sites.
Note that the following PowerShell modules had to be added to the Azure Automation account:

  • AzureRM.Profile
  • AzureRM.SQL
  • AzureRM.TrafficManager

The Runbooks

And now we get to the magic. To be honest, the runbooks below are quite simple. There are 3 steps in each runbook:

  1. Disable the active Traffic Manager endpoint
  2. Pull the Azure SQL failover group from the current region to the desired region
  3. Enable the desired Traffic Manager endpoint

Here is the PetriFailover runbook:

$connectionName = "AzureRunAsConnection"
try
{
    # Get the connection "AzureRunAsConnection "
    $servicePrincipalConnection=Get-AutomationConnection -Name $connectionName
    "Logging in to Azure..."
    Add-AzureRmAccount `
        -ServicePrincipal `
        -TenantId $servicePrincipalConnection.TenantId `
        -ApplicationId $servicePrincipalConnection.ApplicationId `
        -CertificateThumbprint $servicePrincipalConnection.CertificateThumbprint
}
catch {
    if (!$servicePrincipalConnection)
    {
        $ErrorMessage = "Connection $connectionName not found."
        throw $ErrorMessage
    } else{
        Write-Error -Message $_.Exception
        throw $_.Exception
    }
}
# Failover Starts Here
$Start = Get-Date
# SQL Failover Group variables
$Start = Get-Date
# SQL Failover Group variables
$PrimaryResourceGroupName = "petri"
$PrimaryServerName = "sqlsvr-petri"
$SecondaryResourceGroupName = "petrifo"
$SecondaryServerName = "sqlsvr-petrifo"
$FailoverGroupName = "sqlfog-petri"
# Traffic Manager variables
$TMResourceGroup = "petridr"
$TMProfileName = "petri"
$PriEndpoint = "PrimaryEndpoint"
$SecEndpoint = "SecondaryEndpoint"
# Check the primary Traffic Manager profile
$PrimaryEndpointStatus = (Get-AzureRmTrafficManagerEndpoint -Name $PriEndpoint -Type AzureEndpoints -ProfileName $TMProfileName -ResourceGroupName $TMResourceGroup).EndpointStatus
if ($PrimaryEndpointStatus -eq "Enabled")
{
    #Disable Traffic Manager primary profile if it is enabled
    Write-Output "Disable the primary Traffic Manager profile"
    Disable-AzureRmTrafficManagerEndpoint -Name $PriEndpoint -Type AzureEndpoints -ProfileName $TMProfileName -ResourceGroupName $TMResourceGroup -Force
}
# Verify that the Azure SQL database failover group is on the primary
$PreFailoverServer = (Get-AzureRmSqlDatabaseFailoverGroup -ResourceGroupName $PrimaryResourceGroupName -ServerName $Primaryservername -FailoverGroupName $FailoverGroupName).ServerName
if ($PreFailoverServer -eq $PrimaryServerName)
{
    #Failover the SQL failover group
    Write-Output "Failover Azure SQL database failover group"
    Switch-AzureRMSqlDatabaseFailoverGroup -ResourceGroupName $SecondaryResourceGroupName -ServerName $SecondaryServerName -FailoverGroupName $FailoverGroupName
}
# Check the failover
$PostFailoverServer = (Get-AzureRmSqlDatabaseFailoverGroup -ResourceGroupName $SecondaryResourceGroupName -ServerName $SecondaryServerName -FailoverGroupName $FailoverGroupName).ServerName
if ($PostFailoverServer -eq $SecondaryServerName)
{
    # Enable the secondary Traffic Manager profile
    Write-Output "Enable the secondary Traffic Manager profile"
    Enable-AzureRmTrafficManagerEndpoint -Name $SecEndpoint -Type AzureEndpoints -ProfileName $TMProfileName -ResourceGroupName $TMResourceGroup
}
$Stop = Get-Date
$TimeTaken = ($Stop - $Start).TotalSeconds
Write-Output "The time to run this script was $TimeTaken seconds"

And here is the PetriFailback runbook:

$connectionName = "AzureRunAsConnection"
try
{
    # Get the connection "AzureRunAsConnection "
    $servicePrincipalConnection=Get-AutomationConnection -Name $connectionName
    "Logging in to Azure..."
    Add-AzureRmAccount `
        -ServicePrincipal `
        -TenantId $servicePrincipalConnection.TenantId `
        -ApplicationId $servicePrincipalConnection.ApplicationId `
        -CertificateThumbprint $servicePrincipalConnection.CertificateThumbprint
}
catch {
    if (!$servicePrincipalConnection)
    {
        $ErrorMessage = "Connection $connectionName not found."
        throw $ErrorMessage
    } else{
        Write-Error -Message $_.Exception
        throw $_.Exception
    }
}
# Failover Starts Here
$Start = Get-Date
# SQL Failover Group variables
$PrimaryResourceGroupName = "petri"
$PrimaryServerName = "sqlsvr-petri"
$SecondaryResourceGroupName = "petrifo"
$SecondaryServerName = "sqlsvr-petrifo"
$FailoverGroupName = "sqlfog-petri"
# Traffic Manager variables
$TMResourceGroup = "petridr"
$TMProfileName = "petri"
$PriEndpoint = "PrimaryEndpoint"
$SecEndpoint = "SecondaryEndpoint"
# Check the secondary Traffic Manager profile
$SecondaryEndpointStatus = (Get-AzureRmTrafficManagerEndpoint -Name $SecEndpoint -Type AzureEndpoints -ProfileName $TMProfileName -ResourceGroupName $TMResourceGroup).EndpointStatus
if ($SecondaryEndpointStatus -eq "Enabled")
{
    #Disable Traffic Manager secondary profile if it is enabled
    Write-Output "Disable the secondary Traffic Manager profile"
    Disable-AzureRmTrafficManagerEndpoint -Name $SecEndpoint -Type AzureEndpoints -ProfileName $TMProfileName -ResourceGroupName $TMResourceGroup -Force
}
# Verify that the Azure SQL database failover group is on the secondary
$PreFailbackServer = (Get-AzureRmSqlDatabaseFailoverGroup -ResourceGroupName $SecondaryResourceGroupName -ServerName $SecondaryServerName -FailoverGroupName $FailoverGroupName).ServerName
if ($PreFailbackServer -eq $SecondaryServerName)
{
    #Failover the SQL failover group
    Write-Output "Failover Azure SQL database failover group"
    Switch-AzureRMSqlDatabaseFailoverGroup -ResourceGroupName $PrimaryResourceGroupName -ServerName $PrimaryServerName -FailoverGroupName $FailoverGroupName
}
# Check the failover
$PostFailbackServer = (Get-AzureRmSqlDatabaseFailoverGroup -ResourceGroupName $PrimaryResourceGroupName -ServerName $PrimaryServerName -FailoverGroupName $FailoverGroupName).ServerName
if ($PostFailbackServer -eq $PrimaryServerName)
{
    # Enable the primary Traffic Manager profile
    Write-Output "Enable the primary Traffic Manager profile"
    Enable-AzureRmTrafficManagerEndpoint -Name $PriEndpoint -Type AzureEndpoints -ProfileName $TMProfileName -ResourceGroupName $TMResourceGroup
}
$Stop = Get-Date
$TimeTaken = ($Stop - $Start).TotalSeconds
Write-Output "The time to run this script was $TimeTaken seconds"

A failover of the Traffic Manager profile and the SQL failover group (with one database) will normally take no more than 3 minutes to execute.

The output of an Azure Automation runbook failing over the web app [Image Credit: Aidan Finn]
The output of an Azure Automation runbook failing over the web app [Image Credit: Aidan Finn]
And that’s it. In less than 1 hour, using the power of Azure’s platform, you could deploy:

  • A highly available app service (load balanced instances with “always on SQL databases”) in a production site.
  • A duplicate disaster recovery environment
  • Replication of a SQL Server database or databases from the production site to the secondary in a few clicks
  • A means to switch users from the production system to the secondary system
  • An orchestrated solution to failover the production system to the secondary system.

Try doing that in colo-hosting, on-premises, or even with virtual machines in the cloud!