Last Update: Sep 04, 2024 | Published: Dec 21, 2018
In this post, I will explain how you can implement disaster recovery failover for an application that has been built on Azure’s App Services and Azure SQL.
One of the great things about Azure is how easy it can be to solve some of the old business & technology challenges, especially if you have gone through a digital transformation and moved beyond the limits of virtual machines and infrastructure. Microsoft Azure allows us to deploy in locations around the world, at fairly modest costs, and easily switch users from one deployment to another.
The core feature that I stress for people to consider when thinking about installation flexibility and disaster recovery, even outside of Azure, is Traffic Manager. This micro-cost service abstracts DNS records and public IP addresses (together they are referred to as an endpoint by Azure) and enables simple direction, load balancing, geo-redirection, performance enhancement, and prioritization (automating failover) of endpoints.
A simple application can be deployed in one region, along with its database. A duplicate can be created in another region, and with a combination of Azure solutions, replication and failover can be implemented. More complex applications can have single databases feeding into a central data warehouse, or maybe even use a geo-resilient database such as Cosmos DB.
In this post, I’m going to stick with a very common and simple scenario. Imagine a deployment that has two load-balanced web servers and a backend machine running SQL Server – that’s not so exotic! Now, replace those web servers with Azure’s App Services, and replace the SQL Server with Azure SQL; this will reduce management costs, possibly reduce runtime costs, and allow you to focus on the service instead of the distractions of infrastructure configuration. It took only a few minutes to deploy the below “production + test” environment into a resource group called Petri in the Azure North Europe region:
Let’s assume that the above web app generates revenue for the business and has become mission critical. The production elements are:
We need to “replicate” these items to another Azure region just in case North Europe either has extended downtime or is destroyed. The remaining items are test & dev related and do not need to be replicated.
Ideally, any failover will be:
In reality, the app services and SQL Server will not be replicating. Instead, the content will be replicated to the disaster recovery site:
Next, we have to figure out how to redirect clients from the production version of the website to the secondary; this is easily accomplished using a Traffic Manager profile (in priority mode). The DNS name of the site will point to the Traffic Manager profile’s Microsoft-managed fully qualified domain name (FQDN) using a CNAME record. The Traffic Manager profile will have two endpoints that can redirect clients to either the production or the secondary site:
In theory, one could leave both endpoints enabled and configure PrimaryEndpoint with a higher priority than SecondaryEndpoint. However, this could lead to a situation where the production site could be faulty but failover does not occur, or even a false failover – I want a manual decision to trigger failover!
PrimaryEndpoint is enabled, and all clients will be redirected to the app service running in North Europe unless I change that. SecondaryEndpoint is disabled. To achieve a failover, I will disable PrimaryEndpoint and enable SecondaryEndpoint, thus redirecting clients to the secondary system.
Note that Traffic Manager is a global service that is hosted in all regions. Recent global issues in the cloud have made me very careful, so I have placed the Traffic Manager profile into a resource group that is in a third “witness region”: UK South.
To achieve an orchestrated failover, I will use Azure Automation. Two PowerShell runbooks will be created:
The Azure Automation account will also be deployed into the “witness region” (UK South), isolating it from anything bad that might happen in the production or secondary sites.
Note that the following PowerShell modules had to be added to the Azure Automation account:
And now we get to the magic. To be honest, the runbooks below are quite simple. There are 3 steps in each runbook:
Here is the PetriFailover runbook:
$connectionName = "AzureRunAsConnection" try { # Get the connection "AzureRunAsConnection " $servicePrincipalConnection=Get-AutomationConnection -Name $connectionName "Logging in to Azure..." Add-AzureRmAccount ` -ServicePrincipal ` -TenantId $servicePrincipalConnection.TenantId ` -ApplicationId $servicePrincipalConnection.ApplicationId ` -CertificateThumbprint $servicePrincipalConnection.CertificateThumbprint } catch { if (!$servicePrincipalConnection) { $ErrorMessage = "Connection $connectionName not found." throw $ErrorMessage } else{ Write-Error -Message $_.Exception throw $_.Exception } } # Failover Starts Here $Start = Get-Date # SQL Failover Group variables $Start = Get-Date # SQL Failover Group variables $PrimaryResourceGroupName = "petri" $PrimaryServerName = "sqlsvr-petri" $SecondaryResourceGroupName = "petrifo" $SecondaryServerName = "sqlsvr-petrifo" $FailoverGroupName = "sqlfog-petri" # Traffic Manager variables $TMResourceGroup = "petridr" $TMProfileName = "petri" $PriEndpoint = "PrimaryEndpoint" $SecEndpoint = "SecondaryEndpoint" # Check the primary Traffic Manager profile $PrimaryEndpointStatus = (Get-AzureRmTrafficManagerEndpoint -Name $PriEndpoint -Type AzureEndpoints -ProfileName $TMProfileName -ResourceGroupName $TMResourceGroup).EndpointStatus if ($PrimaryEndpointStatus -eq "Enabled") { #Disable Traffic Manager primary profile if it is enabled Write-Output "Disable the primary Traffic Manager profile" Disable-AzureRmTrafficManagerEndpoint -Name $PriEndpoint -Type AzureEndpoints -ProfileName $TMProfileName -ResourceGroupName $TMResourceGroup -Force } # Verify that the Azure SQL database failover group is on the primary $PreFailoverServer = (Get-AzureRmSqlDatabaseFailoverGroup -ResourceGroupName $PrimaryResourceGroupName -ServerName $Primaryservername -FailoverGroupName $FailoverGroupName).ServerName if ($PreFailoverServer -eq $PrimaryServerName) { #Failover the SQL failover group Write-Output "Failover Azure SQL database failover group" Switch-AzureRMSqlDatabaseFailoverGroup -ResourceGroupName $SecondaryResourceGroupName -ServerName $SecondaryServerName -FailoverGroupName $FailoverGroupName } # Check the failover $PostFailoverServer = (Get-AzureRmSqlDatabaseFailoverGroup -ResourceGroupName $SecondaryResourceGroupName -ServerName $SecondaryServerName -FailoverGroupName $FailoverGroupName).ServerName if ($PostFailoverServer -eq $SecondaryServerName) { # Enable the secondary Traffic Manager profile Write-Output "Enable the secondary Traffic Manager profile" Enable-AzureRmTrafficManagerEndpoint -Name $SecEndpoint -Type AzureEndpoints -ProfileName $TMProfileName -ResourceGroupName $TMResourceGroup } $Stop = Get-Date $TimeTaken = ($Stop - $Start).TotalSeconds Write-Output "The time to run this script was $TimeTaken seconds"
And here is the PetriFailback runbook:
$connectionName = "AzureRunAsConnection" try { # Get the connection "AzureRunAsConnection " $servicePrincipalConnection=Get-AutomationConnection -Name $connectionName "Logging in to Azure..." Add-AzureRmAccount ` -ServicePrincipal ` -TenantId $servicePrincipalConnection.TenantId ` -ApplicationId $servicePrincipalConnection.ApplicationId ` -CertificateThumbprint $servicePrincipalConnection.CertificateThumbprint } catch { if (!$servicePrincipalConnection) { $ErrorMessage = "Connection $connectionName not found." throw $ErrorMessage } else{ Write-Error -Message $_.Exception throw $_.Exception } } # Failover Starts Here $Start = Get-Date # SQL Failover Group variables $PrimaryResourceGroupName = "petri" $PrimaryServerName = "sqlsvr-petri" $SecondaryResourceGroupName = "petrifo" $SecondaryServerName = "sqlsvr-petrifo" $FailoverGroupName = "sqlfog-petri" # Traffic Manager variables $TMResourceGroup = "petridr" $TMProfileName = "petri" $PriEndpoint = "PrimaryEndpoint" $SecEndpoint = "SecondaryEndpoint" # Check the secondary Traffic Manager profile $SecondaryEndpointStatus = (Get-AzureRmTrafficManagerEndpoint -Name $SecEndpoint -Type AzureEndpoints -ProfileName $TMProfileName -ResourceGroupName $TMResourceGroup).EndpointStatus if ($SecondaryEndpointStatus -eq "Enabled") { #Disable Traffic Manager secondary profile if it is enabled Write-Output "Disable the secondary Traffic Manager profile" Disable-AzureRmTrafficManagerEndpoint -Name $SecEndpoint -Type AzureEndpoints -ProfileName $TMProfileName -ResourceGroupName $TMResourceGroup -Force } # Verify that the Azure SQL database failover group is on the secondary $PreFailbackServer = (Get-AzureRmSqlDatabaseFailoverGroup -ResourceGroupName $SecondaryResourceGroupName -ServerName $SecondaryServerName -FailoverGroupName $FailoverGroupName).ServerName if ($PreFailbackServer -eq $SecondaryServerName) { #Failover the SQL failover group Write-Output "Failover Azure SQL database failover group" Switch-AzureRMSqlDatabaseFailoverGroup -ResourceGroupName $PrimaryResourceGroupName -ServerName $PrimaryServerName -FailoverGroupName $FailoverGroupName } # Check the failover $PostFailbackServer = (Get-AzureRmSqlDatabaseFailoverGroup -ResourceGroupName $PrimaryResourceGroupName -ServerName $PrimaryServerName -FailoverGroupName $FailoverGroupName).ServerName if ($PostFailbackServer -eq $PrimaryServerName) { # Enable the primary Traffic Manager profile Write-Output "Enable the primary Traffic Manager profile" Enable-AzureRmTrafficManagerEndpoint -Name $PriEndpoint -Type AzureEndpoints -ProfileName $TMProfileName -ResourceGroupName $TMResourceGroup } $Stop = Get-Date $TimeTaken = ($Stop - $Start).TotalSeconds Write-Output "The time to run this script was $TimeTaken seconds"
A failover of the Traffic Manager profile and the SQL failover group (with one database) will normally take no more than 3 minutes to execute.
And that’s it. In less than 1 hour, using the power of Azure’s platform, you could deploy:
Try doing that in colo-hosting, on-premises, or even with virtual machines in the cloud!