How to Deploy Traffic Manager for Azure App Services for Disaster Recovery

Microsoft Azure cloud hero

The Production and DR Deployments

Two Azure App Services websites/plans have been deployed:

  • Petriprod: A production app service is hosted on an app service plan in North Europe.
  • Petridr: A disaster recovery or secondary app service is hosted on an app service plan in West Europe

The production and disaster recovery Azure App Services deployments [Image Credit: Aidan Finn]
The production and disaster recovery Azure App Services deployments [Image Credit: Aidan Finn]
It is the responsibility of the dev/operator to duplicate the app service/web app content from app service in North Europe to the app service in West Europe. This could be done via the publishing or DevOps mechanism(s), or via an app service extension.

Create Traffic Manager Profile

The Traffic Manager Profile is a DNS abstraction mechanism that is hosted globally in Azure. Clients will browse to the DNS name of the profile, via a CNAME alias for their website URL, and the profile will direct them to the production or failover site, depending on the situation.
To create a Traffic Manager profile, click Create a Resource in the Azure Portal, search for and select Traffic Manager Profile, and click Create. Enter the following information in the Create Traffic Manager Profile blade:

  • Name: A globally unique name that will form the suffix of a Microsoft-managed .trafficmanager.net domain name.
  • Routing Method: Priority is the method that should be used for failover/disaster recovery.
  • Subscription: Select the subscription that the profile will be created in.
  • Resource Group: Select a resource group to create the profile in.
  • Resource Group Location: The traffic manager profile is global, but the resource group has a location – more on this in a moment.

Creating a new Traffic Manager Profile in Azure [Image Credit: Aidan Finn]
Creating a new Traffic Manager Profile in Azure [Image Credit: Aidan Finn]
The globally hosted Traffic Manager Profile will be created in a resource group that does have a single location. I take no chances with “how things should work in a regional outage” – I recommend placing the resource group in a “witness region”, or a third region that will be independent of the production and failover regions. In my example, I am deploying the resource group for Traffic Manager into France Central.
Deploying Traffic Manager to a witness region resource group [Image Credit: Aidan Finn]
Deploying Traffic Manager to a witness region resource group [Image Credit: Aidan Finn]

Traffic Manager Endpoints

An endpoint is any place that Traffic Manager can redirect traffic to. In our case, those endpoints will be App Services that are deployed into North Europe and West Europe.
To add an endpoint, open the Traffic Manager Profile, browse to Settings > Endpoints, and click + Add. An Add Endpoint blade appears; enter the following information into this blade:

  • Type: Azure Endpoint
  • Name: Name the endpoint – I will use Production for the production site and Secondary for the secondary/DR site.
  • Target Resource Type: App Service
  • Target Resource: Select the appropriate app service.
  • Priority: This determines the order at which Azure will attempt to send traffic to the app services in – more on this in a moment.
  • Custom Header Settings: Talk to your devs to determine if this is required.
  • Add As Disabled: Is the endpoint going to be available for redirection or not? More on this in a moment.

The typical blog post you will see online will tell you to set up two endpoints:

  • Production: Priority 10
  • Secondary: Priority 20

In theory, Traffic Manager will:

  1. Test Production and Secondary for availability.
  2. If Production goes offline, clients will be directed to Secondary.

However, there are two issues:

  • What if there is a backend database, such as Azure SQL, that will be configured in active/passive mode, and has not failed over? The site will not work then.
  • There is a lack of human control. DR experts greatly dislike automatic failover.

My preferred approach is to take control of the failover as follows:

  • Production: Priority 10, Enabled.
  • Secondary: Priority 20, Disabled.

The secondary Traffic Manager endpoint is disabled [Image Credit: Aidan Finn]
The secondary Traffic Manager endpoint is disabled [Image Credit: Aidan Finn]
I can now use a process, such as Azure Automation, to enable the Secondary endpoint and disabled the Production endpoint; this will give me full control over the failover and allow me to orchestrate failover of other components of the solution, such as Azure SQL.

Configure the Traffic Manager Profile

If you open the Traffic Manager Profile and browse to Settings > Configuration, you will find a number of settings for changing the behaviour of the profile, including:

  • Redirection method
  • DNS record caching timeout (TTL)
  • The method for monitoring the endpoints
  • How to determine that an endpoint has failed

Because I have taken control of failover, I am just interested in DNS Time To Live (TTL). This determines how often clients of the website will need to resolve the name of the website to be redirected by Traffic Manager. The longer this is, the longer a site might appear offline after failover; the default is 60 seconds.

DNS

The new traffic manager profile has an Internet-resolvable domain name in the form of <profilename>.trafficmanager.net, which you can see in Overview or Properties. Copy this name and create a CNAME record for your desired website URL. In my example, the DNS domain is hosted in Azure – Azure DNS is a global service, but the resource group is also in the witness site of France Central.

Creating a CNAME record to redirect the website URL to Traffic Manager [Image Credit: Aidan Finn]
Creating a CNAME record to redirect the website URL to Traffic Manager [Image Credit: Aidan Finn]
Now if a user browses www.joeelway.com, the name will redirect to tm-petridrmgmt.trafficmanager.net, and the Traffic Manager Profile will redirect this lookup to the active endpoint … so the possible customer browses the site that I determine should be online.