Exchange Server Database Availability Groups don't always guarantee high availability.
High Availability (HA) was introduced in Exchange 2010 and since then has become the mainstay of message continuity in organizations. A high availability setup relies on Database Availability Groups (DAG) as a critical component to deploy a “highly available” messaging service and mailbox database across multiple Active Directory (AD) sites for resilience.
A DAG is a group of up to 16 mailbox servers. A DAG is configured with redundant copies of the mailbox databases to enable automatic recovery in case of a switchover (scheduled) or failover (unexpected) outage of the database, server, or datacenter.
The DAG is created as an empty object in Active Directory. Mailbox servers are then added to the DAG object. Adding the first mailbox server automatically creates a dedicated failover cluster for the DAG. This failover cluster monitors the DAG’s crucial information, including:
Mailbox servers that are subsequently added to the DAG are joined to the failover cluster.
The functioning state of a DAG, in case of failure of a member server, is determined by the quorum model of the failover cluster. Quorum serves as a file share witness to ensure that all, or a majority of cluster members, remain functional to serve “high availability” and “responsiveness”. You can also use a disk majority or now even cloud storage as a witness.
There are two types of mechanisms for handling outage or downtime in a high availability Exchange Server organization:
Switchover comes into play when there is planned or scheduled downtime of the database, server, or datacenter, typically for maintenance, hardware upgrades, and windows updates. In a switchover, the administrator “manually” initiates the outage by using the Exchange Admin Center (EAC) or Exchange Management Shell. Switchover involves switching over of one or more active database copies to other mailbox server(s) in the DAG.
In contrast, Failover is an “automatic” response to an unexpected outage due to the failure of a database, server or datacenter, and it involves automated recovery by moving the active database copies to another mailbox server in the DAG, which was previously a passive server.
Switchover and failover mechanisms rely on Active Manager, a role that runs inside the Exchange Replication service on all mailbox servers in the DAG, to manage the switching of active databases to other servers.
Here is the process for switchover and failover, particularly in the case of a “database” outage.
Here’s how an admin performs a database switchover in an Exchange Server DAG.
*PAM is the Active Manager role managing the active and passive database copies in a DAG. PAM role resides in the DAG member that owns the cluster quorum resource.
In the case of a failover, here’s what happens:
**Crimson channel is a category of event logs in Windows Server that records the events associated with a single application or component, which in this case are High Availability and replication of mailbox databases.
There could be situations that might disrupt mailbox connectivity in a DAG HA setup. The following example illustrates:
Consider a DAG setup, comprising three member servers, namely MB01, MB02, and MB03, where each member server hosts a copy of three mailbox databases: DB 1, DB 2, and DB 3. The database copies are mirrored across each server.
Fig. 1: DAG with mirrored database copies
Note, this is a DAG comprising an odd number of members, so it is governed by the Node Majority Quorum model.
The DAG is seen running in a healthy state and is providing high availability. Next, the administrator needs to perform maintenance on MB02 and initiate a switchover. As a result, the active copy of DB 2 is moved from MB02 to MB01, and subsequently, MB02 goes offline. The DAG is still able to maintain high availability with MB01 and MB03; MB01 is now running two active database copies, those of DB 1 and DB 2.
Now, imagine that while MB02 is down for maintenance, MB01 faces a sudden hardware failure and crashes. What happens next is the loss of cluster quorum, so the DAG will not be able to initiate a failover and undergoes total failure, resulting in the databases dismounting. This happens by design as there is only one server left and one node to vote, so as a failsafe, the cluster is shutdown.
This situation leads to an extended outage until the failed member server is recovered to reinstate the quorum and restore the DAG. It requires manual intervention of the administrator to recover the server.
In the meanwhile, and until the HA setup is restored, mailbox connectivity in the organization will remain hampered, despite “availability” of the mailbox database copies. However, mere availability will not guarantee restoration of the “latest mailboxes”, given the fact that MB01 – the server that crashed – was hosting the active copies of DB 1 and DB 2. So, there is a significant chance that the database copies hosted on MB01 are in a dirty state.
Follow these steps to check the state of the database and restore it to a consistent state:
You can check the state of the database files by using ESEUTIL /MH cmdlet to read the header of the database in offline mode, as follows:
eseutil /mh ”c:\DB 1.edb”
The above cmdlet checks the state of the database file, named DB 1.edb. It returns the database state as “Dirty Shutdown” and displays the missing log file range in the Log Required section.
Use the ESEUTIL /R cmdlet to replay the transaction logs and restore the database to a consistent state, which is also known as soft repair or recovery. The following is the syntax of ESEUTIL /R cmdlet, further illustrated with an example:
ESEUTIL /r <log_prefix> /l <path_to_the_folder_with_log_files> /d <path_to_the_folder_with_the_database>
Example:
ESEUTIL /r E00 /l “C:\Program Files\Microsoft\Exchange Server\V15\MB01\DB 1” /d “C:\Program Files\Microsoft\Exchange Server\V15\MB01\DB 1”
This example illustrates the use of the ESEUTIL /R cmdlet to perform soft recovery of database, named DB 1.
Next, recheck the database state by using ESEUTIL /MH cmdlet.
By using the ESEUTIL commands, you should be able to restore the database to a clean shutdown state and mount it successfully, reinstating access to the latest copies of users’ mailboxes.
You can attempt a hard recovery by using ESEUTIL /P cmdlet, but be aware that the hard recovery method involves “removal” of data to attain database consistency, and therefore it results in data loss. So, it must be used only when there is no other option. In fact, EseUtil itself will prompt you to accept the data loss.
Hard recovery is not a 100% guaranteed solution and you must also consider that Microsoft adds hard-coded information in the database if hard recovery is used. If you have a support agreement with Microsoft and you ask them for assistance after you run a hard recovery, they will not support you, as it’s a breach of the support agreement.
As an alternative, and a more successful method to recover the databases with no complications and with ease, use a third-party tool to repair corrupted Exchange Server databases and restore Exchange services with the least impact to business and without data loss.
High availability is undoubtedly the preferred architecture in Exchange organizations worldwide, as it ensures business continuity. Based on DAGs, HA can be extended beyond a single datacenter to multiple AD sites for attaining site resilience, which is a dream setup for any Exchange administrator. But, like any other system, there are un-factored failover scenarios that could lead to extended disruptions, which in the case of Exchange Server, could mean downtime for email and even data loss.
For instance, incidents like server crashes can lead to complete failure of DAGs and corrupt the database copies, putting administrators in a tight spot. In this case, apart from using ESEUTIL, there’s not much that can reliably fix the database corruption and mount the database.
Also, data loss is an added risk, if hard recovery comes into play. You must also factor in the downtime, administrative effort, and resources needed to restore the services. So, owning a third-party Exchange database recovery tool can be a smart move for handling such situations, given the fact that database corruption can happen anytime and for reasons beyond your control.