Microsoft Exchange Server High Availability
History of Exchange HA
Prior to Exchange 2007, high availability and disaster recovery features were fairly limited and even up until Exchange 2010 these features relied heavily upon expensive technologies that were complex to implement. Exchange 2007 introduced Local Continuous Replication (LCR), Cluster Continuous Replication (CCR) and Standby Continuous Replication (SCR) although the latter came within Exchange 2007 SP1. LCR works pretty much as the name describes, a copy of the storage group is created on a second set of disks that are locally connected to the mailbox server. Now as you can tell this created a single point of failure at the hardware level, which is why CCR was much more popular as it utilized Windows Failover Clustering technology to provide redundancy at both the hardware and storage level. SCR then utilized the same technology as LCR and CCR to provide site resilience as it made it possible to ship the log files to another Exchange 2007 mailbox server. Exchange 2010 then dropped LCR and combined CCR and SCR to create Database Availability Groups (DAG).
Database Availability Groups
At the heart of Microsoft Exchange Servers High Availability and site resilience framework is Database Availability Groups. Introduced in Exchange 2010, enhanced in 2013 and still utilized in Exchange 2016, DAG’s are simply a group of up to 16 Mailbox servers, with each server hosting a set of databases. Once there is a failure of a DAG member, any active mailbox databases fail-over to another DAG member. The introduction of DAG removed several single points of failure, as there is no longer the reliance upon a single instance of a database – this is due to the ability to have up to 16 globally distributed database copies. This not only provides further resiliency but also reduces the need for technologies such as RAID or other traditional backups, if a hard drive was to fail numerous other database copies would already be available to activate. The progression of DAG has also made it incredibly simple to deploy and with the removal of the requirement for expensive high-performing storage solutions High Availability is now an affordable option in most installations.
New Additions to Exchange HA
Most of the High Availability enhancements within Exchange have been centred around improving the capabilities of the DAG including the introduction of lagged database copies. A lagged database copy is a copy of the database that isn’t updated by replaying transactions as they become available – instead the transaction logs are held for the defined period and then replayed. The primary reason is to provide access to a database that is at a certain point in time where it was known to be in a good state – therefore acting like an insurance policy should there be any form of corruption. If you were to be in the unfortunate situation where the active database had become corrupt then this would enable you to utilize the lagged database copy and bring the database back to a point prior to the corruption.
Replay Lag Manager
Replay Lag Manager was introduced in Exchange 2013, refined in 2016 and will be enabled by default in 2016 CU1. It enables Exchange to change a lag copy into a highly available copy if needed. Once Replay Lag Manager is enabled, it allows log replay to play down the log files in the following scenarios;
- Low disk space threshold has been reached
- Physical corruption to the lagged copy
- When there are less than three available healthy highly available copies for more than 24 hours
Deferred Lagged Copy Play Down
This is an enhancement to the previously mentioned replay lag manager. If one of the above scenarios is triggered then a play down event will be initiated. The idea of deferred lagged copy play down is to assess the disks IO latency, if the disks read IO latency goes above 35ms then the play down event is deferred, once the disk’s ready IO latency drops below 25ms the play down event will be resumed. This is used to ensure the replaying of the transaction logs isn’t impacting the performance of any other databases on the same disk’s.
The graph below illustrates how this would work. As you can see between 1am and 9am the disk IO latency stays below 25ms and therefore lagged copy replay is allowed. Then at 10am the latecy rises above 35ms until 2pm so within this period lagged copy replay will be deffered. Once the latency then drops below 25ms at 2pm lagged copy replay will be resumed. This cycle will then continue any time the latency exceeds 35ms.
Take for example the following scenario, you have a disk that holds four database copies– two active, one passive and one lagged. The above conditions are then met and a down play event is triggered. The resulting IO generated by the replaying log files has the potential to impact the active copies on that disk which would therefore impact any users accesing their data on the active copy. However with deferred lagged copy play down this is kept under control and will prevent any possible disruption during peak-hours. It’s important to note that the maximum amount of time that a play down event can be deferred is 24hours however this can be adjusted.