Building a disaster recovery (DR) solution: Handling MongoDB
Introduction
Building a robust disaster recovery (DR) solution requires balancing latency, failover times, and data consistency. For our use case — achieving an RTO and RPO of 1 hour with seamless failover between Mumbai and Hyderabad regions — MongoDB Atlas posed unique challenges and opportunities. Here’s how we approached it step by step.
To fully grasp this concept, you’ll need some experience with MongoDB cluster provisioning and a basic understanding of fault tolerance and high availability. For a detailed explainer on this— you can go through this other article I published a few years ago — https://www.linkedin.com/pulse/fault-tolerance-concepts-analysis-shishir-khandelwal/
Approach 1: Direct Attach Approach
MongoDB Atlas offers a “Direct Attach” restoration method, where the Cloud Backup service replaces the target cluster’s storage volumes with snapshots. To implement this, we would have maintained an identical cluster running in Hyderabad. In the event of a disaster, we would copy the snapshot from Mumbai to Hyderabad. Atlas would then use the snapshot to create new EBS volumes in Hyderabad and replace the EBS volumes attached to the existing cluster nodes with the newly created ones.
While straightforward, this approach fell short for us due to the following reasons:
Backup Time Constraints: Backups for larger clusters (e.g., 2TB data) often exceed one hour, violating the RTO requirement.
Regional Data Storage Limitations: MongoDB Atlas stores backup snapshots in the same region as the cluster. Creating a cluster in Hyderabad requires transferring this data from Mumbai, introducing significant delays due to inter-region network latency.
Limitations of Direct Attach: While generally faster, it has bottlenecks with snapshot preparation and volume mounting, making it infeasible to meet our DR requirements. Despite these constraints, we tested the approach, only to conclude that it wouldn’t fulfill the desired RTO/RPO.
Approach 2: Streaming Approach
The Streaming method downloads snapshot data from the backup storage source to the destination cluster, offering more flexibility. We enabled multi-region backups to optimize this, specifically targeting the Hyderabad region.
Observed Success with Small Clusters: Restoration times were acceptable for smaller clusters, making this a viable solution for less critical applications.
Challenges with Larger Clusters: Restoration times for larger clusters (e.g., 1TB+) often exceeded the 1-hour RTO.
Key bottlenecks: Preparing and mounting the initial volume takes a lot of time as it requires disk warming and snapshot data transfer — these depend on data size and network latency.
Although the Streaming approach performed better than the Direct Attach, it still didn’t meet our requirements for critical, larger clusters that had higher storage volumes.
Approach 3: Pre-Provisioning Nodes in Target Region
Our learnings from the Streaming method led us to explore a hybrid solution. While testing, we observed that restoration times improve significantly if at least one node is pre-provisioned in the target cluster.
Key Insight: The first node in the cluster takes the longest to restore due to EBS initialization and snapshot mounting. Additional nodes synchronize much faster once the first is operational.
There were additional considerations we needed to address, as outlined below.
Approach 3.1: Adding a Read-Only Node in Hyderabad
To address the latency and restoration concerns, we experimented with pre-provisioning a read-only node in Hyderabad. This allowed:
- Faster restoration times during failover, as the cluster wasn’t starting from scratch.
- A potential reduction in inter-region data transfer delays.
This approach aimed to leverage a read-only node in Hyderabad to reduce restoration times during a disaster. While this seemed promising, it ultimately fell short for several reasons:
Inadequate Fault Tolerance:
- In normal conditions, the cluster would have three electable nodes in AWS Mumbai and one non-electable read-only node in Hyderabad.
- During a disaster (e.g., Mumbai going down), only the single read-only node in Hyderabad would remain, failing to meet fault tolerance and high availability requirements.
Atlas Limitations:
- MongoDB Atlas does not allow clusters with only one node to function.
- Post-disaster, manual intervention would be needed to add two more electable nodes in Hyderabad, convert them to electable status, and reconfigure the cluster.
- While node addition typically takes under 10 minutes, Atlas does not guarantee strict SLAs, leaving restoration times unpredictable.
This approach was ultimately rejected due to its dependency on manual intervention and insufficient fault tolerance.
Approach 3.2: Redistributing Electable Nodes
The next idea involved reducing the number of electable nodes in Mumbai from three to two and adding one electable node in Hyderabad. This seemed cost-effective, as the total number of nodes remained the same.
Pros:
- Retained quorum for elections, as three electable nodes are sufficient.
- Eliminated the need for additional node costs.
Cons:
Fault Tolerance Concerns:
- If AWS Mumbai went down, only one electable node in Hyderabad would remain. While it could theoretically serve as the primary node, this setup lacked redundancy.
- Increased risk of resource usage spikes due to all read and write operations converging on a single node, particularly during disaster scenarios when retries are common.
Uncertainty in Atlas Behavior:
- The behavior of Atlas in such a configuration was uncertain, particularly during outages.
- Resource spikes could lead to performance degradation, undermining the goals of high availability and fault tolerance.
This approach was also discarded for failing to meet critical disaster recovery requirements.
Approach 3.3: Adding Two Electable Nodes in Hyderabad
This approach involved pre-provisioning two additional electable nodes in Hyderabad, resulting in:
- Three electable nodes in AWS Mumbai.
- Two electable nodes in AWS Hyderabad.
Pros:
- Ensured fault tolerance and high availability for most scenarios.
- If AWS Hyderabad went down, AWS Mumbai nodes would still maintain quorum and function seamlessly.
Cons:
Manual Intervention Required:
- If AWS Mumbai went down, a third electable node would need to be added to Hyderabad to maintain quorum.
- Manual intervention at this stage introduced a risk of delays, especially if response times were not immediate.
Cost Increase:
- Adding two electable nodes significantly increased operational costs, particularly for larger clusters.
While this approach improved fault tolerance, the need for manual intervention kept it from being the ideal solution.
Approach 3.4: Adding Electable Nodes in non-Mumbai and non-Hyderabad regions
To address the shortcomings of previous approaches, we considered a multi-region deployment:
Configuration:
- Two electable nodes in AWS Mumbai.
- Two electable nodes in AWS Hyderabad.
- One electable node in a third region (e.g., AWS Singapore, GCP Delhi).
Pros:
Improved Fault Tolerance:
- If AWS Mumbai goes down: Three electable nodes (two in Hyderabad, one in Singapore) maintain quorum, ensuring the cluster remains operational.
- If AWS Hyderabad goes down: Three electable nodes (two in Mumbai, one in Singapore) maintain quorum, ensuring the cluster remains operational.
- If AWS Singapore goes down: Four electable nodes remain (two in Mumbai, two in Hyderabad), minimizing the impact on elections and maintaining high availability. Having four electable nodes can also result in a tie during elections. Let’s explore what happens in such a scenario.
Resilience in Tie Situations:
MongoDB Atlas’s election algorithms are designed to retry until a primary is elected, ensuring continuity even in edge cases like election ties. This means an election would always end up with at least one node being elected as the primary.
Automatic Failover:
Eliminated the need for manual intervention, as quorum would be maintained automatically in all scenarios.
Cons:
Increased Costs:
- Adding a third region significantly increases expenses due to cross-region replication and additional nodes.
Considerations Post Approach 3.4
While Approach 3.4 addressed many challenges, further refinements were necessary to ensure an optimal solution.
Consideration of ‘Pre-Warming’ under approach 3.4
One critical realization from our tests was the inefficiency of approaches that required adding new nodes during a disaster. MongoDB Atlas employs a pre-warming process whenever a new node is added to a cluster, even in the same region where existing nodes already operate.
What is Pre-Warming?
- When a new node is added, Atlas copies data from the existing node’s EBS storage to the new node.
- It then performs an optimized sync process, which includes pre-warming.
- Pre-warming involves retrieving files from S3, storing them on the local disk, and touching each block to initialize the disk. This ensures that: The node is optimized for future operations and Frequently accessed data is loaded into memory to improve performance (especially for larger clusters).
Challenges of Pre-Warming:
Time-Intensive Process:
For clusters with high storage (e.g., 4TB), pre-warming can take significant time, dependent on factors like disk size, network throughput, and cloud provider infrastructure. The lack of predictability in completion times adds further uncertainty.
Dependency on Multiple Variables:
Disk warming performance varies across cloud providers and regions. For instance, AWS might handle this differently than GCP due to differences in storage and network configurations.
Impact on DR Solution:
- Any approach that required adding new nodes during a disaster was deemed impractical, even for non-critical production clusters.
- This solidified our preference for the multi-region cluster setup in Approach 3.4, as it avoids the need for node addition during failover.
Consideration of ‘Third Region’ under Approach 3.4
Under the multi-region cluster setup, the choice of the third region became a pivotal decision. Our goal was to ensure minimal latency even in extreme scenarios (e.g., AWS Mumbai or AWS Hyderabad becoming unavailable).
Options Considered:
- GCP Delhi
- AWS Singapore
Key Factors Influencing the Decision:
NVMe Support:
- Regions offering NVMe-backed storage have better IOPS performance, which is critical for database-intensive workloads.
- The availability of NVMe in specific regions varied by cloud provider, influencing our selection.
Tier Availability:
- MongoDB Atlas provides various instance types (e.g., M60, M80, M140) based on region and cloud provider.
- Some regions lacked certain tiers, limiting flexibility in scaling.
Cost Considerations:
- Cost differences between AWS and GCP were a significant factor in few cases. In some cases, AWS was more expensive than GCP for the same configuration.
Final decisions were made based on the above.
Conclusion
The multi-region approach (Approach 3.4) with careful consideration of pre-warming challenges and third-region selection proved to be the most resilient and practical solution. It minimized downtime, ensured fault tolerance, and maintained high availability, all while balancing cost and latency.
By addressing these nuances — like pre-warming and third-region selection — we ensured that our active-active DR strategy was robust enough to handle real-world scenarios across critical and non-critical clusters.