Description
Key Focus Areas:
Disaster Recovery & Business Continuity
Azure Site Recovery Architecture
Automated DR Testing & Validation
RTO/RPO Governance
Multi-Region Failover Orchestration
Recovery Readiness Compliance
Executive Summary
Architected a cloud-native Disaster Recovery and Business Continuity platform on Microsoft Azure integrating Azure Site Recovery multi-region replication, automated failover orchestration, Infrastructure-as-Code deployment, continuous DR testing pipelines, immutable backup protection, and centralised RTO/RPO compliance dashboards.
The architecture establishes a multi-region active-passive DR platform — primary production workloads in East US with continuous ASR replication to a preconfigured secondary DR environment in Central US — capable of automated failover execution within defined RTO objectives and continuous replication within defined RPO objectives.
The primary differentiator of this platform is continuous DR validation — automated test failover pipelines executing regularly against isolated recovery environments, validating that recovery procedures remain operational rather than assuming readiness based on replication health metrics alone.
Business Drivers
Traditional disaster recovery approaches rely on passive backup retention and annual manual DR tests — neither of which provides confidence that recovery procedures will work when an actual incident occurs. Replication health metrics confirm data is being replicated but do not validate that recovered workloads will start, configure correctly, and serve traffic within RTO objectives.
This architecture was designed to address the BCDR requirements of organisations where existing approaches result in:
Manual and error-prone failover procedures — recovery steps documented in runbooks but untested under realistic conditions
Unverified DR readiness — replication health is monitored but actual recovery capability is assumed rather than validated
Inconsistent RTO/RPO enforcement — recovery objectives defined in policy but not measurably tracked against actual replication and recovery performance
Absence of automated DR testing — annual manual tests are disruptive, infrequent, and fail to detect recovery procedure drift between tests
Limited visibility into recovery health — no executive dashboard demonstrating DR readiness for governance and regulatory audit purposes
Compliance pressure from regulated industries — financial services, healthcare, and critical infrastructure regulators increasingly require demonstrable and tested recovery capabilities
Operational Constraints
The architecture was designed to operate within the following constraints typical of enterprise multi-region DR environments:
Cross-region replication must maintain RPO compliance continuously — replication lag must not exceed defined RPO thresholds without alerting
DR testing must not impact production environments — test failovers must execute in isolated environments without disrupting primary region workloads
Administrative access must remain secure and functional during DR scenarios — secondary region must have equivalent administrative access capability
Compliance reporting must demonstrate measurable RTO/RPO performance — audit evidence requires tracked metrics, not claimed objectives
Backup systems must provide ransomware resilience — immutable vault protection preventing backup deletion or modification
Recovery automation must be scalable across workload count — manual orchestration does not scale as workload estate grows
Secondary region infrastructure must be preconfigured — cold-start infrastructure provisioning during an actual incident extends RTO beyond acceptable targets
Recovery Objectives
Workload Tier | Target RTO | Target RPO | Replication Mechanism | Test Frequency |
|---|---|---|---|---|
Tier 1 — Mission Critical | 2 hours | 15 minutes | ASR continuous replication | Monthly |
Tier 2 — Business Important | 4 hours | 1 hour | ASR continuous replication | Quarterly |
Tier 3 — Standard Operations | 8 hours | 4 hours | ASR + Azure Backup | Bi-annually |
Database Tier | 1 hour | 5 minutes | ASR + SQL geo-replication | Monthly |
These recovery objectives represent design targets. Production RTO/RPO commitments require validation through load-tested recovery plan execution under realistic infrastructure conditions.
Architecture Principles
Recovery readiness by design — recovery capability must be continuously validated, not assumed from replication health metrics
Automated failover orchestration — recovery plans execute through predefined automation rather than manual runbook steps
Separation of replication and backup functions — ASR handles availability recovery (RTO), Azure Backup handles data protection and long-term retention (RPO and compliance)
Immutable data protection — backup vaults configured with immutability and soft-delete preventing ransomware-driven backup deletion
Continuous DR validation — automated test failover pipelines executing on defined schedules detecting recovery procedure drift before incidents occur
Secure DR operations — secondary region maintains equivalent security controls and administrative access to primary region
Infrastructure automation and repeatability — secondary region infrastructure preconfigured through Terraform ensuring consistent recovery environment without cold-start provisioning delays
Centralised RTO/RPO observability — replication health, recovery performance, and test results tracked in unified compliance dashboards
Architecture Overview
The solution is structured as a six-layer multi-region BCDR platform integrating primary production hosting, secondary DR infrastructure, backup and retention, automation and DR testing, monitoring and observability, and governance and compliance.
1. Primary Production Region — East US
The primary region hosts production workloads with full security controls — serving as the operational baseline from which ASR replication targets the secondary region.
Workload Configuration:
Windows Server 2022 and RHEL virtual machines hosting application workloads
Private-only networking — no public IP addresses on workload VMs
NSG-enforced least-privilege ingress and egress traffic controls
ASR Mobility Service agent installed on all protected VMs — enabling continuous replication to secondary region
Secure Administrative Access:
Jumpbox VM in dedicated management subnet providing administrative access without public VM exposure
NSG on management subnet permitting inbound RDP/SSH from authorised administrative IP ranges only
Future evolution: Azure Bastion replacement eliminating jumpbox VM management overhead
ASR Replication Policy Configuration:
Parameter | Tier 1 | Tier 2 | Tier 3 |
|---|---|---|---|
RPO threshold alert | 15 minutes | 1 hour | 4 hours |
App-consistent snapshot | Every 1 hour | Every 4 hours | Every 6 hours |
Crash-consistent snapshot | Every 5 minutes | Every 5 minutes | Every 5 minutes |
Recovery point retention | 72 hours | 24 hours | 15 days |
Multi-VM Consistency Groups: Related VMs sharing application dependencies are grouped in ASR multi-VM consistency groups — ensuring all VMs in a group are replicated to the same crash-consistent and application-consistent recovery points simultaneously. Without consistency groups, web, application, and database VMs may replicate to different points-in-time creating application-inconsistent recovery scenarios.
Example consistency group: order-processing-group containing order-web-vm, order-app-vm, and order-db-vm — all replicated to the same recovery point ensuring the recovered application stack is internally consistent.
2. Secondary Disaster Recovery Region — Central US
The secondary region serves as the preconfigured DR target — infrastructure deployed and validated before incidents occur, enabling rapid workload activation during failover.
Secondary Region Infrastructure — Preconfigured Through Terraform: All secondary region network infrastructure, NSGs, load balancers, and recovery vault configuration are deployed through Terraform in advance — eliminating cold-start infrastructure provisioning time from RTO calculations. Only compute resources (VMs) are not running in the secondary region during normal operations — they are activated by ASR failover execution.
Azure Site Recovery — Failover Architecture:
Automated Network Mapping: ASR network mapping connects primary region subnets to corresponding secondary region subnets — failed-over VMs automatically receive IPs from the mapped recovery network without manual network reconfiguration during failover execution.
Recovery Plan Structure:
3. Backup & Retention Layer
Azure Backup provides data protection complementing ASR replication — addressing long-term retention, compliance retention, and cyber recovery scenarios that continuous replication alone cannot serve.
ASR vs Azure Backup — Complementary Functions:
Capability | Azure Site Recovery | Azure Backup |
|---|---|---|
Primary purpose | Availability — fast RTO | Data protection — RPO and compliance |
Recovery granularity | Full VM failover | File, folder, VM, SQL point-in-time |
Retention window | 72 hours (configurable) | Years — compliance retention |
Ransomware protection | Limited — replicates deletions | Immutable vault — tamper-proof |
Use case | Regional outage recovery | Data corruption, accidental deletion, compliance |
Azure Backup Configuration:
Recovery Services Vault with immutability enabled — compliance mode locking preventing vault deletion or backup modification
Soft delete with 14-day retention window providing secondary protection against accidental deletion
VM backup policy: daily backups with 30-day retention for operational recovery, weekly backups retained 52 weeks for compliance
SQL Server backup: full weekly, differential daily, transaction log every 15 minutes — supporting 15-minute database RPO
Immutable Vault Configuration: Vault immutability prevents backup deletion and retention period shortening — even by subscription administrators. For regulated workloads, immutable vaults provide the tamper-proof backup retention evidence required by financial services and healthcare regulatory frameworks.
4. Automation & DR Testing Layer
The automated DR testing pipeline is the primary differentiator of this platform — continuous recovery validation detecting procedure drift before actual incidents occur.
DR Testing Pipeline Architecture:
yaml
DR Test Isolation — No Production Impact: Test failovers execute in an isolated network (test-failover-vnet) with no connectivity to production systems or external networks — recovered VMs start in a sandbox environment where application health can be validated without any risk of split-brain scenarios or production traffic routing to test-recovered VMs.
DR Test Result Tracking:
Test Metric | Target | Measured | Pass/Fail |
|---|---|---|---|
Actual RTO achieved | ≤ 2 hours | Measured from failover trigger to health confirmation | Pass if ≤ target |
Actual RPO at recovery | ≤ 15 minutes | Recovery point timestamp vs test execution time | Pass if ≤ target |
Application health validation | 100% endpoints healthy | HTTP health check success rate | Pass if 100% |
Test failover completion | ≤ 30 minutes | ASR test failover execution duration | Pass if ≤ target |
Test results are published to the DR compliance dashboard and stored in Log Analytics — providing a continuous record of recovery capability validation for regulatory audit evidence.
5. Monitoring & Observability Layer
Centralised monitoring provides operational visibility across replication health, backup status, DR test results, and RTO/RPO compliance tracking.
Azure Monitor — Replication Health Alerting:
ASR replication health alerts — notification when any protected VM deviates from Normal replication state
RPO breach alerts — notification when replication lag approaches or exceeds defined RPO thresholds
Backup job failure alerts — immediate notification of backup job failures before next scheduled backup window
Recovery vault health alerts — notification of vault configuration changes or immutability violations
Azure Log Analytics — DR Operational Analytics:
ASR replication event logs — failover executions, replication state changes, and recovery plan operations
Azure Backup job completion logs — backup success, failure, and retention compliance tracking
DR test pipeline execution results — actual RTO and RPO achieved per test, trend over time
Immutable vault audit logs — any access or modification attempt against protected backup vaults
Power BI — RTO/RPO Compliance Dashboards:
Dashboard | Audience | Content |
|---|---|---|
DR Readiness Executive Summary | CISO / CTO | Overall DR readiness score, last test date, RTO/RPO compliance rate |
Replication Health Dashboard | IT Operations | Per-VM replication health, RPO lag, consistency group status |
DR Test History | Governance / Audit | Historical test results, RTO/RPO trend, pass/fail per test |
Backup Compliance Report | Compliance Team | Backup coverage, retention compliance, vault integrity status |
Recovery Time Performance | IT Management | Actual vs target RTO by workload tier, trend analysis |
DR Readiness Score Methodology:
6. Governance & Compliance Layer
Azure Policy — DR Compliance Enforcement:
Deny unprotected VM deployment in production resource groups — VMs must be enrolled in ASR replication
Audit backup coverage — alert on VMs without Azure Backup policy assignment
Require Recovery Services Vault immutability for production vaults
Enforce approved recovery regions — replication target must be the designated DR region
Microsoft Defender for Cloud — Security Posture in DR Context:
Security recommendations for DR-related configurations — unprotected VMs, unencrypted backup vaults
Regulatory compliance assessment against resilience-related framework controls
Threat protection on replicated VMs — security monitoring continues in secondary region
Terraform — Infrastructure Governance: All primary and secondary region infrastructure managed through Terraform — consistent deployment, version-controlled configuration, and auditable change history for both production and DR environments.
Architecture Diagram

Technologies Used
Category | Technologies |
|---|---|
Disaster Recovery | Azure Site Recovery (ASR) |
Backup & Retention | Azure Backup, Immutable Recovery Services Vault |
DR Testing | Azure DevOps YAML Pipelines, Python validation scripts |
Infrastructure as Code | Terraform |
Cloud Platform | Azure VMs (Windows Server 2022, RHEL), Azure VNets, NSGs |
Administrative Access | Jumpbox VMs (interim — Bastion planned) |
Monitoring | Azure Monitor, Log Analytics |
Reporting | Power BI, Azure Workbooks |
Governance | Azure Policy, Microsoft Defender for Cloud |
Compliance Frameworks | ISO 22301 (Business Continuity), NIST SP 800-34, PCI DSS v4.0 |
Key Challenges Addressed
Ensuring reliable cross-region replication without data inconsistency — addressed through multi-VM consistency groups ensuring related VMs replicate to the same recovery point simultaneously — preventing application-inconsistent recovery scenarios where web, application, and database tiers recover to different points-in-time.
Validating RTO/RPO targets under realistic operational conditions — addressed through automated test failover pipeline measuring actual failover execution time and recovery point timestamp — providing empirical RTO/RPO validation data rather than theoretical estimates.
Automating DR testing without production impact — addressed through test failover execution in isolated networks with no production connectivity — recovered VMs operate in a sandbox environment with application health validation but no production traffic routing risk.
Maintaining secure access during failover scenarios — addressed through preconfigured secondary region administrative infrastructure — jumpbox VMs and NSG configurations deployed in the secondary region before incidents occur, ensuring administrative access remains operational immediately after failover.
Protecting backups against ransomware and destructive operations — addressed through immutable vault configuration preventing backup deletion or retention period modification — complementing ASR replication which would replicate ransomware encryption to the secondary region without backup protection.
Providing measurable RTO/RPO compliance evidence — addressed through automated DR test result collection, Power BI compliance dashboards, and Log Analytics trend storage — producing auditable, continuously updated recovery performance evidence for regulatory review.
Design Decisions & Rationale
Active-Passive over Active-Active DR Model : Active-active multi-region deployment provides zero RTO but requires significantly higher infrastructure cost — running full production capacity in two regions simultaneously. Active-passive provides acceptable RTO (2 hours for Tier 1) at significantly lower cost — secondary region compute resources are not running until failover activation. For most enterprise workloads where 2-hour RTO is acceptable, active-passive provides the appropriate cost-to-resilience balance.
Separation of ASR Replication and Azure Backup : ASR replication is optimised for availability — fast RTO through continuous replication and orchestrated failover. However, ASR replication faithfully replicates data corruption and ransomware encryption to the secondary region — it provides no protection against data integrity failures. Azure Backup provides independent, immutable data protection covering corruption, accidental deletion, and long-term compliance retention that ASR cannot serve. The two mechanisms address different failure scenarios and must coexist.
Automated DR Testing over Annual Manual Tests : Annual manual DR tests are expensive, disruptive, and infrequent — a recovery procedure that worked in January may have drifted by October due to infrastructure changes, application updates, or network reconfigurations. Monthly automated test failover pipelines detect procedure drift continuously — the cost of a failed automated test is minimal; the cost of a failed actual recovery is catastrophic.
Preconfigured Secondary Region Infrastructure : Cold-start secondary region infrastructure provisioning during an actual incident extends RTO beyond acceptable targets — Terraform deployment of network infrastructure typically takes 15-30 minutes before ASR failover can begin. Preconfiguring secondary region network infrastructure, NSGs, and load balancers before incidents means failover can begin immediately — compute resources start through ASR failover while network infrastructure is already operational.
Immutable Vault for Production Backup Protection : Standard Recovery Services Vaults permit backup deletion and retention period modification by administrators — a ransomware actor with sufficient Azure access can delete backup copies before triggering encryption. Immutable vault compliance mode prevents any modification regardless of administrative privilege level — providing tamper-proof backup protection. The operational constraint (retention periods cannot be shortened after immutability lock) is an acceptable trade-off for the protection it provides.
Azure-Native Services over Third-Party DR Platforms : Third-party DR platforms introduce additional licensing cost, operational tooling complexity, and Azure integration overhead. Azure Site Recovery and Azure Backup provide native integration with Azure VMs, Azure networking, Azure Policy governance, and Azure Monitor — reducing operational complexity while maintaining enterprise-grade DR capability appropriate for most workload categories.
Trade-offs & Design Constraints
Active-Passive RTO Dependency on Secondary Region Readiness : The 2-hour RTO target depends on secondary region network infrastructure being preconfigured before incidents. If Terraform-managed secondary region infrastructure is not maintained in sync with primary region changes — new subnets added in primary not replicated to secondary, NSG rules updated in primary not applied to secondary — failover may encounter infrastructure mismatches extending actual RTO beyond the 2-hour target. Infrastructure drift detection between primary and secondary regions should be monitored through scheduled Terraform plan runs comparing state.
ASR Replication Faithfully Replicates Corruption : ASR continuous replication does not distinguish between healthy writes and ransomware encryption writes — it replicates all changes to the secondary region. If ransomware encrypts files in the primary region, the encrypted versions are replicated to secondary within the RPO window. Recovery from ransomware scenarios requires Azure Backup restore from a pre-infection recovery point — not ASR failover to the secondary region. The architecture must clearly document which failure scenarios are addressed by ASR (regional outage) versus Azure Backup (data integrity failure, ransomware).
Test Failover Isolated Network Validation Limitations : Test failovers execute in isolated networks — application health endpoints are validated against the isolated test environment, not against actual production dependencies (external APIs, on-premises systems, DNS resolution). Validation scripts must account for these isolation boundaries — testing that the application starts and responds to health checks in isolation, not that it can process live production transactions. Full end-to-end production traffic validation requires planned failover (actual failover with production traffic) rather than test failover.
Recovery Plan Maintenance Overhead : As application workloads evolve — new VMs added, services decomposed, dependencies changed — recovery plans must be updated to reflect current architecture. Stale recovery plans that do not match current workload topology cause failover failures or incorrect recovery sequences. Recovery plan definitions should be managed through Terraform with mandatory update procedures triggered by infrastructure change events.
Multi-VM Consistency Group Performance Impact : ASR multi-VM consistency groups generate application-consistent snapshots across all VMs in the group simultaneously — requiring VSS quiescence for Windows VMs. At high frequency (every hour for Tier 1), this quiescence can briefly impact application performance during snapshot operations. Consistency group snapshot frequency must be balanced against performance impact — for latency-sensitive applications, less frequent consistency snapshots (every 4 hours) with crash-consistent replication (every 5 minutes) may be the appropriate trade-off.
Projected Outcomes
The architecture is designed to deliver the following resilience and governance outcomes in a production enterprise environment:
Measurable RTO compliance through automated DR test pipeline validation — empirical recovery time measurement replacing theoretical RTO estimates
Continuous RPO enforcement through ASR replication health monitoring with threshold-based alerting before RPO objectives are breached
Near real-time cross-region replication for Tier 1 workloads maintaining 15-minute RPO under normal operating conditions
Automated monthly DR testing providing continuous recovery procedure validation — detecting drift before actual incidents require recovery
Immutable backup protection preventing ransomware-driven backup deletion independent of ASR replication integrity
Executive DR readiness dashboards providing governance-ready evidence of recovery capability for regulatory audit responses
Preconfigured secondary region infrastructure enabling immediate failover initiation without cold-start provisioning delays
Auditable DR test history stored in Log Analytics — continuous compliance evidence record for regulatory frameworks requiring demonstrable recovery capability
Future Evolution
Multi-region active-active recovery models for highest-criticality Tier 0 workloads where 2-hour RTO is not acceptable
AI-assisted failover optimisation through Azure Monitor intelligent alerting predicting replication degradation before RPO breach
Automated chaos engineering through Azure Chaos Studio validating application resilience under component failure scenarios beyond regional outage
Self-healing infrastructure remediation detecting and correcting secondary region infrastructure drift automatically
Cross-cloud disaster recovery federation extending BCDR coverage to workloads in AWS or GCP through multi-cloud ASR equivalent tooling
Continuous compliance validation automation through Azure Policy initiative tracking DR coverage requirements across all production workloads
Advanced ransomware recovery orchestration through dedicated cyber recovery vault with airgapped isolation
Integrated cyber recovery vault architecture providing isolated recovery environment for incidents where primary and secondary regions are simultaneously compromised
Key Takeaways
Disaster recovery requires continuous validation, not passive replication monitoring — replication health metrics confirm data is moving but do not validate that recovered workloads will start correctly and serve traffic within RTO objectives
Automated monthly DR testing is the most impactful operational maturity improvement for enterprise BCDR — annual manual tests are too infrequent to catch recovery procedure drift
ASR and Azure Backup are complementary, not redundant — ASR addresses regional outage recovery (RTO), Azure Backup addresses data integrity protection and compliance retention (RPO and regulatory requirements)
Preconfiguring secondary region infrastructure before incidents is essential for achieving aggressive RTO targets — cold-start infrastructure provisioning during actual incidents extends recovery time unpredictably
ASR faithfully replicates ransomware encryption — regional replication does not protect against data integrity failures; immutable Azure Backup vaults are the correct protection for cyber recovery scenarios
Active-passive architecture provides the appropriate cost-to-resilience balance for most enterprise workloads — active-active provides zero RTO at significantly higher infrastructure cost justified only for the highest-criticality workloads
Recovery plan maintenance must be treated as an ongoing operational requirement — stale plans not reflecting current workload topology are a primary cause of DR test failures
