Cloud-Native Disaster Recovery & Business Continuity Platform

Multi-Region Active-Passive BCDR with Automated Recovery Validation & RTO/RPO Governance

github

https://github.com/sergeksfumey/cloud-native-bcdr-platform

ARCHITECTURE OVERVIEW

Cloud-Native BCDR Platform — Azure Site Recovery, Immutable Backup, and Policy-as-Code Compliance

Cross-region VM replication East US to Central US, long-term immutable backup retention, automated DR testing via Azure DevOps, and real-time Power BI compliance dashboards — aligned to ISO 27001, PCI-DSS, and NIST 800-53

The architecture delivers a fully cloud-native disaster recovery and business continuity platform with no dependency on on-premises infrastructure. All workloads run in a private-only virtual network (vnet-bcdr, East US) with no direct internet exposure to application VMs. Administrative access flows exclusively through a Windows 11 jumpbox — NSGs on vm-app1 and vm-app2 enforce allow rules scoped to the jumpbox IP only, with a deny-all default blocking all other inbound traffic. This jumpbox architecture eliminates the attack surface associated with publicly exposed RDP and SSH endpoints.

Azure Site Recovery continuously replicates all three VMs — vm-jumpbox, vm-app1 (Windows Server 2022), and vm-app2 (RHEL 9) — from East US to Central US through a cache storage account. Network mapping pre-configures the recovery network in Central US, meaning that failover can be executed without manual network reconfiguration. Multi-VM consistency groups ensure that interdependent workloads fail over together, maintaining application-level coherence. Test failover plans and cleanup procedures are defined and automated through Azure DevOps Pipelines and GitHub Actions, enabling regular DR drills without impacting production.

Azure Backup protects all three VMs through a comprehensive retention policy — daily, weekly, monthly, and yearly recovery points — stored in immutable Recovery Services Vaults with soft delete enabled. Immutability ensures that backup data cannot be deleted or modified during the retention period, closing the ransomware and insider threat vectors against the backup infrastructure itself.

The compliance and monitoring plane tracks RTO and RPO metrics continuously through Azure Monitor and Log Analytics, with Power BI dashboards providing real-time visualisation of recovery posture and policy compliance — achieving 95%+ compliance across all environments. Over 100 Azure Policy definitions are deployed via Terraform, enforced through Microsoft Defender for Cloud initiatives, and auto-remediated through Logic Apps and Azure Functions — reducing mean time to remediation by 80% compared to manual compliance processes.

Description

This case study is an independent architecture design exercise developed to demonstrate cloud-native Disaster Recovery and Business Continuity (BCDR) platform architecture for enterprise Azure environments. It was not associated with a production deployment. The scenario is based on the resilience engineering and recovery governance requirements typical of organisations operating regulated workloads with defined RTO/RPO obligations across multiple Azure regions. This study focuses on the comprehensive BCDR platform — multi-region ASR replication, automated failover orchestration, continuous DR testing pipelines, and RTO/RPO compliance dashboards. Immutable backup protection and ransomware recovery are covered in depth in the Immutable Backup and Ransomware Recovery Framework and Hybrid Backup Architecture for Compliance Retention case studies.

Key Focus Areas:

Disaster Recovery & Business Continuity
Azure Site Recovery Architecture
Automated DR Testing & Validation
RTO/RPO Governance
Multi-Region Failover Orchestration
Recovery Readiness Compliance

Executive Summary

Architected a cloud-native Disaster Recovery and Business Continuity platform on Microsoft Azure integrating Azure Site Recovery multi-region replication, automated failover orchestration, Infrastructure-as-Code deployment, continuous DR testing pipelines, immutable backup protection, and centralised RTO/RPO compliance dashboards.

The architecture establishes a multi-region active-passive DR platform — primary production workloads in East US with continuous ASR replication to a preconfigured secondary DR environment in Central US — capable of automated failover execution within defined RTO objectives and continuous replication within defined RPO objectives.

The primary differentiator of this platform is continuous DR validation — automated test failover pipelines executing regularly against isolated recovery environments, validating that recovery procedures remain operational rather than assuming readiness based on replication health metrics alone.

Business Drivers

Traditional disaster recovery approaches rely on passive backup retention and annual manual DR tests — neither of which provides confidence that recovery procedures will work when an actual incident occurs. Replication health metrics confirm data is being replicated but do not validate that recovered workloads will start, configure correctly, and serve traffic within RTO objectives.

This architecture was designed to address the BCDR requirements of organisations where existing approaches result in:

Manual and error-prone failover procedures — recovery steps documented in runbooks but untested under realistic conditions
Unverified DR readiness — replication health is monitored but actual recovery capability is assumed rather than validated
Inconsistent RTO/RPO enforcement — recovery objectives defined in policy but not measurably tracked against actual replication and recovery performance
Absence of automated DR testing — annual manual tests are disruptive, infrequent, and fail to detect recovery procedure drift between tests
Limited visibility into recovery health — no executive dashboard demonstrating DR readiness for governance and regulatory audit purposes
Compliance pressure from regulated industries — financial services, healthcare, and critical infrastructure regulators increasingly require demonstrable and tested recovery capabilities

Operational Constraints

The architecture was designed to operate within the following constraints typical of enterprise multi-region DR environments:

Cross-region replication must maintain RPO compliance continuously — replication lag must not exceed defined RPO thresholds without alerting
DR testing must not impact production environments — test failovers must execute in isolated environments without disrupting primary region workloads
Administrative access must remain secure and functional during DR scenarios — secondary region must have equivalent administrative access capability
Compliance reporting must demonstrate measurable RTO/RPO performance — audit evidence requires tracked metrics, not claimed objectives
Backup systems must provide ransomware resilience — immutable vault protection preventing backup deletion or modification
Recovery automation must be scalable across workload count — manual orchestration does not scale as workload estate grows
Secondary region infrastructure must be preconfigured — cold-start infrastructure provisioning during an actual incident extends RTO beyond acceptable targets

Recovery Objectives

Workload Tier	Target RTO	Target RPO	Replication Mechanism	Test Frequency
Tier 1 — Mission Critical	2 hours	15 minutes	ASR continuous replication	Monthly
Tier 2 — Business Important	4 hours	1 hour	ASR continuous replication	Quarterly
Tier 3 — Standard Operations	8 hours	4 hours	ASR + Azure Backup	Bi-annually
Database Tier	1 hour	5 minutes	ASR + SQL geo-replication	Monthly

These recovery objectives represent design targets. Production RTO/RPO commitments require validation through load-tested recovery plan execution under realistic infrastructure conditions.

Architecture Principles

Recovery readiness by design — recovery capability must be continuously validated, not assumed from replication health metrics
Automated failover orchestration — recovery plans execute through predefined automation rather than manual runbook steps
Separation of replication and backup functions — ASR handles availability recovery (RTO), Azure Backup handles data protection and long-term retention (RPO and compliance)
Immutable data protection — backup vaults configured with immutability and soft-delete preventing ransomware-driven backup deletion
Continuous DR validation — automated test failover pipelines executing on defined schedules detecting recovery procedure drift before incidents occur
Secure DR operations — secondary region maintains equivalent security controls and administrative access to primary region
Infrastructure automation and repeatability — secondary region infrastructure preconfigured through Terraform ensuring consistent recovery environment without cold-start provisioning delays
Centralised RTO/RPO observability — replication health, recovery performance, and test results tracked in unified compliance dashboards

Architecture Overview

The solution is structured as a six-layer multi-region BCDR platform integrating primary production hosting, secondary DR infrastructure, backup and retention, automation and DR testing, monitoring and observability, and governance and compliance.

1. Primary Production Region — East US

The primary region hosts production workloads with full security controls — serving as the operational baseline from which ASR replication targets the secondary region.

Workload Configuration:

Windows Server 2022 and RHEL virtual machines hosting application workloads
Private-only networking — no public IP addresses on workload VMs
NSG-enforced least-privilege ingress and egress traffic controls
ASR Mobility Service agent installed on all protected VMs — enabling continuous replication to secondary region

Secure Administrative Access:

Jumpbox VM in dedicated management subnet providing administrative access without public VM exposure
NSG on management subnet permitting inbound RDP/SSH from authorised administrative IP ranges only
Future evolution: Azure Bastion replacement eliminating jumpbox VM management overhead

ASR Replication Policy Configuration:

Parameter	Tier 1	Tier 2	Tier 3
RPO threshold alert	15 minutes	1 hour	4 hours
App-consistent snapshot	Every 1 hour	Every 4 hours	Every 6 hours
Crash-consistent snapshot	Every 5 minutes	Every 5 minutes	Every 5 minutes
Recovery point retention	72 hours	24 hours	15 days

Multi-VM Consistency Groups: Related VMs sharing application dependencies are grouped in ASR multi-VM consistency groups — ensuring all VMs in a group are replicated to the same crash-consistent and application-consistent recovery points simultaneously. Without consistency groups, web, application, and database VMs may replicate to different points-in-time creating application-inconsistent recovery scenarios.

Example consistency group: order-processing-group containing order-web-vm, order-app-vm, and order-db-vm — all replicated to the same recovery point ensuring the recovered application stack is internally consistent.

2. Secondary Disaster Recovery Region — Central US

The secondary region serves as the preconfigured DR target — infrastructure deployed and validated before incidents occur, enabling rapid workload activation during failover.

Secondary Region Infrastructure — Preconfigured Through Terraform: All secondary region network infrastructure, NSGs, load balancers, and recovery vault configuration are deployed through Terraform in advance — eliminating cold-start infrastructure provisioning time from RTO calculations. Only compute resources (VMs) are not running in the secondary region during normal operations — they are activated by ASR failover execution.

Azure Site Recovery — Failover Architecture:

Primary Region (East US)           Secondary Region (Central US)
─────────────────────────          ──────────────────────────────
production-vnet (10.0.0.0/16)  →  recovery-vnet (10.1.0.0/16)
  workload-subnet                    recovery-subnet
  management-subnet                  recovery-management-subnet
  NSG-production                     NSG-recovery (pre-deployed)
  Load Balancer (active)             Load Balancer (standby)
  VM-web-01 (running)                VM-web-01 (replicated — not running)
  VM-app-01 (running)                VM-app-01 (replicated — not running)
  VM-db-01 (running)                 VM-db-01 (replicated — not running)

Primary Region (East US)           Secondary Region (Central US)
─────────────────────────          ──────────────────────────────
production-vnet (10.0.0.0/16)  →  recovery-vnet (10.1.0.0/16)
  workload-subnet                    recovery-subnet
  management-subnet                  recovery-management-subnet
  NSG-production                     NSG-recovery (pre-deployed)
  Load Balancer (active)             Load Balancer (standby)
  VM-web-01 (running)                VM-web-01 (replicated — not running)
  VM-app-01 (running)                VM-app-01 (replicated — not running)
  VM-db-01 (running)                 VM-db-01 (replicated — not running)

Primary Region (East US)           Secondary Region (Central US)
─────────────────────────          ──────────────────────────────
production-vnet (10.0.0.0/16)  →  recovery-vnet (10.1.0.0/16)
  workload-subnet                    recovery-subnet
  management-subnet                  recovery-management-subnet
  NSG-production                     NSG-recovery (pre-deployed)
  Load Balancer (active)             Load Balancer (standby)
  VM-web-01 (running)                VM-web-01 (replicated — not running)
  VM-app-01 (running)                VM-app-01 (replicated — not running)
  VM-db-01 (running)                 VM-db-01 (replicated — not running)

Automated Network Mapping: ASR network mapping connects primary region subnets to corresponding secondary region subnets — failed-over VMs automatically receive IPs from the mapped recovery network without manual network reconfiguration during failover execution.

Recovery Plan Structure:

Recovery Plan: order-processing-recovery

Group 1 (execute first):
  - Script: validate-recovery-network-connectivity
  - Script: start-recovery-database-services

Group 2 (execute after Group 1 completes):
  - Failover: VM-db-01 (database tier)
  - Wait: 5 minutes (database startup validation)

Group 3 (execute after Group 2 completes):
  - Failover: VM-app-01 (application tier)
  - Wait: 3 minutes (application startup validation)

Group 4 (execute after Group 3 completes):
  - Failover: VM-web-01 (web tier)
  - Script: validate-application-health-endpoint
  - Script: update-dns-records-to-recovery-region
  - Script: notify-operations-team-failover-complete

Recovery Plan: order-processing-recovery

Group 1 (execute first):
  - Script: validate-recovery-network-connectivity
  - Script: start-recovery-database-services

Group 2 (execute after Group 1 completes):
  - Failover: VM-db-01 (database tier)
  - Wait: 5 minutes (database startup validation)

Group 3 (execute after Group 2 completes):
  - Failover: VM-app-01 (application tier)
  - Wait: 3 minutes (application startup validation)

Group 4 (execute after Group 3 completes):
  - Failover: VM-web-01 (web tier)
  - Script: validate-application-health-endpoint
  - Script: update-dns-records-to-recovery-region
  - Script: notify-operations-team-failover-complete

Recovery Plan: order-processing-recovery

Group 1 (execute first):
  - Script: validate-recovery-network-connectivity
  - Script: start-recovery-database-services

Group 2 (execute after Group 1 completes):
  - Failover: VM-db-01 (database tier)
  - Wait: 5 minutes (database startup validation)

Group 3 (execute after Group 2 completes):
  - Failover: VM-app-01 (application tier)
  - Wait: 3 minutes (application startup validation)

Group 4 (execute after Group 3 completes):
  - Failover: VM-web-01 (web tier)
  - Script: validate-application-health-endpoint
  - Script: update-dns-records-to-recovery-region
  - Script: notify-operations-team-failover-complete

3. Backup & Retention Layer

Azure Backup provides data protection complementing ASR replication — addressing long-term retention, compliance retention, and cyber recovery scenarios that continuous replication alone cannot serve.

ASR vs Azure Backup — Complementary Functions:

Capability	Azure Site Recovery	Azure Backup
Primary purpose	Availability — fast RTO	Data protection — RPO and compliance
Recovery granularity	Full VM failover	File, folder, VM, SQL point-in-time
Retention window	72 hours (configurable)	Years — compliance retention
Ransomware protection	Limited — replicates deletions	Immutable vault — tamper-proof
Use case	Regional outage recovery	Data corruption, accidental deletion, compliance

Azure Backup Configuration:

Recovery Services Vault with immutability enabled — compliance mode locking preventing vault deletion or backup modification
Soft delete with 14-day retention window providing secondary protection against accidental deletion
VM backup policy: daily backups with 30-day retention for operational recovery, weekly backups retained 52 weeks for compliance
SQL Server backup: full weekly, differential daily, transaction log every 15 minutes — supporting 15-minute database RPO

Immutable Vault Configuration: Vault immutability prevents backup deletion and retention period shortening — even by subscription administrators. For regulated workloads, immutable vaults provide the tamper-proof backup retention evidence required by financial services and healthcare regulatory frameworks.

4. Automation & DR Testing Layer

The automated DR testing pipeline is the primary differentiator of this platform — continuous recovery validation detecting procedure drift before actual incidents occur.

DR Testing Pipeline Architecture:

yaml

# Azure DevOps pipeline — scheduled DR test execution
trigger: none
schedules:
  - cron: "0 2 1 * *"        # Monthly — 2 AM on 1st of month
    displayName: 'Monthly DR Test'
    branches:
      include: [main]

stages:
  - stage: PreTestValidation
    jobs:
      - job: ValidateReplicationHealth
        steps:
          - script: |
              # Query ASR replication health for all protected VMs
              az site-recovery replicated-item list \
                --resource-group $RECOVERY_RG \
                --vault-name $RECOVERY_VAULT \
                --query "[?properties.providerSpecificDetails.replicationHealth!='Normal']" \
                > unhealthy_vms.json
              
              # Fail pipeline if any VMs are not in Normal replication state
              if [ -s unhealthy_vms.json ]; then
                echo "ERROR: VMs not in healthy replication state"
                cat unhealthy_vms.json
                exit 1
              fi
            displayName: 'Validate ASR Replication Health'

  - stage: TestFailoverExecution
    dependsOn: PreTestValidation
    jobs:
      - job: ExecuteTestFailover
        steps:
          - script: |
              # Execute test failover — isolated network, no production impact
              az site-recovery replication-recovery-plan \
                test-failover \
                --name "order-processing-recovery" \
                --recovery-point-type "Latest" \
                --network-type "VmNetwork" \
                --network "/subscriptions/.../test-failover-vnet"
            displayName: 'Execute Test Failover (Isolated Network)'

          - script: |
              # Wait for test failover completion
              python scripts/wait_for_failover_completion.py \
                --plan-name "order-processing-recovery" \
                --timeout-minutes 30
            displayName: 'Wait for Test Failover Completion'

  - stage: RecoveryValidation
    dependsOn: TestFailoverExecution
    jobs:
      - job: ValidateRecoveredWorkloads
        steps:
          - script: |
              # Validate application health endpoint on recovered VMs
              python scripts/validate_application_health.py \
                --environment test-failover \
                --expected-status 200 \
                --timeout-seconds 300
            displayName: 'Validate Application Health Endpoints'

          - script: |
              # Measure actual RTO from failover initiation to health confirmation
              python scripts/calculate_actual_rto.py \
                --plan-name "order-processing-recovery" \
                --target-rto-minutes 120
            displayName: 'Calculate and Validate Actual RTO'

          - script: |
              # Validate RPO — check recovery point timestamp vs test execution time
              python scripts/validate_rpo_compliance.py \
                --plan-name "order-processing-recovery" \
                --target-rpo-minutes 15
            displayName: 'Validate RPO Compliance'

  - stage: TestCleanup
    dependsOn: RecoveryValidation
    condition: always()   # Always cleanup — even if validation fails
    jobs:
      - job: CleanupTestFailover
        steps:
          - script: |
              az site-recovery replication-recovery-plan \
                cleanup-test-failover \
                --name "order-processing-recovery"
            displayName: 'Cleanup Test Failover Environment'

  - stage: ReportResults
    dependsOn: TestCleanup
    jobs:
      - job: PublishDRTestReport
        steps:
          - script: |
              python scripts/generate_dr_test_report.py \
                --output-format pdf \
                --include-rto-compliance \
                --include-rpo-compliance \
                --send-to governance-team@company.com
            displayName: 'Generate and Distribute DR Test Report'

# Azure DevOps pipeline — scheduled DR test execution
trigger: none
schedules:
  - cron: "0 2 1 * *"        # Monthly — 2 AM on 1st of month
    displayName: 'Monthly DR Test'
    branches:
      include: [main]

stages:
  - stage: PreTestValidation
    jobs:
      - job: ValidateReplicationHealth
        steps:
          - script: |
              # Query ASR replication health for all protected VMs
              az site-recovery replicated-item list \
                --resource-group $RECOVERY_RG \
                --vault-name $RECOVERY_VAULT \
                --query "[?properties.providerSpecificDetails.replicationHealth!='Normal']" \
                > unhealthy_vms.json
              
              # Fail pipeline if any VMs are not in Normal replication state
              if [ -s unhealthy_vms.json ]; then
                echo "ERROR: VMs not in healthy replication state"
                cat unhealthy_vms.json
                exit 1
              fi
            displayName: 'Validate ASR Replication Health'

  - stage: TestFailoverExecution
    dependsOn: PreTestValidation
    jobs:
      - job: ExecuteTestFailover
        steps:
          - script: |
              # Execute test failover — isolated network, no production impact
              az site-recovery replication-recovery-plan \
                test-failover \
                --name "order-processing-recovery" \
                --recovery-point-type "Latest" \
                --network-type "VmNetwork" \
                --network "/subscriptions/.../test-failover-vnet"
            displayName: 'Execute Test Failover (Isolated Network)'

          - script: |
              # Wait for test failover completion
              python scripts/wait_for_failover_completion.py \
                --plan-name "order-processing-recovery" \
                --timeout-minutes 30
            displayName: 'Wait for Test Failover Completion'

  - stage: RecoveryValidation
    dependsOn: TestFailoverExecution
    jobs:
      - job: ValidateRecoveredWorkloads
        steps:
          - script: |
              # Validate application health endpoint on recovered VMs
              python scripts/validate_application_health.py \
                --environment test-failover \
                --expected-status 200 \
                --timeout-seconds 300
            displayName: 'Validate Application Health Endpoints'

          - script: |
              # Measure actual RTO from failover initiation to health confirmation
              python scripts/calculate_actual_rto.py \
                --plan-name "order-processing-recovery" \
                --target-rto-minutes 120
            displayName: 'Calculate and Validate Actual RTO'

          - script: |
              # Validate RPO — check recovery point timestamp vs test execution time
              python scripts/validate_rpo_compliance.py \
                --plan-name "order-processing-recovery" \
                --target-rpo-minutes 15
            displayName: 'Validate RPO Compliance'

  - stage: TestCleanup
    dependsOn: RecoveryValidation
    condition: always()   # Always cleanup — even if validation fails
    jobs:
      - job: CleanupTestFailover
        steps:
          - script: |
              az site-recovery replication-recovery-plan \
                cleanup-test-failover \
                --name "order-processing-recovery"
            displayName: 'Cleanup Test Failover Environment'

  - stage: ReportResults
    dependsOn: TestCleanup
    jobs:
      - job: PublishDRTestReport
        steps:
          - script: |
              python scripts/generate_dr_test_report.py \
                --output-format pdf \
                --include-rto-compliance \
                --include-rpo-compliance \
                --send-to governance-team@company.com
            displayName: 'Generate and Distribute DR Test Report'

# Azure DevOps pipeline — scheduled DR test execution
trigger: none
schedules:
  - cron: "0 2 1 * *"        # Monthly — 2 AM on 1st of month
    displayName: 'Monthly DR Test'
    branches:
      include: [main]

stages:
  - stage: PreTestValidation
    jobs:
      - job: ValidateReplicationHealth
        steps:
          - script: |
              # Query ASR replication health for all protected VMs
              az site-recovery replicated-item list \
                --resource-group $RECOVERY_RG \
                --vault-name $RECOVERY_VAULT \
                --query "[?properties.providerSpecificDetails.replicationHealth!='Normal']" \
                > unhealthy_vms.json
              
              # Fail pipeline if any VMs are not in Normal replication state
              if [ -s unhealthy_vms.json ]; then
                echo "ERROR: VMs not in healthy replication state"
                cat unhealthy_vms.json
                exit 1
              fi
            displayName: 'Validate ASR Replication Health'

  - stage: TestFailoverExecution
    dependsOn: PreTestValidation
    jobs:
      - job: ExecuteTestFailover
        steps:
          - script: |
              # Execute test failover — isolated network, no production impact
              az site-recovery replication-recovery-plan \
                test-failover \
                --name "order-processing-recovery" \
                --recovery-point-type "Latest" \
                --network-type "VmNetwork" \
                --network "/subscriptions/.../test-failover-vnet"
            displayName: 'Execute Test Failover (Isolated Network)'

          - script: |
              # Wait for test failover completion
              python scripts/wait_for_failover_completion.py \
                --plan-name "order-processing-recovery" \
                --timeout-minutes 30
            displayName: 'Wait for Test Failover Completion'

  - stage: RecoveryValidation
    dependsOn: TestFailoverExecution
    jobs:
      - job: ValidateRecoveredWorkloads
        steps:
          - script: |
              # Validate application health endpoint on recovered VMs
              python scripts/validate_application_health.py \
                --environment test-failover \
                --expected-status 200 \
                --timeout-seconds 300
            displayName: 'Validate Application Health Endpoints'

          - script: |
              # Measure actual RTO from failover initiation to health confirmation
              python scripts/calculate_actual_rto.py \
                --plan-name "order-processing-recovery" \
                --target-rto-minutes 120
            displayName: 'Calculate and Validate Actual RTO'

          - script: |
              # Validate RPO — check recovery point timestamp vs test execution time
              python scripts/validate_rpo_compliance.py \
                --plan-name "order-processing-recovery" \
                --target-rpo-minutes 15
            displayName: 'Validate RPO Compliance'

  - stage: TestCleanup
    dependsOn: RecoveryValidation
    condition: always()   # Always cleanup — even if validation fails
    jobs:
      - job: CleanupTestFailover
        steps:
          - script: |
              az site-recovery replication-recovery-plan \
                cleanup-test-failover \
                --name "order-processing-recovery"
            displayName: 'Cleanup Test Failover Environment'

  - stage: ReportResults
    dependsOn: TestCleanup
    jobs:
      - job: PublishDRTestReport
        steps:
          - script: |
              python scripts/generate_dr_test_report.py \
                --output-format pdf \
                --include-rto-compliance \
                --include-rpo-compliance \
                --send-to governance-team@company.com
            displayName: 'Generate and Distribute DR Test Report'

DR Test Isolation — No Production Impact: Test failovers execute in an isolated network (test-failover-vnet) with no connectivity to production systems or external networks — recovered VMs start in a sandbox environment where application health can be validated without any risk of split-brain scenarios or production traffic routing to test-recovered VMs.

DR Test Result Tracking:

Test Metric	Target	Measured	Pass/Fail
Actual RTO achieved	≤ 2 hours	Measured from failover trigger to health confirmation	Pass if ≤ target
Actual RPO at recovery	≤ 15 minutes	Recovery point timestamp vs test execution time	Pass if ≤ target
Application health validation	100% endpoints healthy	HTTP health check success rate	Pass if 100%
Test failover completion	≤ 30 minutes	ASR test failover execution duration	Pass if ≤ target

Test results are published to the DR compliance dashboard and stored in Log Analytics — providing a continuous record of recovery capability validation for regulatory audit evidence.

5. Monitoring & Observability Layer

Centralised monitoring provides operational visibility across replication health, backup status, DR test results, and RTO/RPO compliance tracking.

Azure Monitor — Replication Health Alerting:

ASR replication health alerts — notification when any protected VM deviates from Normal replication state
RPO breach alerts — notification when replication lag approaches or exceeds defined RPO thresholds
Backup job failure alerts — immediate notification of backup job failures before next scheduled backup window
Recovery vault health alerts — notification of vault configuration changes or immutability violations

Azure Log Analytics — DR Operational Analytics:

ASR replication event logs — failover executions, replication state changes, and recovery plan operations
Azure Backup job completion logs — backup success, failure, and retention compliance tracking
DR test pipeline execution results — actual RTO and RPO achieved per test, trend over time
Immutable vault audit logs — any access or modification attempt against protected backup vaults

Power BI — RTO/RPO Compliance Dashboards:

Dashboard	Audience	Content
DR Readiness Executive Summary	CISO / CTO	Overall DR readiness score, last test date, RTO/RPO compliance rate
Replication Health Dashboard	IT Operations	Per-VM replication health, RPO lag, consistency group status
DR Test History	Governance / Audit	Historical test results, RTO/RPO trend, pass/fail per test
Backup Compliance Report	Compliance Team	Backup coverage, retention compliance, vault integrity status
Recovery Time Performance	IT Management	Actual vs target RTO by workload tier, trend analysis

DR Readiness Score Methodology:

DR Readiness Score = 
  (Replication Health Weight × Replication Score) +
  (RTO Compliance Weight × RTO Test Score) +
  (RPO Compliance Weight × RPO Test Score) +
  (Backup Coverage Weight × Backup Score)

Where:
  Replication Score = % of protected VMs in Normal replication state
  RTO Test Score = % of recent DR tests achieving RTO target
  RPO Test Score = % of recent DR tests achieving RPO target
  Backup Score = % of required workloads with compliant backup coverage

Weights: Replication 30%, RTO 25%, RPO 25%, Backup 20

DR Readiness Score = 
  (Replication Health Weight × Replication Score) +
  (RTO Compliance Weight × RTO Test Score) +
  (RPO Compliance Weight × RPO Test Score) +
  (Backup Coverage Weight × Backup Score)

Where:
  Replication Score = % of protected VMs in Normal replication state
  RTO Test Score = % of recent DR tests achieving RTO target
  RPO Test Score = % of recent DR tests achieving RPO target
  Backup Score = % of required workloads with compliant backup coverage

Weights: Replication 30%, RTO 25%, RPO 25%, Backup 20

DR Readiness Score = 
  (Replication Health Weight × Replication Score) +
  (RTO Compliance Weight × RTO Test Score) +
  (RPO Compliance Weight × RPO Test Score) +
  (Backup Coverage Weight × Backup Score)

Where:
  Replication Score = % of protected VMs in Normal replication state
  RTO Test Score = % of recent DR tests achieving RTO target
  RPO Test Score = % of recent DR tests achieving RPO target
  Backup Score = % of required workloads with compliant backup coverage

Weights: Replication 30%, RTO 25%, RPO 25%, Backup 20

6. Governance & Compliance Layer

Azure Policy — DR Compliance Enforcement:

Deny unprotected VM deployment in production resource groups — VMs must be enrolled in ASR replication
Audit backup coverage — alert on VMs without Azure Backup policy assignment
Require Recovery Services Vault immutability for production vaults
Enforce approved recovery regions — replication target must be the designated DR region

Microsoft Defender for Cloud — Security Posture in DR Context:

Security recommendations for DR-related configurations — unprotected VMs, unencrypted backup vaults
Regulatory compliance assessment against resilience-related framework controls
Threat protection on replicated VMs — security monitoring continues in secondary region

Terraform — Infrastructure Governance: All primary and secondary region infrastructure managed through Terraform — consistent deployment, version-controlled configuration, and auditable change history for both production and DR environments.

Technologies Used

Category	Technologies
Disaster Recovery	Azure Site Recovery (ASR)
Backup & Retention	Azure Backup, Immutable Recovery Services Vault
DR Testing	Azure DevOps YAML Pipelines, Python validation scripts
Infrastructure as Code	Terraform
Cloud Platform	Azure VMs (Windows Server 2022, RHEL), Azure VNets, NSGs
Administrative Access	Jumpbox VMs (interim — Bastion planned)
Monitoring	Azure Monitor, Log Analytics
Reporting	Power BI, Azure Workbooks
Governance	Azure Policy, Microsoft Defender for Cloud
Compliance Frameworks	ISO 22301 (Business Continuity), NIST SP 800-34, PCI DSS v4.0

Key Challenges Addressed

Ensuring reliable cross-region replication without data inconsistency — addressed through multi-VM consistency groups ensuring related VMs replicate to the same recovery point simultaneously — preventing application-inconsistent recovery scenarios where web, application, and database tiers recover to different points-in-time.

Validating RTO/RPO targets under realistic operational conditions — addressed through automated test failover pipeline measuring actual failover execution time and recovery point timestamp — providing empirical RTO/RPO validation data rather than theoretical estimates.

Automating DR testing without production impact — addressed through test failover execution in isolated networks with no production connectivity — recovered VMs operate in a sandbox environment with application health validation but no production traffic routing risk.

Maintaining secure access during failover scenarios — addressed through preconfigured secondary region administrative infrastructure — jumpbox VMs and NSG configurations deployed in the secondary region before incidents occur, ensuring administrative access remains operational immediately after failover.

Protecting backups against ransomware and destructive operations — addressed through immutable vault configuration preventing backup deletion or retention period modification — complementing ASR replication which would replicate ransomware encryption to the secondary region without backup protection.

Providing measurable RTO/RPO compliance evidence — addressed through automated DR test result collection, Power BI compliance dashboards, and Log Analytics trend storage — producing auditable, continuously updated recovery performance evidence for regulatory review.

Design Decisions & Rationale

Active-Passive over Active-Active DR Model : Active-active multi-region deployment provides zero RTO but requires significantly higher infrastructure cost — running full production capacity in two regions simultaneously. Active-passive provides acceptable RTO (2 hours for Tier 1) at significantly lower cost — secondary region compute resources are not running until failover activation. For most enterprise workloads where 2-hour RTO is acceptable, active-passive provides the appropriate cost-to-resilience balance.

Separation of ASR Replication and Azure Backup : ASR replication is optimised for availability — fast RTO through continuous replication and orchestrated failover. However, ASR replication faithfully replicates data corruption and ransomware encryption to the secondary region — it provides no protection against data integrity failures. Azure Backup provides independent, immutable data protection covering corruption, accidental deletion, and long-term compliance retention that ASR cannot serve. The two mechanisms address different failure scenarios and must coexist.

Automated DR Testing over Annual Manual Tests : Annual manual DR tests are expensive, disruptive, and infrequent — a recovery procedure that worked in January may have drifted by October due to infrastructure changes, application updates, or network reconfigurations. Monthly automated test failover pipelines detect procedure drift continuously — the cost of a failed automated test is minimal; the cost of a failed actual recovery is catastrophic.

Preconfigured Secondary Region Infrastructure : Cold-start secondary region infrastructure provisioning during an actual incident extends RTO beyond acceptable targets — Terraform deployment of network infrastructure typically takes 15-30 minutes before ASR failover can begin. Preconfiguring secondary region network infrastructure, NSGs, and load balancers before incidents means failover can begin immediately — compute resources start through ASR failover while network infrastructure is already operational.

Immutable Vault for Production Backup Protection : Standard Recovery Services Vaults permit backup deletion and retention period modification by administrators — a ransomware actor with sufficient Azure access can delete backup copies before triggering encryption. Immutable vault compliance mode prevents any modification regardless of administrative privilege level — providing tamper-proof backup protection. The operational constraint (retention periods cannot be shortened after immutability lock) is an acceptable trade-off for the protection it provides.

Azure-Native Services over Third-Party DR Platforms : Third-party DR platforms introduce additional licensing cost, operational tooling complexity, and Azure integration overhead. Azure Site Recovery and Azure Backup provide native integration with Azure VMs, Azure networking, Azure Policy governance, and Azure Monitor — reducing operational complexity while maintaining enterprise-grade DR capability appropriate for most workload categories.

Trade-offs & Design Constraints

Active-Passive RTO Dependency on Secondary Region Readiness : The 2-hour RTO target depends on secondary region network infrastructure being preconfigured before incidents. If Terraform-managed secondary region infrastructure is not maintained in sync with primary region changes — new subnets added in primary not replicated to secondary, NSG rules updated in primary not applied to secondary — failover may encounter infrastructure mismatches extending actual RTO beyond the 2-hour target. Infrastructure drift detection between primary and secondary regions should be monitored through scheduled Terraform plan runs comparing state.

ASR Replication Faithfully Replicates Corruption : ASR continuous replication does not distinguish between healthy writes and ransomware encryption writes — it replicates all changes to the secondary region. If ransomware encrypts files in the primary region, the encrypted versions are replicated to secondary within the RPO window. Recovery from ransomware scenarios requires Azure Backup restore from a pre-infection recovery point — not ASR failover to the secondary region. The architecture must clearly document which failure scenarios are addressed by ASR (regional outage) versus Azure Backup (data integrity failure, ransomware).

Test Failover Isolated Network Validation Limitations : Test failovers execute in isolated networks — application health endpoints are validated against the isolated test environment, not against actual production dependencies (external APIs, on-premises systems, DNS resolution). Validation scripts must account for these isolation boundaries — testing that the application starts and responds to health checks in isolation, not that it can process live production transactions. Full end-to-end production traffic validation requires planned failover (actual failover with production traffic) rather than test failover.

Recovery Plan Maintenance Overhead : As application workloads evolve — new VMs added, services decomposed, dependencies changed — recovery plans must be updated to reflect current architecture. Stale recovery plans that do not match current workload topology cause failover failures or incorrect recovery sequences. Recovery plan definitions should be managed through Terraform with mandatory update procedures triggered by infrastructure change events.

Multi-VM Consistency Group Performance Impact : ASR multi-VM consistency groups generate application-consistent snapshots across all VMs in the group simultaneously — requiring VSS quiescence for Windows VMs. At high frequency (every hour for Tier 1), this quiescence can briefly impact application performance during snapshot operations. Consistency group snapshot frequency must be balanced against performance impact — for latency-sensitive applications, less frequent consistency snapshots (every 4 hours) with crash-consistent replication (every 5 minutes) may be the appropriate trade-off.

Projected Outcomes

The architecture is designed to deliver the following resilience and governance outcomes in a production enterprise environment:

Measurable RTO compliance through automated DR test pipeline validation — empirical recovery time measurement replacing theoretical RTO estimates
Continuous RPO enforcement through ASR replication health monitoring with threshold-based alerting before RPO objectives are breached
Near real-time cross-region replication for Tier 1 workloads maintaining 15-minute RPO under normal operating conditions
Automated monthly DR testing providing continuous recovery procedure validation — detecting drift before actual incidents require recovery
Immutable backup protection preventing ransomware-driven backup deletion independent of ASR replication integrity
Executive DR readiness dashboards providing governance-ready evidence of recovery capability for regulatory audit responses
Preconfigured secondary region infrastructure enabling immediate failover initiation without cold-start provisioning delays
Auditable DR test history stored in Log Analytics — continuous compliance evidence record for regulatory frameworks requiring demonstrable recovery capability

Future Evolution

Multi-region active-active recovery models for highest-criticality Tier 0 workloads where 2-hour RTO is not acceptable
AI-assisted failover optimisation through Azure Monitor intelligent alerting predicting replication degradation before RPO breach
Automated chaos engineering through Azure Chaos Studio validating application resilience under component failure scenarios beyond regional outage
Self-healing infrastructure remediation detecting and correcting secondary region infrastructure drift automatically
Cross-cloud disaster recovery federation extending BCDR coverage to workloads in AWS or GCP through multi-cloud ASR equivalent tooling
Continuous compliance validation automation through Azure Policy initiative tracking DR coverage requirements across all production workloads
Advanced ransomware recovery orchestration through dedicated cyber recovery vault with airgapped isolation
Integrated cyber recovery vault architecture providing isolated recovery environment for incidents where primary and secondary regions are simultaneously compromised

Key Takeaways

Disaster recovery requires continuous validation, not passive replication monitoring — replication health metrics confirm data is moving but do not validate that recovered workloads will start correctly and serve traffic within RTO objectives
Automated monthly DR testing is the most impactful operational maturity improvement for enterprise BCDR — annual manual tests are too infrequent to catch recovery procedure drift
ASR and Azure Backup are complementary, not redundant — ASR addresses regional outage recovery (RTO), Azure Backup addresses data integrity protection and compliance retention (RPO and regulatory requirements)
Preconfiguring secondary region infrastructure before incidents is essential for achieving aggressive RTO targets — cold-start infrastructure provisioning during actual incidents extends recovery time unpredictably
ASR faithfully replicates ransomware encryption — regional replication does not protect against data integrity failures; immutable Azure Backup vaults are the correct protection for cyber recovery scenarios
Active-passive architecture provides the appropriate cost-to-resilience balance for most enterprise workloads — active-active provides zero RTO at significantly higher infrastructure cost justified only for the highest-criticality workloads
Recovery plan maintenance must be treated as an ongoing operational requirement — stale plans not reflecting current workload topology are a primary cause of DR test failures

Executive Summary

Business Drivers

This architecture was designed to address the BCDR requirements of organisations where existing approaches result in:

Manual and error-prone failover procedures — recovery steps documented in runbooks but untested under realistic conditions
Unverified DR readiness — replication health is monitored but actual recovery capability is assumed rather than validated
Inconsistent RTO/RPO enforcement — recovery objectives defined in policy but not measurably tracked against actual replication and recovery performance
Absence of automated DR testing — annual manual tests are disruptive, infrequent, and fail to detect recovery procedure drift between tests
Limited visibility into recovery health — no executive dashboard demonstrating DR readiness for governance and regulatory audit purposes
Compliance pressure from regulated industries — financial services, healthcare, and critical infrastructure regulators increasingly require demonstrable and tested recovery capabilities

Operational Constraints

The architecture was designed to operate within the following constraints typical of enterprise multi-region DR environments:

Cross-region replication must maintain RPO compliance continuously — replication lag must not exceed defined RPO thresholds without alerting
DR testing must not impact production environments — test failovers must execute in isolated environments without disrupting primary region workloads
Administrative access must remain secure and functional during DR scenarios — secondary region must have equivalent administrative access capability
Compliance reporting must demonstrate measurable RTO/RPO performance — audit evidence requires tracked metrics, not claimed objectives
Backup systems must provide ransomware resilience — immutable vault protection preventing backup deletion or modification
Recovery automation must be scalable across workload count — manual orchestration does not scale as workload estate grows
Secondary region infrastructure must be preconfigured — cold-start infrastructure provisioning during an actual incident extends RTO beyond acceptable targets

Recovery Objectives

Workload Tier	Target RTO	Target RPO	Replication Mechanism	Test Frequency
Tier 1 — Mission Critical	2 hours	15 minutes	ASR continuous replication	Monthly
Tier 2 — Business Important	4 hours	1 hour	ASR continuous replication	Quarterly
Tier 3 — Standard Operations	8 hours	4 hours	ASR + Azure Backup	Bi-annually
Database Tier	1 hour	5 minutes	ASR + SQL geo-replication	Monthly

These recovery objectives represent design targets. Production RTO/RPO commitments require validation through load-tested recovery plan execution under realistic infrastructure conditions.

Architecture Principles

Recovery readiness by design — recovery capability must be continuously validated, not assumed from replication health metrics
Automated failover orchestration — recovery plans execute through predefined automation rather than manual runbook steps
Separation of replication and backup functions — ASR handles availability recovery (RTO), Azure Backup handles data protection and long-term retention (RPO and compliance)
Immutable data protection — backup vaults configured with immutability and soft-delete preventing ransomware-driven backup deletion
Continuous DR validation — automated test failover pipelines executing on defined schedules detecting recovery procedure drift before incidents occur
Secure DR operations — secondary region maintains equivalent security controls and administrative access to primary region
Infrastructure automation and repeatability — secondary region infrastructure preconfigured through Terraform ensuring consistent recovery environment without cold-start provisioning delays
Centralised RTO/RPO observability — replication health, recovery performance, and test results tracked in unified compliance dashboards

Architecture Overview

1. Primary Production Region — East US

The primary region hosts production workloads with full security controls — serving as the operational baseline from which ASR replication targets the secondary region.

Workload Configuration:

Windows Server 2022 and RHEL virtual machines hosting application workloads
Private-only networking — no public IP addresses on workload VMs
NSG-enforced least-privilege ingress and egress traffic controls
ASR Mobility Service agent installed on all protected VMs — enabling continuous replication to secondary region

Secure Administrative Access:

Jumpbox VM in dedicated management subnet providing administrative access without public VM exposure
NSG on management subnet permitting inbound RDP/SSH from authorised administrative IP ranges only
Future evolution: Azure Bastion replacement eliminating jumpbox VM management overhead

ASR Replication Policy Configuration:

Parameter	Tier 1	Tier 2	Tier 3
RPO threshold alert	15 minutes	1 hour	4 hours
App-consistent snapshot	Every 1 hour	Every 4 hours	Every 6 hours
Crash-consistent snapshot	Every 5 minutes	Every 5 minutes	Every 5 minutes
Recovery point retention	72 hours	24 hours	15 days

2. Secondary Disaster Recovery Region — Central US

The secondary region serves as the preconfigured DR target — infrastructure deployed and validated before incidents occur, enabling rapid workload activation during failover.

Azure Site Recovery — Failover Architecture:

Primary Region (East US)           Secondary Region (Central US)
─────────────────────────          ──────────────────────────────
production-vnet (10.0.0.0/16)  →  recovery-vnet (10.1.0.0/16)
  workload-subnet                    recovery-subnet
  management-subnet                  recovery-management-subnet
  NSG-production                     NSG-recovery (pre-deployed)
  Load Balancer (active)             Load Balancer (standby)
  VM-web-01 (running)                VM-web-01 (replicated — not running)
  VM-app-01 (running)                VM-app-01 (replicated — not running)
  VM-db-01 (running)                 VM-db-01 (replicated — not running)

Primary Region (East US)           Secondary Region (Central US)
─────────────────────────          ──────────────────────────────
production-vnet (10.0.0.0/16)  →  recovery-vnet (10.1.0.0/16)
  workload-subnet                    recovery-subnet
  management-subnet                  recovery-management-subnet
  NSG-production                     NSG-recovery (pre-deployed)
  Load Balancer (active)             Load Balancer (standby)
  VM-web-01 (running)                VM-web-01 (replicated — not running)
  VM-app-01 (running)                VM-app-01 (replicated — not running)
  VM-db-01 (running)                 VM-db-01 (replicated — not running)

Primary Region (East US)           Secondary Region (Central US)
─────────────────────────          ──────────────────────────────
production-vnet (10.0.0.0/16)  →  recovery-vnet (10.1.0.0/16)
  workload-subnet                    recovery-subnet
  management-subnet                  recovery-management-subnet
  NSG-production                     NSG-recovery (pre-deployed)
  Load Balancer (active)             Load Balancer (standby)
  VM-web-01 (running)                VM-web-01 (replicated — not running)
  VM-app-01 (running)                VM-app-01 (replicated — not running)
  VM-db-01 (running)                 VM-db-01 (replicated — not running)

Recovery Plan Structure:

Recovery Plan: order-processing-recovery

Group 1 (execute first):
  - Script: validate-recovery-network-connectivity
  - Script: start-recovery-database-services

Group 2 (execute after Group 1 completes):
  - Failover: VM-db-01 (database tier)
  - Wait: 5 minutes (database startup validation)

Group 3 (execute after Group 2 completes):
  - Failover: VM-app-01 (application tier)
  - Wait: 3 minutes (application startup validation)

Group 4 (execute after Group 3 completes):
  - Failover: VM-web-01 (web tier)
  - Script: validate-application-health-endpoint
  - Script: update-dns-records-to-recovery-region
  - Script: notify-operations-team-failover-complete

Recovery Plan: order-processing-recovery

Group 1 (execute first):
  - Script: validate-recovery-network-connectivity
  - Script: start-recovery-database-services

Group 2 (execute after Group 1 completes):
  - Failover: VM-db-01 (database tier)
  - Wait: 5 minutes (database startup validation)

Group 3 (execute after Group 2 completes):
  - Failover: VM-app-01 (application tier)
  - Wait: 3 minutes (application startup validation)

Group 4 (execute after Group 3 completes):
  - Failover: VM-web-01 (web tier)
  - Script: validate-application-health-endpoint
  - Script: update-dns-records-to-recovery-region
  - Script: notify-operations-team-failover-complete

Recovery Plan: order-processing-recovery

Group 1 (execute first):
  - Script: validate-recovery-network-connectivity
  - Script: start-recovery-database-services

Group 2 (execute after Group 1 completes):
  - Failover: VM-db-01 (database tier)
  - Wait: 5 minutes (database startup validation)

Group 3 (execute after Group 2 completes):
  - Failover: VM-app-01 (application tier)
  - Wait: 3 minutes (application startup validation)

Group 4 (execute after Group 3 completes):
  - Failover: VM-web-01 (web tier)
  - Script: validate-application-health-endpoint
  - Script: update-dns-records-to-recovery-region
  - Script: notify-operations-team-failover-complete

3. Backup & Retention Layer

ASR vs Azure Backup — Complementary Functions:

Capability	Azure Site Recovery	Azure Backup
Primary purpose	Availability — fast RTO	Data protection — RPO and compliance
Recovery granularity	Full VM failover	File, folder, VM, SQL point-in-time
Retention window	72 hours (configurable)	Years — compliance retention
Ransomware protection	Limited — replicates deletions	Immutable vault — tamper-proof
Use case	Regional outage recovery	Data corruption, accidental deletion, compliance

Azure Backup Configuration:

Recovery Services Vault with immutability enabled — compliance mode locking preventing vault deletion or backup modification
Soft delete with 14-day retention window providing secondary protection against accidental deletion
VM backup policy: daily backups with 30-day retention for operational recovery, weekly backups retained 52 weeks for compliance
SQL Server backup: full weekly, differential daily, transaction log every 15 minutes — supporting 15-minute database RPO

4. Automation & DR Testing Layer

The automated DR testing pipeline is the primary differentiator of this platform — continuous recovery validation detecting procedure drift before actual incidents occur.

DR Testing Pipeline Architecture:

yaml

# Azure DevOps pipeline — scheduled DR test execution
trigger: none
schedules:
  - cron: "0 2 1 * *"        # Monthly — 2 AM on 1st of month
    displayName: 'Monthly DR Test'
    branches:
      include: [main]

stages:
  - stage: PreTestValidation
    jobs:
      - job: ValidateReplicationHealth
        steps:
          - script: |
              # Query ASR replication health for all protected VMs
              az site-recovery replicated-item list \
                --resource-group $RECOVERY_RG \
                --vault-name $RECOVERY_VAULT \
                --query "[?properties.providerSpecificDetails.replicationHealth!='Normal']" \
                > unhealthy_vms.json
              
              # Fail pipeline if any VMs are not in Normal replication state
              if [ -s unhealthy_vms.json ]; then
                echo "ERROR: VMs not in healthy replication state"
                cat unhealthy_vms.json
                exit 1
              fi
            displayName: 'Validate ASR Replication Health'

  - stage: TestFailoverExecution
    dependsOn: PreTestValidation
    jobs:
      - job: ExecuteTestFailover
        steps:
          - script: |
              # Execute test failover — isolated network, no production impact
              az site-recovery replication-recovery-plan \
                test-failover \
                --name "order-processing-recovery" \
                --recovery-point-type "Latest" \
                --network-type "VmNetwork" \
                --network "/subscriptions/.../test-failover-vnet"
            displayName: 'Execute Test Failover (Isolated Network)'

          - script: |
              # Wait for test failover completion
              python scripts/wait_for_failover_completion.py \
                --plan-name "order-processing-recovery" \
                --timeout-minutes 30
            displayName: 'Wait for Test Failover Completion'

  - stage: RecoveryValidation
    dependsOn: TestFailoverExecution
    jobs:
      - job: ValidateRecoveredWorkloads
        steps:
          - script: |
              # Validate application health endpoint on recovered VMs
              python scripts/validate_application_health.py \
                --environment test-failover \
                --expected-status 200 \
                --timeout-seconds 300
            displayName: 'Validate Application Health Endpoints'

          - script: |
              # Measure actual RTO from failover initiation to health confirmation
              python scripts/calculate_actual_rto.py \
                --plan-name "order-processing-recovery" \
                --target-rto-minutes 120
            displayName: 'Calculate and Validate Actual RTO'

          - script: |
              # Validate RPO — check recovery point timestamp vs test execution time
              python scripts/validate_rpo_compliance.py \
                --plan-name "order-processing-recovery" \
                --target-rpo-minutes 15
            displayName: 'Validate RPO Compliance'

  - stage: TestCleanup
    dependsOn: RecoveryValidation
    condition: always()   # Always cleanup — even if validation fails
    jobs:
      - job: CleanupTestFailover
        steps:
          - script: |
              az site-recovery replication-recovery-plan \
                cleanup-test-failover \
                --name "order-processing-recovery"
            displayName: 'Cleanup Test Failover Environment'

  - stage: ReportResults
    dependsOn: TestCleanup
    jobs:
      - job: PublishDRTestReport
        steps:
          - script: |
              python scripts/generate_dr_test_report.py \
                --output-format pdf \
                --include-rto-compliance \
                --include-rpo-compliance \
                --send-to governance-team@company.com
            displayName: 'Generate and Distribute DR Test Report'

# Azure DevOps pipeline — scheduled DR test execution
trigger: none
schedules:
  - cron: "0 2 1 * *"        # Monthly — 2 AM on 1st of month
    displayName: 'Monthly DR Test'
    branches:
      include: [main]

stages:
  - stage: PreTestValidation
    jobs:
      - job: ValidateReplicationHealth
        steps:
          - script: |
              # Query ASR replication health for all protected VMs
              az site-recovery replicated-item list \
                --resource-group $RECOVERY_RG \
                --vault-name $RECOVERY_VAULT \
                --query "[?properties.providerSpecificDetails.replicationHealth!='Normal']" \
                > unhealthy_vms.json
              
              # Fail pipeline if any VMs are not in Normal replication state
              if [ -s unhealthy_vms.json ]; then
                echo "ERROR: VMs not in healthy replication state"
                cat unhealthy_vms.json
                exit 1
              fi
            displayName: 'Validate ASR Replication Health'

  - stage: TestFailoverExecution
    dependsOn: PreTestValidation
    jobs:
      - job: ExecuteTestFailover
        steps:
          - script: |
              # Execute test failover — isolated network, no production impact
              az site-recovery replication-recovery-plan \
                test-failover \
                --name "order-processing-recovery" \
                --recovery-point-type "Latest" \
                --network-type "VmNetwork" \
                --network "/subscriptions/.../test-failover-vnet"
            displayName: 'Execute Test Failover (Isolated Network)'

          - script: |
              # Wait for test failover completion
              python scripts/wait_for_failover_completion.py \
                --plan-name "order-processing-recovery" \
                --timeout-minutes 30
            displayName: 'Wait for Test Failover Completion'

  - stage: RecoveryValidation
    dependsOn: TestFailoverExecution
    jobs:
      - job: ValidateRecoveredWorkloads
        steps:
          - script: |
              # Validate application health endpoint on recovered VMs
              python scripts/validate_application_health.py \
                --environment test-failover \
                --expected-status 200 \
                --timeout-seconds 300
            displayName: 'Validate Application Health Endpoints'

          - script: |
              # Measure actual RTO from failover initiation to health confirmation
              python scripts/calculate_actual_rto.py \
                --plan-name "order-processing-recovery" \
                --target-rto-minutes 120
            displayName: 'Calculate and Validate Actual RTO'

          - script: |
              # Validate RPO — check recovery point timestamp vs test execution time
              python scripts/validate_rpo_compliance.py \
                --plan-name "order-processing-recovery" \
                --target-rpo-minutes 15
            displayName: 'Validate RPO Compliance'

  - stage: TestCleanup
    dependsOn: RecoveryValidation
    condition: always()   # Always cleanup — even if validation fails
    jobs:
      - job: CleanupTestFailover
        steps:
          - script: |
              az site-recovery replication-recovery-plan \
                cleanup-test-failover \
                --name "order-processing-recovery"
            displayName: 'Cleanup Test Failover Environment'

  - stage: ReportResults
    dependsOn: TestCleanup
    jobs:
      - job: PublishDRTestReport
        steps:
          - script: |
              python scripts/generate_dr_test_report.py \
                --output-format pdf \
                --include-rto-compliance \
                --include-rpo-compliance \
                --send-to governance-team@company.com
            displayName: 'Generate and Distribute DR Test Report'

# Azure DevOps pipeline — scheduled DR test execution
trigger: none
schedules:
  - cron: "0 2 1 * *"        # Monthly — 2 AM on 1st of month
    displayName: 'Monthly DR Test'
    branches:
      include: [main]

stages:
  - stage: PreTestValidation
    jobs:
      - job: ValidateReplicationHealth
        steps:
          - script: |
              # Query ASR replication health for all protected VMs
              az site-recovery replicated-item list \
                --resource-group $RECOVERY_RG \
                --vault-name $RECOVERY_VAULT \
                --query "[?properties.providerSpecificDetails.replicationHealth!='Normal']" \
                > unhealthy_vms.json
              
              # Fail pipeline if any VMs are not in Normal replication state
              if [ -s unhealthy_vms.json ]; then
                echo "ERROR: VMs not in healthy replication state"
                cat unhealthy_vms.json
                exit 1
              fi
            displayName: 'Validate ASR Replication Health'

  - stage: TestFailoverExecution
    dependsOn: PreTestValidation
    jobs:
      - job: ExecuteTestFailover
        steps:
          - script: |
              # Execute test failover — isolated network, no production impact
              az site-recovery replication-recovery-plan \
                test-failover \
                --name "order-processing-recovery" \
                --recovery-point-type "Latest" \
                --network-type "VmNetwork" \
                --network "/subscriptions/.../test-failover-vnet"
            displayName: 'Execute Test Failover (Isolated Network)'

          - script: |
              # Wait for test failover completion
              python scripts/wait_for_failover_completion.py \
                --plan-name "order-processing-recovery" \
                --timeout-minutes 30
            displayName: 'Wait for Test Failover Completion'

  - stage: RecoveryValidation
    dependsOn: TestFailoverExecution
    jobs:
      - job: ValidateRecoveredWorkloads
        steps:
          - script: |
              # Validate application health endpoint on recovered VMs
              python scripts/validate_application_health.py \
                --environment test-failover \
                --expected-status 200 \
                --timeout-seconds 300
            displayName: 'Validate Application Health Endpoints'

          - script: |
              # Measure actual RTO from failover initiation to health confirmation
              python scripts/calculate_actual_rto.py \
                --plan-name "order-processing-recovery" \
                --target-rto-minutes 120
            displayName: 'Calculate and Validate Actual RTO'

          - script: |
              # Validate RPO — check recovery point timestamp vs test execution time
              python scripts/validate_rpo_compliance.py \
                --plan-name "order-processing-recovery" \
                --target-rpo-minutes 15
            displayName: 'Validate RPO Compliance'

  - stage: TestCleanup
    dependsOn: RecoveryValidation
    condition: always()   # Always cleanup — even if validation fails
    jobs:
      - job: CleanupTestFailover
        steps:
          - script: |
              az site-recovery replication-recovery-plan \
                cleanup-test-failover \
                --name "order-processing-recovery"
            displayName: 'Cleanup Test Failover Environment'

  - stage: ReportResults
    dependsOn: TestCleanup
    jobs:
      - job: PublishDRTestReport
        steps:
          - script: |
              python scripts/generate_dr_test_report.py \
                --output-format pdf \
                --include-rto-compliance \
                --include-rpo-compliance \
                --send-to governance-team@company.com
            displayName: 'Generate and Distribute DR Test Report'

DR Test Result Tracking:

Test Metric	Target	Measured	Pass/Fail
Actual RTO achieved	≤ 2 hours	Measured from failover trigger to health confirmation	Pass if ≤ target
Actual RPO at recovery	≤ 15 minutes	Recovery point timestamp vs test execution time	Pass if ≤ target
Application health validation	100% endpoints healthy	HTTP health check success rate	Pass if 100%
Test failover completion	≤ 30 minutes	ASR test failover execution duration	Pass if ≤ target

Test results are published to the DR compliance dashboard and stored in Log Analytics — providing a continuous record of recovery capability validation for regulatory audit evidence.

5. Monitoring & Observability Layer

Centralised monitoring provides operational visibility across replication health, backup status, DR test results, and RTO/RPO compliance tracking.

Azure Monitor — Replication Health Alerting:

ASR replication health alerts — notification when any protected VM deviates from Normal replication state
RPO breach alerts — notification when replication lag approaches or exceeds defined RPO thresholds
Backup job failure alerts — immediate notification of backup job failures before next scheduled backup window
Recovery vault health alerts — notification of vault configuration changes or immutability violations

Azure Log Analytics — DR Operational Analytics:

ASR replication event logs — failover executions, replication state changes, and recovery plan operations
Azure Backup job completion logs — backup success, failure, and retention compliance tracking
DR test pipeline execution results — actual RTO and RPO achieved per test, trend over time
Immutable vault audit logs — any access or modification attempt against protected backup vaults

Power BI — RTO/RPO Compliance Dashboards:

Dashboard	Audience	Content
DR Readiness Executive Summary	CISO / CTO	Overall DR readiness score, last test date, RTO/RPO compliance rate
Replication Health Dashboard	IT Operations	Per-VM replication health, RPO lag, consistency group status
DR Test History	Governance / Audit	Historical test results, RTO/RPO trend, pass/fail per test
Backup Compliance Report	Compliance Team	Backup coverage, retention compliance, vault integrity status
Recovery Time Performance	IT Management	Actual vs target RTO by workload tier, trend analysis

DR Readiness Score Methodology:

DR Readiness Score = 
  (Replication Health Weight × Replication Score) +
  (RTO Compliance Weight × RTO Test Score) +
  (RPO Compliance Weight × RPO Test Score) +
  (Backup Coverage Weight × Backup Score)

Where:
  Replication Score = % of protected VMs in Normal replication state
  RTO Test Score = % of recent DR tests achieving RTO target
  RPO Test Score = % of recent DR tests achieving RPO target
  Backup Score = % of required workloads with compliant backup coverage

Weights: Replication 30%, RTO 25%, RPO 25%, Backup 20

DR Readiness Score = 
  (Replication Health Weight × Replication Score) +
  (RTO Compliance Weight × RTO Test Score) +
  (RPO Compliance Weight × RPO Test Score) +
  (Backup Coverage Weight × Backup Score)

Where:
  Replication Score = % of protected VMs in Normal replication state
  RTO Test Score = % of recent DR tests achieving RTO target
  RPO Test Score = % of recent DR tests achieving RPO target
  Backup Score = % of required workloads with compliant backup coverage

Weights: Replication 30%, RTO 25%, RPO 25%, Backup 20

DR Readiness Score = 
  (Replication Health Weight × Replication Score) +
  (RTO Compliance Weight × RTO Test Score) +
  (RPO Compliance Weight × RPO Test Score) +
  (Backup Coverage Weight × Backup Score)

Where:
  Replication Score = % of protected VMs in Normal replication state
  RTO Test Score = % of recent DR tests achieving RTO target
  RPO Test Score = % of recent DR tests achieving RPO target
  Backup Score = % of required workloads with compliant backup coverage

Weights: Replication 30%, RTO 25%, RPO 25%, Backup 20

6. Governance & Compliance Layer

Azure Policy — DR Compliance Enforcement:

Deny unprotected VM deployment in production resource groups — VMs must be enrolled in ASR replication
Audit backup coverage — alert on VMs without Azure Backup policy assignment
Require Recovery Services Vault immutability for production vaults
Enforce approved recovery regions — replication target must be the designated DR region

Microsoft Defender for Cloud — Security Posture in DR Context:

Security recommendations for DR-related configurations — unprotected VMs, unencrypted backup vaults
Regulatory compliance assessment against resilience-related framework controls
Threat protection on replicated VMs — security monitoring continues in secondary region

Technologies Used

Category	Technologies
Disaster Recovery	Azure Site Recovery (ASR)
Backup & Retention	Azure Backup, Immutable Recovery Services Vault
DR Testing	Azure DevOps YAML Pipelines, Python validation scripts
Infrastructure as Code	Terraform
Cloud Platform	Azure VMs (Windows Server 2022, RHEL), Azure VNets, NSGs
Administrative Access	Jumpbox VMs (interim — Bastion planned)
Monitoring	Azure Monitor, Log Analytics
Reporting	Power BI, Azure Workbooks
Governance	Azure Policy, Microsoft Defender for Cloud
Compliance Frameworks	ISO 22301 (Business Continuity), NIST SP 800-34, PCI DSS v4.0

Key Challenges Addressed

Design Decisions & Rationale

Trade-offs & Design Constraints

Projected Outcomes

The architecture is designed to deliver the following resilience and governance outcomes in a production enterprise environment:

Measurable RTO compliance through automated DR test pipeline validation — empirical recovery time measurement replacing theoretical RTO estimates
Continuous RPO enforcement through ASR replication health monitoring with threshold-based alerting before RPO objectives are breached
Near real-time cross-region replication for Tier 1 workloads maintaining 15-minute RPO under normal operating conditions
Automated monthly DR testing providing continuous recovery procedure validation — detecting drift before actual incidents require recovery
Immutable backup protection preventing ransomware-driven backup deletion independent of ASR replication integrity
Executive DR readiness dashboards providing governance-ready evidence of recovery capability for regulatory audit responses
Preconfigured secondary region infrastructure enabling immediate failover initiation without cold-start provisioning delays
Auditable DR test history stored in Log Analytics — continuous compliance evidence record for regulatory frameworks requiring demonstrable recovery capability

Future Evolution

Multi-region active-active recovery models for highest-criticality Tier 0 workloads where 2-hour RTO is not acceptable
AI-assisted failover optimisation through Azure Monitor intelligent alerting predicting replication degradation before RPO breach
Automated chaos engineering through Azure Chaos Studio validating application resilience under component failure scenarios beyond regional outage
Self-healing infrastructure remediation detecting and correcting secondary region infrastructure drift automatically
Cross-cloud disaster recovery federation extending BCDR coverage to workloads in AWS or GCP through multi-cloud ASR equivalent tooling
Continuous compliance validation automation through Azure Policy initiative tracking DR coverage requirements across all production workloads
Advanced ransomware recovery orchestration through dedicated cyber recovery vault with airgapped isolation
Integrated cyber recovery vault architecture providing isolated recovery environment for incidents where primary and secondary regions are simultaneously compromised

Key Takeaways

Disaster recovery requires continuous validation, not passive replication monitoring — replication health metrics confirm data is moving but do not validate that recovered workloads will start correctly and serve traffic within RTO objectives
Automated monthly DR testing is the most impactful operational maturity improvement for enterprise BCDR — annual manual tests are too infrequent to catch recovery procedure drift
ASR and Azure Backup are complementary, not redundant — ASR addresses regional outage recovery (RTO), Azure Backup addresses data integrity protection and compliance retention (RPO and regulatory requirements)
Preconfiguring secondary region infrastructure before incidents is essential for achieving aggressive RTO targets — cold-start infrastructure provisioning during actual incidents extends recovery time unpredictably
ASR faithfully replicates ransomware encryption — regional replication does not protect against data integrity failures; immutable Azure Backup vaults are the correct protection for cyber recovery scenarios
Active-passive architecture provides the appropriate cost-to-resilience balance for most enterprise workloads — active-active provides zero RTO at significantly higher infrastructure cost justified only for the highest-criticality workloads
Recovery plan maintenance must be treated as an ongoing operational requirement — stale plans not reflecting current workload topology are a primary cause of DR test failures

Open to discussing infrastructure architecture, cloud transformation, or high-availability system design.

Whether the objective is infrastructure modernization, operational resilience, hybrid cloud transformation, or enterprise security architecture, I am always interested in discussing complex infrastructure environments and strategic technical initiatives.

Get in touch

Open to discussing infrastructure architecture, cloud transformation, or high-availability system design.

Whether the objective is infrastructure modernization, operational resilience, hybrid cloud transformation, or enterprise security architecture, I am always interested in discussing complex infrastructure environments and strategic technical initiatives.

Get in touch

Open to discussing infrastructure architecture, cloud transformation, or high-availability system design.

Whether the objective is infrastructure modernization, operational resilience, hybrid cloud transformation, or enterprise security architecture, I am always interested in discussing complex infrastructure environments and strategic technical initiatives.