Policy-as-Code Infrastructure Compliance Platform

Policy-as-Code Infrastructure Compliance Platform

Enterprise-Scale Azure Governance with Event-Driven Remediation & Multi-Subscription Compliance Analytics

Enterprise-Scale Azure Governance with Event-Driven Remediation & Multi-Subscription Compliance Analytics

Description

This case study is an independent architecture design exercise developed to demonstrate enterprise-scale Policy-as-Code governance platform architecture for multi-subscription Azure environments. It was not associated with a production deployment. The scenario is based on the compliance governance requirements typical of organisations managing Azure infrastructure across multiple subscriptions and environments with automated enforcement, real-time remediation, and executive compliance reporting requirements. This study focuses on enterprise governance operations at scale — multi-subscription policy management, environment-aware initiative design, event-driven remediation orchestration, Azure Resource Graph compliance analytics, and Power BI executive reporting. Security scanning in IaC pipelines and deployment-time governance are covered in depth in the Security-as-Code & DevSecOps Governance case study.

This case study is an independent architecture design exercise developed to demonstrate enterprise-scale Policy-as-Code governance platform architecture for multi-subscription Azure environments. It was not associated with a production deployment. The scenario is based on the compliance governance requirements typical of organisations managing Azure infrastructure across multiple subscriptions and environments with automated enforcement, real-time remediation, and executive compliance reporting requirements. This study focuses on enterprise governance operations at scale — multi-subscription policy management, environment-aware initiative design, event-driven remediation orchestration, Azure Resource Graph compliance analytics, and Power BI executive reporting. Security scanning in IaC pipelines and deployment-time governance are covered in depth in the Security-as-Code & DevSecOps Governance case study.

Key Focus Areas:

  • Policy-as-Code & Cloud Governance

  • Multi-Subscription Compliance Management

  • Event-Driven Remediation Orchestration

  • Azure Resource Graph Compliance Analytics

  • Environment-Aware Initiative Design

  • Executive Compliance Reporting


Executive Summary

Architected a cloud-native Policy-as-Code governance platform on Microsoft Azure enabling automated compliance enforcement, event-driven remediation, cross-subscription visibility, and executive reporting across Development, Test, and Production environments at enterprise scale.

The platform integrates Terraform-managed Azure Policy definitions and initiatives, GitOps-driven policy lifecycle governance, CI/CD-integrated deployment workflows, Azure Event Grid-triggered remediation orchestration through Logic Apps and Azure Automation Runbooks, Azure Resource Graph cross-subscription compliance querying, and Power BI executive compliance dashboards.

The design is differentiated from deployment-time security governance studies by its focus on operational compliance at scale — what happens after policies are deployed across a large Azure estate: how violations are detected in real time, how remediation is automated without introducing operational instability, how compliance state is queried across hundreds of resources across multiple subscriptions, and how governance evidence is surfaced to executive stakeholders.

Business Drivers

As organisations expand Azure adoption across multiple subscriptions and environments, point-in-time compliance audits and manual policy management become operationally unsustainable. Compliance drift — where resources that were compliant at deployment gradually deviate through configuration changes, new resource deployments, or policy scope expansion — is the most common enterprise governance failure in large Azure estates.

This architecture was designed to address the enterprise governance requirements of organisations where existing approaches result in:

  • Compliance drift between environments — policy changes applied to production not propagated to development and test environments creating inconsistent governance posture

  • Limited real-time visibility into policy violations — compliance state only known at scheduled audit intervals rather than continuously

  • Slow manual remediation cycles — non-compliant resources identified in audits but remediated through manual operational tickets extending exposure windows

  • Weak integration between governance controls and infrastructure delivery — policies applied after infrastructure is deployed rather than governed through the same delivery lifecycle

  • Difficulty scaling governance across multiple subscriptions — manual policy management across dozens of subscriptions creates inconsistency and coverage gaps

  • Compliance evidence requiring manual collection — audit responses built from portal exports rather than continuously maintained and queryable compliance state

Operational Constraints

The architecture was designed to operate within the following constraints typical of enterprise multi-subscription Azure governance environments:

  • Governance controls must be consistent across Development, Test, and Production environments but with environment-specific enforcement severity — development teams require operational flexibility that production cannot afford

  • Policy deployment workflows must integrate into CI/CD pipelines — governance changes must flow through the same review and approval process as infrastructure changes

  • Automated remediation must avoid operational instability — not all compliance violations should trigger immediate automated remediation; high-impact remediations require human approval

  • Azure Resource Graph queries must support cross-subscription compliance reporting — no single-subscription visibility model is adequate for enterprise estates

  • Compliance reporting must serve two audiences — technical operators requiring resource-level violation details and executive stakeholders requiring KPI-level governance posture visibility

  • Policy exceptions must be manageable at environment scope — development environments may legitimately require exemptions from controls mandatory in production

  • Multi-subscription governance must follow management group hierarchy — policies assigned at management group level propagate to child subscriptions consistently

Objectives

  • Design a management group hierarchy enabling consistent policy inheritance across Dev, Test, and Production subscription tiers

  • Develop environment-specific policy initiatives with differentiated enforcement severity per environment tier

  • Automate policy lifecycle management through Terraform with GitOps governance and CI/CD deployment

  • Implement event-driven remediation architecture detecting violations in real time and triggering automated corrective actions

  • Design Azure Resource Graph queries providing cross-subscription compliance visibility beyond Azure Policy portal limitations

  • Build Power BI executive compliance dashboards and Azure Workbooks technical compliance dashboards

  • Define Mean Time to Remediation (MTTR) targets per violation severity — distinguishing automated from human-approved remediation paths

  • Establish compliance exemption governance — controlling and auditing policy exemptions across the enterprise estate

Management Group Hierarchy & Policy Inheritance

The management group hierarchy is the foundational governance design decision — policy assignments at management group level inherit to all child subscriptions automatically.

Tenant Root Group
└── Enterprise Management Group
    ├── Platform Management Group          Platform baseline policies
    ├── Identity Subscription
    └── Connectivity Subscription
    ├── Landing Zones Management Group     Workload governance policies
    ├── Production Management Group   Strict enforcement initiatives
    ├── Prod-Sub-01
    └── Prod-Sub-02
    ├── Test Management Group         Moderate enforcement initiatives
    └── Test-Sub-01
    └── Development Management Group  Flexible enforcement initiatives
    └── Dev-Sub-01
    └── Sandbox Management Group          Minimal governance exploration only
        └── Sandbox-Sub-01
Tenant Root Group
└── Enterprise Management Group
    ├── Platform Management Group          Platform baseline policies
    ├── Identity Subscription
    └── Connectivity Subscription
    ├── Landing Zones Management Group     Workload governance policies
    ├── Production Management Group   Strict enforcement initiatives
    ├── Prod-Sub-01
    └── Prod-Sub-02
    ├── Test Management Group         Moderate enforcement initiatives
    └── Test-Sub-01
    └── Development Management Group  Flexible enforcement initiatives
    └── Dev-Sub-01
    └── Sandbox Management Group          Minimal governance exploration only
        └── Sandbox-Sub-01
Tenant Root Group
└── Enterprise Management Group
    ├── Platform Management Group          Platform baseline policies
    ├── Identity Subscription
    └── Connectivity Subscription
    ├── Landing Zones Management Group     Workload governance policies
    ├── Production Management Group   Strict enforcement initiatives
    ├── Prod-Sub-01
    └── Prod-Sub-02
    ├── Test Management Group         Moderate enforcement initiatives
    └── Test-Sub-01
    └── Development Management Group  Flexible enforcement initiatives
    └── Dev-Sub-01
    └── Sandbox Management Group          Minimal governance exploration only
        └── Sandbox-Sub-01

Policy Inheritance Design:

  • Policies assigned at Enterprise Management Group level apply to all subscriptions — foundational security controls with no environment exceptions

  • Environment-specific initiatives assigned at Production, Test, and Development management group levels — providing differentiated enforcement without duplicating universal controls

  • Sandbox subscriptions have minimal governance — intentional for innovation and exploration without compliance friction

Environment-Aware Policy Initiative Design

Three-Tier Governance Model:

Control Category

Development

Test

Production

Public IP on VMs

Audit

Deny

Deny

Diagnostic settings

Audit

Audit

DeployIfNotExists

Resource tagging

Audit

Deny

Deny

TLS minimum version

Audit

Deny

Deny

Approved VM SKUs

Disabled

Audit

Deny

Storage HTTPS only

Audit

Deny

Deny

Key Vault soft delete

Audit

Deny

Deny

Approved locations

Disabled

Audit

Deny

MFA for management

Audit

Audit

Deny

Rationale for Environment Differentiation: Development environments using Deny effects for all controls creates operational friction that slows development velocity without proportional security benefit — developers testing configurations in development should have flexibility to iterate. Audit effects in development surface compliance awareness without blocking operations. Production uses Deny effects for all security-critical controls — non-compliance is simply not permitted, regardless of operational convenience.

Architecture Overview

The solution is structured as a seven-layer enterprise governance platform integrating policy definition and IaC, GitOps governance, CI/CD automation, compliance enforcement, event-driven remediation, monitoring and analytics, and executive reporting.

1. Policy Definition & Infrastructure-as-Code Layer

All governance definitions are managed as Terraform code — version-controlled, peer-reviewed, and deployed through CI/CD pipelines.

Terraform Module Structure:

governance/
├── modules/
├── policy-definition/      # Custom policy definition module
├── policy-initiative/      # Initiative (policy set) module
├── policy-assignment/      # Assignment at MG/subscription scope
├── policy-exemption/       # Exemption management with expiry
└── remediation-task/       # Remediation task creation
├── initiatives/
├── production-baseline/    # Production strict initiative
├── test-baseline/          # Test moderate initiative
├── dev-baseline/           # Dev flexible initiative
└── platform-universal/     # Universal controls all environments
├── definitions/
├── network/                # Network security policy definitions
├── identity/               # Identity governance definitions
├── data-protection/        # Encryption and data policies
└── operational/            # Tagging, diagnostics, monitoring
└── assignments/
    ├── enterprise-mg.tf        # Enterprise MG universal assignments
    ├── production-mg.tf        # Production MG strict assignments
    ├── test-mg.tf              # Test MG moderate assignments
    └── dev-mg.tf               # Dev MG flexible assignments
governance/
├── modules/
├── policy-definition/      # Custom policy definition module
├── policy-initiative/      # Initiative (policy set) module
├── policy-assignment/      # Assignment at MG/subscription scope
├── policy-exemption/       # Exemption management with expiry
└── remediation-task/       # Remediation task creation
├── initiatives/
├── production-baseline/    # Production strict initiative
├── test-baseline/          # Test moderate initiative
├── dev-baseline/           # Dev flexible initiative
└── platform-universal/     # Universal controls all environments
├── definitions/
├── network/                # Network security policy definitions
├── identity/               # Identity governance definitions
├── data-protection/        # Encryption and data policies
└── operational/            # Tagging, diagnostics, monitoring
└── assignments/
    ├── enterprise-mg.tf        # Enterprise MG universal assignments
    ├── production-mg.tf        # Production MG strict assignments
    ├── test-mg.tf              # Test MG moderate assignments
    └── dev-mg.tf               # Dev MG flexible assignments
governance/
├── modules/
├── policy-definition/      # Custom policy definition module
├── policy-initiative/      # Initiative (policy set) module
├── policy-assignment/      # Assignment at MG/subscription scope
├── policy-exemption/       # Exemption management with expiry
└── remediation-task/       # Remediation task creation
├── initiatives/
├── production-baseline/    # Production strict initiative
├── test-baseline/          # Test moderate initiative
├── dev-baseline/           # Dev flexible initiative
└── platform-universal/     # Universal controls all environments
├── definitions/
├── network/                # Network security policy definitions
├── identity/               # Identity governance definitions
├── data-protection/        # Encryption and data policies
└── operational/            # Tagging, diagnostics, monitoring
└── assignments/
    ├── enterprise-mg.tf        # Enterprise MG universal assignments
    ├── production-mg.tf        # Production MG strict assignments
    ├── test-mg.tf              # Test MG moderate assignments
    └── dev-mg.tf               # Dev MG flexible assignments

Example Custom Policy Definition — Terraform:

hcl

resource "azurerm_policy_definition" "require_diagnostic_settings" {
  name         = "require-diagnostic-settings-storage"
  policy_type  = "Custom"
  mode         = "Indexed"
  display_name = "Deploy diagnostic settings for Storage Accounts"

  metadata = jsonencode({
    category = "Monitoring"
    version  = "1.2.0"
  })

  parameters = jsonencode({
    logAnalyticsWorkspaceId = {
      type     = "String"
      metadata = { displayName = "Log Analytics Workspace ID" }
    }
  })

  policy_rule = jsonencode({
    if = {
      field  = "type"
      equals = "Microsoft.Storage/storageAccounts"
    }
    then = {
      effect = "DeployIfNotExists"
      details = {
        type = "Microsoft.Insights/diagnosticSettings"
        roleDefinitionIds = [
          "/providers/Microsoft.Authorization/roleDefinitions/b24988ac-6180-42a0-ab88-20f7382dd24c"
        ]
        deployment = {
          properties = {
            # ... diagnostic settings deployment template ...
          }
        }
      }
    }
  })
}
resource "azurerm_policy_definition" "require_diagnostic_settings" {
  name         = "require-diagnostic-settings-storage"
  policy_type  = "Custom"
  mode         = "Indexed"
  display_name = "Deploy diagnostic settings for Storage Accounts"

  metadata = jsonencode({
    category = "Monitoring"
    version  = "1.2.0"
  })

  parameters = jsonencode({
    logAnalyticsWorkspaceId = {
      type     = "String"
      metadata = { displayName = "Log Analytics Workspace ID" }
    }
  })

  policy_rule = jsonencode({
    if = {
      field  = "type"
      equals = "Microsoft.Storage/storageAccounts"
    }
    then = {
      effect = "DeployIfNotExists"
      details = {
        type = "Microsoft.Insights/diagnosticSettings"
        roleDefinitionIds = [
          "/providers/Microsoft.Authorization/roleDefinitions/b24988ac-6180-42a0-ab88-20f7382dd24c"
        ]
        deployment = {
          properties = {
            # ... diagnostic settings deployment template ...
          }
        }
      }
    }
  })
}
resource "azurerm_policy_definition" "require_diagnostic_settings" {
  name         = "require-diagnostic-settings-storage"
  policy_type  = "Custom"
  mode         = "Indexed"
  display_name = "Deploy diagnostic settings for Storage Accounts"

  metadata = jsonencode({
    category = "Monitoring"
    version  = "1.2.0"
  })

  parameters = jsonencode({
    logAnalyticsWorkspaceId = {
      type     = "String"
      metadata = { displayName = "Log Analytics Workspace ID" }
    }
  })

  policy_rule = jsonencode({
    if = {
      field  = "type"
      equals = "Microsoft.Storage/storageAccounts"
    }
    then = {
      effect = "DeployIfNotExists"
      details = {
        type = "Microsoft.Insights/diagnosticSettings"
        roleDefinitionIds = [
          "/providers/Microsoft.Authorization/roleDefinitions/b24988ac-6180-42a0-ab88-20f7382dd24c"
        ]
        deployment = {
          properties = {
            # ... diagnostic settings deployment template ...
          }
        }
      }
    }
  })
}

2. GitOps Governance Layer

Git repositories serve as the authoritative source of truth for all governance definitions — policy changes require pull request review and approval before reaching any environment.

Branch Strategy for Governance:

main              Production policy assignments (approval required)
test              Test environment policies (approval required)
develop           Development policies (automated)
feature/policy-*  New policy definition development
main              Production policy assignments (approval required)
test              Test environment policies (approval required)
develop           Development policies (automated)
feature/policy-*  New policy definition development
main              Production policy assignments (approval required)
test              Test environment policies (approval required)
develop           Development policies (automated)
feature/policy-*  New policy definition development

Pull Request Governance Requirements:

  • Policy definition changes require review from at least one governance team member

  • Production policy assignment changes require review from two approvers — governance team and security team

  • All PR checks must pass: Terraform validate, tfsec scan of policy definitions, and terraform plan output reviewed as PR comment

  • Policy definition changes include impact assessment — identifying which existing resources would be affected by the new policy

3. CI/CD Automation Layer

Governance deployment pipelines enforce validation, planning, approval, and deployment stages independently per environment tier.

Pipeline Stage Architecture:

yaml

stages:
  - stage: ValidateGovernance
    jobs:
      - job: PolicyValidation
        steps:
          - script: terraform validate
          - script: tfsec . --minimum-severity MEDIUM
          - script: |
              # Custom policy definition schema validation
              python scripts/validate_policy_json.py
          - script: |
              # Check for policy assignments without corresponding definitions
              python scripts/validate_assignment_references.py

  - stage: PlanGovernance
    dependsOn: ValidateGovernance
    jobs:
      - job: TerraformPlan
        steps:
          - script: terraform plan -out=governance.tfplan
          - script: |
              # Generate human-readable policy impact report
              python scripts/generate_impact_report.py governance.tfplan
          - publish: governance.tfplan

  - stage: ApproveProduction
    dependsOn: PlanGovernance
    condition: eq(variables['targetEnvironment'], 'production')
    jobs:
      - deployment: GovernanceApproval
        environment: governance-production   # Two-approver gate

  - stage: DeployGovernance
    dependsOn: ApproveProduction
    jobs:
      - job: TerraformApply
        steps:
          - download: governance.tfplan
          - script

stages:
  - stage: ValidateGovernance
    jobs:
      - job: PolicyValidation
        steps:
          - script: terraform validate
          - script: tfsec . --minimum-severity MEDIUM
          - script: |
              # Custom policy definition schema validation
              python scripts/validate_policy_json.py
          - script: |
              # Check for policy assignments without corresponding definitions
              python scripts/validate_assignment_references.py

  - stage: PlanGovernance
    dependsOn: ValidateGovernance
    jobs:
      - job: TerraformPlan
        steps:
          - script: terraform plan -out=governance.tfplan
          - script: |
              # Generate human-readable policy impact report
              python scripts/generate_impact_report.py governance.tfplan
          - publish: governance.tfplan

  - stage: ApproveProduction
    dependsOn: PlanGovernance
    condition: eq(variables['targetEnvironment'], 'production')
    jobs:
      - deployment: GovernanceApproval
        environment: governance-production   # Two-approver gate

  - stage: DeployGovernance
    dependsOn: ApproveProduction
    jobs:
      - job: TerraformApply
        steps:
          - download: governance.tfplan
          - script

stages:
  - stage: ValidateGovernance
    jobs:
      - job: PolicyValidation
        steps:
          - script: terraform validate
          - script: tfsec . --minimum-severity MEDIUM
          - script: |
              # Custom policy definition schema validation
              python scripts/validate_policy_json.py
          - script: |
              # Check for policy assignments without corresponding definitions
              python scripts/validate_assignment_references.py

  - stage: PlanGovernance
    dependsOn: ValidateGovernance
    jobs:
      - job: TerraformPlan
        steps:
          - script: terraform plan -out=governance.tfplan
          - script: |
              # Generate human-readable policy impact report
              python scripts/generate_impact_report.py governance.tfplan
          - publish: governance.tfplan

  - stage: ApproveProduction
    dependsOn: PlanGovernance
    condition: eq(variables['targetEnvironment'], 'production')
    jobs:
      - deployment: GovernanceApproval
        environment: governance-production   # Two-approver gate

  - stage: DeployGovernance
    dependsOn: ApproveProduction
    jobs:
      - job: TerraformApply
        steps:
          - download: governance.tfplan
          - script

Policy Impact Report Generation: Before production governance changes are approved, an automated impact report is generated showing how many existing resources would be affected by each policy change — enabling approvers to make informed decisions about production policy deployment timing and potential operational impact.

4. Compliance Enforcement Layer

Azure Policy enforces governance controls across the management group hierarchy with environment-appropriate effects.

Universal Controls — All Environments:

hcl

# Applied at Enterprise Management Group no exceptions
resource "azurerm_management_group_policy_assignment" "universal_baseline" {
  name                 = "universal-security-baseline"
  management_group_id  = data.azurerm_management_group.enterprise.id
  policy_definition_id = azurerm_policy_set_definition.universal_baseline.id

  # Controls that apply everywhere without exception:
  # - Deny MMA legacy agent (deprecated)
  # - Require HTTPS on storage
  # - Deny classic resources
  # - Require Key Vault soft delete
  # - Deny public IP on management subnets
}
# Applied at Enterprise Management Group no exceptions
resource "azurerm_management_group_policy_assignment" "universal_baseline" {
  name                 = "universal-security-baseline"
  management_group_id  = data.azurerm_management_group.enterprise.id
  policy_definition_id = azurerm_policy_set_definition.universal_baseline.id

  # Controls that apply everywhere without exception:
  # - Deny MMA legacy agent (deprecated)
  # - Require HTTPS on storage
  # - Deny classic resources
  # - Require Key Vault soft delete
  # - Deny public IP on management subnets
}
# Applied at Enterprise Management Group no exceptions
resource "azurerm_management_group_policy_assignment" "universal_baseline" {
  name                 = "universal-security-baseline"
  management_group_id  = data.azurerm_management_group.enterprise.id
  policy_definition_id = azurerm_policy_set_definition.universal_baseline.id

  # Controls that apply everywhere without exception:
  # - Deny MMA legacy agent (deprecated)
  # - Require HTTPS on storage
  # - Deny classic resources
  # - Require Key Vault soft delete
  # - Deny public IP on management subnets
}

Compliance Exemption Governance:

hcl

# Time-limited exemption requires justification and expiry date
resource "azurerm_resource_policy_exemption" "dev_public_ip_exemption" {
  name                 = "dev-public-ip-testing-exemption"
  resource_id          = azurerm_virtual_machine.dev_test_vm.id
  policy_assignment_id = azurerm_management_group_policy_assignment.deny_public_ip.id
  exemption_category   = "Waiver"
  expires_on           = "2025-12-31T00:00:00Z"    # Mandatory expiry no permanent exemptions

  description = "Temporary exemption for public IP testing in dev environment. JIRA-1234. Expires 2025-12-31."
}
# Time-limited exemption requires justification and expiry date
resource "azurerm_resource_policy_exemption" "dev_public_ip_exemption" {
  name                 = "dev-public-ip-testing-exemption"
  resource_id          = azurerm_virtual_machine.dev_test_vm.id
  policy_assignment_id = azurerm_management_group_policy_assignment.deny_public_ip.id
  exemption_category   = "Waiver"
  expires_on           = "2025-12-31T00:00:00Z"    # Mandatory expiry no permanent exemptions

  description = "Temporary exemption for public IP testing in dev environment. JIRA-1234. Expires 2025-12-31."
}
# Time-limited exemption requires justification and expiry date
resource "azurerm_resource_policy_exemption" "dev_public_ip_exemption" {
  name                 = "dev-public-ip-testing-exemption"
  resource_id          = azurerm_virtual_machine.dev_test_vm.id
  policy_assignment_id = azurerm_management_group_policy_assignment.deny_public_ip.id
  exemption_category   = "Waiver"
  expires_on           = "2025-12-31T00:00:00Z"    # Mandatory expiry no permanent exemptions

  description = "Temporary exemption for public IP testing in dev environment. JIRA-1234. Expires 2025-12-31."
}

All exemptions are version-controlled, require justification comments, have mandatory expiry dates, and are reviewed quarterly — permanent exemptions are not permitted.

5. Event-Driven Remediation Layer

The remediation layer detects compliance violations in real time and orchestrates automated corrective actions — with human approval gates for high-impact remediations.

Remediation Architecture Flow:

Azure Policy violation detected
         
Azure Event Grid (Policy compliance state change event)
         
Event Grid subscription filters by violation severity
         
         ├── HIGH/CRITICAL Azure Logic App (approval workflow)
         
         Approval notification to governance team
         
         Human approves/rejects remediation
         
         Approved Azure Automation Runbook executes remediation
         
         └── MEDIUM/LOW Azure Function (automated remediation)
                   
           Automated corrective action executed
           without human intervention
Azure Policy violation detected
         
Azure Event Grid (Policy compliance state change event)
         
Event Grid subscription filters by violation severity
         
         ├── HIGH/CRITICAL Azure Logic App (approval workflow)
         
         Approval notification to governance team
         
         Human approves/rejects remediation
         
         Approved Azure Automation Runbook executes remediation
         
         └── MEDIUM/LOW Azure Function (automated remediation)
                   
           Automated corrective action executed
           without human intervention
Azure Policy violation detected
         
Azure Event Grid (Policy compliance state change event)
         
Event Grid subscription filters by violation severity
         
         ├── HIGH/CRITICAL Azure Logic App (approval workflow)
         
         Approval notification to governance team
         
         Human approves/rejects remediation
         
         Approved Azure Automation Runbook executes remediation
         
         └── MEDIUM/LOW Azure Function (automated remediation)
                   
           Automated corrective action executed
           without human intervention

Concrete Remediation Example — Public IP Detected on VM:

python

# Azure Function triggered by Event Grid compliance violation event
def remediate_public_ip_violation(event):
    resource_id = event['data']['resourceId']
    violation_type = event['data']['policyDefinitionName']

    if violation_type == "deny-public-ip-vm":
        # Get NIC associated with the VM
        nic = azure_client.network.get_nic_from_vm(resource_id)

        # Remove public IP association from NIC
        nic.ip_configurations[0].public_ip_address = None
        azure_client.network.update_nic(nic)

        # Log remediation action
        log_remediation_action(
            resource_id=resource_id,
            action="removed_public_ip",
            triggered_by="automated_policy_remediation",
            timestamp=datetime.utcnow()
        )

        # Notify governance team
        send_notification(
            subject=f"Auto-remediated: Public IP removed from {resource_id}",
            body=f"Policy violation auto-remediated at {datetime.utcnow()}"
        )
# Azure Function triggered by Event Grid compliance violation event
def remediate_public_ip_violation(event):
    resource_id = event['data']['resourceId']
    violation_type = event['data']['policyDefinitionName']

    if violation_type == "deny-public-ip-vm":
        # Get NIC associated with the VM
        nic = azure_client.network.get_nic_from_vm(resource_id)

        # Remove public IP association from NIC
        nic.ip_configurations[0].public_ip_address = None
        azure_client.network.update_nic(nic)

        # Log remediation action
        log_remediation_action(
            resource_id=resource_id,
            action="removed_public_ip",
            triggered_by="automated_policy_remediation",
            timestamp=datetime.utcnow()
        )

        # Notify governance team
        send_notification(
            subject=f"Auto-remediated: Public IP removed from {resource_id}",
            body=f"Policy violation auto-remediated at {datetime.utcnow()}"
        )
# Azure Function triggered by Event Grid compliance violation event
def remediate_public_ip_violation(event):
    resource_id = event['data']['resourceId']
    violation_type = event['data']['policyDefinitionName']

    if violation_type == "deny-public-ip-vm":
        # Get NIC associated with the VM
        nic = azure_client.network.get_nic_from_vm(resource_id)

        # Remove public IP association from NIC
        nic.ip_configurations[0].public_ip_address = None
        azure_client.network.update_nic(nic)

        # Log remediation action
        log_remediation_action(
            resource_id=resource_id,
            action="removed_public_ip",
            triggered_by="automated_policy_remediation",
            timestamp=datetime.utcnow()
        )

        # Notify governance team
        send_notification(
            subject=f"Auto-remediated: Public IP removed from {resource_id}",
            body=f"Policy violation auto-remediated at {datetime.utcnow()}"
        )

Remediation Decision Matrix:

Violation Type

Severity

Remediation Path

Human Approval

Target MTTR

Public IP on workload VM

High

Automated Function

No

5 minutes

Missing diagnostic settings

Medium

DeployIfNotExists Policy

No

30 minutes

Missing resource tags

Low

Automated Function

No

15 minutes

Non-compliant VM SKU

High

Logic App workflow

Yes

4 hours

Public storage blob access

Critical

Automated Function

No

2 minutes

Missing NSG on subnet

High

Logic App workflow

Yes

2 hours

Encryption disabled

Critical

Logic App workflow

Yes

1 hour

Why Separation of Automated and Approved Remediation: Automatically remediating all violations regardless of impact risks operational disruption — removing a public IP from a VM that a team intentionally exposed for legitimate testing breaks their workflow without notice. Automated remediation is appropriate only for violations where the corrective action has no plausible legitimate use case and low operational impact. High-impact remediations route through approval workflows — ensuring human judgement is applied before irreversible or operationally disruptive actions are taken.

6. Monitoring & Analytics Layer

Compliance analytics leverage Azure Resource Graph for cross-subscription querying — the correct tool for enterprise-scale compliance visibility that Azure Policy portal cannot provide at subscription-crossing scale.

Azure Resource Graph — Cross-Subscription Compliance Queries:

kusto

// Non-compliant resources across all subscriptions — grouped by policy
PolicyResources
| where type == "microsoft.policyinsights/policystates"
| where properties.complianceState == "NonCompliant"
| summarize
    NonCompliantCount = count(),
    AffectedSubscriptions = dcount(subscriptionId)
    by PolicyDefinitionName = tostring(properties.policyDefinitionName),
       ResourceType = tostring(properties.resourceType)
| order by NonCompliantCount desc
| take 20
// Non-compliant resources across all subscriptions — grouped by policy
PolicyResources
| where type == "microsoft.policyinsights/policystates"
| where properties.complianceState == "NonCompliant"
| summarize
    NonCompliantCount = count(),
    AffectedSubscriptions = dcount(subscriptionId)
    by PolicyDefinitionName = tostring(properties.policyDefinitionName),
       ResourceType = tostring(properties.resourceType)
| order by NonCompliantCount desc
| take 20
// Non-compliant resources across all subscriptions — grouped by policy
PolicyResources
| where type == "microsoft.policyinsights/policystates"
| where properties.complianceState == "NonCompliant"
| summarize
    NonCompliantCount = count(),
    AffectedSubscriptions = dcount(subscriptionId)
    by PolicyDefinitionName = tostring(properties.policyDefinitionName),
       ResourceType = tostring(properties.resourceType)
| order by NonCompliantCount desc
| take 20

kusto

// Compliance trend over 30 days — detecting drift
PolicyResources
| where type == "microsoft.policyinsights/policystates"
| where properties.timestamp > ago(30d)
| summarize
    CompliantCount = countif(properties.complianceState == "Compliant"),
    NonCompliantCount = countif(properties.complianceState == "NonCompliant")
    by Day = bin(todatetime(properties.timestamp), 1d)
| project Day, ComplianceRate = (CompliantCount * 100.0) / (CompliantCount + NonCompliantCount)
| order by Day asc
// Compliance trend over 30 days — detecting drift
PolicyResources
| where type == "microsoft.policyinsights/policystates"
| where properties.timestamp > ago(30d)
| summarize
    CompliantCount = countif(properties.complianceState == "Compliant"),
    NonCompliantCount = countif(properties.complianceState == "NonCompliant")
    by Day = bin(todatetime(properties.timestamp), 1d)
| project Day, ComplianceRate = (CompliantCount * 100.0) / (CompliantCount + NonCompliantCount)
| order by Day asc
// Compliance trend over 30 days — detecting drift
PolicyResources
| where type == "microsoft.policyinsights/policystates"
| where properties.timestamp > ago(30d)
| summarize
    CompliantCount = countif(properties.complianceState == "Compliant"),
    NonCompliantCount = countif(properties.complianceState == "NonCompliant")
    by Day = bin(todatetime(properties.timestamp), 1d)
| project Day, ComplianceRate = (CompliantCount * 100.0) / (CompliantCount + NonCompliantCount)
| order by Day asc

kusto

// Resources without required tags — cross-subscription
Resources
| where isnull(tags.environment) or isnull(tags.owner) or isnull(tags.cost_centre)
| project
    ResourceId = id,
    ResourceType = type,
    Subscription = subscriptionId,
    MissingTags = pack(
        "environment", isnull(tags.environment),
        "owner", isnull(tags.owner),
        "cost_centre", isnull(tags.cost_centre)
    )
| order by Subscription asc
// Resources without required tags — cross-subscription
Resources
| where isnull(tags.environment) or isnull(tags.owner) or isnull(tags.cost_centre)
| project
    ResourceId = id,
    ResourceType = type,
    Subscription = subscriptionId,
    MissingTags = pack(
        "environment", isnull(tags.environment),
        "owner", isnull(tags.owner),
        "cost_centre", isnull(tags.cost_centre)
    )
| order by Subscription asc
// Resources without required tags — cross-subscription
Resources
| where isnull(tags.environment) or isnull(tags.owner) or isnull(tags.cost_centre)
| project
    ResourceId = id,
    ResourceType = type,
    Subscription = subscriptionId,
    MissingTags = pack(
        "environment", isnull(tags.environment),
        "owner", isnull(tags.owner),
        "cost_centre", isnull(tags.cost_centre)
    )
| order by Subscription asc

Azure Monitor — Governance Telemetry:

  • Azure Policy compliance state change events logged to Log Analytics

  • Remediation task execution results — success, failure, and partial remediation outcomes

  • Policy assignment deployment events from CI/CD pipeline

  • Alert rules for compliance rate degradation — alerting when subscription compliance rate drops below defined threshold

7. Visualisation & Reporting Layer

Compliance reporting serves two audiences through separate visualisation tools — technical operators and executive governance stakeholders.

Azure Workbooks — Technical Compliance Dashboard:

  • Per-subscription compliance rate by policy initiative

  • Non-compliant resource inventory with drill-down to resource-level violation details

  • Remediation task status — pending, in-progress, completed, failed

  • Policy assignment coverage map — which initiatives are assigned to which management groups

  • Recent policy deployment history from CI/CD pipeline

Power BI — Executive Compliance Dashboard:

Report

Audience

Refresh Frequency

Purpose

Enterprise Compliance Score

Executive / CISO

Daily

Overall governance posture KPI

Environment Compliance Comparison

Governance team

Daily

Dev/Test/Prod compliance rate comparison

Compliance Trend

Executive / Governance

Weekly

90-day compliance rate trend

Top Violations

Governance team

Daily

Most frequent policy violations requiring attention

Remediation Performance

Operations

Daily

MTTR by violation type vs targets

Exemption Register

Compliance/Audit

Weekly

Active exemptions with expiry dates

Compliance Score Methodology:

Enterprise Compliance Score = 
  (Compliant Resources / Total Assessed Resources) × 100

Weighted Score (optional) = 
  Σ (Policy Weight × Compliance Rate per Policy) / Σ Policy Weights

Where policy weights reflect regulatory significance:
  - PCI DSS controls: weight 3
  - Security baseline controls: weight 2
  - Operational controls: weight 1
Enterprise Compliance Score = 
  (Compliant Resources / Total Assessed Resources) × 100

Weighted Score (optional) = 
  Σ (Policy Weight × Compliance Rate per Policy) / Σ Policy Weights

Where policy weights reflect regulatory significance:
  - PCI DSS controls: weight 3
  - Security baseline controls: weight 2
  - Operational controls: weight 1
Enterprise Compliance Score = 
  (Compliant Resources / Total Assessed Resources) × 100

Weighted Score (optional) = 
  Σ (Policy Weight × Compliance Rate per Policy) / Σ Policy Weights

Where policy weights reflect regulatory significance:
  - PCI DSS controls: weight 3
  - Security baseline controls: weight 2
  - Operational controls: weight 1

Architecture Diagram

Technologies Used

Category

Technologies

Infrastructure as Code

Terraform

CI/CD & GitOps

GitHub Actions, Azure DevOps, YAML Pipelines

Policy Governance

Azure Policy, Azure Initiative Definitions, Management Groups

Cross-Subscription Analytics

Azure Resource Graph, KQL

Event-Driven Remediation

Azure Event Grid, Logic Apps, Azure Functions, Azure Automation Runbooks

Monitoring

Azure Monitor, Log Analytics

Reporting

Power BI, Azure Workbooks

Security & Compliance

Microsoft Defender for Cloud, Azure RBAC, Azure Key Vault

Compliance Frameworks

CIS Azure Benchmark v2.0, NIST SP 800-53, PCI DSS v4.0

Key Challenges Addressed

Maintaining policy consistency across multiple subscriptions — addressed through management group hierarchy policy inheritance — universal controls assigned at enterprise management group level propagate to all subscriptions automatically without per-subscription assignment management.

Integrating governance into CI/CD without slowing delivery — addressed through environment-tiered deployment pipelines where development governance changes deploy automatically while production governance changes require two-approver gate — maintaining delivery velocity in lower environments without sacrificing production governance rigour.

Providing real-time cross-subscription compliance visibility — addressed through Azure Resource Graph KQL queries aggregating compliance state across all subscriptions simultaneously — Azure Policy portal provides single-subscription visibility only.

Automating remediation without introducing instability — addressed through remediation decision matrix separating automated low-impact remediations from human-approved high-impact actions — automated remediation applies only where corrective actions have no plausible legitimate business use case.

Supporting environment-specific governance flexibility — addressed through three-tier initiative design using Audit effects in development and Deny effects in production — same policy definitions, different enforcement severity per environment management group.

Scaling exemption governance — addressed through Terraform-managed exemptions with mandatory expiry dates, justification requirements, and version-controlled audit trail — preventing exemption accumulation that gradually erodes compliance posture.

Design Decisions & Rationale

Management Group Hierarchy as the Policy Distribution Foundation : Assigning policies at individual subscription level creates management overhead that scales linearly with subscription count. Management group hierarchy enables hierarchical inheritance — universal controls assigned once at the top propagate automatically to all child subscriptions. Environment-specific initiatives assigned at environment management group level apply consistently to all subscriptions within that environment tier without per-subscription configuration.

Environment-Aware Initiative Design with Effect Differentiation : Uniform Deny effects across all environments blocks legitimate development activities — developers testing configurations need flexibility that production cannot permit. Three-tier initiative design maps enforcement severity to operational risk — development receives Audit awareness without operational blocking, production receives Deny enforcement without exception. The same policy definitions serve all environments with environment-specific effect parameters.

Event-Driven Remediation over Scheduled Remediation : Scheduled remediation runs (e.g. daily compliance remediation jobs) leave non-compliant resources exposed for the interval between runs. Event Grid-triggered remediation responds to compliance state changes in near real time — reducing the exposure window from hours to minutes for automated remediations and providing immediate notification for human-approved remediations.

Azure Resource Graph over Azure Policy Portal for Compliance Analytics : Azure Policy compliance portal provides per-subscription compliance views — inadequate for enterprise estates spanning dozens of subscriptions. Azure Resource Graph queries execute across all subscriptions simultaneously, enabling cross-subscription compliance aggregation, trend analysis, and KQL-based custom compliance reporting that the portal cannot provide.

Mandatory Exemption Expiry Dates : Permanent policy exemptions accumulate over time as environments evolve — exempt resources become forgotten compliance gaps. Mandatory expiry dates on all exemptions through Terraform enforcement ensure exemptions are reviewed and either renewed with justification or removed when the underlying business need expires. Quarterly exemption review processes validate that active exemptions remain justified.

Separation of Enforcement and Remediation Layers : Combining Azure Policy enforcement and remediation in a single workflow creates risk — a misconfigured remediation action could cascade across large numbers of resources simultaneously. Separating enforcement (Azure Policy — detects violations) from remediation (Event Grid → Functions/Runbooks — corrects violations) enables independent testing, independent failure modes, and granular control over which violations trigger automated vs human-approved remediation.

Trade-offs & Design Constraints

Azure Policy DeployIfNotExists Remediation Timing Gap : DeployIfNotExists effect creates a remediation task that runs asynchronously after resource creation — there is a window between resource deployment and remediation completion where resources exist without required configurations. For compliance-critical controls (diagnostic settings, encryption), Terraform should explicitly configure these settings rather than relying on Policy remediation — Policy remediation should serve as a backstop for resources deployed outside IaC governance, not the primary configuration mechanism for IaC-managed resources.

Event Grid Compliance Event Volume at Scale : In large Azure estates with frequent resource changes, Azure Policy generates high volumes of compliance state change events. Event Grid handles high throughput but downstream Logic Apps and Azure Functions must be designed for concurrent execution — a compliance event storm following a large deployment could trigger thousands of simultaneous remediation events. Rate limiting, dead letter queuing, and idempotent remediation function design are essential for production remediation reliability.

Resource Graph Query Throttling : Azure Resource Graph queries are subject to throttling limits — approximately 15 queries per 5 seconds per tenant for standard tier. Power BI dashboards refreshing compliance data through Resource Graph queries must implement query result caching and refresh scheduling to avoid throttling. Direct Power BI → Resource Graph integration without caching creates throttling risk at enterprise scale.

Terraform Policy State Import Complexity : Importing existing manually configured Azure Policy definitions and assignments into Terraform state requires careful attribute mapping — policy rule JSON in existing definitions must exactly match Terraform resource attribute structure. Mismatches generate plan drift requiring careful reconciliation. A discovery-first approach — using Azure CLI to export existing policy definitions before writing Terraform resources — reduces import complexity.

Remediation Identity Permissions Scope : Azure Automation Runbooks and Azure Functions executing remediation actions require Azure RBAC permissions — typically Contributor on affected resource groups. Overly broad remediation identity permissions create risk if the remediation service is compromised. Permissions should be scoped to the minimum required for each remediation action type — separate managed identities per remediation function with purpose-specific role assignments rather than a single broadly-scoped remediation identity.

Projected Outcomes

The architecture is designed to deliver the following governance and operational outcomes in a production enterprise environment:

  • Consistent policy enforcement across all subscriptions through management group hierarchy inheritance — universal controls applied without per-subscription configuration management

  • Environment-appropriate governance through three-tier initiative design — development flexibility and production strictness enforced through the same policy definitions with differentiated effects

  • Near real-time compliance violation detection and automated remediation for defined violation categories through Event Grid-triggered orchestration

  • Cross-subscription compliance visibility through Azure Resource Graph KQL queries — enterprise-wide compliance posture queryable on demand

  • Executive governance reporting through Power BI dashboards with daily compliance score, trend analysis, and MTTR performance tracking

  • Auditable governance lifecycle through Terraform-managed policy definitions, GitOps version control, and CI/CD deployment history

  • Controlled exemption governance through mandatory expiry dates, justification requirements, and quarterly review processes preventing exemption accumulation

Future Evolution

  • OPA/Gatekeeper integration for Kubernetes workload governance — extending Policy-as-Code governance to AKS admission control through the same GitOps governance model

  • AI-assisted compliance anomaly detection — identifying unusual compliance degradation patterns indicating potential security incidents rather than routine configuration drift

  • Automated risk scoring and violation prioritisation — weighting compliance violations by asset criticality and regulatory impact for intelligent remediation sequencing

  • Cross-cloud governance federation — extending management group-equivalent governance patterns to AWS Organizations and GCP Resource Manager through Terraform multi-cloud provider management

  • Continuous compliance validation pipelines — scheduled Resource Graph compliance scans triggering pipeline alerts when compliance rate drops below defined thresholds

  • Self-healing remediation workflows — expanding automated remediation coverage as remediation patterns are proven stable through operational experience

  • FinOps governance integration — Azure Policy enforcement of cost governance controls (approved VM SKUs, required shutdown schedules, resource lifecycle tagging) through the same governance platform

  • Security posture benchmarking automation — automated CIS Azure Benchmark and NIST SP 800-53 compliance scoring through Defender for Cloud regulatory compliance integration

Key Takeaways

  • Management group hierarchy is the foundational Policy-as-Code design decision — policy inheritance eliminates per-subscription management overhead that scales unsustainably with subscription count

  • Environment-aware initiative design with effect differentiation is essential — uniform Deny enforcement across all environments blocks legitimate development activities; the same policy definitions should serve all environments with environment-specific effect parameters

  • Event-driven remediation dramatically reduces compliance violation exposure windows compared to scheduled remediation — real-time response versus hourly or daily remediation cycles

  • Azure Resource Graph is the correct tool for cross-subscription compliance analytics — Azure Policy portal provides single-subscription visibility only and cannot support enterprise-scale compliance aggregation

  • Automated remediation must be bounded by a decision matrix — not all violations should trigger automated corrective actions; high-impact remediations require human approval to prevent operational disruption

  • Exemption governance requires mandatory expiry enforcement — permanent exemptions accumulate into compliance debt; Terraform-managed exemptions with expiry dates prevent this erosion

  • Separation of enforcement and remediation layers enables independent failure modes — Azure Policy detecting violations and Functions/Runbooks correcting them can be tested, operated, and failed independently

Open to discussing infrastructure architecture, cloud transformation, or high-availability system design.

Whether the objective is infrastructure modernization, operational resilience, hybrid cloud transformation, or enterprise security architecture, I am always interested in discussing complex infrastructure environments and strategic technical initiatives.

Open to discussing infrastructure architecture, cloud transformation, or high-availability system design.

Whether the objective is infrastructure modernization, operational resilience, hybrid cloud transformation, or enterprise security architecture, I am always interested in discussing complex infrastructure environments and strategic technical initiatives.

Open to discussing infrastructure architecture, cloud transformation, or high-availability system design.

Whether the objective is infrastructure modernization, operational resilience, hybrid cloud transformation, or enterprise security architecture, I am always interested in discussing complex infrastructure environments and strategic technical initiatives.

ENTERPRISE INFRASTRUCTURE ARCHITECTURE

My work focuses on ensuring service continuity, optimizing performance, and supporting large-scale infrastructure transformations across multi-site and hybrid environments.

ENTERPRISE INFRASTRUCTURE ARCHITECTURE

My work focuses on ensuring service continuity, optimizing performance, and supporting large-scale infrastructure transformations across multi-site and hybrid environments.

ENTERPRISE INFRASTRUCTURE ARCHITECTURE

My work focuses on ensuring service continuity, optimizing performance, and supporting large-scale infrastructure transformations across multi-site and hybrid environments.