Policy-as-Code Infrastructure Compliance Platform

Enterprise-Scale Azure Governance with Event-Driven Remediation & Multi-Subscription Compliance Analytics

github

https://github.com/sergeksfumey/policy-as-code-compliance

ARCHITECTURE OVERVIEW

Policy-as-Code Infrastructure Compliance Platform — Terraform, Azure Policy & GitOps-Driven Governance

Automated deployment of 100+ Azure Policy definitions via Terraform and GitHub Actions, environment-specific initiative enforcement across Dev/Test/Prod, Event Grid-triggered Logic App remediation, and real-time Power BI compliance dashboards achieving 95%+ compliance posture

The platform treats cloud governance as software — every policy definition, initiative grouping, and assignment is version-controlled in Git, reviewed through pull requests, and deployed through automated CI/CD pipelines. Terraform modules define both custom and built-in Azure Policy structures, parameterised by environment so that the same codebase deploys flexible tagging controls to Dev, moderate restrictions to Test, and locked-down security baselines — NSG rules, encryption enforcement, diagnostic settings — to Production with mandatory approval gates before apply.

On commit or pull request, GitHub Actions or Azure DevOps Pipelines execute a Terraform Plan to validate policy definitions against the target environment, followed by Terraform Apply to deploy and assign initiatives to the appropriate subscription and resource group scope. Over 100 policy definitions are deployed through this pipeline, with Microsoft Defender for Cloud security baseline initiatives layered on top. The practical effect is demonstrated by the "Enforce Environment Tag" deny policy: attempting to create a VM without the required environment tag fails at deployment time with a policy validation error — governance is enforced at the control plane, not after the fact.

Non-compliance events propagate through an event-driven remediation chain. Azure Policy evaluation results trigger Event Grid events on compliance state changes, which activate Logic App workflows that parse the violation data, format alert messages, and dispatch notifications via Teams, Slack, email, or SMS. The same Logic App triggers downstream remediation — Azure Functions handle stateless auto-tagging and diagnostic enablement, while Azure Automation Runbooks manage bulk remediation sweeps for known policy violations. This automation chain achieves an 80% auto-remediation success rate, eliminating the manual triage backlog that characterises traditional compliance management.

Compliance posture is continuously visualised through Power BI dashboards fed by Azure Monitor and Azure Resource Graph — displaying per-environment compliance rates, policy evaluation results, remediation success trends, and historical drift analysis across all subscriptions. Tag compliance Workbooks break down non-compliant resources by type, providing operations and audit teams with actionable, drill-down visibility into exactly which resources require attention and why — sustaining 95%+ compliance across all environments.

Description

This case study is an independent architecture design exercise developed to demonstrate enterprise-scale Policy-as-Code governance platform architecture for multi-subscription Azure environments. It was not associated with a production deployment. The scenario is based on the compliance governance requirements typical of organisations managing Azure infrastructure across multiple subscriptions and environments with automated enforcement, real-time remediation, and executive compliance reporting requirements. This study focuses on enterprise governance operations at scale — multi-subscription policy management, environment-aware initiative design, event-driven remediation orchestration, Azure Resource Graph compliance analytics, and Power BI executive reporting. Security scanning in IaC pipelines and deployment-time governance are covered in depth in the Security-as-Code & DevSecOps Governance case study.

Key Focus Areas:

Policy-as-Code & Cloud Governance
Multi-Subscription Compliance Management
Event-Driven Remediation Orchestration
Azure Resource Graph Compliance Analytics
Environment-Aware Initiative Design
Executive Compliance Reporting

Executive Summary

Architected a cloud-native Policy-as-Code governance platform on Microsoft Azure enabling automated compliance enforcement, event-driven remediation, cross-subscription visibility, and executive reporting across Development, Test, and Production environments at enterprise scale.

The platform integrates Terraform-managed Azure Policy definitions and initiatives, GitOps-driven policy lifecycle governance, CI/CD-integrated deployment workflows, Azure Event Grid-triggered remediation orchestration through Logic Apps and Azure Automation Runbooks, Azure Resource Graph cross-subscription compliance querying, and Power BI executive compliance dashboards.

The design is differentiated from deployment-time security governance studies by its focus on operational compliance at scale — what happens after policies are deployed across a large Azure estate: how violations are detected in real time, how remediation is automated without introducing operational instability, how compliance state is queried across hundreds of resources across multiple subscriptions, and how governance evidence is surfaced to executive stakeholders.

Business Drivers

As organisations expand Azure adoption across multiple subscriptions and environments, point-in-time compliance audits and manual policy management become operationally unsustainable. Compliance drift — where resources that were compliant at deployment gradually deviate through configuration changes, new resource deployments, or policy scope expansion — is the most common enterprise governance failure in large Azure estates.

This architecture was designed to address the enterprise governance requirements of organisations where existing approaches result in:

Compliance drift between environments — policy changes applied to production not propagated to development and test environments creating inconsistent governance posture
Limited real-time visibility into policy violations — compliance state only known at scheduled audit intervals rather than continuously
Slow manual remediation cycles — non-compliant resources identified in audits but remediated through manual operational tickets extending exposure windows
Weak integration between governance controls and infrastructure delivery — policies applied after infrastructure is deployed rather than governed through the same delivery lifecycle
Difficulty scaling governance across multiple subscriptions — manual policy management across dozens of subscriptions creates inconsistency and coverage gaps
Compliance evidence requiring manual collection — audit responses built from portal exports rather than continuously maintained and queryable compliance state

Operational Constraints

The architecture was designed to operate within the following constraints typical of enterprise multi-subscription Azure governance environments:

Governance controls must be consistent across Development, Test, and Production environments but with environment-specific enforcement severity — development teams require operational flexibility that production cannot afford
Policy deployment workflows must integrate into CI/CD pipelines — governance changes must flow through the same review and approval process as infrastructure changes
Automated remediation must avoid operational instability — not all compliance violations should trigger immediate automated remediation; high-impact remediations require human approval
Azure Resource Graph queries must support cross-subscription compliance reporting — no single-subscription visibility model is adequate for enterprise estates
Compliance reporting must serve two audiences — technical operators requiring resource-level violation details and executive stakeholders requiring KPI-level governance posture visibility
Policy exceptions must be manageable at environment scope — development environments may legitimately require exemptions from controls mandatory in production
Multi-subscription governance must follow management group hierarchy — policies assigned at management group level propagate to child subscriptions consistently

Objectives

Design a management group hierarchy enabling consistent policy inheritance across Dev, Test, and Production subscription tiers
Develop environment-specific policy initiatives with differentiated enforcement severity per environment tier
Automate policy lifecycle management through Terraform with GitOps governance and CI/CD deployment
Implement event-driven remediation architecture detecting violations in real time and triggering automated corrective actions
Design Azure Resource Graph queries providing cross-subscription compliance visibility beyond Azure Policy portal limitations
Build Power BI executive compliance dashboards and Azure Workbooks technical compliance dashboards
Define Mean Time to Remediation (MTTR) targets per violation severity — distinguishing automated from human-approved remediation paths
Establish compliance exemption governance — controlling and auditing policy exemptions across the enterprise estate

Management Group Hierarchy & Policy Inheritance

The management group hierarchy is the foundational governance design decision — policy assignments at management group level inherit to all child subscriptions automatically.

Tenant Root Group
└── Enterprise Management Group
    ├── Platform Management Group          ← Platform baseline policies
    │   ├── Identity Subscription
    │   └── Connectivity Subscription
    ├── Landing Zones Management Group     ← Workload governance policies
    │   ├── Production Management Group   ← Strict enforcement initiatives
    │   │   ├── Prod-Sub-01
    │   │   └── Prod-Sub-02
    │   ├── Test Management Group         ← Moderate enforcement initiatives
    │   │   └── Test-Sub-01
    │   └── Development Management Group  ← Flexible enforcement initiatives
    │       └── Dev-Sub-01
    └── Sandbox Management Group          ← Minimal governance — exploration only
        └── Sandbox-Sub-01

Tenant Root Group
└── Enterprise Management Group
    ├── Platform Management Group          ← Platform baseline policies
    │   ├── Identity Subscription
    │   └── Connectivity Subscription
    ├── Landing Zones Management Group     ← Workload governance policies
    │   ├── Production Management Group   ← Strict enforcement initiatives
    │   │   ├── Prod-Sub-01
    │   │   └── Prod-Sub-02
    │   ├── Test Management Group         ← Moderate enforcement initiatives
    │   │   └── Test-Sub-01
    │   └── Development Management Group  ← Flexible enforcement initiatives
    │       └── Dev-Sub-01
    └── Sandbox Management Group          ← Minimal governance — exploration only
        └── Sandbox-Sub-01

Tenant Root Group
└── Enterprise Management Group
    ├── Platform Management Group          ← Platform baseline policies
    │   ├── Identity Subscription
    │   └── Connectivity Subscription
    ├── Landing Zones Management Group     ← Workload governance policies
    │   ├── Production Management Group   ← Strict enforcement initiatives
    │   │   ├── Prod-Sub-01
    │   │   └── Prod-Sub-02
    │   ├── Test Management Group         ← Moderate enforcement initiatives
    │   │   └── Test-Sub-01
    │   └── Development Management Group  ← Flexible enforcement initiatives
    │       └── Dev-Sub-01
    └── Sandbox Management Group          ← Minimal governance — exploration only
        └── Sandbox-Sub-01

Policy Inheritance Design:

Policies assigned at Enterprise Management Group level apply to all subscriptions — foundational security controls with no environment exceptions
Environment-specific initiatives assigned at Production, Test, and Development management group levels — providing differentiated enforcement without duplicating universal controls
Sandbox subscriptions have minimal governance — intentional for innovation and exploration without compliance friction

Environment-Aware Policy Initiative Design

Three-Tier Governance Model:

Control Category	Development	Test	Production
Public IP on VMs	Audit	Deny	Deny
Diagnostic settings	Audit	Audit	DeployIfNotExists
Resource tagging	Audit	Deny	Deny
TLS minimum version	Audit	Deny	Deny
Approved VM SKUs	Disabled	Audit	Deny
Storage HTTPS only	Audit	Deny	Deny
Key Vault soft delete	Audit	Deny	Deny
Approved locations	Disabled	Audit	Deny
MFA for management	Audit	Audit	Deny

Rationale for Environment Differentiation: Development environments using Deny effects for all controls creates operational friction that slows development velocity without proportional security benefit — developers testing configurations in development should have flexibility to iterate. Audit effects in development surface compliance awareness without blocking operations. Production uses Deny effects for all security-critical controls — non-compliance is simply not permitted, regardless of operational convenience.

Architecture Overview

The solution is structured as a seven-layer enterprise governance platform integrating policy definition and IaC, GitOps governance, CI/CD automation, compliance enforcement, event-driven remediation, monitoring and analytics, and executive reporting.

1. Policy Definition & Infrastructure-as-Code Layer

All governance definitions are managed as Terraform code — version-controlled, peer-reviewed, and deployed through CI/CD pipelines.

Terraform Module Structure:

governance/
├── modules/
│   ├── policy-definition/      # Custom policy definition module
│   ├── policy-initiative/      # Initiative (policy set) module
│   ├── policy-assignment/      # Assignment at MG/subscription scope
│   ├── policy-exemption/       # Exemption management with expiry
│   └── remediation-task/       # Remediation task creation
├── initiatives/
│   ├── production-baseline/    # Production strict initiative
│   ├── test-baseline/          # Test moderate initiative
│   ├── dev-baseline/           # Dev flexible initiative
│   └── platform-universal/     # Universal controls — all environments
├── definitions/
│   ├── network/                # Network security policy definitions
│   ├── identity/               # Identity governance definitions
│   ├── data-protection/        # Encryption and data policies
│   └── operational/            # Tagging, diagnostics, monitoring
└── assignments/
    ├── enterprise-mg.tf        # Enterprise MG universal assignments
    ├── production-mg.tf        # Production MG strict assignments
    ├── test-mg.tf              # Test MG moderate assignments
    └── dev-mg.tf               # Dev MG flexible assignments

governance/
├── modules/
│   ├── policy-definition/      # Custom policy definition module
│   ├── policy-initiative/      # Initiative (policy set) module
│   ├── policy-assignment/      # Assignment at MG/subscription scope
│   ├── policy-exemption/       # Exemption management with expiry
│   └── remediation-task/       # Remediation task creation
├── initiatives/
│   ├── production-baseline/    # Production strict initiative
│   ├── test-baseline/          # Test moderate initiative
│   ├── dev-baseline/           # Dev flexible initiative
│   └── platform-universal/     # Universal controls — all environments
├── definitions/
│   ├── network/                # Network security policy definitions
│   ├── identity/               # Identity governance definitions
│   ├── data-protection/        # Encryption and data policies
│   └── operational/            # Tagging, diagnostics, monitoring
└── assignments/
    ├── enterprise-mg.tf        # Enterprise MG universal assignments
    ├── production-mg.tf        # Production MG strict assignments
    ├── test-mg.tf              # Test MG moderate assignments
    └── dev-mg.tf               # Dev MG flexible assignments

governance/
├── modules/
│   ├── policy-definition/      # Custom policy definition module
│   ├── policy-initiative/      # Initiative (policy set) module
│   ├── policy-assignment/      # Assignment at MG/subscription scope
│   ├── policy-exemption/       # Exemption management with expiry
│   └── remediation-task/       # Remediation task creation
├── initiatives/
│   ├── production-baseline/    # Production strict initiative
│   ├── test-baseline/          # Test moderate initiative
│   ├── dev-baseline/           # Dev flexible initiative
│   └── platform-universal/     # Universal controls — all environments
├── definitions/
│   ├── network/                # Network security policy definitions
│   ├── identity/               # Identity governance definitions
│   ├── data-protection/        # Encryption and data policies
│   └── operational/            # Tagging, diagnostics, monitoring
└── assignments/
    ├── enterprise-mg.tf        # Enterprise MG universal assignments
    ├── production-mg.tf        # Production MG strict assignments
    ├── test-mg.tf              # Test MG moderate assignments
    └── dev-mg.tf               # Dev MG flexible assignments

Example Custom Policy Definition — Terraform:

hcl

resource "azurerm_policy_definition" "require_diagnostic_settings" {
  name         = "require-diagnostic-settings-storage"
  policy_type  = "Custom"
  mode         = "Indexed"
  display_name = "Deploy diagnostic settings for Storage Accounts"

  metadata = jsonencode({
    category = "Monitoring"
    version  = "1.2.0"
  })

  parameters = jsonencode({
    logAnalyticsWorkspaceId = {
      type     = "String"
      metadata = { displayName = "Log Analytics Workspace ID" }
    }
  })

  policy_rule = jsonencode({
    if = {
      field  = "type"
      equals = "Microsoft.Storage/storageAccounts"
    }
    then = {
      effect = "DeployIfNotExists"
      details = {
        type = "Microsoft.Insights/diagnosticSettings"
        roleDefinitionIds = [
          "/providers/Microsoft.Authorization/roleDefinitions/b24988ac-6180-42a0-ab88-20f7382dd24c"
        ]
        deployment = {
          properties = {
            # ... diagnostic settings deployment template ...
          }
        }
      }
    }
  })
}

resource "azurerm_policy_definition" "require_diagnostic_settings" {
  name         = "require-diagnostic-settings-storage"
  policy_type  = "Custom"
  mode         = "Indexed"
  display_name = "Deploy diagnostic settings for Storage Accounts"

  metadata = jsonencode({
    category = "Monitoring"
    version  = "1.2.0"
  })

  parameters = jsonencode({
    logAnalyticsWorkspaceId = {
      type     = "String"
      metadata = { displayName = "Log Analytics Workspace ID" }
    }
  })

  policy_rule = jsonencode({
    if = {
      field  = "type"
      equals = "Microsoft.Storage/storageAccounts"
    }
    then = {
      effect = "DeployIfNotExists"
      details = {
        type = "Microsoft.Insights/diagnosticSettings"
        roleDefinitionIds = [
          "/providers/Microsoft.Authorization/roleDefinitions/b24988ac-6180-42a0-ab88-20f7382dd24c"
        ]
        deployment = {
          properties = {
            # ... diagnostic settings deployment template ...
          }
        }
      }
    }
  })
}

resource "azurerm_policy_definition" "require_diagnostic_settings" {
  name         = "require-diagnostic-settings-storage"
  policy_type  = "Custom"
  mode         = "Indexed"
  display_name = "Deploy diagnostic settings for Storage Accounts"

  metadata = jsonencode({
    category = "Monitoring"
    version  = "1.2.0"
  })

  parameters = jsonencode({
    logAnalyticsWorkspaceId = {
      type     = "String"
      metadata = { displayName = "Log Analytics Workspace ID" }
    }
  })

  policy_rule = jsonencode({
    if = {
      field  = "type"
      equals = "Microsoft.Storage/storageAccounts"
    }
    then = {
      effect = "DeployIfNotExists"
      details = {
        type = "Microsoft.Insights/diagnosticSettings"
        roleDefinitionIds = [
          "/providers/Microsoft.Authorization/roleDefinitions/b24988ac-6180-42a0-ab88-20f7382dd24c"
        ]
        deployment = {
          properties = {
            # ... diagnostic settings deployment template ...
          }
        }
      }
    }
  })
}

2. GitOps Governance Layer

Git repositories serve as the authoritative source of truth for all governance definitions — policy changes require pull request review and approval before reaching any environment.

Branch Strategy for Governance:

main              → Production policy assignments (approval required)
test              → Test environment policies (approval required)
develop           → Development policies (automated)
feature/policy-*  → New policy definition development

main              → Production policy assignments (approval required)
test              → Test environment policies (approval required)
develop           → Development policies (automated)
feature/policy-*  → New policy definition development

main              → Production policy assignments (approval required)
test              → Test environment policies (approval required)
develop           → Development policies (automated)
feature/policy-*  → New policy definition development

Pull Request Governance Requirements:

Policy definition changes require review from at least one governance team member
Production policy assignment changes require review from two approvers — governance team and security team
All PR checks must pass: Terraform validate, tfsec scan of policy definitions, and terraform plan output reviewed as PR comment
Policy definition changes include impact assessment — identifying which existing resources would be affected by the new policy

3. CI/CD Automation Layer

Governance deployment pipelines enforce validation, planning, approval, and deployment stages independently per environment tier.

Pipeline Stage Architecture:

yaml

stages:
  - stage: ValidateGovernance
    jobs:
      - job: PolicyValidation
        steps:
          - script: terraform validate
          - script: tfsec . --minimum-severity MEDIUM
          - script: |
              # Custom policy definition schema validation
              python scripts/validate_policy_json.py
          - script: |
              # Check for policy assignments without corresponding definitions
              python scripts/validate_assignment_references.py

  - stage: PlanGovernance
    dependsOn: ValidateGovernance
    jobs:
      - job: TerraformPlan
        steps:
          - script: terraform plan -out=governance.tfplan
          - script: |
              # Generate human-readable policy impact report
              python scripts/generate_impact_report.py governance.tfplan
          - publish: governance.tfplan

  - stage: ApproveProduction
    dependsOn: PlanGovernance
    condition: eq(variables['targetEnvironment'], 'production')
    jobs:
      - deployment: GovernanceApproval
        environment: governance-production   # Two-approver gate

  - stage: DeployGovernance
    dependsOn: ApproveProduction
    jobs:
      - job: TerraformApply
        steps:
          - download: governance.tfplan
          - script

stages:
  - stage: ValidateGovernance
    jobs:
      - job: PolicyValidation
        steps:
          - script: terraform validate
          - script: tfsec . --minimum-severity MEDIUM
          - script: |
              # Custom policy definition schema validation
              python scripts/validate_policy_json.py
          - script: |
              # Check for policy assignments without corresponding definitions
              python scripts/validate_assignment_references.py

  - stage: PlanGovernance
    dependsOn: ValidateGovernance
    jobs:
      - job: TerraformPlan
        steps:
          - script: terraform plan -out=governance.tfplan
          - script: |
              # Generate human-readable policy impact report
              python scripts/generate_impact_report.py governance.tfplan
          - publish: governance.tfplan

  - stage: ApproveProduction
    dependsOn: PlanGovernance
    condition: eq(variables['targetEnvironment'], 'production')
    jobs:
      - deployment: GovernanceApproval
        environment: governance-production   # Two-approver gate

  - stage: DeployGovernance
    dependsOn: ApproveProduction
    jobs:
      - job: TerraformApply
        steps:
          - download: governance.tfplan
          - script

stages:
  - stage: ValidateGovernance
    jobs:
      - job: PolicyValidation
        steps:
          - script: terraform validate
          - script: tfsec . --minimum-severity MEDIUM
          - script: |
              # Custom policy definition schema validation
              python scripts/validate_policy_json.py
          - script: |
              # Check for policy assignments without corresponding definitions
              python scripts/validate_assignment_references.py

  - stage: PlanGovernance
    dependsOn: ValidateGovernance
    jobs:
      - job: TerraformPlan
        steps:
          - script: terraform plan -out=governance.tfplan
          - script: |
              # Generate human-readable policy impact report
              python scripts/generate_impact_report.py governance.tfplan
          - publish: governance.tfplan

  - stage: ApproveProduction
    dependsOn: PlanGovernance
    condition: eq(variables['targetEnvironment'], 'production')
    jobs:
      - deployment: GovernanceApproval
        environment: governance-production   # Two-approver gate

  - stage: DeployGovernance
    dependsOn: ApproveProduction
    jobs:
      - job: TerraformApply
        steps:
          - download: governance.tfplan
          - script

Policy Impact Report Generation: Before production governance changes are approved, an automated impact report is generated showing how many existing resources would be affected by each policy change — enabling approvers to make informed decisions about production policy deployment timing and potential operational impact.

4. Compliance Enforcement Layer

Azure Policy enforces governance controls across the management group hierarchy with environment-appropriate effects.

Universal Controls — All Environments:

hcl

# Applied at Enterprise Management Group — no exceptions
resource "azurerm_management_group_policy_assignment" "universal_baseline" {
  name                 = "universal-security-baseline"
  management_group_id  = data.azurerm_management_group.enterprise.id
  policy_definition_id = azurerm_policy_set_definition.universal_baseline.id

  # Controls that apply everywhere without exception:
  # - Deny MMA legacy agent (deprecated)
  # - Require HTTPS on storage
  # - Deny classic resources
  # - Require Key Vault soft delete
  # - Deny public IP on management subnets
}

# Applied at Enterprise Management Group — no exceptions
resource "azurerm_management_group_policy_assignment" "universal_baseline" {
  name                 = "universal-security-baseline"
  management_group_id  = data.azurerm_management_group.enterprise.id
  policy_definition_id = azurerm_policy_set_definition.universal_baseline.id

  # Controls that apply everywhere without exception:
  # - Deny MMA legacy agent (deprecated)
  # - Require HTTPS on storage
  # - Deny classic resources
  # - Require Key Vault soft delete
  # - Deny public IP on management subnets
}

# Applied at Enterprise Management Group — no exceptions
resource "azurerm_management_group_policy_assignment" "universal_baseline" {
  name                 = "universal-security-baseline"
  management_group_id  = data.azurerm_management_group.enterprise.id
  policy_definition_id = azurerm_policy_set_definition.universal_baseline.id

  # Controls that apply everywhere without exception:
  # - Deny MMA legacy agent (deprecated)
  # - Require HTTPS on storage
  # - Deny classic resources
  # - Require Key Vault soft delete
  # - Deny public IP on management subnets
}

Compliance Exemption Governance:

hcl

# Time-limited exemption — requires justification and expiry date
resource "azurerm_resource_policy_exemption" "dev_public_ip_exemption" {
  name                 = "dev-public-ip-testing-exemption"
  resource_id          = azurerm_virtual_machine.dev_test_vm.id
  policy_assignment_id = azurerm_management_group_policy_assignment.deny_public_ip.id
  exemption_category   = "Waiver"
  expires_on           = "2025-12-31T00:00:00Z"    # Mandatory expiry — no permanent exemptions

  description = "Temporary exemption for public IP testing in dev environment. JIRA-1234. Expires 2025-12-31."
}

# Time-limited exemption — requires justification and expiry date
resource "azurerm_resource_policy_exemption" "dev_public_ip_exemption" {
  name                 = "dev-public-ip-testing-exemption"
  resource_id          = azurerm_virtual_machine.dev_test_vm.id
  policy_assignment_id = azurerm_management_group_policy_assignment.deny_public_ip.id
  exemption_category   = "Waiver"
  expires_on           = "2025-12-31T00:00:00Z"    # Mandatory expiry — no permanent exemptions

  description = "Temporary exemption for public IP testing in dev environment. JIRA-1234. Expires 2025-12-31."
}

# Time-limited exemption — requires justification and expiry date
resource "azurerm_resource_policy_exemption" "dev_public_ip_exemption" {
  name                 = "dev-public-ip-testing-exemption"
  resource_id          = azurerm_virtual_machine.dev_test_vm.id
  policy_assignment_id = azurerm_management_group_policy_assignment.deny_public_ip.id
  exemption_category   = "Waiver"
  expires_on           = "2025-12-31T00:00:00Z"    # Mandatory expiry — no permanent exemptions

  description = "Temporary exemption for public IP testing in dev environment. JIRA-1234. Expires 2025-12-31."
}

All exemptions are version-controlled, require justification comments, have mandatory expiry dates, and are reviewed quarterly — permanent exemptions are not permitted.

5. Event-Driven Remediation Layer

The remediation layer detects compliance violations in real time and orchestrates automated corrective actions — with human approval gates for high-impact remediations.

Remediation Architecture Flow:

Azure Policy violation detected
         ↓
Azure Event Grid (Policy compliance state change event)
         ↓
Event Grid subscription filters by violation severity
         ↓
         ├── HIGH/CRITICAL → Azure Logic App (approval workflow)
         │         ↓
         │   Approval notification to governance team
         │         ↓
         │   Human approves/rejects remediation
         │         ↓
         │   Approved → Azure Automation Runbook executes remediation
         │
         └── MEDIUM/LOW → Azure Function (automated remediation)
                   ↓
           Automated corrective action executed
           without human intervention

Azure Policy violation detected
         ↓
Azure Event Grid (Policy compliance state change event)
         ↓
Event Grid subscription filters by violation severity
         ↓
         ├── HIGH/CRITICAL → Azure Logic App (approval workflow)
         │         ↓
         │   Approval notification to governance team
         │         ↓
         │   Human approves/rejects remediation
         │         ↓
         │   Approved → Azure Automation Runbook executes remediation
         │
         └── MEDIUM/LOW → Azure Function (automated remediation)
                   ↓
           Automated corrective action executed
           without human intervention

Azure Policy violation detected
         ↓
Azure Event Grid (Policy compliance state change event)
         ↓
Event Grid subscription filters by violation severity
         ↓
         ├── HIGH/CRITICAL → Azure Logic App (approval workflow)
         │         ↓
         │   Approval notification to governance team
         │         ↓
         │   Human approves/rejects remediation
         │         ↓
         │   Approved → Azure Automation Runbook executes remediation
         │
         └── MEDIUM/LOW → Azure Function (automated remediation)
                   ↓
           Automated corrective action executed
           without human intervention

Concrete Remediation Example — Public IP Detected on VM:

python

# Azure Function triggered by Event Grid compliance violation event
def remediate_public_ip_violation(event):
    resource_id = event['data']['resourceId']
    violation_type = event['data']['policyDefinitionName']

    if violation_type == "deny-public-ip-vm":
        # Get NIC associated with the VM
        nic = azure_client.network.get_nic_from_vm(resource_id)

        # Remove public IP association from NIC
        nic.ip_configurations[0].public_ip_address = None
        azure_client.network.update_nic(nic)

        # Log remediation action
        log_remediation_action(
            resource_id=resource_id,
            action="removed_public_ip",
            triggered_by="automated_policy_remediation",
            timestamp=datetime.utcnow()
        )

        # Notify governance team
        send_notification(
            subject=f"Auto-remediated: Public IP removed from {resource_id}",
            body=f"Policy violation auto-remediated at {datetime.utcnow()}"
        )

# Azure Function triggered by Event Grid compliance violation event
def remediate_public_ip_violation(event):
    resource_id = event['data']['resourceId']
    violation_type = event['data']['policyDefinitionName']

    if violation_type == "deny-public-ip-vm":
        # Get NIC associated with the VM
        nic = azure_client.network.get_nic_from_vm(resource_id)

        # Remove public IP association from NIC
        nic.ip_configurations[0].public_ip_address = None
        azure_client.network.update_nic(nic)

        # Log remediation action
        log_remediation_action(
            resource_id=resource_id,
            action="removed_public_ip",
            triggered_by="automated_policy_remediation",
            timestamp=datetime.utcnow()
        )

        # Notify governance team
        send_notification(
            subject=f"Auto-remediated: Public IP removed from {resource_id}",
            body=f"Policy violation auto-remediated at {datetime.utcnow()}"
        )

# Azure Function triggered by Event Grid compliance violation event
def remediate_public_ip_violation(event):
    resource_id = event['data']['resourceId']
    violation_type = event['data']['policyDefinitionName']

    if violation_type == "deny-public-ip-vm":
        # Get NIC associated with the VM
        nic = azure_client.network.get_nic_from_vm(resource_id)

        # Remove public IP association from NIC
        nic.ip_configurations[0].public_ip_address = None
        azure_client.network.update_nic(nic)

        # Log remediation action
        log_remediation_action(
            resource_id=resource_id,
            action="removed_public_ip",
            triggered_by="automated_policy_remediation",
            timestamp=datetime.utcnow()
        )

        # Notify governance team
        send_notification(
            subject=f"Auto-remediated: Public IP removed from {resource_id}",
            body=f"Policy violation auto-remediated at {datetime.utcnow()}"
        )

Remediation Decision Matrix:

Violation Type	Severity	Remediation Path	Human Approval	Target MTTR
Public IP on workload VM	High	Automated Function	No	5 minutes
Missing diagnostic settings	Medium	DeployIfNotExists Policy	No	30 minutes
Missing resource tags	Low	Automated Function	No	15 minutes
Non-compliant VM SKU	High	Logic App workflow	Yes	4 hours
Public storage blob access	Critical	Automated Function	No	2 minutes
Missing NSG on subnet	High	Logic App workflow	Yes	2 hours
Encryption disabled	Critical	Logic App workflow	Yes	1 hour

Why Separation of Automated and Approved Remediation: Automatically remediating all violations regardless of impact risks operational disruption — removing a public IP from a VM that a team intentionally exposed for legitimate testing breaks their workflow without notice. Automated remediation is appropriate only for violations where the corrective action has no plausible legitimate use case and low operational impact. High-impact remediations route through approval workflows — ensuring human judgement is applied before irreversible or operationally disruptive actions are taken.

6. Monitoring & Analytics Layer

Compliance analytics leverage Azure Resource Graph for cross-subscription querying — the correct tool for enterprise-scale compliance visibility that Azure Policy portal cannot provide at subscription-crossing scale.

Azure Resource Graph — Cross-Subscription Compliance Queries:

kusto

// Non-compliant resources across all subscriptions — grouped by policy
PolicyResources
| where type == "microsoft.policyinsights/policystates"
| where properties.complianceState == "NonCompliant"
| summarize
    NonCompliantCount = count(),
    AffectedSubscriptions = dcount(subscriptionId)
    by PolicyDefinitionName = tostring(properties.policyDefinitionName),
       ResourceType = tostring(properties.resourceType)
| order by NonCompliantCount desc
| take 20

// Non-compliant resources across all subscriptions — grouped by policy
PolicyResources
| where type == "microsoft.policyinsights/policystates"
| where properties.complianceState == "NonCompliant"
| summarize
    NonCompliantCount = count(),
    AffectedSubscriptions = dcount(subscriptionId)
    by PolicyDefinitionName = tostring(properties.policyDefinitionName),
       ResourceType = tostring(properties.resourceType)
| order by NonCompliantCount desc
| take 20

// Non-compliant resources across all subscriptions — grouped by policy
PolicyResources
| where type == "microsoft.policyinsights/policystates"
| where properties.complianceState == "NonCompliant"
| summarize
    NonCompliantCount = count(),
    AffectedSubscriptions = dcount(subscriptionId)
    by PolicyDefinitionName = tostring(properties.policyDefinitionName),
       ResourceType = tostring(properties.resourceType)
| order by NonCompliantCount desc
| take 20

kusto

// Compliance trend over 30 days — detecting drift
PolicyResources
| where type == "microsoft.policyinsights/policystates"
| where properties.timestamp > ago(30d)
| summarize
    CompliantCount = countif(properties.complianceState == "Compliant"),
    NonCompliantCount = countif(properties.complianceState == "NonCompliant")
    by Day = bin(todatetime(properties.timestamp), 1d)
| project Day, ComplianceRate = (CompliantCount * 100.0) / (CompliantCount + NonCompliantCount)
| order by Day asc

// Compliance trend over 30 days — detecting drift
PolicyResources
| where type == "microsoft.policyinsights/policystates"
| where properties.timestamp > ago(30d)
| summarize
    CompliantCount = countif(properties.complianceState == "Compliant"),
    NonCompliantCount = countif(properties.complianceState == "NonCompliant")
    by Day = bin(todatetime(properties.timestamp), 1d)
| project Day, ComplianceRate = (CompliantCount * 100.0) / (CompliantCount + NonCompliantCount)
| order by Day asc

// Compliance trend over 30 days — detecting drift
PolicyResources
| where type == "microsoft.policyinsights/policystates"
| where properties.timestamp > ago(30d)
| summarize
    CompliantCount = countif(properties.complianceState == "Compliant"),
    NonCompliantCount = countif(properties.complianceState == "NonCompliant")
    by Day = bin(todatetime(properties.timestamp), 1d)
| project Day, ComplianceRate = (CompliantCount * 100.0) / (CompliantCount + NonCompliantCount)
| order by Day asc

kusto

// Resources without required tags — cross-subscription
Resources
| where isnull(tags.environment) or isnull(tags.owner) or isnull(tags.cost_centre)
| project
    ResourceId = id,
    ResourceType = type,
    Subscription = subscriptionId,
    MissingTags = pack(
        "environment", isnull(tags.environment),
        "owner", isnull(tags.owner),
        "cost_centre", isnull(tags.cost_centre)
    )
| order by Subscription asc

// Resources without required tags — cross-subscription
Resources
| where isnull(tags.environment) or isnull(tags.owner) or isnull(tags.cost_centre)
| project
    ResourceId = id,
    ResourceType = type,
    Subscription = subscriptionId,
    MissingTags = pack(
        "environment", isnull(tags.environment),
        "owner", isnull(tags.owner),
        "cost_centre", isnull(tags.cost_centre)
    )
| order by Subscription asc

// Resources without required tags — cross-subscription
Resources
| where isnull(tags.environment) or isnull(tags.owner) or isnull(tags.cost_centre)
| project
    ResourceId = id,
    ResourceType = type,
    Subscription = subscriptionId,
    MissingTags = pack(
        "environment", isnull(tags.environment),
        "owner", isnull(tags.owner),
        "cost_centre", isnull(tags.cost_centre)
    )
| order by Subscription asc

Azure Monitor — Governance Telemetry:

Azure Policy compliance state change events logged to Log Analytics
Remediation task execution results — success, failure, and partial remediation outcomes
Policy assignment deployment events from CI/CD pipeline
Alert rules for compliance rate degradation — alerting when subscription compliance rate drops below defined threshold

7. Visualisation & Reporting Layer

Compliance reporting serves two audiences through separate visualisation tools — technical operators and executive governance stakeholders.

Azure Workbooks — Technical Compliance Dashboard:

Per-subscription compliance rate by policy initiative
Non-compliant resource inventory with drill-down to resource-level violation details
Remediation task status — pending, in-progress, completed, failed
Policy assignment coverage map — which initiatives are assigned to which management groups
Recent policy deployment history from CI/CD pipeline

Power BI — Executive Compliance Dashboard:

Report	Audience	Refresh Frequency	Purpose
Enterprise Compliance Score	Executive / CISO	Daily	Overall governance posture KPI
Environment Compliance Comparison	Governance team	Daily	Dev/Test/Prod compliance rate comparison
Compliance Trend	Executive / Governance	Weekly	90-day compliance rate trend
Top Violations	Governance team	Daily	Most frequent policy violations requiring attention
Remediation Performance	Operations	Daily	MTTR by violation type vs targets
Exemption Register	Compliance/Audit	Weekly	Active exemptions with expiry dates

Compliance Score Methodology:

Enterprise Compliance Score = 
  (Compliant Resources / Total Assessed Resources) × 100

Weighted Score (optional) = 
  Σ (Policy Weight × Compliance Rate per Policy) / Σ Policy Weights

Where policy weights reflect regulatory significance:
  - PCI DSS controls: weight 3
  - Security baseline controls: weight 2
  - Operational controls: weight 1

Enterprise Compliance Score = 
  (Compliant Resources / Total Assessed Resources) × 100

Weighted Score (optional) = 
  Σ (Policy Weight × Compliance Rate per Policy) / Σ Policy Weights

Where policy weights reflect regulatory significance:
  - PCI DSS controls: weight 3
  - Security baseline controls: weight 2
  - Operational controls: weight 1

Enterprise Compliance Score = 
  (Compliant Resources / Total Assessed Resources) × 100

Weighted Score (optional) = 
  Σ (Policy Weight × Compliance Rate per Policy) / Σ Policy Weights

Where policy weights reflect regulatory significance:
  - PCI DSS controls: weight 3
  - Security baseline controls: weight 2
  - Operational controls: weight 1

Technologies Used

Category	Technologies
Infrastructure as Code	Terraform
CI/CD & GitOps	GitHub Actions, Azure DevOps, YAML Pipelines
Policy Governance	Azure Policy, Azure Initiative Definitions, Management Groups
Cross-Subscription Analytics	Azure Resource Graph, KQL
Event-Driven Remediation	Azure Event Grid, Logic Apps, Azure Functions, Azure Automation Runbooks
Monitoring	Azure Monitor, Log Analytics
Reporting	Power BI, Azure Workbooks
Security & Compliance	Microsoft Defender for Cloud, Azure RBAC, Azure Key Vault
Compliance Frameworks	CIS Azure Benchmark v2.0, NIST SP 800-53, PCI DSS v4.0

Key Challenges Addressed

Maintaining policy consistency across multiple subscriptions — addressed through management group hierarchy policy inheritance — universal controls assigned at enterprise management group level propagate to all subscriptions automatically without per-subscription assignment management.

Integrating governance into CI/CD without slowing delivery — addressed through environment-tiered deployment pipelines where development governance changes deploy automatically while production governance changes require two-approver gate — maintaining delivery velocity in lower environments without sacrificing production governance rigour.

Providing real-time cross-subscription compliance visibility — addressed through Azure Resource Graph KQL queries aggregating compliance state across all subscriptions simultaneously — Azure Policy portal provides single-subscription visibility only.

Automating remediation without introducing instability — addressed through remediation decision matrix separating automated low-impact remediations from human-approved high-impact actions — automated remediation applies only where corrective actions have no plausible legitimate business use case.

Supporting environment-specific governance flexibility — addressed through three-tier initiative design using Audit effects in development and Deny effects in production — same policy definitions, different enforcement severity per environment management group.

Scaling exemption governance — addressed through Terraform-managed exemptions with mandatory expiry dates, justification requirements, and version-controlled audit trail — preventing exemption accumulation that gradually erodes compliance posture.

Design Decisions & Rationale

Management Group Hierarchy as the Policy Distribution Foundation : Assigning policies at individual subscription level creates management overhead that scales linearly with subscription count. Management group hierarchy enables hierarchical inheritance — universal controls assigned once at the top propagate automatically to all child subscriptions. Environment-specific initiatives assigned at environment management group level apply consistently to all subscriptions within that environment tier without per-subscription configuration.

Environment-Aware Initiative Design with Effect Differentiation : Uniform Deny effects across all environments blocks legitimate development activities — developers testing configurations need flexibility that production cannot permit. Three-tier initiative design maps enforcement severity to operational risk — development receives Audit awareness without operational blocking, production receives Deny enforcement without exception. The same policy definitions serve all environments with environment-specific effect parameters.

Event-Driven Remediation over Scheduled Remediation : Scheduled remediation runs (e.g. daily compliance remediation jobs) leave non-compliant resources exposed for the interval between runs. Event Grid-triggered remediation responds to compliance state changes in near real time — reducing the exposure window from hours to minutes for automated remediations and providing immediate notification for human-approved remediations.

Azure Resource Graph over Azure Policy Portal for Compliance Analytics : Azure Policy compliance portal provides per-subscription compliance views — inadequate for enterprise estates spanning dozens of subscriptions. Azure Resource Graph queries execute across all subscriptions simultaneously, enabling cross-subscription compliance aggregation, trend analysis, and KQL-based custom compliance reporting that the portal cannot provide.

Mandatory Exemption Expiry Dates : Permanent policy exemptions accumulate over time as environments evolve — exempt resources become forgotten compliance gaps. Mandatory expiry dates on all exemptions through Terraform enforcement ensure exemptions are reviewed and either renewed with justification or removed when the underlying business need expires. Quarterly exemption review processes validate that active exemptions remain justified.

Separation of Enforcement and Remediation Layers : Combining Azure Policy enforcement and remediation in a single workflow creates risk — a misconfigured remediation action could cascade across large numbers of resources simultaneously. Separating enforcement (Azure Policy — detects violations) from remediation (Event Grid → Functions/Runbooks — corrects violations) enables independent testing, independent failure modes, and granular control over which violations trigger automated vs human-approved remediation.

Trade-offs & Design Constraints

Azure Policy DeployIfNotExists Remediation Timing Gap : DeployIfNotExists effect creates a remediation task that runs asynchronously after resource creation — there is a window between resource deployment and remediation completion where resources exist without required configurations. For compliance-critical controls (diagnostic settings, encryption), Terraform should explicitly configure these settings rather than relying on Policy remediation — Policy remediation should serve as a backstop for resources deployed outside IaC governance, not the primary configuration mechanism for IaC-managed resources.

Event Grid Compliance Event Volume at Scale : In large Azure estates with frequent resource changes, Azure Policy generates high volumes of compliance state change events. Event Grid handles high throughput but downstream Logic Apps and Azure Functions must be designed for concurrent execution — a compliance event storm following a large deployment could trigger thousands of simultaneous remediation events. Rate limiting, dead letter queuing, and idempotent remediation function design are essential for production remediation reliability.

Resource Graph Query Throttling : Azure Resource Graph queries are subject to throttling limits — approximately 15 queries per 5 seconds per tenant for standard tier. Power BI dashboards refreshing compliance data through Resource Graph queries must implement query result caching and refresh scheduling to avoid throttling. Direct Power BI → Resource Graph integration without caching creates throttling risk at enterprise scale.

Terraform Policy State Import Complexity : Importing existing manually configured Azure Policy definitions and assignments into Terraform state requires careful attribute mapping — policy rule JSON in existing definitions must exactly match Terraform resource attribute structure. Mismatches generate plan drift requiring careful reconciliation. A discovery-first approach — using Azure CLI to export existing policy definitions before writing Terraform resources — reduces import complexity.

Remediation Identity Permissions Scope : Azure Automation Runbooks and Azure Functions executing remediation actions require Azure RBAC permissions — typically Contributor on affected resource groups. Overly broad remediation identity permissions create risk if the remediation service is compromised. Permissions should be scoped to the minimum required for each remediation action type — separate managed identities per remediation function with purpose-specific role assignments rather than a single broadly-scoped remediation identity.

Projected Outcomes

The architecture is designed to deliver the following governance and operational outcomes in a production enterprise environment:

Consistent policy enforcement across all subscriptions through management group hierarchy inheritance — universal controls applied without per-subscription configuration management
Environment-appropriate governance through three-tier initiative design — development flexibility and production strictness enforced through the same policy definitions with differentiated effects
Near real-time compliance violation detection and automated remediation for defined violation categories through Event Grid-triggered orchestration
Cross-subscription compliance visibility through Azure Resource Graph KQL queries — enterprise-wide compliance posture queryable on demand
Executive governance reporting through Power BI dashboards with daily compliance score, trend analysis, and MTTR performance tracking
Auditable governance lifecycle through Terraform-managed policy definitions, GitOps version control, and CI/CD deployment history
Controlled exemption governance through mandatory expiry dates, justification requirements, and quarterly review processes preventing exemption accumulation

Future Evolution

OPA/Gatekeeper integration for Kubernetes workload governance — extending Policy-as-Code governance to AKS admission control through the same GitOps governance model
AI-assisted compliance anomaly detection — identifying unusual compliance degradation patterns indicating potential security incidents rather than routine configuration drift
Automated risk scoring and violation prioritisation — weighting compliance violations by asset criticality and regulatory impact for intelligent remediation sequencing
Cross-cloud governance federation — extending management group-equivalent governance patterns to AWS Organizations and GCP Resource Manager through Terraform multi-cloud provider management
Continuous compliance validation pipelines — scheduled Resource Graph compliance scans triggering pipeline alerts when compliance rate drops below defined thresholds
Self-healing remediation workflows — expanding automated remediation coverage as remediation patterns are proven stable through operational experience
FinOps governance integration — Azure Policy enforcement of cost governance controls (approved VM SKUs, required shutdown schedules, resource lifecycle tagging) through the same governance platform
Security posture benchmarking automation — automated CIS Azure Benchmark and NIST SP 800-53 compliance scoring through Defender for Cloud regulatory compliance integration

Key Takeaways

Management group hierarchy is the foundational Policy-as-Code design decision — policy inheritance eliminates per-subscription management overhead that scales unsustainably with subscription count
Environment-aware initiative design with effect differentiation is essential — uniform Deny enforcement across all environments blocks legitimate development activities; the same policy definitions should serve all environments with environment-specific effect parameters
Event-driven remediation dramatically reduces compliance violation exposure windows compared to scheduled remediation — real-time response versus hourly or daily remediation cycles
Azure Resource Graph is the correct tool for cross-subscription compliance analytics — Azure Policy portal provides single-subscription visibility only and cannot support enterprise-scale compliance aggregation
Automated remediation must be bounded by a decision matrix — not all violations should trigger automated corrective actions; high-impact remediations require human approval to prevent operational disruption
Exemption governance requires mandatory expiry enforcement — permanent exemptions accumulate into compliance debt; Terraform-managed exemptions with expiry dates prevent this erosion
Separation of enforcement and remediation layers enables independent failure modes — Azure Policy detecting violations and Functions/Runbooks correcting them can be tested, operated, and failed independently

Executive Summary

Business Drivers

This architecture was designed to address the enterprise governance requirements of organisations where existing approaches result in:

Compliance drift between environments — policy changes applied to production not propagated to development and test environments creating inconsistent governance posture
Limited real-time visibility into policy violations — compliance state only known at scheduled audit intervals rather than continuously
Slow manual remediation cycles — non-compliant resources identified in audits but remediated through manual operational tickets extending exposure windows
Weak integration between governance controls and infrastructure delivery — policies applied after infrastructure is deployed rather than governed through the same delivery lifecycle
Difficulty scaling governance across multiple subscriptions — manual policy management across dozens of subscriptions creates inconsistency and coverage gaps
Compliance evidence requiring manual collection — audit responses built from portal exports rather than continuously maintained and queryable compliance state

Operational Constraints

The architecture was designed to operate within the following constraints typical of enterprise multi-subscription Azure governance environments:

Governance controls must be consistent across Development, Test, and Production environments but with environment-specific enforcement severity — development teams require operational flexibility that production cannot afford
Policy deployment workflows must integrate into CI/CD pipelines — governance changes must flow through the same review and approval process as infrastructure changes
Automated remediation must avoid operational instability — not all compliance violations should trigger immediate automated remediation; high-impact remediations require human approval
Azure Resource Graph queries must support cross-subscription compliance reporting — no single-subscription visibility model is adequate for enterprise estates
Compliance reporting must serve two audiences — technical operators requiring resource-level violation details and executive stakeholders requiring KPI-level governance posture visibility
Policy exceptions must be manageable at environment scope — development environments may legitimately require exemptions from controls mandatory in production
Multi-subscription governance must follow management group hierarchy — policies assigned at management group level propagate to child subscriptions consistently

Objectives

Design a management group hierarchy enabling consistent policy inheritance across Dev, Test, and Production subscription tiers
Develop environment-specific policy initiatives with differentiated enforcement severity per environment tier
Automate policy lifecycle management through Terraform with GitOps governance and CI/CD deployment
Implement event-driven remediation architecture detecting violations in real time and triggering automated corrective actions
Design Azure Resource Graph queries providing cross-subscription compliance visibility beyond Azure Policy portal limitations
Build Power BI executive compliance dashboards and Azure Workbooks technical compliance dashboards
Define Mean Time to Remediation (MTTR) targets per violation severity — distinguishing automated from human-approved remediation paths
Establish compliance exemption governance — controlling and auditing policy exemptions across the enterprise estate

Management Group Hierarchy & Policy Inheritance

The management group hierarchy is the foundational governance design decision — policy assignments at management group level inherit to all child subscriptions automatically.

Tenant Root Group
└── Enterprise Management Group
    ├── Platform Management Group          ← Platform baseline policies
    │   ├── Identity Subscription
    │   └── Connectivity Subscription
    ├── Landing Zones Management Group     ← Workload governance policies
    │   ├── Production Management Group   ← Strict enforcement initiatives
    │   │   ├── Prod-Sub-01
    │   │   └── Prod-Sub-02
    │   ├── Test Management Group         ← Moderate enforcement initiatives
    │   │   └── Test-Sub-01
    │   └── Development Management Group  ← Flexible enforcement initiatives
    │       └── Dev-Sub-01
    └── Sandbox Management Group          ← Minimal governance — exploration only
        └── Sandbox-Sub-01

Tenant Root Group
└── Enterprise Management Group
    ├── Platform Management Group          ← Platform baseline policies
    │   ├── Identity Subscription
    │   └── Connectivity Subscription
    ├── Landing Zones Management Group     ← Workload governance policies
    │   ├── Production Management Group   ← Strict enforcement initiatives
    │   │   ├── Prod-Sub-01
    │   │   └── Prod-Sub-02
    │   ├── Test Management Group         ← Moderate enforcement initiatives
    │   │   └── Test-Sub-01
    │   └── Development Management Group  ← Flexible enforcement initiatives
    │       └── Dev-Sub-01
    └── Sandbox Management Group          ← Minimal governance — exploration only
        └── Sandbox-Sub-01

Tenant Root Group
└── Enterprise Management Group
    ├── Platform Management Group          ← Platform baseline policies
    │   ├── Identity Subscription
    │   └── Connectivity Subscription
    ├── Landing Zones Management Group     ← Workload governance policies
    │   ├── Production Management Group   ← Strict enforcement initiatives
    │   │   ├── Prod-Sub-01
    │   │   └── Prod-Sub-02
    │   ├── Test Management Group         ← Moderate enforcement initiatives
    │   │   └── Test-Sub-01
    │   └── Development Management Group  ← Flexible enforcement initiatives
    │       └── Dev-Sub-01
    └── Sandbox Management Group          ← Minimal governance — exploration only
        └── Sandbox-Sub-01

Policy Inheritance Design:

Policies assigned at Enterprise Management Group level apply to all subscriptions — foundational security controls with no environment exceptions
Environment-specific initiatives assigned at Production, Test, and Development management group levels — providing differentiated enforcement without duplicating universal controls
Sandbox subscriptions have minimal governance — intentional for innovation and exploration without compliance friction

Environment-Aware Policy Initiative Design

Three-Tier Governance Model:

Control Category	Development	Test	Production
Public IP on VMs	Audit	Deny	Deny
Diagnostic settings	Audit	Audit	DeployIfNotExists
Resource tagging	Audit	Deny	Deny
TLS minimum version	Audit	Deny	Deny
Approved VM SKUs	Disabled	Audit	Deny
Storage HTTPS only	Audit	Deny	Deny
Key Vault soft delete	Audit	Deny	Deny
Approved locations	Disabled	Audit	Deny
MFA for management	Audit	Audit	Deny

Architecture Overview

1. Policy Definition & Infrastructure-as-Code Layer

All governance definitions are managed as Terraform code — version-controlled, peer-reviewed, and deployed through CI/CD pipelines.

Terraform Module Structure:

governance/
├── modules/
│   ├── policy-definition/      # Custom policy definition module
│   ├── policy-initiative/      # Initiative (policy set) module
│   ├── policy-assignment/      # Assignment at MG/subscription scope
│   ├── policy-exemption/       # Exemption management with expiry
│   └── remediation-task/       # Remediation task creation
├── initiatives/
│   ├── production-baseline/    # Production strict initiative
│   ├── test-baseline/          # Test moderate initiative
│   ├── dev-baseline/           # Dev flexible initiative
│   └── platform-universal/     # Universal controls — all environments
├── definitions/
│   ├── network/                # Network security policy definitions
│   ├── identity/               # Identity governance definitions
│   ├── data-protection/        # Encryption and data policies
│   └── operational/            # Tagging, diagnostics, monitoring
└── assignments/
    ├── enterprise-mg.tf        # Enterprise MG universal assignments
    ├── production-mg.tf        # Production MG strict assignments
    ├── test-mg.tf              # Test MG moderate assignments
    └── dev-mg.tf               # Dev MG flexible assignments

governance/
├── modules/
│   ├── policy-definition/      # Custom policy definition module
│   ├── policy-initiative/      # Initiative (policy set) module
│   ├── policy-assignment/      # Assignment at MG/subscription scope
│   ├── policy-exemption/       # Exemption management with expiry
│   └── remediation-task/       # Remediation task creation
├── initiatives/
│   ├── production-baseline/    # Production strict initiative
│   ├── test-baseline/          # Test moderate initiative
│   ├── dev-baseline/           # Dev flexible initiative
│   └── platform-universal/     # Universal controls — all environments
├── definitions/
│   ├── network/                # Network security policy definitions
│   ├── identity/               # Identity governance definitions
│   ├── data-protection/        # Encryption and data policies
│   └── operational/            # Tagging, diagnostics, monitoring
└── assignments/
    ├── enterprise-mg.tf        # Enterprise MG universal assignments
    ├── production-mg.tf        # Production MG strict assignments
    ├── test-mg.tf              # Test MG moderate assignments
    └── dev-mg.tf               # Dev MG flexible assignments

governance/
├── modules/
│   ├── policy-definition/      # Custom policy definition module
│   ├── policy-initiative/      # Initiative (policy set) module
│   ├── policy-assignment/      # Assignment at MG/subscription scope
│   ├── policy-exemption/       # Exemption management with expiry
│   └── remediation-task/       # Remediation task creation
├── initiatives/
│   ├── production-baseline/    # Production strict initiative
│   ├── test-baseline/          # Test moderate initiative
│   ├── dev-baseline/           # Dev flexible initiative
│   └── platform-universal/     # Universal controls — all environments
├── definitions/
│   ├── network/                # Network security policy definitions
│   ├── identity/               # Identity governance definitions
│   ├── data-protection/        # Encryption and data policies
│   └── operational/            # Tagging, diagnostics, monitoring
└── assignments/
    ├── enterprise-mg.tf        # Enterprise MG universal assignments
    ├── production-mg.tf        # Production MG strict assignments
    ├── test-mg.tf              # Test MG moderate assignments
    └── dev-mg.tf               # Dev MG flexible assignments

Example Custom Policy Definition — Terraform:

hcl

resource "azurerm_policy_definition" "require_diagnostic_settings" {
  name         = "require-diagnostic-settings-storage"
  policy_type  = "Custom"
  mode         = "Indexed"
  display_name = "Deploy diagnostic settings for Storage Accounts"

  metadata = jsonencode({
    category = "Monitoring"
    version  = "1.2.0"
  })

  parameters = jsonencode({
    logAnalyticsWorkspaceId = {
      type     = "String"
      metadata = { displayName = "Log Analytics Workspace ID" }
    }
  })

  policy_rule = jsonencode({
    if = {
      field  = "type"
      equals = "Microsoft.Storage/storageAccounts"
    }
    then = {
      effect = "DeployIfNotExists"
      details = {
        type = "Microsoft.Insights/diagnosticSettings"
        roleDefinitionIds = [
          "/providers/Microsoft.Authorization/roleDefinitions/b24988ac-6180-42a0-ab88-20f7382dd24c"
        ]
        deployment = {
          properties = {
            # ... diagnostic settings deployment template ...
          }
        }
      }
    }
  })
}

resource "azurerm_policy_definition" "require_diagnostic_settings" {
  name         = "require-diagnostic-settings-storage"
  policy_type  = "Custom"
  mode         = "Indexed"
  display_name = "Deploy diagnostic settings for Storage Accounts"

  metadata = jsonencode({
    category = "Monitoring"
    version  = "1.2.0"
  })

  parameters = jsonencode({
    logAnalyticsWorkspaceId = {
      type     = "String"
      metadata = { displayName = "Log Analytics Workspace ID" }
    }
  })

  policy_rule = jsonencode({
    if = {
      field  = "type"
      equals = "Microsoft.Storage/storageAccounts"
    }
    then = {
      effect = "DeployIfNotExists"
      details = {
        type = "Microsoft.Insights/diagnosticSettings"
        roleDefinitionIds = [
          "/providers/Microsoft.Authorization/roleDefinitions/b24988ac-6180-42a0-ab88-20f7382dd24c"
        ]
        deployment = {
          properties = {
            # ... diagnostic settings deployment template ...
          }
        }
      }
    }
  })
}

resource "azurerm_policy_definition" "require_diagnostic_settings" {
  name         = "require-diagnostic-settings-storage"
  policy_type  = "Custom"
  mode         = "Indexed"
  display_name = "Deploy diagnostic settings for Storage Accounts"

  metadata = jsonencode({
    category = "Monitoring"
    version  = "1.2.0"
  })

  parameters = jsonencode({
    logAnalyticsWorkspaceId = {
      type     = "String"
      metadata = { displayName = "Log Analytics Workspace ID" }
    }
  })

  policy_rule = jsonencode({
    if = {
      field  = "type"
      equals = "Microsoft.Storage/storageAccounts"
    }
    then = {
      effect = "DeployIfNotExists"
      details = {
        type = "Microsoft.Insights/diagnosticSettings"
        roleDefinitionIds = [
          "/providers/Microsoft.Authorization/roleDefinitions/b24988ac-6180-42a0-ab88-20f7382dd24c"
        ]
        deployment = {
          properties = {
            # ... diagnostic settings deployment template ...
          }
        }
      }
    }
  })
}

2. GitOps Governance Layer

Git repositories serve as the authoritative source of truth for all governance definitions — policy changes require pull request review and approval before reaching any environment.

Branch Strategy for Governance:

main              → Production policy assignments (approval required)
test              → Test environment policies (approval required)
develop           → Development policies (automated)
feature/policy-*  → New policy definition development

main              → Production policy assignments (approval required)
test              → Test environment policies (approval required)
develop           → Development policies (automated)
feature/policy-*  → New policy definition development

main              → Production policy assignments (approval required)
test              → Test environment policies (approval required)
develop           → Development policies (automated)
feature/policy-*  → New policy definition development

Pull Request Governance Requirements:

Policy definition changes require review from at least one governance team member
Production policy assignment changes require review from two approvers — governance team and security team
All PR checks must pass: Terraform validate, tfsec scan of policy definitions, and terraform plan output reviewed as PR comment
Policy definition changes include impact assessment — identifying which existing resources would be affected by the new policy

3. CI/CD Automation Layer

Governance deployment pipelines enforce validation, planning, approval, and deployment stages independently per environment tier.

Pipeline Stage Architecture:

yaml

stages:
  - stage: ValidateGovernance
    jobs:
      - job: PolicyValidation
        steps:
          - script: terraform validate
          - script: tfsec . --minimum-severity MEDIUM
          - script: |
              # Custom policy definition schema validation
              python scripts/validate_policy_json.py
          - script: |
              # Check for policy assignments without corresponding definitions
              python scripts/validate_assignment_references.py

  - stage: PlanGovernance
    dependsOn: ValidateGovernance
    jobs:
      - job: TerraformPlan
        steps:
          - script: terraform plan -out=governance.tfplan
          - script: |
              # Generate human-readable policy impact report
              python scripts/generate_impact_report.py governance.tfplan
          - publish: governance.tfplan

  - stage: ApproveProduction
    dependsOn: PlanGovernance
    condition: eq(variables['targetEnvironment'], 'production')
    jobs:
      - deployment: GovernanceApproval
        environment: governance-production   # Two-approver gate

  - stage: DeployGovernance
    dependsOn: ApproveProduction
    jobs:
      - job: TerraformApply
        steps:
          - download: governance.tfplan
          - script

stages:
  - stage: ValidateGovernance
    jobs:
      - job: PolicyValidation
        steps:
          - script: terraform validate
          - script: tfsec . --minimum-severity MEDIUM
          - script: |
              # Custom policy definition schema validation
              python scripts/validate_policy_json.py
          - script: |
              # Check for policy assignments without corresponding definitions
              python scripts/validate_assignment_references.py

  - stage: PlanGovernance
    dependsOn: ValidateGovernance
    jobs:
      - job: TerraformPlan
        steps:
          - script: terraform plan -out=governance.tfplan
          - script: |
              # Generate human-readable policy impact report
              python scripts/generate_impact_report.py governance.tfplan
          - publish: governance.tfplan

  - stage: ApproveProduction
    dependsOn: PlanGovernance
    condition: eq(variables['targetEnvironment'], 'production')
    jobs:
      - deployment: GovernanceApproval
        environment: governance-production   # Two-approver gate

  - stage: DeployGovernance
    dependsOn: ApproveProduction
    jobs:
      - job: TerraformApply
        steps:
          - download: governance.tfplan
          - script

stages:
  - stage: ValidateGovernance
    jobs:
      - job: PolicyValidation
        steps:
          - script: terraform validate
          - script: tfsec . --minimum-severity MEDIUM
          - script: |
              # Custom policy definition schema validation
              python scripts/validate_policy_json.py
          - script: |
              # Check for policy assignments without corresponding definitions
              python scripts/validate_assignment_references.py

  - stage: PlanGovernance
    dependsOn: ValidateGovernance
    jobs:
      - job: TerraformPlan
        steps:
          - script: terraform plan -out=governance.tfplan
          - script: |
              # Generate human-readable policy impact report
              python scripts/generate_impact_report.py governance.tfplan
          - publish: governance.tfplan

  - stage: ApproveProduction
    dependsOn: PlanGovernance
    condition: eq(variables['targetEnvironment'], 'production')
    jobs:
      - deployment: GovernanceApproval
        environment: governance-production   # Two-approver gate

  - stage: DeployGovernance
    dependsOn: ApproveProduction
    jobs:
      - job: TerraformApply
        steps:
          - download: governance.tfplan
          - script

4. Compliance Enforcement Layer

Azure Policy enforces governance controls across the management group hierarchy with environment-appropriate effects.

Universal Controls — All Environments:

hcl

# Applied at Enterprise Management Group — no exceptions
resource "azurerm_management_group_policy_assignment" "universal_baseline" {
  name                 = "universal-security-baseline"
  management_group_id  = data.azurerm_management_group.enterprise.id
  policy_definition_id = azurerm_policy_set_definition.universal_baseline.id

  # Controls that apply everywhere without exception:
  # - Deny MMA legacy agent (deprecated)
  # - Require HTTPS on storage
  # - Deny classic resources
  # - Require Key Vault soft delete
  # - Deny public IP on management subnets
}

# Applied at Enterprise Management Group — no exceptions
resource "azurerm_management_group_policy_assignment" "universal_baseline" {
  name                 = "universal-security-baseline"
  management_group_id  = data.azurerm_management_group.enterprise.id
  policy_definition_id = azurerm_policy_set_definition.universal_baseline.id

  # Controls that apply everywhere without exception:
  # - Deny MMA legacy agent (deprecated)
  # - Require HTTPS on storage
  # - Deny classic resources
  # - Require Key Vault soft delete
  # - Deny public IP on management subnets
}

# Applied at Enterprise Management Group — no exceptions
resource "azurerm_management_group_policy_assignment" "universal_baseline" {
  name                 = "universal-security-baseline"
  management_group_id  = data.azurerm_management_group.enterprise.id
  policy_definition_id = azurerm_policy_set_definition.universal_baseline.id

  # Controls that apply everywhere without exception:
  # - Deny MMA legacy agent (deprecated)
  # - Require HTTPS on storage
  # - Deny classic resources
  # - Require Key Vault soft delete
  # - Deny public IP on management subnets
}

Compliance Exemption Governance:

hcl

# Time-limited exemption — requires justification and expiry date
resource "azurerm_resource_policy_exemption" "dev_public_ip_exemption" {
  name                 = "dev-public-ip-testing-exemption"
  resource_id          = azurerm_virtual_machine.dev_test_vm.id
  policy_assignment_id = azurerm_management_group_policy_assignment.deny_public_ip.id
  exemption_category   = "Waiver"
  expires_on           = "2025-12-31T00:00:00Z"    # Mandatory expiry — no permanent exemptions

  description = "Temporary exemption for public IP testing in dev environment. JIRA-1234. Expires 2025-12-31."
}

# Time-limited exemption — requires justification and expiry date
resource "azurerm_resource_policy_exemption" "dev_public_ip_exemption" {
  name                 = "dev-public-ip-testing-exemption"
  resource_id          = azurerm_virtual_machine.dev_test_vm.id
  policy_assignment_id = azurerm_management_group_policy_assignment.deny_public_ip.id
  exemption_category   = "Waiver"
  expires_on           = "2025-12-31T00:00:00Z"    # Mandatory expiry — no permanent exemptions

  description = "Temporary exemption for public IP testing in dev environment. JIRA-1234. Expires 2025-12-31."
}

# Time-limited exemption — requires justification and expiry date
resource "azurerm_resource_policy_exemption" "dev_public_ip_exemption" {
  name                 = "dev-public-ip-testing-exemption"
  resource_id          = azurerm_virtual_machine.dev_test_vm.id
  policy_assignment_id = azurerm_management_group_policy_assignment.deny_public_ip.id
  exemption_category   = "Waiver"
  expires_on           = "2025-12-31T00:00:00Z"    # Mandatory expiry — no permanent exemptions

  description = "Temporary exemption for public IP testing in dev environment. JIRA-1234. Expires 2025-12-31."
}

All exemptions are version-controlled, require justification comments, have mandatory expiry dates, and are reviewed quarterly — permanent exemptions are not permitted.

5. Event-Driven Remediation Layer

The remediation layer detects compliance violations in real time and orchestrates automated corrective actions — with human approval gates for high-impact remediations.

Remediation Architecture Flow:

Azure Policy violation detected
         ↓
Azure Event Grid (Policy compliance state change event)
         ↓
Event Grid subscription filters by violation severity
         ↓
         ├── HIGH/CRITICAL → Azure Logic App (approval workflow)
         │         ↓
         │   Approval notification to governance team
         │         ↓
         │   Human approves/rejects remediation
         │         ↓
         │   Approved → Azure Automation Runbook executes remediation
         │
         └── MEDIUM/LOW → Azure Function (automated remediation)
                   ↓
           Automated corrective action executed
           without human intervention

Azure Policy violation detected
         ↓
Azure Event Grid (Policy compliance state change event)
         ↓
Event Grid subscription filters by violation severity
         ↓
         ├── HIGH/CRITICAL → Azure Logic App (approval workflow)
         │         ↓
         │   Approval notification to governance team
         │         ↓
         │   Human approves/rejects remediation
         │         ↓
         │   Approved → Azure Automation Runbook executes remediation
         │
         └── MEDIUM/LOW → Azure Function (automated remediation)
                   ↓
           Automated corrective action executed
           without human intervention

Azure Policy violation detected
         ↓
Azure Event Grid (Policy compliance state change event)
         ↓
Event Grid subscription filters by violation severity
         ↓
         ├── HIGH/CRITICAL → Azure Logic App (approval workflow)
         │         ↓
         │   Approval notification to governance team
         │         ↓
         │   Human approves/rejects remediation
         │         ↓
         │   Approved → Azure Automation Runbook executes remediation
         │
         └── MEDIUM/LOW → Azure Function (automated remediation)
                   ↓
           Automated corrective action executed
           without human intervention

Concrete Remediation Example — Public IP Detected on VM:

python

# Azure Function triggered by Event Grid compliance violation event
def remediate_public_ip_violation(event):
    resource_id = event['data']['resourceId']
    violation_type = event['data']['policyDefinitionName']

    if violation_type == "deny-public-ip-vm":
        # Get NIC associated with the VM
        nic = azure_client.network.get_nic_from_vm(resource_id)

        # Remove public IP association from NIC
        nic.ip_configurations[0].public_ip_address = None
        azure_client.network.update_nic(nic)

        # Log remediation action
        log_remediation_action(
            resource_id=resource_id,
            action="removed_public_ip",
            triggered_by="automated_policy_remediation",
            timestamp=datetime.utcnow()
        )

        # Notify governance team
        send_notification(
            subject=f"Auto-remediated: Public IP removed from {resource_id}",
            body=f"Policy violation auto-remediated at {datetime.utcnow()}"
        )

# Azure Function triggered by Event Grid compliance violation event
def remediate_public_ip_violation(event):
    resource_id = event['data']['resourceId']
    violation_type = event['data']['policyDefinitionName']

    if violation_type == "deny-public-ip-vm":
        # Get NIC associated with the VM
        nic = azure_client.network.get_nic_from_vm(resource_id)

        # Remove public IP association from NIC
        nic.ip_configurations[0].public_ip_address = None
        azure_client.network.update_nic(nic)

        # Log remediation action
        log_remediation_action(
            resource_id=resource_id,
            action="removed_public_ip",
            triggered_by="automated_policy_remediation",
            timestamp=datetime.utcnow()
        )

        # Notify governance team
        send_notification(
            subject=f"Auto-remediated: Public IP removed from {resource_id}",
            body=f"Policy violation auto-remediated at {datetime.utcnow()}"
        )

# Azure Function triggered by Event Grid compliance violation event
def remediate_public_ip_violation(event):
    resource_id = event['data']['resourceId']
    violation_type = event['data']['policyDefinitionName']

    if violation_type == "deny-public-ip-vm":
        # Get NIC associated with the VM
        nic = azure_client.network.get_nic_from_vm(resource_id)

        # Remove public IP association from NIC
        nic.ip_configurations[0].public_ip_address = None
        azure_client.network.update_nic(nic)

        # Log remediation action
        log_remediation_action(
            resource_id=resource_id,
            action="removed_public_ip",
            triggered_by="automated_policy_remediation",
            timestamp=datetime.utcnow()
        )

        # Notify governance team
        send_notification(
            subject=f"Auto-remediated: Public IP removed from {resource_id}",
            body=f"Policy violation auto-remediated at {datetime.utcnow()}"
        )

Remediation Decision Matrix:

Violation Type	Severity	Remediation Path	Human Approval	Target MTTR
Public IP on workload VM	High	Automated Function	No	5 minutes
Missing diagnostic settings	Medium	DeployIfNotExists Policy	No	30 minutes
Missing resource tags	Low	Automated Function	No	15 minutes
Non-compliant VM SKU	High	Logic App workflow	Yes	4 hours
Public storage blob access	Critical	Automated Function	No	2 minutes
Missing NSG on subnet	High	Logic App workflow	Yes	2 hours
Encryption disabled	Critical	Logic App workflow	Yes	1 hour

6. Monitoring & Analytics Layer

Azure Resource Graph — Cross-Subscription Compliance Queries:

kusto

// Non-compliant resources across all subscriptions — grouped by policy
PolicyResources
| where type == "microsoft.policyinsights/policystates"
| where properties.complianceState == "NonCompliant"
| summarize
    NonCompliantCount = count(),
    AffectedSubscriptions = dcount(subscriptionId)
    by PolicyDefinitionName = tostring(properties.policyDefinitionName),
       ResourceType = tostring(properties.resourceType)
| order by NonCompliantCount desc
| take 20

// Non-compliant resources across all subscriptions — grouped by policy
PolicyResources
| where type == "microsoft.policyinsights/policystates"
| where properties.complianceState == "NonCompliant"
| summarize
    NonCompliantCount = count(),
    AffectedSubscriptions = dcount(subscriptionId)
    by PolicyDefinitionName = tostring(properties.policyDefinitionName),
       ResourceType = tostring(properties.resourceType)
| order by NonCompliantCount desc
| take 20

// Non-compliant resources across all subscriptions — grouped by policy
PolicyResources
| where type == "microsoft.policyinsights/policystates"
| where properties.complianceState == "NonCompliant"
| summarize
    NonCompliantCount = count(),
    AffectedSubscriptions = dcount(subscriptionId)
    by PolicyDefinitionName = tostring(properties.policyDefinitionName),
       ResourceType = tostring(properties.resourceType)
| order by NonCompliantCount desc
| take 20

kusto

// Compliance trend over 30 days — detecting drift
PolicyResources
| where type == "microsoft.policyinsights/policystates"
| where properties.timestamp > ago(30d)
| summarize
    CompliantCount = countif(properties.complianceState == "Compliant"),
    NonCompliantCount = countif(properties.complianceState == "NonCompliant")
    by Day = bin(todatetime(properties.timestamp), 1d)
| project Day, ComplianceRate = (CompliantCount * 100.0) / (CompliantCount + NonCompliantCount)
| order by Day asc

// Compliance trend over 30 days — detecting drift
PolicyResources
| where type == "microsoft.policyinsights/policystates"
| where properties.timestamp > ago(30d)
| summarize
    CompliantCount = countif(properties.complianceState == "Compliant"),
    NonCompliantCount = countif(properties.complianceState == "NonCompliant")
    by Day = bin(todatetime(properties.timestamp), 1d)
| project Day, ComplianceRate = (CompliantCount * 100.0) / (CompliantCount + NonCompliantCount)
| order by Day asc

// Compliance trend over 30 days — detecting drift
PolicyResources
| where type == "microsoft.policyinsights/policystates"
| where properties.timestamp > ago(30d)
| summarize
    CompliantCount = countif(properties.complianceState == "Compliant"),
    NonCompliantCount = countif(properties.complianceState == "NonCompliant")
    by Day = bin(todatetime(properties.timestamp), 1d)
| project Day, ComplianceRate = (CompliantCount * 100.0) / (CompliantCount + NonCompliantCount)
| order by Day asc

kusto

// Resources without required tags — cross-subscription
Resources
| where isnull(tags.environment) or isnull(tags.owner) or isnull(tags.cost_centre)
| project
    ResourceId = id,
    ResourceType = type,
    Subscription = subscriptionId,
    MissingTags = pack(
        "environment", isnull(tags.environment),
        "owner", isnull(tags.owner),
        "cost_centre", isnull(tags.cost_centre)
    )
| order by Subscription asc

// Resources without required tags — cross-subscription
Resources
| where isnull(tags.environment) or isnull(tags.owner) or isnull(tags.cost_centre)
| project
    ResourceId = id,
    ResourceType = type,
    Subscription = subscriptionId,
    MissingTags = pack(
        "environment", isnull(tags.environment),
        "owner", isnull(tags.owner),
        "cost_centre", isnull(tags.cost_centre)
    )
| order by Subscription asc

// Resources without required tags — cross-subscription
Resources
| where isnull(tags.environment) or isnull(tags.owner) or isnull(tags.cost_centre)
| project
    ResourceId = id,
    ResourceType = type,
    Subscription = subscriptionId,
    MissingTags = pack(
        "environment", isnull(tags.environment),
        "owner", isnull(tags.owner),
        "cost_centre", isnull(tags.cost_centre)
    )
| order by Subscription asc

Azure Monitor — Governance Telemetry:

Azure Policy compliance state change events logged to Log Analytics
Remediation task execution results — success, failure, and partial remediation outcomes
Policy assignment deployment events from CI/CD pipeline
Alert rules for compliance rate degradation — alerting when subscription compliance rate drops below defined threshold

7. Visualisation & Reporting Layer

Compliance reporting serves two audiences through separate visualisation tools — technical operators and executive governance stakeholders.

Azure Workbooks — Technical Compliance Dashboard:

Per-subscription compliance rate by policy initiative
Non-compliant resource inventory with drill-down to resource-level violation details
Remediation task status — pending, in-progress, completed, failed
Policy assignment coverage map — which initiatives are assigned to which management groups
Recent policy deployment history from CI/CD pipeline

Power BI — Executive Compliance Dashboard:

Report	Audience	Refresh Frequency	Purpose
Enterprise Compliance Score	Executive / CISO	Daily	Overall governance posture KPI
Environment Compliance Comparison	Governance team	Daily	Dev/Test/Prod compliance rate comparison
Compliance Trend	Executive / Governance	Weekly	90-day compliance rate trend
Top Violations	Governance team	Daily	Most frequent policy violations requiring attention
Remediation Performance	Operations	Daily	MTTR by violation type vs targets
Exemption Register	Compliance/Audit	Weekly	Active exemptions with expiry dates

Compliance Score Methodology:

Enterprise Compliance Score = 
  (Compliant Resources / Total Assessed Resources) × 100

Weighted Score (optional) = 
  Σ (Policy Weight × Compliance Rate per Policy) / Σ Policy Weights

Where policy weights reflect regulatory significance:
  - PCI DSS controls: weight 3
  - Security baseline controls: weight 2
  - Operational controls: weight 1

Enterprise Compliance Score = 
  (Compliant Resources / Total Assessed Resources) × 100

Weighted Score (optional) = 
  Σ (Policy Weight × Compliance Rate per Policy) / Σ Policy Weights

Where policy weights reflect regulatory significance:
  - PCI DSS controls: weight 3
  - Security baseline controls: weight 2
  - Operational controls: weight 1

Enterprise Compliance Score = 
  (Compliant Resources / Total Assessed Resources) × 100

Weighted Score (optional) = 
  Σ (Policy Weight × Compliance Rate per Policy) / Σ Policy Weights

Where policy weights reflect regulatory significance:
  - PCI DSS controls: weight 3
  - Security baseline controls: weight 2
  - Operational controls: weight 1

Technologies Used

Category	Technologies
Infrastructure as Code	Terraform
CI/CD & GitOps	GitHub Actions, Azure DevOps, YAML Pipelines
Policy Governance	Azure Policy, Azure Initiative Definitions, Management Groups
Cross-Subscription Analytics	Azure Resource Graph, KQL
Event-Driven Remediation	Azure Event Grid, Logic Apps, Azure Functions, Azure Automation Runbooks
Monitoring	Azure Monitor, Log Analytics
Reporting	Power BI, Azure Workbooks
Security & Compliance	Microsoft Defender for Cloud, Azure RBAC, Azure Key Vault
Compliance Frameworks	CIS Azure Benchmark v2.0, NIST SP 800-53, PCI DSS v4.0

Key Challenges Addressed

Design Decisions & Rationale

Trade-offs & Design Constraints

Projected Outcomes

The architecture is designed to deliver the following governance and operational outcomes in a production enterprise environment:

Consistent policy enforcement across all subscriptions through management group hierarchy inheritance — universal controls applied without per-subscription configuration management
Environment-appropriate governance through three-tier initiative design — development flexibility and production strictness enforced through the same policy definitions with differentiated effects
Near real-time compliance violation detection and automated remediation for defined violation categories through Event Grid-triggered orchestration
Cross-subscription compliance visibility through Azure Resource Graph KQL queries — enterprise-wide compliance posture queryable on demand
Executive governance reporting through Power BI dashboards with daily compliance score, trend analysis, and MTTR performance tracking
Auditable governance lifecycle through Terraform-managed policy definitions, GitOps version control, and CI/CD deployment history
Controlled exemption governance through mandatory expiry dates, justification requirements, and quarterly review processes preventing exemption accumulation

Future Evolution

OPA/Gatekeeper integration for Kubernetes workload governance — extending Policy-as-Code governance to AKS admission control through the same GitOps governance model
AI-assisted compliance anomaly detection — identifying unusual compliance degradation patterns indicating potential security incidents rather than routine configuration drift
Automated risk scoring and violation prioritisation — weighting compliance violations by asset criticality and regulatory impact for intelligent remediation sequencing
Cross-cloud governance federation — extending management group-equivalent governance patterns to AWS Organizations and GCP Resource Manager through Terraform multi-cloud provider management
Continuous compliance validation pipelines — scheduled Resource Graph compliance scans triggering pipeline alerts when compliance rate drops below defined thresholds
Self-healing remediation workflows — expanding automated remediation coverage as remediation patterns are proven stable through operational experience
FinOps governance integration — Azure Policy enforcement of cost governance controls (approved VM SKUs, required shutdown schedules, resource lifecycle tagging) through the same governance platform
Security posture benchmarking automation — automated CIS Azure Benchmark and NIST SP 800-53 compliance scoring through Defender for Cloud regulatory compliance integration

Key Takeaways

Management group hierarchy is the foundational Policy-as-Code design decision — policy inheritance eliminates per-subscription management overhead that scales unsustainably with subscription count
Environment-aware initiative design with effect differentiation is essential — uniform Deny enforcement across all environments blocks legitimate development activities; the same policy definitions should serve all environments with environment-specific effect parameters
Event-driven remediation dramatically reduces compliance violation exposure windows compared to scheduled remediation — real-time response versus hourly or daily remediation cycles
Azure Resource Graph is the correct tool for cross-subscription compliance analytics — Azure Policy portal provides single-subscription visibility only and cannot support enterprise-scale compliance aggregation
Automated remediation must be bounded by a decision matrix — not all violations should trigger automated corrective actions; high-impact remediations require human approval to prevent operational disruption
Exemption governance requires mandatory expiry enforcement — permanent exemptions accumulate into compliance debt; Terraform-managed exemptions with expiry dates prevent this erosion
Separation of enforcement and remediation layers enables independent failure modes — Azure Policy detecting violations and Functions/Runbooks correcting them can be tested, operated, and failed independently

Open to discussing infrastructure architecture, cloud transformation, or high-availability system design.

Whether the objective is infrastructure modernization, operational resilience, hybrid cloud transformation, or enterprise security architecture, I am always interested in discussing complex infrastructure environments and strategic technical initiatives.

Get in touch

Open to discussing infrastructure architecture, cloud transformation, or high-availability system design.

Whether the objective is infrastructure modernization, operational resilience, hybrid cloud transformation, or enterprise security architecture, I am always interested in discussing complex infrastructure environments and strategic technical initiatives.

Get in touch

Open to discussing infrastructure architecture, cloud transformation, or high-availability system design.

Whether the objective is infrastructure modernization, operational resilience, hybrid cloud transformation, or enterprise security architecture, I am always interested in discussing complex infrastructure environments and strategic technical initiatives.