Description
Key Focus Areas:
Policy-as-Code & Cloud Governance
Multi-Subscription Compliance Management
Event-Driven Remediation Orchestration
Azure Resource Graph Compliance Analytics
Environment-Aware Initiative Design
Executive Compliance Reporting
Executive Summary
Architected a cloud-native Policy-as-Code governance platform on Microsoft Azure enabling automated compliance enforcement, event-driven remediation, cross-subscription visibility, and executive reporting across Development, Test, and Production environments at enterprise scale.
The platform integrates Terraform-managed Azure Policy definitions and initiatives, GitOps-driven policy lifecycle governance, CI/CD-integrated deployment workflows, Azure Event Grid-triggered remediation orchestration through Logic Apps and Azure Automation Runbooks, Azure Resource Graph cross-subscription compliance querying, and Power BI executive compliance dashboards.
The design is differentiated from deployment-time security governance studies by its focus on operational compliance at scale — what happens after policies are deployed across a large Azure estate: how violations are detected in real time, how remediation is automated without introducing operational instability, how compliance state is queried across hundreds of resources across multiple subscriptions, and how governance evidence is surfaced to executive stakeholders.
Business Drivers
As organisations expand Azure adoption across multiple subscriptions and environments, point-in-time compliance audits and manual policy management become operationally unsustainable. Compliance drift — where resources that were compliant at deployment gradually deviate through configuration changes, new resource deployments, or policy scope expansion — is the most common enterprise governance failure in large Azure estates.
This architecture was designed to address the enterprise governance requirements of organisations where existing approaches result in:
Compliance drift between environments — policy changes applied to production not propagated to development and test environments creating inconsistent governance posture
Limited real-time visibility into policy violations — compliance state only known at scheduled audit intervals rather than continuously
Slow manual remediation cycles — non-compliant resources identified in audits but remediated through manual operational tickets extending exposure windows
Weak integration between governance controls and infrastructure delivery — policies applied after infrastructure is deployed rather than governed through the same delivery lifecycle
Difficulty scaling governance across multiple subscriptions — manual policy management across dozens of subscriptions creates inconsistency and coverage gaps
Compliance evidence requiring manual collection — audit responses built from portal exports rather than continuously maintained and queryable compliance state
Operational Constraints
The architecture was designed to operate within the following constraints typical of enterprise multi-subscription Azure governance environments:
Governance controls must be consistent across Development, Test, and Production environments but with environment-specific enforcement severity — development teams require operational flexibility that production cannot afford
Policy deployment workflows must integrate into CI/CD pipelines — governance changes must flow through the same review and approval process as infrastructure changes
Automated remediation must avoid operational instability — not all compliance violations should trigger immediate automated remediation; high-impact remediations require human approval
Azure Resource Graph queries must support cross-subscription compliance reporting — no single-subscription visibility model is adequate for enterprise estates
Compliance reporting must serve two audiences — technical operators requiring resource-level violation details and executive stakeholders requiring KPI-level governance posture visibility
Policy exceptions must be manageable at environment scope — development environments may legitimately require exemptions from controls mandatory in production
Multi-subscription governance must follow management group hierarchy — policies assigned at management group level propagate to child subscriptions consistently
Objectives
Design a management group hierarchy enabling consistent policy inheritance across Dev, Test, and Production subscription tiers
Develop environment-specific policy initiatives with differentiated enforcement severity per environment tier
Automate policy lifecycle management through Terraform with GitOps governance and CI/CD deployment
Implement event-driven remediation architecture detecting violations in real time and triggering automated corrective actions
Design Azure Resource Graph queries providing cross-subscription compliance visibility beyond Azure Policy portal limitations
Build Power BI executive compliance dashboards and Azure Workbooks technical compliance dashboards
Define Mean Time to Remediation (MTTR) targets per violation severity — distinguishing automated from human-approved remediation paths
Establish compliance exemption governance — controlling and auditing policy exemptions across the enterprise estate
Management Group Hierarchy & Policy Inheritance
The management group hierarchy is the foundational governance design decision — policy assignments at management group level inherit to all child subscriptions automatically.
Policy Inheritance Design:
Policies assigned at Enterprise Management Group level apply to all subscriptions — foundational security controls with no environment exceptions
Environment-specific initiatives assigned at Production, Test, and Development management group levels — providing differentiated enforcement without duplicating universal controls
Sandbox subscriptions have minimal governance — intentional for innovation and exploration without compliance friction
Environment-Aware Policy Initiative Design
Three-Tier Governance Model:
Control Category | Development | Test | Production |
|---|---|---|---|
Public IP on VMs | Audit | Deny | Deny |
Diagnostic settings | Audit | Audit | DeployIfNotExists |
Resource tagging | Audit | Deny | Deny |
TLS minimum version | Audit | Deny | Deny |
Approved VM SKUs | Disabled | Audit | Deny |
Storage HTTPS only | Audit | Deny | Deny |
Key Vault soft delete | Audit | Deny | Deny |
Approved locations | Disabled | Audit | Deny |
MFA for management | Audit | Audit | Deny |
Rationale for Environment Differentiation: Development environments using Deny effects for all controls creates operational friction that slows development velocity without proportional security benefit — developers testing configurations in development should have flexibility to iterate. Audit effects in development surface compliance awareness without blocking operations. Production uses Deny effects for all security-critical controls — non-compliance is simply not permitted, regardless of operational convenience.
Architecture Overview
The solution is structured as a seven-layer enterprise governance platform integrating policy definition and IaC, GitOps governance, CI/CD automation, compliance enforcement, event-driven remediation, monitoring and analytics, and executive reporting.
1. Policy Definition & Infrastructure-as-Code Layer
All governance definitions are managed as Terraform code — version-controlled, peer-reviewed, and deployed through CI/CD pipelines.
Terraform Module Structure:
Example Custom Policy Definition — Terraform:
hcl
2. GitOps Governance Layer
Git repositories serve as the authoritative source of truth for all governance definitions — policy changes require pull request review and approval before reaching any environment.
Branch Strategy for Governance:
Pull Request Governance Requirements:
Policy definition changes require review from at least one governance team member
Production policy assignment changes require review from two approvers — governance team and security team
All PR checks must pass: Terraform validate, tfsec scan of policy definitions, and
terraform planoutput reviewed as PR commentPolicy definition changes include impact assessment — identifying which existing resources would be affected by the new policy
3. CI/CD Automation Layer
Governance deployment pipelines enforce validation, planning, approval, and deployment stages independently per environment tier.
Pipeline Stage Architecture:
yaml
Policy Impact Report Generation: Before production governance changes are approved, an automated impact report is generated showing how many existing resources would be affected by each policy change — enabling approvers to make informed decisions about production policy deployment timing and potential operational impact.
4. Compliance Enforcement Layer
Azure Policy enforces governance controls across the management group hierarchy with environment-appropriate effects.
Universal Controls — All Environments:
hcl
Compliance Exemption Governance:
hcl
All exemptions are version-controlled, require justification comments, have mandatory expiry dates, and are reviewed quarterly — permanent exemptions are not permitted.
5. Event-Driven Remediation Layer
The remediation layer detects compliance violations in real time and orchestrates automated corrective actions — with human approval gates for high-impact remediations.
Remediation Architecture Flow:
Concrete Remediation Example — Public IP Detected on VM:
python
Remediation Decision Matrix:
Violation Type | Severity | Remediation Path | Human Approval | Target MTTR |
|---|---|---|---|---|
Public IP on workload VM | High | Automated Function | No | 5 minutes |
Missing diagnostic settings | Medium | DeployIfNotExists Policy | No | 30 minutes |
Missing resource tags | Low | Automated Function | No | 15 minutes |
Non-compliant VM SKU | High | Logic App workflow | Yes | 4 hours |
Public storage blob access | Critical | Automated Function | No | 2 minutes |
Missing NSG on subnet | High | Logic App workflow | Yes | 2 hours |
Encryption disabled | Critical | Logic App workflow | Yes | 1 hour |
Why Separation of Automated and Approved Remediation: Automatically remediating all violations regardless of impact risks operational disruption — removing a public IP from a VM that a team intentionally exposed for legitimate testing breaks their workflow without notice. Automated remediation is appropriate only for violations where the corrective action has no plausible legitimate use case and low operational impact. High-impact remediations route through approval workflows — ensuring human judgement is applied before irreversible or operationally disruptive actions are taken.
6. Monitoring & Analytics Layer
Compliance analytics leverage Azure Resource Graph for cross-subscription querying — the correct tool for enterprise-scale compliance visibility that Azure Policy portal cannot provide at subscription-crossing scale.
Azure Resource Graph — Cross-Subscription Compliance Queries:
kusto
kusto
kusto
Azure Monitor — Governance Telemetry:
Azure Policy compliance state change events logged to Log Analytics
Remediation task execution results — success, failure, and partial remediation outcomes
Policy assignment deployment events from CI/CD pipeline
Alert rules for compliance rate degradation — alerting when subscription compliance rate drops below defined threshold
7. Visualisation & Reporting Layer
Compliance reporting serves two audiences through separate visualisation tools — technical operators and executive governance stakeholders.
Azure Workbooks — Technical Compliance Dashboard:
Per-subscription compliance rate by policy initiative
Non-compliant resource inventory with drill-down to resource-level violation details
Remediation task status — pending, in-progress, completed, failed
Policy assignment coverage map — which initiatives are assigned to which management groups
Recent policy deployment history from CI/CD pipeline
Power BI — Executive Compliance Dashboard:
Report | Audience | Refresh Frequency | Purpose |
|---|---|---|---|
Enterprise Compliance Score | Executive / CISO | Daily | Overall governance posture KPI |
Environment Compliance Comparison | Governance team | Daily | Dev/Test/Prod compliance rate comparison |
Compliance Trend | Executive / Governance | Weekly | 90-day compliance rate trend |
Top Violations | Governance team | Daily | Most frequent policy violations requiring attention |
Remediation Performance | Operations | Daily | MTTR by violation type vs targets |
Exemption Register | Compliance/Audit | Weekly | Active exemptions with expiry dates |
Compliance Score Methodology:
Architecture Diagram

Technologies Used
Category | Technologies |
|---|---|
Infrastructure as Code | Terraform |
CI/CD & GitOps | GitHub Actions, Azure DevOps, YAML Pipelines |
Policy Governance | Azure Policy, Azure Initiative Definitions, Management Groups |
Cross-Subscription Analytics | Azure Resource Graph, KQL |
Event-Driven Remediation | Azure Event Grid, Logic Apps, Azure Functions, Azure Automation Runbooks |
Monitoring | Azure Monitor, Log Analytics |
Reporting | Power BI, Azure Workbooks |
Security & Compliance | Microsoft Defender for Cloud, Azure RBAC, Azure Key Vault |
Compliance Frameworks | CIS Azure Benchmark v2.0, NIST SP 800-53, PCI DSS v4.0 |
Key Challenges Addressed
Maintaining policy consistency across multiple subscriptions — addressed through management group hierarchy policy inheritance — universal controls assigned at enterprise management group level propagate to all subscriptions automatically without per-subscription assignment management.
Integrating governance into CI/CD without slowing delivery — addressed through environment-tiered deployment pipelines where development governance changes deploy automatically while production governance changes require two-approver gate — maintaining delivery velocity in lower environments without sacrificing production governance rigour.
Providing real-time cross-subscription compliance visibility — addressed through Azure Resource Graph KQL queries aggregating compliance state across all subscriptions simultaneously — Azure Policy portal provides single-subscription visibility only.
Automating remediation without introducing instability — addressed through remediation decision matrix separating automated low-impact remediations from human-approved high-impact actions — automated remediation applies only where corrective actions have no plausible legitimate business use case.
Supporting environment-specific governance flexibility — addressed through three-tier initiative design using Audit effects in development and Deny effects in production — same policy definitions, different enforcement severity per environment management group.
Scaling exemption governance — addressed through Terraform-managed exemptions with mandatory expiry dates, justification requirements, and version-controlled audit trail — preventing exemption accumulation that gradually erodes compliance posture.
Design Decisions & Rationale
Management Group Hierarchy as the Policy Distribution Foundation : Assigning policies at individual subscription level creates management overhead that scales linearly with subscription count. Management group hierarchy enables hierarchical inheritance — universal controls assigned once at the top propagate automatically to all child subscriptions. Environment-specific initiatives assigned at environment management group level apply consistently to all subscriptions within that environment tier without per-subscription configuration.
Environment-Aware Initiative Design with Effect Differentiation : Uniform Deny effects across all environments blocks legitimate development activities — developers testing configurations need flexibility that production cannot permit. Three-tier initiative design maps enforcement severity to operational risk — development receives Audit awareness without operational blocking, production receives Deny enforcement without exception. The same policy definitions serve all environments with environment-specific effect parameters.
Event-Driven Remediation over Scheduled Remediation : Scheduled remediation runs (e.g. daily compliance remediation jobs) leave non-compliant resources exposed for the interval between runs. Event Grid-triggered remediation responds to compliance state changes in near real time — reducing the exposure window from hours to minutes for automated remediations and providing immediate notification for human-approved remediations.
Azure Resource Graph over Azure Policy Portal for Compliance Analytics : Azure Policy compliance portal provides per-subscription compliance views — inadequate for enterprise estates spanning dozens of subscriptions. Azure Resource Graph queries execute across all subscriptions simultaneously, enabling cross-subscription compliance aggregation, trend analysis, and KQL-based custom compliance reporting that the portal cannot provide.
Mandatory Exemption Expiry Dates : Permanent policy exemptions accumulate over time as environments evolve — exempt resources become forgotten compliance gaps. Mandatory expiry dates on all exemptions through Terraform enforcement ensure exemptions are reviewed and either renewed with justification or removed when the underlying business need expires. Quarterly exemption review processes validate that active exemptions remain justified.
Separation of Enforcement and Remediation Layers : Combining Azure Policy enforcement and remediation in a single workflow creates risk — a misconfigured remediation action could cascade across large numbers of resources simultaneously. Separating enforcement (Azure Policy — detects violations) from remediation (Event Grid → Functions/Runbooks — corrects violations) enables independent testing, independent failure modes, and granular control over which violations trigger automated vs human-approved remediation.
Trade-offs & Design Constraints
Azure Policy DeployIfNotExists Remediation Timing Gap : DeployIfNotExists effect creates a remediation task that runs asynchronously after resource creation — there is a window between resource deployment and remediation completion where resources exist without required configurations. For compliance-critical controls (diagnostic settings, encryption), Terraform should explicitly configure these settings rather than relying on Policy remediation — Policy remediation should serve as a backstop for resources deployed outside IaC governance, not the primary configuration mechanism for IaC-managed resources.
Event Grid Compliance Event Volume at Scale : In large Azure estates with frequent resource changes, Azure Policy generates high volumes of compliance state change events. Event Grid handles high throughput but downstream Logic Apps and Azure Functions must be designed for concurrent execution — a compliance event storm following a large deployment could trigger thousands of simultaneous remediation events. Rate limiting, dead letter queuing, and idempotent remediation function design are essential for production remediation reliability.
Resource Graph Query Throttling : Azure Resource Graph queries are subject to throttling limits — approximately 15 queries per 5 seconds per tenant for standard tier. Power BI dashboards refreshing compliance data through Resource Graph queries must implement query result caching and refresh scheduling to avoid throttling. Direct Power BI → Resource Graph integration without caching creates throttling risk at enterprise scale.
Terraform Policy State Import Complexity : Importing existing manually configured Azure Policy definitions and assignments into Terraform state requires careful attribute mapping — policy rule JSON in existing definitions must exactly match Terraform resource attribute structure. Mismatches generate plan drift requiring careful reconciliation. A discovery-first approach — using Azure CLI to export existing policy definitions before writing Terraform resources — reduces import complexity.
Remediation Identity Permissions Scope : Azure Automation Runbooks and Azure Functions executing remediation actions require Azure RBAC permissions — typically Contributor on affected resource groups. Overly broad remediation identity permissions create risk if the remediation service is compromised. Permissions should be scoped to the minimum required for each remediation action type — separate managed identities per remediation function with purpose-specific role assignments rather than a single broadly-scoped remediation identity.
Projected Outcomes
The architecture is designed to deliver the following governance and operational outcomes in a production enterprise environment:
Consistent policy enforcement across all subscriptions through management group hierarchy inheritance — universal controls applied without per-subscription configuration management
Environment-appropriate governance through three-tier initiative design — development flexibility and production strictness enforced through the same policy definitions with differentiated effects
Near real-time compliance violation detection and automated remediation for defined violation categories through Event Grid-triggered orchestration
Cross-subscription compliance visibility through Azure Resource Graph KQL queries — enterprise-wide compliance posture queryable on demand
Executive governance reporting through Power BI dashboards with daily compliance score, trend analysis, and MTTR performance tracking
Auditable governance lifecycle through Terraform-managed policy definitions, GitOps version control, and CI/CD deployment history
Controlled exemption governance through mandatory expiry dates, justification requirements, and quarterly review processes preventing exemption accumulation
Future Evolution
OPA/Gatekeeper integration for Kubernetes workload governance — extending Policy-as-Code governance to AKS admission control through the same GitOps governance model
AI-assisted compliance anomaly detection — identifying unusual compliance degradation patterns indicating potential security incidents rather than routine configuration drift
Automated risk scoring and violation prioritisation — weighting compliance violations by asset criticality and regulatory impact for intelligent remediation sequencing
Cross-cloud governance federation — extending management group-equivalent governance patterns to AWS Organizations and GCP Resource Manager through Terraform multi-cloud provider management
Continuous compliance validation pipelines — scheduled Resource Graph compliance scans triggering pipeline alerts when compliance rate drops below defined thresholds
Self-healing remediation workflows — expanding automated remediation coverage as remediation patterns are proven stable through operational experience
FinOps governance integration — Azure Policy enforcement of cost governance controls (approved VM SKUs, required shutdown schedules, resource lifecycle tagging) through the same governance platform
Security posture benchmarking automation — automated CIS Azure Benchmark and NIST SP 800-53 compliance scoring through Defender for Cloud regulatory compliance integration
Key Takeaways
Management group hierarchy is the foundational Policy-as-Code design decision — policy inheritance eliminates per-subscription management overhead that scales unsustainably with subscription count
Environment-aware initiative design with effect differentiation is essential — uniform Deny enforcement across all environments blocks legitimate development activities; the same policy definitions should serve all environments with environment-specific effect parameters
Event-driven remediation dramatically reduces compliance violation exposure windows compared to scheduled remediation — real-time response versus hourly or daily remediation cycles
Azure Resource Graph is the correct tool for cross-subscription compliance analytics — Azure Policy portal provides single-subscription visibility only and cannot support enterprise-scale compliance aggregation
Automated remediation must be bounded by a decision matrix — not all violations should trigger automated corrective actions; high-impact remediations require human approval to prevent operational disruption
Exemption governance requires mandatory expiry enforcement — permanent exemptions accumulate into compliance debt; Terraform-managed exemptions with expiry dates prevent this erosion
Separation of enforcement and remediation layers enables independent failure modes — Azure Policy detecting violations and Functions/Runbooks correcting them can be tested, operated, and failed independently
