Azure Cloud Cost Optimization & Governance

Background

Cloud adoption had outpaced any attempt to govern it. Teams provisioned what they needed, resources accumulated with no attribution, and by the time finance flagged the monthly bill, nobody could answer the basic question: where is the money going?

Two years of decentralized Azure expansion had produced a sprawling environment across multiple subscriptions — inconsistent naming, no tagging, no centralized view of what was running or why. The executive team was paying the bill without understanding it.

Anil Choudhary led the cost optimization initiative from audit through governance implementation and ongoing review framework setup.

The Challenge

No Cost Visibility

The Azure bill arrived as a monolithic number. Decomposing it into meaningful cost centres was not possible because:

No tagging existed on the majority of resources — 70%+ of resources had no owner, environment, or application tag
Multiple subscriptions with no consistent naming made identifying which teams owned which resources an exercise in guesswork
Azure Cost Management showed cost by service type (compute, storage, networking) but could not attribute cost by team, application, or business unit
Engineering teams had no visibility into the cost of their own workloads — there was no feedback loop connecting technical decisions to financial impact

Uncontrolled Provisioning

Resources were being created without any approval or review process:

Development subscriptions had Contributor access for large groups of engineers
No policy prevented provisioning of expensive resource types (high-SKU VMs, premium storage, large databases) without justification
Resources created for one-time tasks or experiments were never decommissioned
No auto-shutdown policies existed on development or test VMs

Idle and Oversized Resources

A detailed resource utilization audit revealed significant waste:

VMs running 24/7 with CPU consistently below 5% — clearly oversized for their workload
Development and test VMs running overnight and on weekends when no development was in progress
Databases provisioned at high service tiers with minimal actual usage
Storage accounts with data never accessed in months — retained "just in case"
Orphaned resources: managed disks with no attached VM, public IPs with no associated resource, network interfaces unattached

No Reserved Instance Coverage

The organization was paying on-demand prices for workloads that had been running continuously for over a year. Reserved instances (1-year or 3-year commitments) on stable workloads offered significant discounts — but no one had assessed what was eligible.

Implementation

Phase 1: Tagging Policy and Remediation

A mandatory tagging policy was the foundation everything else depended on. Without tags, cost attribution was impossible.

Required Tags (enforced via Azure Policy)

Tag	Values	Purpose
`environment`	dev / test / staging / prod	Environment identification
`owner`	Team or individual name	Accountability
`cost-centre`	Finance cost centre code	Financial allocation
`application`	Application or service name	Workload identification
`expiry-date`	ISO date or `permanent`	Identifies temporary resources

Policy Implementation

{
  "mode": "Indexed",
  "policyRule": {
    "if": {
      "allOf": [
        { "field": "type", "notIn": ["Microsoft.Resources/subscriptions", "Microsoft.Resources/resourceGroups"] },
        {
          "anyOf": [
            { "field": "tags['environment']", "exists": false },
            { "field": "tags['owner']", "exists": false },
            { "field": "tags['cost-centre']", "exists": false },
            { "field": "tags['application']", "exists": false }
          ]
        }
      ]
    },
    "then": {
      "effect": "deny"
    }
  }
}

The policy was applied in audit mode first for 2 weeks to measure the scope of non-compliance, then switched to deny for all new resources. Existing untagged resources were remediated via a bulk tagging exercise using Azure Resource Graph queries to identify them and Azure CLI scripts to apply tags at scale.

# Bulk tag remediation — apply owner tag to all untagged resources in a subscription
az resource list --subscription $SUB_ID --query "[?tags.owner==null].id" -o tsv | \
  xargs -I{} az resource tag --ids {} --tags owner="$TEAM_NAME" cost-centre="$CC_CODE"

Phase 2: Cost Visibility Dashboard

With tagging in place, a Cost Management dashboard was built to give every stakeholder a view of their costs:

Executive Dashboard

Total monthly spend vs. budget with trend line
Top 5 cost centres by spend
Month-over-month change by team
Forecast vs. actual for current month

Team Dashboards Each team received their own dashboard scope filtered to their cost-centre tag:

Their current month's spend by application
Top 5 most expensive resources
Resources with expiry-date in the past (overdue for decommission)
Recommendations from Azure Advisor for their resources

Budget Alerts Budgets were set per cost centre at 80% and 100% thresholds, with email and Teams notification to the team owner and their manager.

Phase 3: Idle and Oversized Resource Remediation

The utilization audit identified resources in three categories:

Category 1: Terminate immediately

Orphaned managed disks (no attached VM) — 47 found, costing ~$1,200/month
Unattached public IP addresses — 23 found
Stopped VMs still incurring storage and IP costs — 31 found
Expired temporary resources — 18 resources with past expiry dates

These were confirmed with resource owners and deleted within the first two weeks.

Category 2: Rightsize VMs with consistently low CPU and memory utilization over 30 days were identified via Azure Monitor metrics:

Perf
| where TimeGenerated > ago(30d)
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| summarize AvgCPU = avg(CounterValue), MaxCPU = max(CounterValue) by Computer
| where AvgCPU < 10 and MaxCPU < 30
| join kind=inner (
    Heartbeat | summarize by Computer, ResourceGroup, SubscriptionId
) on Computer

34 VMs were identified for rightsizing. A comparison of current SKU to recommended SKU (based on Azure Advisor recommendations and actual utilization) showed an average cost reduction of 40% per machine after resizing.

Category 3: Implement auto-shutdown Development and test VMs running outside of business hours were placed under an auto-shutdown schedule:

# Apply auto-shutdown to all VMs tagged environment=dev
az vm list --query "[?tags.environment=='dev'].id" -o tsv | \
  while read vm_id; do
    az vm auto-shutdown --ids "$vm_id" --time 1900 --timezone "UTC"
  done

Development VMs shutting down at 19:00 UTC (7pm) and restarting manually when needed reduced compute spend on the development fleet by ~62%.

Phase 4: Reserved Instance Strategy

An analysis of the production fleet identified workloads with stable, predictable resource requirements — candidates for reserved instance commitments:

Workload Type	Current Cost (on-demand)	Reserved 1-yr	Reserved 3-yr	Recommendation
Production web tier VMs	$8,400/mo	$5,460/mo (35% saving)	$4,200/mo (50% saving)	1-year RI
Database servers	$6,200/mo	$3,720/mo (40% saving)	$2,790/mo (55% saving)	3-year RI
App Service Plans	$3,100/mo	$2,170/mo (30% saving)	$1,860/mo (40% saving)	1-year RI

Reserved instances were purchased in phases — starting with the workloads with the highest confidence in stability. The total annual commitment reduced compute costs on covered workloads by an average of 38%.

Phase 5: FinOps Governance Framework

The final phase established ongoing governance to prevent cost sprawl from recurring:

Monthly Cost Review A structured monthly review was established with team leads and the finance team:

Actual vs. budget per cost centre
Resources flagged for decommission
New resource spending reviewed against justification
Reserved instance utilization reviewed (unused RI capacity is wasted)

Cost Review Board For any resource with a monthly cost projection above a defined threshold, pre-provisioning approval was required from the Cost Review Board — a lightweight process (async Teams message with cost estimate) rather than a formal committee meeting.

Savings Tracking A running savings log was maintained to track the cumulative impact of optimization activities, providing visibility for executive reporting.

Results

Category	Monthly Saving
Terminated idle/orphaned resources	~$4,200
VM rightsizing	~$6,800
Dev/test auto-shutdown	~$5,100
Reserved instance commitments	~$8,300
Database tier optimization	~$3,600
Total monthly reduction	~$28,000 (35%)

Governance Metric	Before	After
Resources with required tags	Below 30%	100% (enforced)
Cost attribution by team	Impossible	Real-time dashboard per cost centre
Budget alerts	None	80% and 100% thresholds per team
Reserved instance coverage	0%	68% of eligible production workloads
Dev VM overnight running	~100%	Near zero — auto-shutdown enforced
Orphaned resource accumulation	Ongoing	Monthly review with expiry-date enforcement

The 35% reduction in monthly spend happened within 10 weeks. But the more durable outcome is the governance framework. Cost visibility, tagging enforcement, and monthly reviews mean the savings compound rather than quietly erode. Without the governance layer, the same patterns would have re-emerged within a year.