Background
The company had grown from a handful of engineers to 50+ without anyone stepping back to think about how deployments actually worked. The result was a patchwork: manual steps living in Confluence, environment scripts that only two people could reliably run, and approval gates that had accumulated over time with nobody sure which ones still served a purpose.
Features sat in staging for days. Competitive velocity was slipping and everyone felt it.
The Challenge
Three problems, each reinforcing the others:
- No pipeline standardisation - 14 microservices each had different deployment scripts, maintained by different teams, with no shared pattern
- Manual gate overload - 6 approval steps between commit and production, most rubber-stamped without review because reviewers lacked the context to evaluate them
- Zero deployment observability - No way to see what version was running in which environment, whether a deployment succeeded, or what changed between releases
Approach
The solution was built in three layers: standardised builds, automated delivery, and integrated observability.
Containerised Build Standard
Every service was migrated to a common Docker build pattern:
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build
FROM node:20-alpine AS runtime
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
USER node
EXPOSE 3000
CMD ["node", "dist/index.js"]
Images are tagged with the Git SHA, pushed to Azure Container Registry, and vulnerability-scanned with Trivy before promotion.
Azure DevOps Pipeline Architecture
stages:
- stage: Build
jobs:
- job: BuildAndScan
steps:
- task: Docker@2
displayName: Build image
- script: trivy image --exit-code 1 --severity HIGH,CRITICAL $(imageTag)
displayName: Security scan
- stage: Deploy_Dev
condition: succeeded()
jobs:
- deployment: DeployDev
environment: development
- stage: Deploy_Staging
condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
jobs:
- deployment: DeployStaging
environment: staging
- stage: Deploy_Production
condition: succeeded()
jobs:
- deployment: DeployProd
environment: production # single approval gate
The key decision was collapsing 6 manual gates into a single production approval — backed by automated quality checks that gave the approver actual signal rather than just a checkbox to click through.
AKS Deployment with Rollback Safety
Deployments use Kubernetes rolling updates with readiness probes. If the new pods do not pass health checks within the timeout window, the deployment automatically rolls back.
resource "azurerm_kubernetes_cluster" "main" {
sku_tier = "Standard"
default_node_pool {
name = "system"
vm_size = "Standard_D4s_v3"
enable_auto_scaling = true
min_count = 2
max_count = 10
}
auto_scaler_profile {
balance_similar_node_groups = true
scale_down_delay_after_add = "10m"
}
}
Datadog Observability Integration
Every deployment emits an event to Datadog with the service name, version, environment, and Git SHA. This surfaces as a deployment marker on every metric graph, making it trivial to correlate a metric change with a specific release.
Custom dashboards were built for:
- Deployment frequency per service
- Change failure rate (deployments that triggered an incident within 1 hour)
- Mean time to restore for rollback events
Results
| Metric | Before | After |
|---|---|---|
| Release cycle | 5 days | 90 minutes |
| Manual approval gates | 6 | 1 |
| Rollback time | 2-4 hours | 8 minutes (automated) |
| Deployment visibility | None | Full Datadog trace |
| Failed deployments caught pre-prod | ~40% | 94% |
Within two months of going live, the team went from one release per week to multiple per day. The pipeline didn't just get faster — it removed the reason to batch releases in the first place.
