Background
Years of cloud expansion had left the privileged access model in a difficult state. Admin accounts with permanent, always-on permissions had spread across Azure subscriptions, on-premises Active Directory, and a secondary cloud environment. Active 24 hours a day, whether or not anyone was actually doing admin work. Each one a credential waiting to be exploited.
The server estate was in a similar position. Three hosting environments, no consistent configuration baseline, security agents deployed manually at different times. Some servers were covered. Some had outdated agents. A subset had nothing at all. Nobody had a reliable picture of which was which.
Anil Choudhary led the design, implementation, and handover end-to-end — covering Azure PIM, Ansible automation, and the Zero Trust model that tied it together.
The Challenge
Privileged Access as a Persistent Attack Surface
In financial services, privileged accounts are the highest-value targets for attackers. The existing access model had several critical weaknesses:
- Always-on Global Administrators — multiple accounts with permanent subscription-level admin rights, active whether or not any administrative task was underway
- No time-bounding — once a role was assigned, it remained assigned indefinitely with no expiry
- No approval workflow — developers and engineers could self-assign elevated permissions in some areas without oversight
- No MFA on role activation — privileged operations could be performed without any additional authentication challenge
- No audit trail — there was no centralized log of who had elevated access, when they used it, or what actions they took while elevated
A single compromised credential with persistent Global Administrator rights would give an attacker unrestricted access to the entire Azure tenant — no time limit, no automatic detection. In financial services, that's an existential risk.
Hybrid Infrastructure Complexity
The server estate spanned three environments with different management planes:
- Azure VMs — managed through Azure Resource Manager, policy-eligible, Arc-connected where needed
- On-premises servers — Windows Server and Linux, managed through legacy tooling, no consistent agent baseline
- Secondary cloud workloads — VMs in a non-Azure environment with no current integration into the organization's security monitoring
Ensuring consistent configuration — security agents, hardening baselines, patch levels — across all three required a tool and approach that could operate uniformly regardless of hosting location.
Configuration Drift and Manual Deployment
The existing deployment model was fully manual:
- Security agents were installed one server at a time via RDP/SSH
- No configuration baseline existed — servers were provisioned and configured differently depending on who did the work
- Drift accumulation meant the security posture of any individual server was unknown without connecting to it directly
- Scaling security tooling to new servers required manual intervention each time
- The time between a server being provisioned and a security agent being installed on it was measured in days, not minutes
For an organization operating under financial services compliance requirements, this was an auditable gap.
No Zero Trust Enforcement
The security model in place was implicitly perimeter-based: once inside the network, entities were broadly trusted. There was no verification layer applied to internal traffic, privileged operations, or lateral movement within the environment.
Architecture
The solution was built on three interdependent pillars: identity governance through Azure PIM, automated workload protection through Ansible-driven Defender deployment, and continuous verification through integrated monitoring.
Zero Trust Control Mapping
| Zero Trust Principle | Implementation |
|---|---|
| Never trust, always verify | Azure PIM: every privileged action requires activation, MFA, and approval |
| Use least privilege access | Eligible assignments only — no permanent admin roles |
| Assume breach | Defender for Servers: continuous vulnerability scanning and threat detection |
| Verify explicitly | Conditional Access: device compliance + MFA required for all admin sessions |
| Automate response | Ansible: consistent enforcement across the full estate; Defender playbooks for remediation |
Azure Privileged Identity Management
Design Philosophy
The PIM implementation was built around a single principle: no account should hold elevated permissions unless actively performing a task that requires them. The window between activation and expiry should be as short as the task permits.
Role Assignment Model
Two assignment types were configured, with clear policies on when each applies:
Eligible Assignment (default for all admin roles)
└── User must explicitly activate the role
└── Activation requires: MFA challenge + justification text + optional approval
└── Role is active for a configured duration (max 8 hours)
└── Role expires automatically — no manual deactivation needed
Active Assignment (exceptions only, requires documented business justification)
└── Break-glass emergency accounts only
└── Monitored by Azure Sentinel with immediate alert on any usage
└── Reviewed quarterly — active assignments not re-justified are reverted to eligible
Permanent Active assignments for operational roles were removed entirely during the implementation. Every operational admin account — including Global Administrator — was converted to Eligible.
Role Configuration by Persona
| Role | Assignment Type | Activation Duration | Approval Required | MFA Required |
|---|---|---|---|---|
| Global Administrator | Eligible | 4 hours | Yes — two approvers | Yes |
| Privileged Role Administrator | Eligible | 4 hours | Yes — one approver | Yes |
| Subscription Owner | Eligible | 8 hours | Yes — one approver | Yes |
| Contributor | Eligible | 8 hours | No | Yes |
| Security Reader | Eligible | 8 hours | No | No |
| Break-glass account | Active (permanent) | N/A | N/A | N/A — monitored |
Activation Workflow
When an engineer needs elevated access, the process is:
- Navigate to PIM in the Azure portal or trigger activation via PowerShell/CLI
- Select the role and specify the duration required (up to the configured maximum)
- Provide a justification describing the task (stored in the audit log)
- Complete an MFA challenge
- If approval is required: wait for an approver to review and approve (approvers are notified via email and Teams)
- Role becomes active — the engineer receives notification and the activation is logged
- Role expires automatically at the end of the configured duration
# Activating a role via PowerShell (for scripted workflows)
$schedule = New-Object Microsoft.Open.MSGraph.Model.AzureADMSPrivilegedSchedule
$schedule.Type = "Once"
$schedule.Duration = "PT4H" # 4 hours
$schedule.StartDateTime = (Get-Date).ToUniversalTime().ToString("yyyy-MM-ddTHH:mm:ss.fffZ")
Open-AzureADMSPrivilegedRoleAssignmentRequest `
-ProviderId "aadRoles" `
-ResourceId "<tenant-id>" `
-RoleDefinitionId "<role-definition-id>" `
-SubjectId "<user-object-id>" `
-AssignmentState "Active" `
-Type "UserAdd" `
-Reason "Deploying security policy update - incident INC-20231105" `
-Schedule $schedule
Access Reviews
Quarterly access reviews were configured for all PIM-eligible assignments:
- Review owners: team leads for each function (not the security team — ownership sits with the business)
- Scope: all eligible and active role assignments
- Outcome: assignments not actively re-justified within the review window are automatically removed
- Results reported to the compliance team for audit evidence
Audit and Monitoring
All PIM activations, approvals, and denials are logged to Azure AD audit logs and forwarded to the Log Analytics workspace:
// PIM role activations in the last 24 hours
AuditLogs
| where TimeGenerated > ago(24h)
| where OperationName has "Add eligible member to role in PIM completed"
or OperationName has "Add member to role in PIM completed (permanent)"
| extend
RoleName = tostring(TargetResources[0].displayName),
InitiatedBy = tostring(InitiatedBy.user.userPrincipalName),
Justification = tostring(AdditionalDetails[0].value)
| project TimeGenerated, InitiatedBy, RoleName, Justification, Result
| order by TimeGenerated desc
A Sentinel analytics rule fires an alert on any Global Administrator or Privileged Role Administrator activation outside business hours — a pattern inconsistent with normal operations that warrants investigation.
Workload Protection: Defender for Servers
Deployment Decision
Defender for Servers Plan 2 was selected over Plan 1 for its inclusion of Microsoft Defender for Endpoint (MDE) integration, file integrity monitoring, and the 500 MB/day free Log Analytics data ingestion per covered server — significant at scale.
Hybrid Onboarding via Azure Arc
The on-premises and secondary cloud servers were onboarded into Azure management via Azure Arc before Defender deployment. Arc extends the Azure control plane to non-Azure machines, enabling policy assignment, Defender coverage, and monitoring through the same tooling used for Azure VMs.
Arc onboarding was scripted and included in the Ansible playbooks so that new servers automatically joined the Arc management plane on first provisioning:
# tasks/arc-onboarding.yml
- name: Download Azure Arc agent
get_url:
url: "https://aka.ms/azcmagent-linux"
dest: /tmp/install_linux_azcmagent.sh
mode: '0755'
- name: Install Azure Arc agent
command: /tmp/install_linux_azcmagent.sh
args:
creates: /usr/bin/azcmagent
- name: Connect to Azure Arc
command: >
azcmagent connect
--subscription-id "{{ azure_subscription_id }}"
--resource-group "{{ azure_resource_group }}"
--tenant-id "{{ azure_tenant_id }}"
--location "{{ azure_location }}"
--cloud AzureCloud
environment:
AZURE_CLIENT_ID: "{{ arc_service_principal_id }}"
AZURE_CLIENT_SECRET: "{{ arc_service_principal_secret }}"
register: arc_connect_result
- name: Verify Arc connection
command: azcmagent show
register: arc_status
failed_when: "'Connected' not in arc_status.stdout"
Server Coverage Matrix
| Environment | Onboarding Method | Defender Plan | Agent Type |
|---|---|---|---|
| Azure VMs | Native — Defender for Cloud policy | Plan 2 | MMA / AMA |
| On-premises Windows | Azure Arc | Plan 2 | AMA via Arc |
| On-premises Linux | Azure Arc | Plan 2 | AMA via Arc |
| Secondary cloud VMs | Azure Arc | Plan 2 | AMA via Arc |
A Defender for Cloud policy was assigned at the management group level with DeployIfNotExists effect — any Azure VM not covered by Defender for Servers Plan 2 is automatically remediated by the policy engine, with no manual intervention required.
Ansible Automation Framework
Architecture
The Ansible control plane was deployed as a dedicated Linux VM inside the management subnet, accessible only via Azure Bastion. No SSH port was exposed externally.
Management Subnet
└── Ansible Control Node (Ubuntu 22.04)
└── SSH key-based authentication (no password auth)
└── Inventory: dynamic (Azure Resource Graph + on-prem static)
└── Vault: credentials stored in Ansible Vault, keys in Azure Key Vault
Client Nodes (all environments)
└── SSH key installed at provisioning time
└── Dedicated ansible service account (no shell login, sudo for specific commands only)
└── Firewall rule: inbound SSH from Ansible control node IP only
Inventory Management
A hybrid inventory was built to cover all three environments:
Azure VMs — queried dynamically via the Azure Resource Manager inventory plugin:
# inventory/azure_rm.yml
plugin: azure.azcollection.azure_rm
auth_source: auto
include_vm_resource_groups:
- rg-production-workloads
- rg-development
keyed_groups:
- key: tags.environment
prefix: env
- key: tags.os_type
prefix: os
On-premises and Arc-connected servers — maintained in a static inventory file with group variables:
# inventory/onprem.ini
[linux_servers]
srv-prod-01.internal ansible_host=10.10.1.11
srv-prod-02.internal ansible_host=10.10.1.12
srv-dev-01.internal ansible_host=10.10.2.11
[windows_servers]
win-prod-01.internal ansible_host=10.10.1.21 ansible_connection=winrm
win-prod-02.internal ansible_host=10.10.1.22 ansible_connection=winrm
[linux_servers:vars]
ansible_user=ansible-svc
ansible_ssh_private_key_file=/etc/ansible/keys/ansible_rsa
ansible_become=true
ansible_become_method=sudo
Core Playbooks
Security Baseline Playbook
Applied to every server on first provisioning and run on a weekly schedule via cron to detect and correct drift:
# playbooks/security-baseline.yml
---
- name: Apply security baseline
hosts: all
become: true
vars_files:
- ../vault/secrets.yml
tasks:
- name: Ensure auditd is installed and running
package:
name: auditd
state: present
notify: restart auditd
- name: Disable root SSH login
lineinfile:
path: /etc/ssh/sshd_config
regexp: '^PermitRootLogin'
line: 'PermitRootLogin no'
state: present
notify: restart sshd
- name: Disable password authentication
lineinfile:
path: /etc/ssh/sshd_config
regexp: '^PasswordAuthentication'
line: 'PasswordAuthentication no'
state: present
notify: restart sshd
- name: Set login banner
copy:
content: |
Authorized access only. All activity is monitored and logged.
dest: /etc/issue.net
- name: Configure SSH to use login banner
lineinfile:
path: /etc/ssh/sshd_config
regexp: '^Banner'
line: 'Banner /etc/issue.net'
state: present
notify: restart sshd
- name: Ensure unattended-upgrades is configured
apt:
name: unattended-upgrades
state: present
when: ansible_os_family == "Debian"
- name: Enable automatic security updates
debconf:
name: unattended-upgrades
question: unattended-upgrades/enable_auto_updates
vtype: boolean
value: 'true'
when: ansible_os_family == "Debian"
handlers:
- name: restart sshd
service:
name: sshd
state: restarted
- name: restart auditd
service:
name: auditd
state: restarted
Defender for Servers Deployment Playbook
The core automation that replaced manual agent installation across the entire server estate:
# playbooks/deploy-defender.yml
---
- name: Deploy Microsoft Defender for Servers agent
hosts: linux_servers
become: true
vars:
workspace_id: "{{ lookup('azure_keyvault_secret', 'law-workspace-id') }}"
workspace_key: "{{ lookup('azure_keyvault_secret', 'law-workspace-key') }}"
tasks:
- name: Check if OMS agent already installed
stat:
path: /opt/microsoft/omsagent/bin/omsagent
register: oms_installed
- name: Download OMS agent installer
get_url:
url: "https://raw.githubusercontent.com/Microsoft/OMS-Agent-for-Linux/master/installer/scripts/onboard_agent.sh"
dest: /tmp/onboard_agent.sh
mode: '0755'
when: not oms_installed.stat.exists
- name: Install and onboard OMS agent
command: >
/tmp/onboard_agent.sh
-w "{{ workspace_id }}"
-s "{{ workspace_key }}"
-d opinsights.azure.com
when: not oms_installed.stat.exists
register: oms_install_result
- name: Verify agent is running
service:
name: omsagent
state: started
enabled: yes
register: agent_status
- name: Confirm agent visible in workspace
uri:
url: "https://management.azure.com/subscriptions/{{ azure_subscription_id }}/resourceGroups/{{ resource_group }}/providers/Microsoft.OperationalInsights/workspaces/{{ workspace_name }}/computers?api-version=2020-08-01"
method: GET
headers:
Authorization: "Bearer {{ azure_access_token }}"
register: workspace_check
until: workspace_check.json.value | selectattr('name', 'equalto', inventory_hostname) | list | length > 0
retries: 12
delay: 30
- name: Log deployment result
debug:
msg: "Defender agent successfully deployed and confirmed on {{ inventory_hostname }}"
when: workspace_check is succeeded
Drift Detection and Remediation
A weekly scheduled run compared the actual state of each server against the baseline and generated a compliance report:
# playbooks/compliance-check.yml
---
- name: Security compliance check
hosts: all
become: true
gather_facts: true
tasks:
- name: Check SSH root login disabled
command: grep -c "^PermitRootLogin no" /etc/ssh/sshd_config
register: root_login_check
changed_when: false
failed_when: false
- name: Check Defender agent running
service_facts:
- name: Check auditd running
service_facts:
- name: Compile compliance status
set_fact:
compliance:
hostname: "{{ inventory_hostname }}"
ssh_root_disabled: "{{ root_login_check.rc == 0 }}"
defender_running: "{{ 'omsagent' in services and services['omsagent'].state == 'running' }}"
auditd_running: "{{ 'auditd' in services and services['auditd'].state == 'running' }}"
- name: Output compliance report
delegate_to: localhost
lineinfile:
path: /var/log/ansible/compliance-{{ ansible_date_time.date }}.csv
line: "{{ compliance.hostname }},{{ compliance.ssh_root_disabled }},{{ compliance.defender_running }},{{ compliance.auditd_running }}"
create: yes
Windows Server Deployment via SCCM
For the Windows-heavy segments of the on-premises estate, SCCM (System Center Configuration Manager) was used as an alternative deployment mechanism alongside Ansible. SCCM provided:
- Centralized software distribution with pre-configured packages
- Staged deployment to pilot groups before full rollout
- Compliance reporting via the SCCM console
- Integration with Windows Server Update Services (WSUS) for patch management
The Defender for Endpoint package was distributed via SCCM collection targeting — servers in the "Security Unprotected" collection (automatically populated via compliance baseline evaluation) received the agent package automatically.
This dual-path approach — Ansible for Linux and mixed environments, SCCM for Windows-heavy on-premises — was a deliberate decision to use the most operationally efficient tool for each context rather than forcing a single deployment method across all environments.
Security Operations Integration
Continuous Monitoring
With PIM activation events and Defender for Servers telemetry both flowing into the Log Analytics workspace, a unified monitoring layer was built:
Privileged Access Monitoring
// Privileged role activations with duration and justification
AuditLogs
| where TimeGenerated > ago(7d)
| where Category == "RoleManagement"
| where OperationName contains "Add member to role in PIM completed"
| extend
RoleName = tostring(TargetResources[0].displayName),
ActivatedBy = tostring(InitiatedBy.user.userPrincipalName),
Justification = tostring(AdditionalDetails[0].value),
Duration = tostring(AdditionalDetails[1].value)
| summarize ActivationCount = count() by ActivatedBy, RoleName, bin(TimeGenerated, 1d)
| order by ActivationCount desc
Defender Coverage Gap Detection
// Servers not reporting to Defender in the last 24 hours
Heartbeat
| where TimeGenerated > ago(24h)
| summarize LastSeen = max(TimeGenerated) by Computer, OSType, ResourceGroup
| where LastSeen < ago(12h)
| project Computer, OSType, ResourceGroup, LastSeen, HoursSinceLastSeen = datetime_diff('hour', now(), LastSeen)
| order by HoursSinceLastSeen desc
Configuration Drift Alert
A Sentinel analytics rule fired when Ansible's compliance report (ingested as a custom log) showed any server with Defender agent in a stopped state:
AnsibleComplianceReport_CL
| where TimeGenerated > ago(25h) // slight overlap with 24h report cadence
| where defender_running_b == false
| project TimeGenerated, hostname_s, ssh_root_disabled_b, auditd_running_b
Incident Response
For Defender-generated critical alerts, a Logic App playbook was configured to:
- Extract the affected server hostname from the Sentinel incident
- Check PIM logs to identify any privileged activations on that server in the preceding 4 hours
- If a PIM activation correlates with the alert timeline: automatically notify the security team with the activation details and justification provided
- If no PIM activation: escalate as a potential unauthorized access incident with higher priority
- Create an ITSM ticket with the full context — Defender alert, correlated PIM events, server compliance status
Results
| Dimension | Before | After |
|---|---|---|
| Standing privileged accounts | Multiple permanent Global Admins | Zero — all converted to eligible with JIT |
| Average privilege exposure window | Always-on (24/7/365) | Maximum 8 hours per activation, audit-logged |
| MFA on privileged operations | Not enforced | Mandatory on every role activation |
| Approval workflow | None | Configured for all Contributor and above |
| Server security agent coverage | Inconsistent — unknown actual percentage | 100% — enforced via policy + Ansible |
| Time to deploy agent on new server | Days (manual) | Minutes (Ansible playbook on first provision) |
| Configuration drift detection | None | Weekly automated scan with remediation |
| Cross-environment visibility | Siloed per hosting platform | Unified in Log Analytics via Azure Arc |
| Compliance reporting | Manual extraction per audit cycle | Continuous — Defender for Cloud compliance dashboard |
| Privileged access audit trail | Fragmented AD logs | Centralised PIM audit log with full activation history |
Compromising an admin credential now doesn't get an attacker much. The account is eligible but inactive — it still needs MFA, a justification, and in most cases, approval before anything actually happens. That's a fundamentally different risk profile.
The Ansible automation brought the server estate under a consistent baseline for the first time. Weekly drift detection means it can't quietly degrade between audits. The security posture the audit sees is the one that's actually in place.
