SRE
SLOs, incidents, postmortems.
The SRE is the persona that keeps production honest. In an AI-native SDLC, the SRE operates a stack of validated primitives, not a wall of dashboards.
Executive summary
The SRE owns production reliability: service level objectives, incident response, on-call operations, toil reduction, and postmortems. In an AI-native SDLC, the SRE operates inside the Operation phase with a fixed set of primitives: one incident agent, four slash prompts, scoped instructions, schema-validated hooks, and a curated list of validated MCPs. SLOs are defined in Azure Monitor, incident communication flows through Microsoft Teams via the M365 Agents SDK, and postmortems live as versioned markdown in GitHub. The primary outputs are SLO definitions, incident briefs, postmortem documents, and toil reduction proposals.
Role and responsibilities
Think of the SRE like a hospital attending physician on night shift. They do not build the hospital, nor design the treatments, but when a patient crashes they lead the room: triage, stabilize, diagnose, document, and feed the learnings into protocol. The measure of an attending is not heroics on a single night; it is the rate at which the same crisis stops happening. In an AI-native SDLC, the hospital is the production estate, the patient is the SLO, and the protocol is the runbook library backed by agentic primitives.
Primary responsibilities:
- Define and maintain SLOs and error budgets in Azure Monitor workbooks
- Lead incident response: triage, mitigation, communication, recovery
- Coordinate on-call via Microsoft Teams with the M365 Agents SDK-backed incident bot
- Author postmortems in GitHub and drive the action items to closure
- Reduce toil through runbook automation and hook enforcement
- Partner with DevOps Engineer and Release Manager on deployment safety
- Operate the Incident Captain agent and the
/slo-review,/incident-brief,/postmortem-draft,/toil-scanprompts
Jobs to be done
- As an SRE, I want SLOs reviewed monthly with error budget burn, so that reliability is a budget, not a mood.
- As an SRE, I want an incident brief drafted in the first 5 minutes, so that responders work from the same understanding.
- As an SRE, I want postmortems drafted from incident telemetry, so that the document writes itself while humans focus on action items.
- As an SRE, I want toil scanned continuously, so that repetitive manual work is converted to code, not absorbed.
- As an SRE, I want runbooks executable from the incident bot in Teams, so that mitigation is a conversation, not a wiki hunt.
- As an SRE, I want the same incident class to never ship twice, so that the action item backlog is short and closed.
Pain points before AI-native
- SLOs as slogans. Every service claims 99.9 percent; nobody computes burn. The first real outage exposes the lie.
- Incident chaos. The first 15 minutes are “who is seeing what.” Facts are collected on Slack, lost when the thread scrolls.
- Postmortems late and thin. Written two weeks later, signed off without action items, filed in a folder nobody reads.
- Toil invisible. Engineers burn afternoons restarting pods and rotating certs. Nobody measures it; nobody budgets against it.
- Runbooks rot. The runbook was correct three architectures ago. Responders skip it and improvise.
AI-native daily workflow
The SRE operates a fixed loop each day. The loop uses GitHub Copilot primitives inside Visual Studio Code and Claude Code at the terminal, plus a small catalog of validated MCPs for external context.
Morning setup
- Open the reliability repo in Visual Studio Code. GitHub Copilot Chat loads
AGENTS.mdand the scoped.github/instructions/reliability.instructions.md. - In Claude Code, run the daily reliability briefing that queries the Azure MCP for the previous 24 hours of SLO burn, Azure Monitor alerts, and Application Insights anomalies.
- Review the open incident backlog and action item aging in GitHub Projects.
- Confirm the on-call schedule for the day in Microsoft Teams.
Midday execution
Each midday cycle is a single reliability improvement or planned incident drill, typically 2 to 3 hours of focused work.
- SLO review. Invoke
/slo-reviewmonthly to recompute error budgets, identify services burning faster than expected, and flag SLOs that are no longer meaningful. - Toil scan. Invoke
/toil-scanto read the previous sprint’s runbook executions and manual interventions. The agent proposes automations, prioritized by hours saved. - Runbook update. Edit the runbooks affected by recent architecture changes; the pre-merge hook validates the runbook schema and the linked dashboards.
- Pull request. Open PRs for SLO changes, runbook updates, and toil reduction automation. GitHub Copilot Code Review scans diffs.
Afternoon incident response (when an incident fires)
- Brief. When Azure Monitor fires an SLO-breaching alert, the M365 Agents SDK-backed incident bot opens a Teams channel and invokes
/incident-brief. The Incident Captain agent produces a 5-minute brief from alerts, recent deploys, and Application Insights errors. - Stabilize. Responders follow the linked runbook; mitigation commands are executed from the incident bot with audit trail in the Teams channel.
- Communicate. Status updates are posted to Teams on a cadence (5 min for high-severity, 15 min for medium). Stakeholders read the channel, not separate emails.
- Postmortem. Within 48 hours of resolution, invoke
/postmortem-draftto produce a timeline, contributing factors, and action items from the incident telemetry. The SRE edits for narrative and assigns owners in GitHub Projects.
Recommended primitives
Agents
| Agent | File | Purpose |
|---|---|---|
incident-captain | .github/agents/incident-captain.agent.md | Lead incident brief, postmortem draft, SLO review, and toil scan |
The Incident Captain agent uses claude-sonnet-4-6 by default. It holds tools read, edit, search, grep, glob, bash, and MCP bindings to Azure MCP, GitHub MCP, and Microsoft 365 Agents SDK MCP. Extended thinking is enabled for postmortem causal reasoning.
Prompts
| Command | File | Purpose |
|---|---|---|
/slo-review | .github/prompts/slo-review.prompt.md | Review error budget burn and flag SLOs that need revision |
/incident-brief | .github/prompts/incident-brief.prompt.md | Produce a 5-minute incident brief from alerts, recent deploys, and errors |
/postmortem-draft | .github/prompts/postmortem-draft.prompt.md | Draft the postmortem from incident telemetry, conversation, and runbook execution logs |
/toil-scan | .github/prompts/toil-scan.prompt.md | Identify toil from the previous sprint and propose automations prioritized by hours saved |
Instructions
Scoped applyTo reduces token cost by approximately 68 percent compared to global instructions.
Scope (applyTo) | File | Purpose |
|---|---|---|
slo/**/*.yaml | .github/instructions/slo.instructions.md | SLO definition schema, burn rate windows, budget policies |
runbooks/**/*.md | .github/instructions/runbook.instructions.md | Runbook format: symptoms, checks, mitigations, validations |
postmortems/**/*.md | .github/instructions/postmortem.instructions.md | Blameless postmortem template, action item discipline |
Skills
Skills are lazy-loaded, so the SRE can install many and pay tokens only for the ones that trigger.
burn-rate-reader: calls the Azure MCP to compute multi-window multi-burn-rate alerts on SLOsaction-item-tracker: ensures every postmortem action item is opened as a GitHub issue with owner and due date
Hooks
Hooks cost zero LLM tokens. They are the strongest governance layer.
pre-commit: validate SLO YAML schema and runbook front matterpre-merge: require action item issues linked on every merged postmortempost-incident: open the postmortem draft PR within 48 hours, escalate if not merged within 7 days
Validated MCPs
Every MCP below is registered in the MCP catalog. Do not reference any MCP that is not in the catalog.
| MCP | Status | Use in this persona |
|---|---|---|
| Azure MCP Server | Official (Microsoft) | Query Azure Monitor, Application Insights, Log Analytics for SLO burn, alerts, and incident telemetry |
| GitHub MCP Server | Official | Open postmortem PRs, track action items, read deploy history |
| Microsoft 365 Agents SDK MCP | Official (Microsoft) | Operate the incident bot in Teams: channel creation, status updates, runbook execution audit |
| Azure DevOps MCP Server | Official (Microsoft) | Read release pipelines and link incidents to release trains |
| Microsoft Learn Docs MCP | Official | Fetch Azure reliability and observability reference guidance while authoring runbooks |
| Playwright MCP | Official (Microsoft) | Run synthetic probes from the incident bot to validate recovery |
Real examples
Scenario A: a p0 incident fires on checkout latency
Input: Azure Monitor fires a 2-hour burn rate alert on the checkout SLO (99.9 percent success, 300 ms p95). Application Insights shows a 4x error rate on the CheckoutController.
Invocation: The incident bot in Teams auto-invokes /incident-brief on alert fire.
Expected output:
- A Teams channel
inc-2026-04-24-checkout-latencycreated within 60 seconds. - An incident brief posted: current SLO burn, last 3 deploys (service and dependency), top 5 error signatures from Application Insights, linked runbook.
- Responders execute the mitigation step (“scale out to 2x, flip feature flag
new-pricing-engineoff”) from the bot with audit trail. - Status updates every 5 minutes for the first 30 minutes; incident resolved within 42 minutes.
Scenario B: draft a postmortem
Input: The checkout incident from Scenario A is resolved. The SRE invokes /postmortem-draft the next morning.
Invocation: /postmortem-draft with the incident channel ID.
Expected output:
- A postmortem
postmortems/2026-04-24-checkout-latency.mdwith timeline, contributing factors, impact, and proposed action items. - Five action items opened as GitHub issues, each with owner and due date, grouped by theme (observability, rollout safety, rollback automation).
- A link to the related release train and the feature flag that was flipped off during mitigation.
- A draft PR that the SRE edits for narrative before merging.
Anti-patterns
- SLOs without burn rate alerts. SLOs defined but nothing alerts until the budget is fully consumed. Mitigation: multi-window multi-burn-rate alerts via the
burn-rate-readerskill. - Heroic incident response. One senior engineer in their DMs fixes every incident. Mitigation: incident bot in Teams enforces brief, channel, and audit trail for every p0 and p1.
- Postmortems filed, not read. Written, signed, ignored. Mitigation:
action-item-trackerensures each postmortem opens real issues with owners and due dates. - Toil absorbed. Engineers rotate certs and restart pods without logging. Mitigation:
/toil-scanreads the runbook execution logs and proposes automations. - Runbook drift. Runbooks describe an old architecture. Mitigation:
pre-mergehook validates runbook schema and the linked dashboards return 200.
KPIs and impact metrics
The SRE persona is evaluated with a mix of SRE and DORA metrics.
| Metric | Baseline (manual) | Target (agentic) | Measurement |
|---|---|---|---|
| SLO coverage | 30 percent of services | > 95 percent | Services with defined SLOs in Azure Monitor |
| Error budget burn visibility | Weekly | Continuous | Burn rate dashboards per service |
| Incident brief latency | 30 min | < 5 min | Time from alert fire to brief posted |
| Mean time to mitigate | 90 min | < 30 min | Azure Monitor incident duration |
| Mean time to restore | 4 hours | < 1 hour | Full recovery to SLO |
| Postmortem completion rate | 50 percent | > 95 percent | Incidents with merged postmortem within 7 days |
| Action item closure rate | 40 percent | > 80 percent | Action items closed within quarter |
| Toil percentage | 50 percent | < 25 percent | On-call hours on manual interventions |
Maturity in four levels
| Level | Name | Markers |
|---|---|---|
| L1 | Manual | SLOs as slogans, incidents managed in Slack threads, postmortems optional |
| L2 | Assisted | GitHub Copilot helps draft postmortems, some SLOs defined, no incident bot |
| L3 | Augmented | One Incident Captain agent, four slash prompts, scoped instructions, Azure and M365 MCPs, incident bot live |
| L4 | Agentic | Full primitives kit, hooks enforced, SLO coverage > 95 percent, incident brief in < 5 min, action items closed > 80 percent |
Integration with other personas
Handoffs:
- From Release Manager: release train, risk tiers, rollback plan, canary report
- From DevOps Engineer: deployed artifact, dashboards, alert rules
- To Developer: incident findings and action items as issues with reproduction steps
- To Platform Architect: toil scan results feed the capability matrix roadmap
- To InfoSec Officer: incidents classified as security are co-owned, with a joint postmortem
Glossary
- Agent: a configured LLM role with tools, instructions, and a defined output shape.
- Prompt: a reusable slash command that invokes an agent with a specific task.
- Instructions: scoped guidance applied by pattern match on file paths via
applyTo. - Skill: a lazy-loaded capability that activates on keyword match.
- Hook: a zero-token rule enforced at a specific lifecycle event.
- MCP: Model Context Protocol server that exposes external systems to the agent.
- SLO: Service Level Objective, the reliability target (e.g., 99.9 percent success at 300 ms p95).
- Error budget: the allowed unreliability derived from the SLO (e.g., 0.1 percent over 28 days).
- Burn rate: the rate at which error budget is being consumed; alerted on multi-window multi-burn-rate.
- Toil: manual, repetitive, automatable operational work that scales linearly with service growth.
- Postmortem: a blameless document capturing timeline, contributing factors, impact, and action items.
References
- Google SRE Book — canonical reference for SLOs, error budgets, and toil
- Azure Monitor documentation — alerts, workbooks, and SLO dashboards
- Application Insights documentation — distributed tracing and anomaly detection
- Log Analytics KQL reference — query language for incident investigation
- Microsoft 365 Agents SDK — incident bot integration with Teams
- GitHub Projects documentation — action item tracking and aging
- DORA metrics research — mean time to restore and change failure rate foundations