03 Reliability & Incidents
Incident metrics track how quickly you detect, respond to, and recover from failures — and how long the system runs between them.
- MTTR
- Mean Time to Recovery — average time from detection to full restoration. The primary driver of availability after an SLO breach.
- MTBF
- Mean Time Between Failures — average operational time between incidents. Rising MTBF indicates a more stable system.
- MTTD
- Mean Time to Detect — how long before the team becomes aware. Poor observability inflates MTTD, costing minutes before recovery begins.
- Incident Rate
- Number of incidents per time period. Tracks whether investments reduce failure frequency or just improve recovery speed.
- Toil %
- Percentage of time on manual, repetitive ops work. SRE targets keeping toil below 50% to preserve time for reliability improvements.