Portrait of Jorge Posada seated against a dark blue studio backdrop
Golden Signals — Real-time All healthy

Golden signals dashboard: Latency 42ms p50, Traffic 1.2k requests per second, Error rate 0.03%, Saturation CPU 72%, Memory 58%, Disk 41%, Network 23%.

01

Golden Signals

The four golden signals are the minimum viable observability set for any production service. Instrument these first; everything else is refinement.

Latency
Time to serve a request. Track both successful and failed latencies separately — errors that return fast can mask real degradation.
Traffic
Demand placed on the system — requests per second, transactions, concurrent users. Establishes baseline and detects anomalies.
Errors
Rate of requests that fail, either explicitly (5xx), implicitly (wrong content), or by policy (SLO violation).
Saturation
How full your service is — CPU, memory, I/O, queue depth. Most services degrade before hitting 100%.
02

SLIs / SLOs / Error Budget

SLIs measure what users experience. SLOs set the target. Error budgets quantify how much unreliability remains before the team must stop shipping and focus on reliability.

Availability
The ratio of good requests to total requests. The SLI that defines "the service is working" from the user's perspective.
Error Budget
The amount of unreliability allowed before breaching SLO. At 99.9%, you have 43 minutes of downtime per month — every minute consumed is a minute less to ship.
Error Budget Burn Rate
How fast you consume your error budget. A burn rate of 1x means you'll exactly exhaust the budget by month-end. Above 2x triggers escalation.
Availability — 30 Day Window SLO met

Availability gauge showing 99.97% 30-day uptime against SLO target of 99.95%. Error budget 62% remaining, allowed downtime 13 minutes per month, burn rate 0.8x.

Incident Metrics — 6 Months MTTR improving

Incident metrics line chart over 6 months showing MTTR decreasing from 48 to 8 minutes, MTTD decreasing from 36 to 5 minutes, and MTBF increasing from 12 to 72 hours. Target line at 15 minutes.

03

Reliability & Incidents

Incident metrics track how quickly you detect, respond to, and recover from failures — and how long the system runs between them.

MTTR
Mean Time to Recovery — average time from detection to full restoration. The primary driver of availability after an SLO breach.
MTBF
Mean Time Between Failures — average operational time between incidents. Rising MTBF indicates a more stable system.
MTTD
Mean Time to Detect — how long before the team becomes aware. Poor observability inflates MTTD, costing minutes before recovery begins.
Incident Rate
Number of incidents per time period. Tracks whether investments reduce failure frequency or just improve recovery speed.
Toil %
Percentage of time on manual, repetitive ops work. SRE targets keeping toil below 50% to preserve time for reliability improvements.
04

Performance

Latency percentiles, throughput, and Apdex reveal how users experience your service. Averages lie — percentile distributions show exactly what fraction of users are waiting too long.

Latency Percentiles
p50 is the median; p95 captures most users; p99 exposes worst-case tail. SLOs on p99 protect the most impacted users.
Throughput
Successful requests per second. Combined with latency, reveals capacity boundaries where adding load increases latency non-linearly.
Apdex Score
Application Performance Index — classifies requests as satisfied, tolerating, or frustrated. Apdex > 0.94 is excellent.
Latency Distribution — Last 24h p99 < 200ms

Latency histogram showing percentile distribution: p50 42ms, p75 89ms, p90 127ms, p95 158ms, p99 187ms. All percentiles below the 200ms SLO target.

Infrastructure Utilization — Live 4 nodes healthy

Infrastructure utilization ring gauges: CPU 72% of 4 cores, Memory 58% of 16 GB, Disk 41% of 500 GB, Network 23% of 1 Gbps. All 4 nodes healthy, zero restarts.

05

Infrastructure

Infrastructure metrics track the health of the compute layer — CPU, memory, disk, and network. Saturated resources cause latency spikes before they cause outages.

CPU / Memory / Disk Utilization
Percentage of resource capacity in use. A node at 80% CPU during normal traffic has no headroom for incident spikes.
Network I/O
Bytes in/out per second. Network saturation causes latency and dropped connections before throughput limits are reached.
Container Restarts
Non-zero restarts indicate OOM kills, liveness probe failures, or crashes — each one is a brief outage for that workload.
Node Availability
Healthy nodes vs expected. Node loss reduces redundancy and may trigger cascade if remaining capacity is insufficient.
06

Deployment & Change

DORA metrics measure software delivery performance. High-performing teams deploy frequently with low failure rates — speed and stability are not a trade-off.

Deploy Frequency
How often the team ships to production. Elite teams deploy multiple times per day. Frequent small deploys reduce blast radius.
Lead Time for Changes
Time from commit to production. Short lead times (under one hour) indicate mature CI/CD and fast feedback loops.
Change Failure Rate
Percentage of deployments causing incidents or rollbacks. Elite teams maintain below 15%.
Rollback Rate
Fraction of deploys requiring revert. A leading indicator of change failure — acts before incidents escalate.
Deploy Frequency — 12 Weeks 4.2x/day avg

Deploy frequency bar chart over 12 weeks showing an upward trend from roughly 15 to 35 deploys per week, averaging 4.2 deploys per day. Change failure rate at 4.2%.

Capacity Planning — Current State Scaling ready

Capacity planning horizontal bar chart: Headroom 34%, Cost per request $0.003, Scaling events 12 per month. System is scaling-ready with adequate headroom.

07

Capacity Planning

Capacity planning ensures the system handles projected load growth without degradation — and that you're not over-provisioning. The goal is headroom without waste.

Resource Headroom
Percentage of provisioned capacity unused under peak load. 30–40% headroom balances cost against absorbing traffic spikes.
Cost per Request
Total infra cost divided by request volume. As traffic grows, cost per request should decrease through optimization.
Scaling Events
Autoscaling triggers per month. Rare events may mean over-provisioning; excessive events signal under-provisioning.