Portrait of Jorge Posada seated against a dark blue studio backdrop

reliability

Risk-Driven Development: Risk and Observability from Design

Stop instrumenting everything uniformly. Start prioritizing design, testing, and observability by the risk they actually mitigate — using FMEA-based analysis as a design activity, not a post-incident exercise.

Read article

FMEArisk-analysisobservability

platform

Estimating Internet Traffic Demand with the CASUAL Model

How do you estimate how much bandwidth an entire country needs? By combining users, services, and access technologies into a three-dimensional model — and feeding it with publicly available data instead of expensive traffic captures.

Read article

capacity-planningtraffic-modelingteletraffic

Golden Signals

The four golden signals are the minimum viable observability set for any production service. Instrument these first; everything else is refinement.

Latency: Time to serve a request. Track both successful and failed latencies separately — errors that return fast can mask real degradation.
Traffic: Demand placed on the system — requests per second, transactions, concurrent users. Establishes baseline and detects anomalies.
Errors: Rate of requests that fail, either explicitly (5xx), implicitly (wrong content), or by policy (SLO violation).
Saturation: How full your service is — CPU, memory, I/O, queue depth. Most services degrade before hitting 100%.

SLIs / SLOs / Error Budget

SLIs measure what users experience. SLOs set the target. Error budgets quantify how much unreliability remains before the team must stop shipping and focus on reliability.

Availability: The ratio of good requests to total requests. The SLI that defines "the service is working" from the user's perspective.
Error Budget: The amount of unreliability allowed before breaching SLO. At 99.9%, you have 43 minutes of downtime per month — every minute consumed is a minute less to ship.
Error Budget Burn Rate: How fast you consume your error budget. A burn rate of 1x means you'll exactly exhaust the budget by month-end. Above 2x triggers escalation.

Reliability & Incidents

Incident metrics track how quickly you detect, respond to, and recover from failures — and how long the system runs between them.

MTTR: Mean Time to Recovery — average time from detection to full restoration. The primary driver of availability after an SLO breach.
MTBF: Mean Time Between Failures — average operational time between incidents. Rising MTBF indicates a more stable system.
MTTD: Mean Time to Detect — how long before the team becomes aware. Poor observability inflates MTTD, costing minutes before recovery begins.
Incident Rate: Number of incidents per time period. Tracks whether investments reduce failure frequency or just improve recovery speed.
Toil %: Percentage of time on manual, repetitive ops work. SRE targets keeping toil below 50% to preserve time for reliability improvements.

Performance

Latency percentiles, throughput, and Apdex reveal how users experience your service. Averages lie — percentile distributions show exactly what fraction of users are waiting too long.

Latency Percentiles: p50 is the median; p95 captures most users; p99 exposes worst-case tail. SLOs on p99 protect the most impacted users.
Throughput: Successful requests per second. Combined with latency, reveals capacity boundaries where adding load increases latency non-linearly.
Apdex Score: Application Performance Index — classifies requests as satisfied, tolerating, or frustrated. Apdex > 0.94 is excellent.

Infrastructure

Infrastructure metrics track the health of the compute layer — CPU, memory, disk, and network. Saturated resources cause latency spikes before they cause outages.

CPU / Memory / Disk Utilization: Percentage of resource capacity in use. A node at 80% CPU during normal traffic has no headroom for incident spikes.
Network I/O: Bytes in/out per second. Network saturation causes latency and dropped connections before throughput limits are reached.
Container Restarts: Non-zero restarts indicate OOM kills, liveness probe failures, or crashes — each one is a brief outage for that workload.
Node Availability: Healthy nodes vs expected. Node loss reduces redundancy and may trigger cascade if remaining capacity is insufficient.

Deployment & Change

DORA metrics measure software delivery performance. High-performing teams deploy frequently with low failure rates — speed and stability are not a trade-off.

Deploy Frequency: How often the team ships to production. Elite teams deploy multiple times per day. Frequent small deploys reduce blast radius.
Lead Time for Changes: Time from commit to production. Short lead times (under one hour) indicate mature CI/CD and fast feedback loops.
Change Failure Rate: Percentage of deployments causing incidents or rollbacks. Elite teams maintain below 15%.
Rollback Rate: Fraction of deploys requiring revert. A leading indicator of change failure — acts before incidents escalate.

Capacity Planning

Capacity planning ensures the system handles projected load growth without degradation — and that you're not over-provisioning. The goal is headroom without waste.

Resource Headroom: Percentage of provisioned capacity unused under peak load. 30–40% headroom balances cost against absorbing traffic spikes.
Cost per Request: Total infra cost divided by request volume. As traffic grows, cost per request should decrease through optimization.
Scaling Events: Autoscaling triggers per month. Rare events may mean over-provisioning; excessive events signal under-provisioning.