Risk-Driven Development: Risk and Observability from Design

What is Risk-Driven Development

Risk-Driven Development (RDD) is the principle that design, testing and observability decisions are prioritized by the risk they mitigate, not by uniform coverage. In an environment with limited resources — one person, a legacy stack, a team without dedicated SRE — this is critical: you cannot instrument everything, but you can instrument what hurts the most.

Risk analysis is not something done “after production”. It is a design activity that occurs when planning each feature, each architectural change and each migration.

Development cycle with integrated risk

Each feature or significant change goes through an expanded cycle:

Plan feature — define scope and requirements
Analyze risk — identify failure modes before writing code
Design observability — define metrics, logs, alerts, traces
Implement — write the code with risk mitigations built in
Validate — test against identified failure scenarios
Monitor — verify in production that risk assumptions hold

The key shift: steps 2 and 3 happen before implementation, not after.

The 4 FMEA questions as a design methodology

Before writing a single line of code, each feature must pass through 4 fundamental FMEA questions applied to software design. These questions are the core of risk analysis. They are not a post-development exercise — they are part of feature design.

1. What can fail? (Failure mode)

Identify the failure modes this change introduces or modifies. A change in an API can break consumers. A data migration can lose records. A new endpoint can saturate the database. If you don’t identify how it can fail, you’re not ready to implement.

2. How serious is it if it fails? (Severity)

Evaluate the business impact if the failure occurs. Data loss is severity 9–10. Latency degradation may be 5–6. This determines whether the change needs automatic rollback, feature flag, canary deploy or if a smoke test is sufficient.

3. How likely is it to fail? (Occurrence)

Estimate the probability based on change complexity, failure history in similar components, and quality of available tests. A change in critical business logic with partial tests has high occurrence.

4. Would we detect it in time? (Detection)

The most important question for operations. If the failure occurs, how long would it take us to know? If the answer is “when the customer calls”, detection is 9–10 and we need to design metrics, alerts and logs before deploying.

Rule: if you cannot answer the 4 questions, the change is not ready to be implemented.

Quick reference: questions by FMEA dimension

Failure mode — “What can fail?”

Example question	What it reveals
If this endpoint receives double the traffic, what happens?	Identifies saturation, bottlenecks and under-dimensioned dependencies
What happens if the database takes 10x longer than normal?	Reveals absence of timeouts, circuit breakers or back-pressure queues
If the upstream service changes its contract, what breaks here?	Exposes fragile coupling and absence of contract tests
What happens if this job runs twice in a row?	Validates idempotency; if not idempotent, there is risk of duplicates or corruption

Severity — “How serious is it?”

Example question	What it reveals
If this change fails, does the user lose data or lose functionality?	Distinguishes between critical failure (S=9–10) and contained degradation (S=4–6)
Does the failure affect one user, a segment or all?	Determines blast radius and whether it needs a feature flag or canary
Is there regulatory, contractual or reputational impact?	Identifies severities that are not technical but are business-critical

Occurrence — “How likely is it?”

Example question	What it reveals
How many times has something similar failed in the last 6 months?	Uses incident history and postmortems as evidence. Don’t guess: measure
What test coverage does the logic I’m modifying have?	Code without tests has high occurrence by definition; QA reduces O directly
Does the change touch legacy code without documentation?	Code without documentation has higher failure probability due to unknown expected behavior

Detection — “Would we detect it?”

Example question	What it reveals
If this fails at 3am, does anyone find out before the customer?	Evaluates whether actionable alerts exist. If the answer is no, D=9–10
Do I have a metric that shows degradation before total failure?	Detecting progressive degradation is better than waiting for complete failure
Does the post-deploy smoke test cover this flow?	If the flow has no smoke test, a failure can escape to production undetected

Occurrence: QA and Chaos Engineering as measurement tools

Occurrence (O) is not just a subjective estimate — it can and should be actively measured:

QA directly reduces Occurrence: each test that covers a failure mode lowers the probability of that failure reaching production. Unit tests on critical logic, contract tests on integrations and post-deploy smoke tests are controls that measurably reduce O.
Chaos engineering measures Occurrence under real conditions: injecting controlled failures (latency, errors, dependency outages) in pre-production or production environments with limited blast radius answers “how likely is this to fail” with data, not intuition.
Both complement each other: QA prevents known failures (lowers O before deploy). Chaos engineering discovers unknown failures (measures real O post-deploy). A mature program uses both.

Architectural context questions

Once the 4 FMEA questions are answered, expand with the specific system context:

Context question	What it informs
What services does this change touch?	Blast radius scope. If it touches multiple services, it requires contract tests
Is it reversible? What is the rollback?	Recovery capability. If not reversible, it needs more gates and validations
What business flows does it impact?	Criticality. Map to the sequence diagram to identify dependencies
Are there known failure modes in similar services?	Consult existing SFMEA for the service. Reuse lessons learned from postmortems
What observability do I need to know it works?	Design metrics, logs and alerts before implementing

Design observability before implementing

Principle: If you can’t observe it, don’t deploy it. Observability is not a post-hoc add-on — it is a design requirement.

For each feature with medium or high risk, define before coding:

Metrics: what RED (Rate, Errors, Duration) signals do I need? Where do I emit them? What dashboard shows them?
Logs: what events should I log with sufficient context? Correlation ID, user ID, transaction ID, result.
Alerts: what condition triggers an alert? What is the threshold? Who receives the alert? What runbook addresses it?
Traces: if the flow crosses multiple services, how do I follow the transaction end-to-end?
SLI/SLO: does this change affect any existing SLO? Does it need a new SLI?

Definition of Done with risk and observability

The definition of “done” for any change in a tier-1 service must include:

Criterion	Detail
Tests adequate to risk	If it touches critical logic: unit tests. If it touches contracts: contract/integration tests. If it touches infrastructure: smoke tests
Documented rollback plan	How to revert if it fails. Feature flag, previous version, rollback script
Verified observability	Metrics emitted, logs with context, alert configured, dashboard updated
SFMEA updated	If the change introduces a new failure mode or modifies an existing one, update the service’s SFMEA sheet
Runbook updated	If the change modifies operational behavior, the runbook must reflect it

When each level of analysis is triggered

Not every change requires a full FTA. The intensity of analysis scales with risk:

Type of change	Analysis level	What is done
Simple bug fix, config change	Basic checklist	PR review, smoke test, verify rollback
New feature in existing service	Lightweight SFMEA + observability	Risk questions, update SFMEA if applicable, design metrics and alerts
Change in integration between services	SFMEA + contract tests	Analyze impact on consumers, pact tests, update sequence diagram
New service or migration	FTA + SFMEA + HAZOP	Full fault tree mapped to architecture, deployment flow HAZOP, end-to-end observability

Example: feature planning with RDD

Feature: Migrate Order Processor from VB to .NET 8

Risk: the current service has RPN 504 for dropped work. The migration may introduce new failure modes if transactional semantics are not preserved.
Updated FTA: add branches for the new service — data migration timeout, format incompatibility, dual write during transition period.
Observability designed beforehand: reconciliation metrics (incoming vs. persisted orders), transaction logs with correlation ID, divergence alert >0.1%.
Planned tests: contract tests against existing consumers, comparative performance baseline (old vs. new), complete transaction smoke tests.
Rollback plan: feature flag to redirect traffic to old service if the new one fails. Dual-run period with reconciliation.
Updated SFMEA: new failure modes for the .NET 8 service, with estimated S-O-D and preventive actions.

The impact: Without Risk-Driven Development, risk analysis only occurs post-incident (reactive). With RDD, risk is analyzed during design (proactive), observability is built before deploy, and postmortems validate whether the analysis was correct — closing the continuous improvement loop.