← All posts

Risk-Driven Development: Risk and Observability from Design

How to integrate FMEA-based risk analysis into your development cycle so that design, testing, and observability decisions are prioritized by the risk they mitigate — not by uniform coverage.

What is Risk-Driven Development

Risk-Driven Development (RDD) is the principle that design, testing and observability decisions are prioritized by the risk they mitigate, not by uniform coverage. In an environment with limited resources — one person, a legacy stack, a team without dedicated SRE — this is critical: you cannot instrument everything, but you can instrument what hurts the most.

Risk analysis is not something done “after production”. It is a design activity that occurs when planning each feature, each architectural change and each migration.

Development cycle with integrated risk

Each feature or significant change goes through an expanded cycle:

  1. Plan feature — define scope and requirements
  2. Analyze risk — identify failure modes before writing code
  3. Design observability — define metrics, logs, alerts, traces
  4. Implement — write the code with risk mitigations built in
  5. Validate — test against identified failure scenarios
  6. Monitor — verify in production that risk assumptions hold

The key shift: steps 2 and 3 happen before implementation, not after.

The 4 FMEA questions as a design methodology

Before writing a single line of code, each feature must pass through 4 fundamental FMEA questions applied to software design. These questions are the core of risk analysis. They are not a post-development exercise — they are part of feature design.

1. What can fail? (Failure mode)

Identify the failure modes this change introduces or modifies. A change in an API can break consumers. A data migration can lose records. A new endpoint can saturate the database. If you don’t identify how it can fail, you’re not ready to implement.

2. How serious is it if it fails? (Severity)

Evaluate the business impact if the failure occurs. Data loss is severity 9–10. Latency degradation may be 5–6. This determines whether the change needs automatic rollback, feature flag, canary deploy or if a smoke test is sufficient.

3. How likely is it to fail? (Occurrence)

Estimate the probability based on change complexity, failure history in similar components, and quality of available tests. A change in critical business logic with partial tests has high occurrence.

4. Would we detect it in time? (Detection)

The most important question for operations. If the failure occurs, how long would it take us to know? If the answer is “when the customer calls”, detection is 9–10 and we need to design metrics, alerts and logs before deploying.

Rule: if you cannot answer the 4 questions, the change is not ready to be implemented.

Quick reference: questions by FMEA dimension

Failure mode — “What can fail?”

Example questionWhat it reveals
If this endpoint receives double the traffic, what happens?Identifies saturation, bottlenecks and under-dimensioned dependencies
What happens if the database takes 10x longer than normal?Reveals absence of timeouts, circuit breakers or back-pressure queues
If the upstream service changes its contract, what breaks here?Exposes fragile coupling and absence of contract tests
What happens if this job runs twice in a row?Validates idempotency; if not idempotent, there is risk of duplicates or corruption

Severity — “How serious is it?”

Example questionWhat it reveals
If this change fails, does the user lose data or lose functionality?Distinguishes between critical failure (S=9–10) and contained degradation (S=4–6)
Does the failure affect one user, a segment or all?Determines blast radius and whether it needs a feature flag or canary
Is there regulatory, contractual or reputational impact?Identifies severities that are not technical but are business-critical

Occurrence — “How likely is it?”

Example questionWhat it reveals
How many times has something similar failed in the last 6 months?Uses incident history and postmortems as evidence. Don’t guess: measure
What test coverage does the logic I’m modifying have?Code without tests has high occurrence by definition; QA reduces O directly
Does the change touch legacy code without documentation?Code without documentation has higher failure probability due to unknown expected behavior

Detection — “Would we detect it?”

Example questionWhat it reveals
If this fails at 3am, does anyone find out before the customer?Evaluates whether actionable alerts exist. If the answer is no, D=9–10
Do I have a metric that shows degradation before total failure?Detecting progressive degradation is better than waiting for complete failure
Does the post-deploy smoke test cover this flow?If the flow has no smoke test, a failure can escape to production undetected

Occurrence: QA and Chaos Engineering as measurement tools

Occurrence (O) is not just a subjective estimate — it can and should be actively measured:

  • QA directly reduces Occurrence: each test that covers a failure mode lowers the probability of that failure reaching production. Unit tests on critical logic, contract tests on integrations and post-deploy smoke tests are controls that measurably reduce O.

  • Chaos engineering measures Occurrence under real conditions: injecting controlled failures (latency, errors, dependency outages) in pre-production or production environments with limited blast radius answers “how likely is this to fail” with data, not intuition.

  • Both complement each other: QA prevents known failures (lowers O before deploy). Chaos engineering discovers unknown failures (measures real O post-deploy). A mature program uses both.

Architectural context questions

Once the 4 FMEA questions are answered, expand with the specific system context:

Context questionWhat it informs
What services does this change touch?Blast radius scope. If it touches multiple services, it requires contract tests
Is it reversible? What is the rollback?Recovery capability. If not reversible, it needs more gates and validations
What business flows does it impact?Criticality. Map to the sequence diagram to identify dependencies
Are there known failure modes in similar services?Consult existing SFMEA for the service. Reuse lessons learned from postmortems
What observability do I need to know it works?Design metrics, logs and alerts before implementing

Design observability before implementing

Principle: If you can’t observe it, don’t deploy it. Observability is not a post-hoc add-on — it is a design requirement.

For each feature with medium or high risk, define before coding:

  • Metrics: what RED (Rate, Errors, Duration) signals do I need? Where do I emit them? What dashboard shows them?
  • Logs: what events should I log with sufficient context? Correlation ID, user ID, transaction ID, result.
  • Alerts: what condition triggers an alert? What is the threshold? Who receives the alert? What runbook addresses it?
  • Traces: if the flow crosses multiple services, how do I follow the transaction end-to-end?
  • SLI/SLO: does this change affect any existing SLO? Does it need a new SLI?

Definition of Done with risk and observability

The definition of “done” for any change in a tier-1 service must include:

CriterionDetail
Tests adequate to riskIf it touches critical logic: unit tests. If it touches contracts: contract/integration tests. If it touches infrastructure: smoke tests
Documented rollback planHow to revert if it fails. Feature flag, previous version, rollback script
Verified observabilityMetrics emitted, logs with context, alert configured, dashboard updated
SFMEA updatedIf the change introduces a new failure mode or modifies an existing one, update the service’s SFMEA sheet
Runbook updatedIf the change modifies operational behavior, the runbook must reflect it

When each level of analysis is triggered

Not every change requires a full FTA. The intensity of analysis scales with risk:

Type of changeAnalysis levelWhat is done
Simple bug fix, config changeBasic checklistPR review, smoke test, verify rollback
New feature in existing serviceLightweight SFMEA + observabilityRisk questions, update SFMEA if applicable, design metrics and alerts
Change in integration between servicesSFMEA + contract testsAnalyze impact on consumers, pact tests, update sequence diagram
New service or migrationFTA + SFMEA + HAZOPFull fault tree mapped to architecture, deployment flow HAZOP, end-to-end observability

Example: feature planning with RDD

Feature: Migrate Order Processor from VB to .NET 8

  1. Risk: the current service has RPN 504 for dropped work. The migration may introduce new failure modes if transactional semantics are not preserved.

  2. Updated FTA: add branches for the new service — data migration timeout, format incompatibility, dual write during transition period.

  3. Observability designed beforehand: reconciliation metrics (incoming vs. persisted orders), transaction logs with correlation ID, divergence alert >0.1%.

  4. Planned tests: contract tests against existing consumers, comparative performance baseline (old vs. new), complete transaction smoke tests.

  5. Rollback plan: feature flag to redirect traffic to old service if the new one fails. Dual-run period with reconciliation.

  6. Updated SFMEA: new failure modes for the .NET 8 service, with estimated S-O-D and preventive actions.

The impact: Without Risk-Driven Development, risk analysis only occurs post-incident (reactive). With RDD, risk is analyzed during design (proactive), observability is built before deploy, and postmortems validate whether the analysis was correct — closing the continuous improvement loop.