Risk-Driven Development: Risk and Observability from Design
How to integrate FMEA-based risk analysis into your development cycle so that design, testing, and observability decisions are prioritized by the risk they mitigate — not by uniform coverage.
What is Risk-Driven Development
Risk-Driven Development (RDD) is the principle that design, testing and observability decisions are prioritized by the risk they mitigate, not by uniform coverage. In an environment with limited resources — one person, a legacy stack, a team without dedicated SRE — this is critical: you cannot instrument everything, but you can instrument what hurts the most.
Risk analysis is not something done “after production”. It is a design activity that occurs when planning each feature, each architectural change and each migration.
Development cycle with integrated risk
Each feature or significant change goes through an expanded cycle:
- Plan feature — define scope and requirements
- Analyze risk — identify failure modes before writing code
- Design observability — define metrics, logs, alerts, traces
- Implement — write the code with risk mitigations built in
- Validate — test against identified failure scenarios
- Monitor — verify in production that risk assumptions hold
The key shift: steps 2 and 3 happen before implementation, not after.
The 4 FMEA questions as a design methodology
Before writing a single line of code, each feature must pass through 4 fundamental FMEA questions applied to software design. These questions are the core of risk analysis. They are not a post-development exercise — they are part of feature design.
1. What can fail? (Failure mode)
Identify the failure modes this change introduces or modifies. A change in an API can break consumers. A data migration can lose records. A new endpoint can saturate the database. If you don’t identify how it can fail, you’re not ready to implement.
2. How serious is it if it fails? (Severity)
Evaluate the business impact if the failure occurs. Data loss is severity 9–10. Latency degradation may be 5–6. This determines whether the change needs automatic rollback, feature flag, canary deploy or if a smoke test is sufficient.
3. How likely is it to fail? (Occurrence)
Estimate the probability based on change complexity, failure history in similar components, and quality of available tests. A change in critical business logic with partial tests has high occurrence.
4. Would we detect it in time? (Detection)
The most important question for operations. If the failure occurs, how long would it take us to know? If the answer is “when the customer calls”, detection is 9–10 and we need to design metrics, alerts and logs before deploying.
Rule: if you cannot answer the 4 questions, the change is not ready to be implemented.
Quick reference: questions by FMEA dimension
Failure mode — “What can fail?”
| Example question | What it reveals |
|---|---|
| If this endpoint receives double the traffic, what happens? | Identifies saturation, bottlenecks and under-dimensioned dependencies |
| What happens if the database takes 10x longer than normal? | Reveals absence of timeouts, circuit breakers or back-pressure queues |
| If the upstream service changes its contract, what breaks here? | Exposes fragile coupling and absence of contract tests |
| What happens if this job runs twice in a row? | Validates idempotency; if not idempotent, there is risk of duplicates or corruption |
Severity — “How serious is it?”
| Example question | What it reveals |
|---|---|
| If this change fails, does the user lose data or lose functionality? | Distinguishes between critical failure (S=9–10) and contained degradation (S=4–6) |
| Does the failure affect one user, a segment or all? | Determines blast radius and whether it needs a feature flag or canary |
| Is there regulatory, contractual or reputational impact? | Identifies severities that are not technical but are business-critical |
Occurrence — “How likely is it?”
| Example question | What it reveals |
|---|---|
| How many times has something similar failed in the last 6 months? | Uses incident history and postmortems as evidence. Don’t guess: measure |
| What test coverage does the logic I’m modifying have? | Code without tests has high occurrence by definition; QA reduces O directly |
| Does the change touch legacy code without documentation? | Code without documentation has higher failure probability due to unknown expected behavior |
Detection — “Would we detect it?”
| Example question | What it reveals |
|---|---|
| If this fails at 3am, does anyone find out before the customer? | Evaluates whether actionable alerts exist. If the answer is no, D=9–10 |
| Do I have a metric that shows degradation before total failure? | Detecting progressive degradation is better than waiting for complete failure |
| Does the post-deploy smoke test cover this flow? | If the flow has no smoke test, a failure can escape to production undetected |
Occurrence: QA and Chaos Engineering as measurement tools
Occurrence (O) is not just a subjective estimate — it can and should be actively measured:
-
QA directly reduces Occurrence: each test that covers a failure mode lowers the probability of that failure reaching production. Unit tests on critical logic, contract tests on integrations and post-deploy smoke tests are controls that measurably reduce O.
-
Chaos engineering measures Occurrence under real conditions: injecting controlled failures (latency, errors, dependency outages) in pre-production or production environments with limited blast radius answers “how likely is this to fail” with data, not intuition.
-
Both complement each other: QA prevents known failures (lowers O before deploy). Chaos engineering discovers unknown failures (measures real O post-deploy). A mature program uses both.
Architectural context questions
Once the 4 FMEA questions are answered, expand with the specific system context:
| Context question | What it informs |
|---|---|
| What services does this change touch? | Blast radius scope. If it touches multiple services, it requires contract tests |
| Is it reversible? What is the rollback? | Recovery capability. If not reversible, it needs more gates and validations |
| What business flows does it impact? | Criticality. Map to the sequence diagram to identify dependencies |
| Are there known failure modes in similar services? | Consult existing SFMEA for the service. Reuse lessons learned from postmortems |
| What observability do I need to know it works? | Design metrics, logs and alerts before implementing |
Design observability before implementing
Principle: If you can’t observe it, don’t deploy it. Observability is not a post-hoc add-on — it is a design requirement.
For each feature with medium or high risk, define before coding:
- Metrics: what RED (Rate, Errors, Duration) signals do I need? Where do I emit them? What dashboard shows them?
- Logs: what events should I log with sufficient context? Correlation ID, user ID, transaction ID, result.
- Alerts: what condition triggers an alert? What is the threshold? Who receives the alert? What runbook addresses it?
- Traces: if the flow crosses multiple services, how do I follow the transaction end-to-end?
- SLI/SLO: does this change affect any existing SLO? Does it need a new SLI?
Definition of Done with risk and observability
The definition of “done” for any change in a tier-1 service must include:
| Criterion | Detail |
|---|---|
| Tests adequate to risk | If it touches critical logic: unit tests. If it touches contracts: contract/integration tests. If it touches infrastructure: smoke tests |
| Documented rollback plan | How to revert if it fails. Feature flag, previous version, rollback script |
| Verified observability | Metrics emitted, logs with context, alert configured, dashboard updated |
| SFMEA updated | If the change introduces a new failure mode or modifies an existing one, update the service’s SFMEA sheet |
| Runbook updated | If the change modifies operational behavior, the runbook must reflect it |
When each level of analysis is triggered
Not every change requires a full FTA. The intensity of analysis scales with risk:
| Type of change | Analysis level | What is done |
|---|---|---|
| Simple bug fix, config change | Basic checklist | PR review, smoke test, verify rollback |
| New feature in existing service | Lightweight SFMEA + observability | Risk questions, update SFMEA if applicable, design metrics and alerts |
| Change in integration between services | SFMEA + contract tests | Analyze impact on consumers, pact tests, update sequence diagram |
| New service or migration | FTA + SFMEA + HAZOP | Full fault tree mapped to architecture, deployment flow HAZOP, end-to-end observability |
Example: feature planning with RDD
Feature: Migrate Order Processor from VB to .NET 8
-
Risk: the current service has RPN 504 for dropped work. The migration may introduce new failure modes if transactional semantics are not preserved.
-
Updated FTA: add branches for the new service — data migration timeout, format incompatibility, dual write during transition period.
-
Observability designed beforehand: reconciliation metrics (incoming vs. persisted orders), transaction logs with correlation ID, divergence alert >0.1%.
-
Planned tests: contract tests against existing consumers, comparative performance baseline (old vs. new), complete transaction smoke tests.
-
Rollback plan: feature flag to redirect traffic to old service if the new one fails. Dual-run period with reconciliation.
-
Updated SFMEA: new failure modes for the .NET 8 service, with estimated S-O-D and preventive actions.
The impact: Without Risk-Driven Development, risk analysis only occurs post-incident (reactive). With RDD, risk is analyzed during design (proactive), observability is built before deploy, and postmortems validate whether the analysis was correct — closing the continuous improvement loop.