← AIOps W3 — Reliability Engineering & Postmortem

W3-D3: Outage Reproduction, Postmortem, ADR, Cost Model

Outage Reproduction, Postmortem, ADR, Cost Model

1. Định nghĩa

Khái niệmĐịnh nghĩa
PostmortemDocument phân tích incident sau khi đã resolve: timeline, root cause, contributing factors, action items
Blameless principlePostmortem focus on systemic cause, không nhằm vào cá nhân. Lý do: blame culture → người báo lỗi ít → bug ẩn lâu hơn
ADR (Architecture Decision Record)1-page document ghi 1 quyết định kiến trúc: context, decision, alternatives, consequences
MTTR / MTTD / MTBFMean Time To Recover / Detect / Between Failures
Error budget(1 - SLO) × total events; quota cho phép fail

Sources:


2. Postmortem template — fields chuẩn Google SRE

# Postmortem: <short incident name>

**Status:** complete | draft  
**Date:** YYYY-MM-DD  
**Authors:** <names>  
**Severity:** SEV1 | SEV2 | SEV3  
**Duration:** <minutes> (start UTC → end UTC)

## Summary
<2-4 sentences: what happened, who was affected, how it was fixed>

## Impact
- Users affected: <number / %>
- Revenue impact: $<estimate>
- SLO budget consumed: <%>
- External communication: <status page updates, blog post>

## Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | Trigger event (deploy, config push, traffic surge) |
| HH:MM | First user-visible symptom |
| HH:MM | First page fired |
| HH:MM | On-call ack |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
| HH:MM | Full recovery |

## Root cause
<technical explanation: what broke, why, why detection delayed>

## Contributing factors
- <factor 1: e.g., insufficient canary, missing alert>
- <factor 2>

## Detection
- How was the incident detected? (user report, alert, dashboard)
- Could it have been detected earlier?

## Response
- What went well
- What went poorly
- Where we got lucky

## Action items
| Item | Owner | Due | Priority |
|------|-------|-----|----------|
| <action 1> | <name> | <date> | P0/P1/P2 |

2.1 Blameless wording

Blame-y (avoid)Blameless (use)
“Alice pushed bad config”“Config push pipeline allowed invalid YAML through”
“On-call was slow to respond”“Alert routing didn’t reach on-call’s primary device”
“Engineer forgot to test”“Pre-merge test suite didn’t cover this scenario”

Focus: system + process, không phải người.


3. Root cause analysis: 5 Whys vs Causal Tree

3.1 5 Whys (Toyota, 1950s)

Linear chain, lặp “why?” 5 lần:

Symptom: API trả 500 lúc 02:14 UTC
Why? → DB connection pool exhausted
Why? → 1 query lock cả pool, block 30s
Why? → Query missing index, full table scan
Why? → Index removed bởi migration tuần trước
Why? → Migration review không catch performance regression

Strength: đơn giản, ai cũng dùng được.
Limit: assume nguyên nhân tuyến tính. Outage thật thường nhiều nhánh.

3.2 Causal Tree

Multiple branches, mỗi branch có thể có cause riêng:

                     Outage 24h11m (GitHub 2018-10-21)
                              │
            ┌─────────────────┼─────────────────┐
            │                 │                 │
       Network blip       Orchestrator      Consistency-first
       43 seconds         failover triggered  recovery policy
            │                 │                 │
       Routine optical    Quorum logic      Engineering decision
       maintenance        deemed primary     (no data loss > faster)
       (BGP convergence   unreachable
        slow)

Dùng khi:

  • Outage có > 1 failure mode đồng thời (Roblox: Consul streaming + BoltDB)
  • Có architectural decision contribute (consistency vs availability trade-off)

Source: Allspaw & Robbins, Web Operations, O’Reilly 2010, Ch 10. Causal tree analysis pattern.


4. Failure mode catalog — 6 patterns với real incident citation

PatternMechanismReal incident
Cascading failureA fail → retry storm → B saturate → C saturateAWS Lambda 2018, DynamoDB 2015
Split-brainNetwork partition → 2 nodes nghĩ mình là primary → divergent stateGitHub MySQL 2018-10-21
Catastrophic backtrackingRegex / parser exponential time on adversarial inputCloudflare 2019-07-02
Capacity exhaustion at boundaryFile descriptor, conn pool, thread pool fullLinkedIn 2017 conn pool, Cloudflare 2022 fd leak
Monitoring dependency loopAIOps stack depends on monitored service → service fail → monitoring blindRoblox 2021-10-28 (Consul)
Operator action without guardrailTypo / wrong scope command takes down prodAWS S3 2017-02-28, GitLab 2017-01-31 (db delete)

4.1 Pattern detail: Cascading failure

User traffic
    ↓
[Service A] ──fail──→ retries with backoff
    ↓                       ↓
[Service B] ←──retry storm (10×)──
    ↓ (CPU saturated by retry handling)
[Service C] ←──fails to get response from B
    ↓
Whole system degraded

Detection trap: alert count cao nhất ở C (downstream), không phải A (root). Naive RCA pick C → wrong. Cần topology-aware + causal-lag analysis.

4.2 Pattern detail: Split-brain (GitHub 2018)

GitHub 2018-10-21, 22:52 UTC. Routine optical equipment maintenance gây 43-second connectivity loss giữa US East coast hub và US East data center. Orchestrator (MySQL failover tool) trên US West thấy East unreachable → triggered failover, promote West replica thành primary. 43s sau East reconnect → 2 primary → divergent writes.

Recovery 24h 11m vì GitHub chose consistency over speed: replay binary log từ East to West thay vì serve possibly-stale data.

Source: github.blog/2018-10-30-oct21-post-incident-analysis

4.3 Pattern detail: Catastrophic backtracking (Cloudflare 2019)

2019-07-02, 13:42 UTC. Cloudflare deploy WAF rule global cùng lúc. Rule chứa regex (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\|-|+)+[)];?((?:\s|-|~|!|{}||||+).(?:.=.*)))`.

Khi gặp input như xxxxx=xxxxxx, regex engine try exponential combinations of .* group → CPU 100% trên mọi edge worldwide. 27 phút outage, traffic drop 82%.

Counter: tested regex against ReDoS (catastrophic backtracking detector) trước deploy; canary rollout 1% → 10% → 100% thay vì global atomic.

Source: blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019

4.4 Pattern detail: Monitoring loop (Roblox 2021)

2021-10-28 → 10-31, 73 hours. Roblox enabled Consul streaming feature dưới production load. Streaming sử dụng ít Go channel hơn long-polling → contention dưới high read+write concurrent → blocking writes. BoltDB freelist algorithm O(n²) at scale → write latency spike.

Critical detail: Roblox monitoring stack depend on Consul cho service discovery. Consul slow → monitoring queries timeout → on-call had no visibility for first ~12 hours. Diagnosis 60+ hours vì 2 unrelated issues (streaming contention + BoltDB) overlap.

Source: about.roblox.com/newsroom/2022/01/roblox-return-to-service-10-28-10-31-2021


5. Outage catalog — pick 1 to reproduce

#IncidentDateDurationFailure modeReproduce difficultyPostmortem URL
1AWS S3 us-east-12017-02-28~4hOperator typo, insufficient blast radiusEasyhttps://aws.amazon.com/message/41926/
2GitHub MySQL split-brain2018-10-2124h 11mNetwork partition + Orchestrator failoverHardhttps://github.blog/2018-10-30-oct21-post-incident-analysis
3Cloudflare WAF regex2019-07-0227 minCatastrophic backtracking + global deployMediumhttps://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019
4Roblox Consul + BoltDB2021-10-2873hStreaming contention + freelist + monitoring loopHardhttps://about.roblox.com/newsroom/2022/01/roblox-return-to-service-10-28-10-31-2021
5Slack Jan 4 20222022-01-04~6hProvisioning system overload post-holidayMediumhttps://slack.engineering/slacks-incident-on-2-22-22/

danluu/post-mortems archive (cộng đồng curate): github.com/danluu/post-mortems


6. Reproduction patterns

Mỗi outage có 1 minimal docker-compose có thể trigger same failure mode.

6.1 AWS S3 2017 reproduction (operator typo)

Mô phỏng: script delete server từ “subsystem A” (billing) bằng input mismatch lấy mất server “subsystem B” (index) + “subsystem C” (placement).

# docker-compose.yml
services:
  billing:
    image: alpine
    command: sleep infinity
  index:
    image: alpine
    command: sleep infinity
  placement:
    image: alpine
    command: sleep infinity

# bad-command.sh
docker compose stop --remove-orphans  # ← typo: should have been "stop billing"

Result: chạy bash → 3 service cùng down → index + placement down nghĩa là không serve được object metadata + không decide được object location.

Postmortem để viết: tại sao 1 command có thể wipe 3 service? Guardrail nào missing?

6.2 Cloudflare 2019 reproduction (regex CPU)

import re, time
EVIL = r'(?:(?:\"|\d|.*)+(?:.*=.*))'  # simplified evil regex
INPUT = "x=" + "x" * 30
t0 = time.time()
re.match(EVIL, INPUT)
print(f"matched in {time.time()-t0:.2f}s — should be < 0.01s")
# Real run: 8-15 seconds on commodity hardware → CPU pegged

Wrap into HTTP middleware → server unresponsive trong vòng giây. AIOps pipeline có catch không?

6.3 GitHub 2018 reproduction (simplified split-brain)

services:
  mysql-primary:
    image: mysql:8
    networks: [east]
  mysql-replica:
    image: mysql:8
    networks: [west]
  orchestrator:
    image: openark/orchestrator:latest
    networks: [east, west]

networks:
  east: {}
  west: {}

# Trigger:
# docker network disconnect east orchestrator   # 43s
# (orchestrator from west sees east unreachable, promotes replica)
# docker network connect east orchestrator
# Both DB now think they're primary → write conflicts

7. Architecture Decision Record (ADR)

7.1 Format Nygard (2011)

# ADR-NNN: <short title of decision>

## Status
Proposed | Accepted | Deprecated | Superseded by ADR-XXX

## Context
<situation prompting the decision: forces at play, constraints>

## Decision
<the change we're making>

## Alternatives considered
- Alternative A — pros, cons, why rejected
- Alternative B — pros, cons, why rejected

## Consequences
- Positive consequence 1
- Negative consequence 1 (trade-off accepted)
- Risk 1, mitigation

7.2 Ví dụ ADR cho AIOps platform

# ADR-007: Use topology-aware RCA over count-based ranking

## Status
Accepted

## Context
RCA cần pick root service từ N service đang fire alert. Count-based ranking
(service nhiều alert nhất = root) fails on cascading failure: downstream
service retry → fires more alerts than upstream root.

## Decision
RCA combine 3 signals:
1. Topology distance from edge (upstream-bias)
2. First-drift time (causal lag analysis via Granger causality)
3. Alert volume (tiebreaker only)

## Alternatives considered
- Count-only ranking — simple, fast, BUT fails retry storm. Rejected.
- LLM-only RCA — flexible, BUT hallucinate confident-wrong root. Rejected as primary.
- Graph PageRank only — captures topology BUT not temporal causality. Rejected as standalone.

## Consequences
+ Catches cascading patterns missed by count-only (verified vs Roblox-style scenario)
+ Composable: each signal degrades gracefully if data missing
− Higher compute cost (Granger causality O(n × lag_window))
− Requires topology graph kept up-to-date — adds operational burden
- Risk: signal weights need tuning per environment, not auto.

8. Cost model — break-even cho AIOps platform

8.1 Cost side

ComponentĐơn vịOrder of magnitude
Metric ingestion$/series/month$0.0001 (Prometheus self-host) — $0.30 (Datadog)
Log ingestion$/GB$0.30 (S3 + Athena) — $2.50 (Datadog) — $5 (Splunk)
Trace ingestion$/million spans$0.50 (Tempo) — $2 (Datadog APM)
Model inference compute$/hour$0.05 (CPU c6i.large) — $3.06 (GPU g5.xlarge)
Storage hot/warm/cold$/GB/month$0.023 (S3) — $0.10 (gp3) — $0.30 (Prometheus local)
SRE/AIOps engineer$/year$120k–$250k loaded cost
On-call rotation overheadhours/engineer/month40–80h ≈ $5k–$15k/month opportunity cost

8.2 Value side

$$ \text{value_per_year} = \text{MTTR_reduction_hours} \times \text{incident_per_year} \times \text{downtime_cost_per_hour} $$

Downtime cost per hour:

Business typeOrder of magnitude
E-commerce mid-tier$5k–$50k/hour
Large e-commerce (Amazon-scale)$200k+/hour
Financial trading$1M+/hour
Internal SaaS$500–$5k/hour
Streaming (Netflix, etc.)$50k–$500k/hour

Source: ITIC 2024 Hourly Cost of Downtime Survey, Gartner 2014 (often-cited $5,600/min baseline).

8.3 Break-even formula

def is_worth_it(
    num_services: int,
    incidents_per_month: int,
    avg_incident_duration_hours: float,
    downtime_cost_per_hour: float,
    expected_mttr_reduction_pct: float = 0.4,
    aiops_monthly_cost: float = 15_000,
) -> dict:
    monthly_downtime_hours = incidents_per_month * avg_incident_duration_hours
    monthly_value = (
        monthly_downtime_hours
        * expected_mttr_reduction_pct
        * downtime_cost_per_hour
    )
    roi = monthly_value / aiops_monthly_cost
    payback_months = aiops_monthly_cost / monthly_value if monthly_value > 0 else float("inf")
    return {
        "monthly_value": monthly_value,
        "monthly_cost": aiops_monthly_cost,
        "roi": roi,
        "payback_months": payback_months,
        "verdict": "worth_it" if roi > 1.5 else "marginal" if roi > 1.0 else "not_worth_it",
    }

8.4 Break-even examples

ScenarioVerdictReason
20 services, 2 incident/mo × 1h, $10k/h, $15k AIOpsROI 0.53 → not_worth_itToo few incidents to justify
100 services, 5 incident/mo × 2h, $20k/h, $25k AIOpsROI 3.2 → worth_itRight size, real downtime cost
500 services, 10 incident/mo × 1.5h, $50k/h, $60k AIOpsROI 5.0 → worth_itScale + cost both high
10 services, 1 incident/mo × 30min, $1k/h, $10k AIOpsROI 0.02 → not_worth_itHire good SRE instead

8.5 When NOT to do AIOps

  • < 30 services and < 3 incident/month
  • Downtime cost < $1k/hour (internal tools, hobby)
  • Observability stack chưa mature (no SLO, no centralized log) → AIOps không có signal sạch để work với
  • Postmortem culture chưa establish → AIOps surface signal nhưng không ai action

Recipe đúng trong những case này: invest vào good observability + SLO + on-call culture, KHÔNG đầu tư AIOps.


9. Bài tập

9.1 Setup được cung cấp

Download pack:

wget https://learning-notes-dz2.pages.dev/aiops-w3/lab/w3-d3-pack.zip
unzip w3-d3-pack.zip -d w3-d3-pack/
cd w3-d3-pack/

Cấu trúc pack:

outage_catalog.yaml           # 5 outage có sẵn template reproduction
reproduction_templates/
├── aws_s3_2017/              # operator typo + over-broad command
├── github_mysql_2018/        # MySQL primary-replica + orchestrator
├── cloudflare_regex_2019/    # FastAPI app với evil regex middleware  
├── roblox_consul_2021/       # Consul + nội dependency loop demo
└── slack_2022/               # provisioning queue overload demo
pipeline/                     # AIOps pipeline (W1+W2 wired, sẵn endpoint)
templates/
├── postmortem_template.md    # field-by-field template
├── adr_template.md           # Nygard format
└── spec_template.md          # SPEC.md outline
scripts/
├── start_reproduction.sh     # spin up chosen reproduction stack
├── inject.sh                 # trigger failure mode
└── capture_timeline.py       # record event timeline với UTC timestamp

9.2 Bước 1 — Pick outage

Chọn 1 outage từ outage_catalog.yaml (5 lựa chọn ở §5).

Trong SUBMIT.md Section 0, viết:

## Outage chosen
- ID: <1-5>
- Name: < dụ AWS S3 2017-02-28>
- Why this one: <2-3 câu  bạn quan tâm pattern >
- Failure mode: <pick từ §4: cascading | split-brain | regex | capacity | monitoring-loop | operator>

9.3 Bước 2 — Reproduce

cd reproduction_templates/<chosen_outage>/
bash ../scripts/start_reproduction.sh
# wait healthcheck
bash ../scripts/inject.sh
# capture
python ../scripts/capture_timeline.py --duration 600 --out timeline.json

timeline.json chứa event được capture từ Prometheus + container event + pipeline output, có UTC timestamp.

9.4 Bước 3 — Run AIOps pipeline trên reproduction

Pipeline đã chạy nền (port 8000). Query:

curl http://localhost:8000/alerts?since=<inject_start_ts> > alerts_observed.json
curl -X POST http://localhost:8000/rca \
     -d '{"window_start": <ts>, "window_end": <ts+600>}' \
     > rca_observed.json

So sánh với expected (theo original postmortem):

  • Pipeline detected sự cố trong < N giây? (target < 30s)
  • Pipeline pick đúng root service không?
  • Có pattern nào pipeline miss hoàn toàn không?

Note ít nhất 2 gap cụ thể vào postmortem.md Section “Detection”.

9.5 Bước 4 — Viết postmortem.md

Theo template §2. Mỗi field bắt buộc fill, không bỏ trống. Timeline phải có ít nhất 8 event với UTC timestamp (lấy từ timeline.json).

Required wording check: 0 instance của “ did X” — chỉ chấp nhận blameless wording (§2.1).

9.6 Bước 5 — Viết ADR.md

1 ADR cho 1 design decision của AIOps platform, theo template Nygard §7.1. Decision phải:

  • Có ít nhất 2 alternatives với pros/cons mỗi cái
  • Có ít nhất 2 consequences (1 positive, 1 trade-off)
  • Reference được gap đã quan sát ở §9.4

Ví dụ topic ADR phù hợp:

  • RCA: count-based vs topology-aware vs causal-lag — pick gì
  • Alert routing: page-everyone vs tier-based on-call rotation
  • Detector: single threshold vs ensemble (3σ + IF + LSTM-AE)
  • Storage: hot Prometheus 2 tuần vs S3+Athena cold long-term
  • LLM: GPT-style cloud API vs self-host Llama vs no LLM

9.7 Bước 6 — Viết cost_model.py

Implement function chính xác theo signature:

def is_worth_it(
    num_services: int,
    incidents_per_month: int,
    avg_incident_duration_hours: float,
    downtime_cost_per_hour: float,
    expected_mttr_reduction_pct: float = 0.4,
    aiops_monthly_cost: float = 15_000,
) -> dict:
    """
    Returns:
      {
        "monthly_value": float,
        "monthly_cost": float,
        "roi": float,
        "payback_months": float,  # or float('inf')
        "verdict": "worth_it" | "marginal" | "not_worth_it"
      }
    Verdict rule:
      roi > 1.5 → worth_it
      1.0 < roi ≤ 1.5 → marginal
      roi ≤ 1.0 → not_worth_it
    """

Plus 3 worked example scenario in cùng file (call function + print result):

if __name__ == "__main__":
    print(is_worth_it(num_services=20, incidents_per_month=2,
                      avg_incident_duration_hours=1, downtime_cost_per_hour=10_000,
                      aiops_monthly_cost=15_000))
    print(is_worth_it(num_services=100, incidents_per_month=5,
                      avg_incident_duration_hours=2, downtime_cost_per_hour=20_000,
                      aiops_monthly_cost=25_000))
    # 1 scenario của bạn — chọn industry, defend choice của downtime cost trong comment

9.8 Bước 7 — Viết SPEC.md (gộp W3 lại)

Outline:

# AIOps Mini-Platform Spec — <your name>

## 1. Platform overview
[2-3 câu: stack được monitor, scope, user của platform]

## 2. SLO definition (from W3-D1)
[paste/reference slo_spec.yaml — 3 service × SLI+SLO+budget]

## 3. Detection + Correlation + RCA stack (from W1+W2)
[1 paragraph mỗi layer — high-level approach + ADR reference]

## 4. Reliability validation (from W3-D2)
[paste chaos_report.md scoreboard + top 3 gap]

## 5. Operational pattern (from W3-D3)
[reproduced outage + key learning + ADR-001 reference]

## 6. Cost model (from W3-D3)
[paste cost_model.py output cho stack hiện tại + break-even point]

## 7. Open risks
[3-5 known gap chưa fix, mỗi cái có severity + mitigation plan]

9.9 Bước 8 — SUBMIT.md

# W3-D3 Submission — <your name>

## Outage chosen
[section §9.2]

## 3 thứ tôi học từ outage này
1. ...
2. ...
3. ...

## 1 thứ pipeline của tôi sẽ vẫn miss nếu outage này xảy ra real
- Pattern: ...
- Why miss: ...
- Mitigation idea: ...

## 1 quyết định trong ADR mà tôi không hoàn toàn chắc
...

## Cost model verdict cho stack của tôi
- ROI: __
- Payback: __ tháng
- Verdict: __

9.10 Acceptance checklist

  • Reproduction chạy được, inject.sh trigger failure mode quan sát được
  • timeline.json có ≥ 8 event với UTC timestamp
  • postmortem.md có đủ field theo template §2, 0 blame wording, timeline ≥ 8 event, ≥ 2 gap detection được note
  • ADR.md có ≥ 2 alternatives mỗi cái có pros/cons, ≥ 2 consequences, reference 1 gap từ §9.4
  • cost_model.py parse được, is_worth_it() return đúng schema, có 3 worked example
  • SPEC.md có 7 section đầy đủ
  • SUBMIT.md đủ 5 section

10. Deliverable summary

FileMô tảSpec
reproduction/Outage reproduction stack (docker-compose + inject + timeline)§9.3
timeline.jsonCaptured event với UTC timestamp§9.3
alerts_observed.json + rca_observed.jsonPipeline output trên reproduction§9.4
postmortem.mdBlameless postmortem theo Google SRE template§9.5, §2
ADR.md1 architecture decision record cho AIOps platform§9.6, §7.1
cost_model.pyis_worth_it() + 3 scenario§9.7
SPEC.mdMaster spec gộp W3 deliverable§9.8
SUBMIT.mdReflection 5-section§9.9

Đường dẫn nộp: aiops-<tên>/w3/d3/.


11. Anti-patterns

Anti-patternHậu quả
Postmortem blame cá nhânBlame culture → bug ẩn lâu hơn → outage tệ hơn
Copy postmortem gốc thay vì viết từ reproductionNo learning, document = fiction
ADR thiếu AlternativesKhông phải decision record, chỉ là announcement
Cost model bỏ qua engineer timeUnderestimate cost 3-5×
Reproduce outage 1:1 với prodĐốt tiền + scope creep. Minimal env đủ để trigger pattern
Action item không có owner + duePostmortem trở thành file lưu trữ, không ai làm
5 Whys khi outage có > 1 causeMiss contributing factors → incomplete fix

12. References

SourceTopicURL
Beyer et al. SRE Book Ch 15Postmortem Culture (canonical Google framework)https://sre.google/sre-book/postmortem-culture/
Michael Nygard (2011)ADR formathttps://www.cognitect.com/blog/2011/11/15/documenting-architecture-decisions
Joel Parker HendersonADR templates collectionhttps://github.com/joelparkerhenderson/architecture-decision-record
danluu/post-mortemsCurated archive of public postmortemshttps://github.com/danluu/post-mortems
GitHub EngineeringOct 2018 MySQL split-brain analysishttps://github.blog/2018-10-30-oct21-post-incident-analysis
CloudflareJuly 2019 regex outage detailhttps://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019
RobloxOct 2021 73-hour Consul outagehttps://about.roblox.com/newsroom/2022/01/roblox-return-to-service-10-28-10-31-2021
AWSFeb 2017 S3 us-east-1 disruptionhttps://aws.amazon.com/message/41926/
Slack EngineeringJan 2022 incident retrospectivehttps://slack.engineering/slacks-incident-on-2-22-22/
Allspaw & RobbinsWeb Operations, O’Reilly 2010 (causal tree pattern)https://www.oreilly.com/library/view/web-operations/9781449377465/
EtsyBlameless postmortem culturehttps://www.etsy.com/codeascraft/blameless-postmortems
ITIC 2024Hourly Cost of Downtime Surveyhttps://itic-corp.com/
AtlassianIncident response handbookhttps://www.atlassian.com/incident-management