# Incident Response

## Response Metrics

- **MTTD** (Mean Time to Detect): Target < 5 minutes
- **MTTA** (Mean Time to Acknowledge): Target < 5 minutes
- **MTTR** (Mean Time to Resolve): Target < 30 minutes
- **MTBF** (Mean Time Between Failures): Maximize

### Severity Levels

| Level | Impact | Response | Example |
|-------|--------|----------|---------|
| SEV1 | Complete outage | Immediate | Database down, payment failed |
| SEV2 | Major degradation | 15 min | API latency >5s, 50% errors |
| SEV3 | Minor degradation | 1 hour | Non-critical feature broken |
| SEV4 | Low impact | Business hours | UI glitch, logging issues |

## Runbook Template

```markdown
# Runbook: High API Error Rate

## Symptoms
- Alert: `api_error_rate > 0.05`
- Dashboard: https://grafana.example.com/d/api-errors

## Impact
Users cannot complete purchases (~$X per minute)

## Triage
1. Check dashboard for affected endpoints
2. Check recent deployments: `kubectl rollout history deployment/api`
3. Check dependencies: database, redis, external APIs

## Resolution

### Option 1: Rollback
kubectl rollout undo deployment/api -n production

### Option 2: Scale Up
kubectl scale deployment/api --replicas=10 -n production

### Option 3: Fix Config
kubectl set env deployment/api DB_POOL_SIZE=50 -n production

## Verification
- [ ] Error rate <1%
- [ ] P95 latency <500ms
- [ ] Health checks passing

## Communication
- Update status page
- Notify #incidents
- Post if user-facing
```

## Auto-Remediation Script

```python
#!/usr/bin/env python3
import kubernetes, prometheus_api_client

class IncidentRemediator:
    def check_high_error_rate(self):
        query = 'rate(http_requests_total{status=~"5.."}[5m]) > 0.05'
        result = self.prometheus.custom_query(query)
        return len(result) > 0

    def rollback_deployment(self, namespace, deployment):
        body = {'spec': {'rollbackTo': {'revision': 0}}}
        self.k8s.patch_namespaced_deployment(deployment, namespace, body)

    def remediate(self):
        if self.check_high_error_rate():
            if self.rollback_deployment('production', 'api'):
                time.sleep(120)
                if not self.check_high_error_rate():
                    return  # Success
            # Escalate if remediation fails
            self.create_incident("Auto-remediation failed")
```

## Postmortem Template

```markdown
# Postmortem: API Outage - 2024-01-15

**Date**: January 15, 2024
**Duration**: 45 minutes (14:23 - 15:08 UTC)
**Severity**: SEV1
**Impact**: Complete API outage, ~$25K revenue loss

## Summary
API became unresponsive due to database connection pool exhaustion
from slow query in v2.3.1.

## Timeline (UTC)
- 14:23 - Alert fired
- 14:27 - Incident declared SEV1
- 14:30 - Rollback initiated
- 14:45 - Identified slow query
- 14:50 - Killed queries
- 15:08 - Resolved

## Root Cause
New query missing index on `user_id`, causing full table scans that
exhausted connection pool under load.

## Impact
- 100% API failure for 45 minutes
- 15,000 users affected
- $25K revenue loss
- 200+ support tickets

## Action Items
| Action | Owner | Deadline |
|--------|-------|----------|
| Add index on user_id | DB team | 2024-01-16 |
| Add query perf testing | Platform | 2024-01-22 |
| Increase staging DB size | Infra | 2024-01-30 |

## Lessons Learned
- Performance testing must use production-scale data
- Connection pool exhaustion needs active intervention
- Consider circuit breakers for DB operations
```

## PagerDuty Configuration

```yaml
schedules:
  - name: Primary On-Call
    time_zone: America/New_York
    layers:
      - rotation_turn_length_seconds: 604800  # 1 week
        users: [PXXXXXX, PXXXXXX, PXXXXXX]

escalation_policies:
  - name: Production
    rules:
      - escalation_delay_in_minutes: 0
        targets: [{type: schedule, id: primary}]
      - escalation_delay_in_minutes: 15
        targets: [{type: schedule, id: secondary}]
      - escalation_delay_in_minutes: 30
        targets: [{type: user, id: manager}]
```

## Chaos Engineering

```yaml
# chaos-mesh: Pod failure test
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-failure-test
spec:
  action: pod-failure
  mode: one
  duration: "30s"
  selector:
    namespaces: [production]
    labelSelectors:
      app: api
  scheduler:
    cron: "@every 2h"
```

```bash
#!/bin/bash
# Game Day: Database failover drill

echo "🎮 Game Day: Database failover"
slack-cli -d incidents "Starting failover drill"

# Simulate failure
kubectl delete pod postgres-0 -n production

# Monitor recovery
start=$(date +%s)
while ! kubectl get pod postgres-1 | grep Running; do
  sleep 5
done
duration=$(($(date +%s) - start))

echo "Failover: ${duration}s" >> results.md
curl -f https://api.example.com/health || echo "FAIL"
```

## Evidence Collection & Forensics

```bash
#!/bin/bash
# collect-evidence.sh - Preserve incident evidence

INCIDENT_ID=$1
EVIDENCE_DIR="incidents/${INCIDENT_ID}/evidence"
mkdir -p $EVIDENCE_DIR

# Preserve logs
kubectl logs -l app=api --all-containers --timestamps \
  --since=2h > $EVIDENCE_DIR/pod-logs.txt

# Capture current state
kubectl get all -n production -o yaml > $EVIDENCE_DIR/k8s-state.yaml
kubectl describe pods -n production > $EVIDENCE_DIR/pod-details.txt

# Network traces
kubectl exec -n production deploy/api -- \
  tcpdump -i any -w /tmp/capture.pcap -G 60 -W 5 &

# Memory/CPU snapshot
kubectl top pods -n production > $EVIDENCE_DIR/resource-usage.txt

# Git commit at time of incident
git log --since="2 hours ago" --oneline > $EVIDENCE_DIR/recent-commits.txt

# Database queries
psql -c "SELECT * FROM pg_stat_activity" > $EVIDENCE_DIR/db-activity.txt

# Create timeline
echo "$(date): Evidence collection completed" >> $EVIDENCE_DIR/timeline.txt
```

## Communication Templates

```markdown
## SEV1 Initial Notification

**INCIDENT ALERT - SEV1**

**Status**: Investigating
**Impact**: Payment API unavailable (100% error rate)
**Started**: 2024-01-15 14:23 UTC
**Affected**: All users (~15K active sessions)
**Lead**: @oncall-engineer
**War Room**: https://zoom.us/incident-123

Updates every 15 minutes or on major change.

---

## SEV1 Resolution Notification

**INCIDENT RESOLVED**

**Summary**: Payment API restored after database connection pool exhaustion
**Duration**: 45 minutes (14:23 - 15:08 UTC)
**Resolution**: Rollback to v2.3.0 + query optimization
**Impact**: 15K users, ~$25K revenue loss
**Next Steps**: Postmortem scheduled for Jan 16 10am

Thanks to @oncall-team for rapid response.
```

## Incident Classification

| Type | Examples | Response Team | Escalation |
|------|----------|---------------|------------|
| **Security** | Breach, data leak, unauthorized access | Security + DevOps | CISO, Legal |
| **Service** | Outage, degradation, errors | DevOps + SRE | Engineering VP |
| **Data** | Corruption, loss, sync issues | DBA + DevOps | CTO |
| **Compliance** | GDPR, SOC2, audit violations | Compliance + Legal | CEO |
| **Third-party** | Provider outage, API failures | DevOps + Product | Account manager |

## Security Incident Specifics

```bash
# Compromise investigation checklist
□ Isolate affected systems
□ Preserve logs and memory dumps
□ Identify attack vector
□ Check for lateral movement
□ Scan for malware/backdoors
□ Review access logs for 30 days
□ Identify data accessed
□ Assess exfiltration risk
□ Check for persistence mechanisms
□ Coordinate with security team
□ Notify legal if PII involved
□ Document chain of custody
```

## Compliance Requirements

```yaml
# Incident notification requirements
gdpr:
  notification_deadline: 72h
  authority: Data Protection Officer
  required_info:
    - Nature of breach
    - Data categories affected
    - Number of individuals
    - Consequences
    - Remediation measures

sox:
  notification_deadline: immediate
  authority: Audit Committee
  documentation:
    - Financial impact
    - Control failures
    - Remediation plan

pci_dss:
  notification_deadline: 24h
  authority: Card brands + acquirer
  required_info:
    - Cardholder data affected
    - Incident timeline
    - Forensic investigation
```

## Best Practices

- Maintain runbooks for critical services
- Practice with game days monthly
- Automate common remediation
- Keep postmortems blameless
- Track incident metrics
- Test recovery procedures
- Document all incidents
- Improve detection continuously
- Preserve evidence chain properly
- Coordinate communication clearly
- Escalate security incidents immediately
- Understand compliance obligations
- Train team on response procedures
- Review and update playbooks quarterly