---
name: systematic-debugging
description: Use when encountering any bug, test failure, or unexpected behavior. 4-phase root cause investigation process - NO fixes without understanding the problem first.
version: 1.0.0
author: Hermes Agent (adapted from Superpowers)
license: MIT
metadata:
  hermes:
    tags: [debugging, troubleshooting, problem-solving, root-cause, investigation]
    related_skills: [test-driven-development, writing-plans, subagent-driven-development]
---

# Systematic Debugging

## Overview

Random fixes waste time and create new bugs. Quick patches mask underlying issues.

**Core principle:** ALWAYS find root cause before attempting fixes. Symptom fixes are failure.

**Violating the letter of this process is violating the spirit of debugging.**

## The Iron Law

```
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST
```

If you haven't completed Phase 1, you cannot propose fixes.

## When to Use

Use for ANY technical issue:
- Test failures
- Bugs in production
- Unexpected behavior
- Performance problems
- Build failures
- Integration issues

**Use this ESPECIALLY when:**
- Under time pressure (emergencies make guessing tempting)
- "Just one quick fix" seems obvious
- You've already tried multiple fixes
- Previous fix didn't work
- You don't fully understand the issue

**Don't skip when:**
- Issue seems simple (simple bugs have root causes too)
- You're in a hurry (rushing guarantees rework)
- Manager wants it fixed NOW (systematic is faster than thrashing)

## The Four Phases

You MUST complete each phase before proceeding to the next.

---

## Phase 1: Root Cause Investigation

**BEFORE attempting ANY fix:**

### 1. Read Error Messages Carefully

- Don't skip past errors or warnings
- They often contain the exact solution
- Read stack traces completely
- Note line numbers, file paths, error codes

**Action:** Copy full error message to your notes.

### 2. Reproduce Consistently

- Can you trigger it reliably?
- What are the exact steps?
- Does it happen every time?
- If not reproducible → gather more data, don't guess

**Action:** Write down exact reproduction steps.

### 3. Check Recent Changes

- What changed that could cause this?
- Git diff, recent commits
- New dependencies, config changes
- Environmental differences

**Commands:**
```bash
# Recent commits
git log --oneline -10

# Uncommitted changes
git diff

# Changes in specific file
git log -p --follow src/problematic_file.py
```

### 4. Gather Evidence in Multi-Component Systems

**WHEN system has multiple components (CI pipeline, API service, database layer):**

**BEFORE proposing fixes, add diagnostic instrumentation:**

For EACH component boundary:
- Log what data enters component
- Log what data exits component
- Verify environment/config propagation
- Check state at each layer

Run once to gather evidence showing WHERE it breaks
THEN analyze evidence to identify failing component
THEN investigate that specific component

**Example (multi-layer system):**
```python
# Layer 1: Entry point
def entry_point(input_data):
    print(f"DEBUG: Input received: {input_data}")
    result = process_layer1(input_data)
    print(f"DEBUG: Layer 1 output: {result}")
    return result

# Layer 2: Processing
def process_layer1(data):
    print(f"DEBUG: Layer 1 received: {data}")
    # ... processing ...
    print(f"DEBUG: Layer 1 returning: {result}")
    return result
```

**Action:** Add logging, run once, analyze output.

### 5. Isolate the Problem

- Comment out code until problem disappears
- Binary search through recent changes
- Create minimal reproduction case
- Test with fresh environment

**Action:** Create minimal reproduction case.

### Phase 1 Completion Checklist

- [ ] Error messages fully read and understood
- [ ] Issue reproduced consistently
- [ ] Recent changes identified and reviewed
- [ ] Evidence gathered (logs, state)
- [ ] Problem isolated to specific component/code
- [ ] Root cause hypothesis formed

**STOP:** Do not proceed to Phase 2 until you understand WHY it's happening.

---

## Phase 2: Solution Design

**Given the root cause, design the fix:**

### 1. Understand the Fix Area

- Read relevant code thoroughly
- Understand data flow
- Identify affected components
- Check for similar issues elsewhere

**Action:** Read all relevant code files.

### 2. Design Minimal Fix

- Smallest change that fixes root cause
- Avoid scope creep
- Don't refactor while fixing
- Fix one issue at a time

**Action:** Write down the exact fix before coding.

### 3. Consider Side Effects

- What else could this change affect?
- Are there dependencies?
- Will this break other functionality?

**Action:** Identify potential side effects.

### Phase 2 Completion Checklist

- [ ] Fix area code fully understood
- [ ] Minimal fix designed
- [ ] Side effects identified
- [ ] Fix approach documented

---

## Phase 3: Implementation

**Make the fix:**

### 1. Write Test First (if possible)

```python
def test_should_handle_empty_input():
    """Regression test for bug #123"""
    result = process_data("")
    assert result == expected_empty_result
```

### 2. Implement Fix

```python
# Before (buggy)
def process_data(data):
    return data.split(",")[0]

# After (fixed)
def process_data(data):
    if not data:
        return ""
    return data.split(",")[0]
```

### 3. Verify Fix

```bash
# Run the specific test
pytest tests/test_data.py::test_should_handle_empty_input -v

# Run all tests to check for regressions
pytest
```

### Phase 3 Completion Checklist

- [ ] Test written that reproduces the bug
- [ ] Minimal fix implemented
- [ ] Test passes
- [ ] No regressions introduced

---

## Phase 4: Verification

**Confirm it's actually fixed:**

### 1. Reproduce Original Issue

- Follow original reproduction steps
- Verify issue is resolved
- Test edge cases

### 2. Regression Testing

```bash
# Full test suite
pytest

# Integration tests
pytest tests/integration/

# Check related areas
pytest -k "related_feature"
```

### 3. Monitor After Deploy

- Watch logs for related errors
- Check metrics
- Verify fix in production

### Phase 4 Completion Checklist

- [ ] Original issue cannot be reproduced
- [ ] All tests pass
- [ ] No new warnings/errors
- [ ] Fix documented (commit message, comments)

---

## Debugging Techniques

### Root Cause Tracing

Ask "why" 5 times:
1. Why did it fail? → Null pointer
2. Why was it null? → Function returned null
3. Why did function return null? → Missing validation
4. Why was validation missing? → Assumed input always valid
5. Why was that assumption wrong? → API changed

**Root cause:** API change not accounted for

### Defense in Depth

Don't fix just the symptom:

**Bad:** Add null check at crash site
**Good:** 
1. Add validation at API boundary
2. Add null check at crash site
3. Add test for both
4. Document API behavior

### Condition-Based Waiting

For timing/race conditions:

```python
# Bad - arbitrary sleep
import time
time.sleep(5)  # "Should be enough"

# Good - wait for condition
from tenacity import retry, wait_exponential, stop_after_attempt

@retry(wait=wait_exponential(multiplier=1, min=4, max=10),
       stop=stop_after_attempt(5))
def wait_for_service():
    response = requests.get(health_url)
    assert response.status_code == 200
```

---

## Common Debugging Pitfalls

### Fix Without Understanding

**Symptom:** "Just add a try/catch"
**Problem:** Masks the real issue
**Solution:** Complete Phase 1 before any fix

### Shotgun Debugging

**Symptom:** Change 5 things at once
**Problem:** Don't know what fixed it
**Solution:** One change at a time, verify each

### Premature Optimization

**Symptom:** Rewrite while debugging
**Problem:** Introduces new bugs
**Solution:** Fix first, refactor later

### Assuming Environment

**Symptom:** "Works on my machine"
**Problem:** Environment differences
**Solution:** Check environment variables, versions, configs

---

## Language-Specific Tools

### Python

```python
# Add debugger
import pdb; pdb.set_trace()

# Or use ipdb for better experience
import ipdb; ipdb.set_trace()

# Log state
import logging
logging.debug(f"Variable state: {variable}")

# Stack trace
import traceback
traceback.print_exc()
```

### JavaScript/TypeScript

```javascript
// Debugger
debugger;

// Console with context
console.log("State:", { var1, var2, var3 });

// Stack trace
console.trace("Here");

// Error with context
throw new Error(`Failed with input: ${JSON.stringify(input)}`);
```

### Go

```go
// Print state
fmt.Printf("Debug: variable=%+v\n", variable)

// Stack trace
import "runtime/debug"
debug.PrintStack()

// Panic with context
if err != nil {
    panic(fmt.Sprintf("unexpected error: %v", err))
}
```

---

## Integration with Other Skills

### With test-driven-development

When debugging:
1. Write test that reproduces bug
2. Debug systematically
3. Fix root cause
4. Test passes

### With writing-plans

Include debugging tasks in plans:
- "Add diagnostic logging"
- "Create reproduction test"
- "Verify fix resolves issue"

### With subagent-driven-development

If subagent gets stuck:
1. Switch to systematic debugging
2. Analyze root cause
3. Provide findings to subagent
4. Resume implementation

---

## Remember

```
PHASE 1: Investigate → Understand WHY
PHASE 2: Design → Plan the fix
PHASE 3: Implement → Make the fix
PHASE 4: Verify → Confirm it's fixed
```

**No shortcuts. No guessing. Systematic always wins.**