# Wayback Machine — Internet Archive CDX

## 1. Summary

The Internet Archive's Wayback Machine has captured ~900B+ web pages since
1996. The CDX server API indexes those captures by URL, timestamp, and
content hash. Free, anonymous, no auth.

## 2. Access Methods

- **CDX server:** `https://web.archive.org/cdx/search/cdx`
- **Wayback URL:** `https://web.archive.org/web/<timestamp>/<url>`
- **Save Page Now (write):** `https://web.archive.org/save/<url>` (different API)
- **Auth:** None
- **Rate limit:** Generous; be polite (~1 req/s)

## 3. Data Schema

Key fields emitted by `fetch_wayback.py`:

| Column | Type | Description |
|--------|------|-------------|
| `url` | str | Original URL captured |
| `timestamp` | str | YYYYMMDDHHMMSS (CDX format) |
| `wayback_url` | str | Direct replay URL |
| `mimetype` | str | Content-type at capture |
| `status` | str | HTTP status (typically 200) |
| `digest` | str | SHA1 of capture content (collapse-friendly) |
| `length` | str | Byte length of capture |

## 4. Coverage

- 1996 → present
- ~900B+ captures across ~700M domains
- Updated continuously by automated crawls + manual saves
- Some domains have aggressive coverage (news), others sparse (private)

## 5. Cross-Reference Potential

- **Wikipedia** ↔ Reverse-lookup pages cited as references that have since
  disappeared
- **News URLs** ↔ Original article content when present-day URLs 404
- **Corporate websites** ↔ Historical "About" pages, executive bios that
  have been scrubbed

The Wayback CDX is most useful as a **content-recovery** layer when other
sources point to URLs that no longer exist.

## 6. Data Quality

- robots.txt-blocked domains may have spotty or no coverage
- Captures vary in completeness (HTML may be saved without CSS/JS)
- Some content is excluded by domain owner request (DMCA, etc.)
- Coverage of "deep links" (URLs with query strings) is uneven
- Time resolution is per-capture, not continuous — gaps are common

## 7. Acquisition Script

Path: `scripts/fetch_wayback.py`

```bash
# All captures of a specific URL
python3 SKILL_DIR/scripts/fetch_wayback.py --url "https://example.com/page" \
    --out data/wb.csv

# All captures of a host
python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \
    --match host --out data/wb.csv

# All captures of a domain + subdomains
python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \
    --match domain --out data/wb.csv

# Only unique-content captures within a date window
python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \
    --match host --collapse digest \
    --from-date 2020-01-01 --to-date 2023-12-31 \
    --out data/wb.csv
```

## 8. Legal & Licensing

- Internet Archive captures are made under fair-use research provisions
- Replay URLs are stable references — citing them is encouraged
- Internet Archive non-profit terms of use govern content
- Some content is rights-restricted; replay may be blocked even if the
  CDX entry shows it as captured

## 9. References

- CDX server docs: https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md
- Wayback API: https://archive.org/help/wayback_api.php
- Internet Archive: https://archive.org/