# Wayback Machine — Internet Archive CDX ## 1. Summary The Internet Archive's Wayback Machine has captured ~900B+ web pages since 1996. The CDX server API indexes those captures by URL, timestamp, and content hash. Free, anonymous, no auth. ## 2. Access Methods - **CDX server:** `https://web.archive.org/cdx/search/cdx` - **Wayback URL:** `https://web.archive.org/web//` - **Save Page Now (write):** `https://web.archive.org/save/` (different API) - **Auth:** None - **Rate limit:** Generous; be polite (~1 req/s) ## 3. Data Schema Key fields emitted by `fetch_wayback.py`: | Column | Type | Description | |--------|------|-------------| | `url` | str | Original URL captured | | `timestamp` | str | YYYYMMDDHHMMSS (CDX format) | | `wayback_url` | str | Direct replay URL | | `mimetype` | str | Content-type at capture | | `status` | str | HTTP status (typically 200) | | `digest` | str | SHA1 of capture content (collapse-friendly) | | `length` | str | Byte length of capture | ## 4. Coverage - 1996 → present - ~900B+ captures across ~700M domains - Updated continuously by automated crawls + manual saves - Some domains have aggressive coverage (news), others sparse (private) ## 5. Cross-Reference Potential - **Wikipedia** ↔ Reverse-lookup pages cited as references that have since disappeared - **News URLs** ↔ Original article content when present-day URLs 404 - **Corporate websites** ↔ Historical "About" pages, executive bios that have been scrubbed The Wayback CDX is most useful as a **content-recovery** layer when other sources point to URLs that no longer exist. ## 6. Data Quality - robots.txt-blocked domains may have spotty or no coverage - Captures vary in completeness (HTML may be saved without CSS/JS) - Some content is excluded by domain owner request (DMCA, etc.) - Coverage of "deep links" (URLs with query strings) is uneven - Time resolution is per-capture, not continuous — gaps are common ## 7. Acquisition Script Path: `scripts/fetch_wayback.py` ```bash # All captures of a specific URL python3 SKILL_DIR/scripts/fetch_wayback.py --url "https://example.com/page" \ --out data/wb.csv # All captures of a host python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \ --match host --out data/wb.csv # All captures of a domain + subdomains python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \ --match domain --out data/wb.csv # Only unique-content captures within a date window python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \ --match host --collapse digest \ --from-date 2020-01-01 --to-date 2023-12-31 \ --out data/wb.csv ``` ## 8. Legal & Licensing - Internet Archive captures are made under fair-use research provisions - Replay URLs are stable references — citing them is encouraged - Internet Archive non-profit terms of use govern content - Some content is rights-restricted; replay may be blocked even if the CDX entry shows it as captured ## 9. References - CDX server docs: https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md - Wayback API: https://archive.org/help/wayback_api.php - Internet Archive: https://archive.org/