diff --git a/optional-skills/research/osint-investigation/SKILL.md b/optional-skills/research/osint-investigation/SKILL.md
new file mode 100644
index 00000000000..b2da82fbd00
--- /dev/null
+++ b/optional-skills/research/osint-investigation/SKILL.md
@@ -0,0 +1,277 @@
+---
+name: osint-investigation
+description: Public-records OSINT investigation framework — SEC EDGAR filings, USAspending contracts, Senate lobbying, OFAC sanctions, ICIJ offshore leaks, NYC property records (ACRIS), OpenCorporates registries, CourtListener court records, Wayback Machine archives, Wikipedia + Wikidata, GDELT news monitoring. Entity resolution across sources, cross-link analysis, timing correlation, evidence chains. Python stdlib only.
+version: 0.1.0
+platforms: [linux, macos, windows]
+author: Hermes Agent (adapted from ShinMegamiBoson/OpenPlanter, MIT)
+metadata:
+  hermes:
+    tags: [osint, investigation, public-records, sec, sanctions, corporate-registry, property, courts, due-diligence, journalism]
+    category: research
+    related_skills: [domain-intel, arxiv]
+---
+
+# OSINT Investigation — Public Records Cross-Reference
+
+Investigative framework for public-records OSINT: government contracts,
+corporate filings, lobbying, sanctions, offshore leaks, property records,
+court records, web archives, knowledge bases, and global news. Resolve
+entities across heterogeneous sources, build cross-links with explicit
+confidence, run statistical timing tests, and produce structured evidence
+chains.
+
+**Python stdlib only.** Zero install. Works on Linux, macOS, Windows. Most
+sources work with no API key (OpenCorporates has an optional free token
+that raises rate limits).
+
+Adapted from the MIT-licensed ShinMegamiBoson/OpenPlanter project; expanded
+to cover identity / property / litigation / archives / news sources that
+the original didn't address.
+
+## When to use this skill
+
+Use when the user asks for:
+
+- "follow the money" — government contracts, lobbying → legislation, sanctions
+- corporate due diligence — who controls company X, where are they
+  incorporated, who serves on their boards, what filings have they made
+- sanctions screening — is entity X on OFAC SDN, ICIJ offshore leaks
+- pay-to-play investigation — contractors with offshore ties, lobbying
+  clients winning awards
+- property ownership — find recorded deeds/mortgages by name or address
+  (NYC; for other counties point users at the relevant recorder)
+- litigation history — find federal + state court opinions and PACER dockets
+- multi-source entity resolution where naming varies (LLC suffixes, abbreviations)
+- evidence-chain construction with explicit confidence levels
+- "what's been said about X" — international news (GDELT) + Wikipedia
+  narrative + Wayback Machine to recover dead URLs
+
+Do NOT use this skill for:
+
+- general web research → `web_search` / `web_extract`
+- domain/infrastructure OSINT → `domain-intel` skill
+- academic literature → `arxiv` skill
+- social-media profile discovery → `sherlock` skill (optional)
+- US **federal** campaign finance — FEC is intentionally NOT covered here
+  (the API is unreliable for ad-hoc contributor-name queries on the free
+  DEMO_KEY tier). For federal donations, point users at
+  https://www.fec.gov/data/ directly.
+
+## Workflow
+
+The agent runs scripts via the `terminal` tool. `SKILL_DIR` is the directory
+holding this SKILL.md.
+
+### 1. Identify which sources apply
+
+Read the data-source wiki entries to plan the investigation:
+
+```
+ls SKILL_DIR/references/sources/
+
+# Federal financial / regulatory
+cat SKILL_DIR/references/sources/sec-edgar.md       # corporate filings
+cat SKILL_DIR/references/sources/usaspending.md     # federal contracts
+cat SKILL_DIR/references/sources/senate-ld.md       # lobbying
+cat SKILL_DIR/references/sources/ofac-sdn.md        # sanctions
+cat SKILL_DIR/references/sources/icij-offshore.md   # offshore leaks
+
+# Identity / property / litigation / archives / news
+cat SKILL_DIR/references/sources/nyc-acris.md       # NYC property records
+cat SKILL_DIR/references/sources/opencorporates.md  # global corporate registry
+cat SKILL_DIR/references/sources/courtlistener.md   # court records (federal + state)
+cat SKILL_DIR/references/sources/wayback.md         # Wayback Machine archives
+cat SKILL_DIR/references/sources/wikipedia.md       # Wikipedia + Wikidata
+cat SKILL_DIR/references/sources/gdelt.md           # global news monitoring
+```
+
+Each entry follows a 9-section template: summary, access, schema, coverage,
+cross-reference keys, data quality, acquisition, legal, references.
+
+The **cross-reference potential** section maps join keys between sources — read
+those first to pick the right pair.
+
+### 2. Acquire data
+
+Each source has a stdlib-only fetch script in `SKILL_DIR/scripts/`:
+
+**Federal financial / regulatory**
+
+```bash
+# SEC EDGAR filings (corporate disclosures)
+python3 SKILL_DIR/scripts/fetch_sec_edgar.py --cik 0000320193 \
+    --types 10-K,10-Q --out data/edgar_filings.csv
+
+# USAspending federal contracts
+python3 SKILL_DIR/scripts/fetch_usaspending.py --recipient "EXAMPLE CORP" \
+    --fy 2024 --out data/contracts.csv
+
+# Senate LD-1 / LD-2 lobbying disclosures
+python3 SKILL_DIR/scripts/fetch_senate_ld.py --client "EXAMPLE CORP" \
+    --year 2024 --out data/lobbying.csv
+
+# OFAC SDN sanctions list (full snapshot)
+python3 SKILL_DIR/scripts/fetch_ofac_sdn.py --out data/ofac_sdn.csv
+
+# ICIJ Offshore Leaks — downloads ~70 MB bulk CSV on first use,
+# then searches it locally. Cached for 30 days under
+# $HERMES_OSINT_CACHE/icij/ (default: ~/.cache/hermes-osint/icij/).
+python3 SKILL_DIR/scripts/fetch_icij_offshore.py --entity "EXAMPLE CORP" \
+    --out data/icij.csv
+```
+
+**Identity / property / litigation / archives / news**
+
+```bash
+# NYC property records (deeds, mortgages, liens) — ACRIS via Socrata
+python3 SKILL_DIR/scripts/fetch_nyc_acris.py --name "SMITH, JOHN" \
+    --out data/acris.csv
+python3 SKILL_DIR/scripts/fetch_nyc_acris.py --address "571 HUDSON" \
+    --out data/acris_addr.csv
+
+# OpenCorporates — 130+ jurisdiction corporate registry
+# (free token required; set OPENCORPORATES_API_TOKEN or pass --token)
+python3 SKILL_DIR/scripts/fetch_opencorporates.py --query "Example Corp" \
+    --jurisdiction us_ny --out data/opencorporates.csv
+
+# CourtListener — federal + state court opinions, PACER dockets
+python3 SKILL_DIR/scripts/fetch_courtlistener.py --query "Smith v. Example Corp" \
+    --type opinions --out data/courts.csv
+
+# Wayback Machine — historical web captures
+python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \
+    --match host --collapse digest --out data/wayback.csv
+
+# Wikipedia + Wikidata — narrative bio + structured facts
+# Set HERMES_OSINT_UA=your-app/1.0 (your@email) to identify yourself
+python3 SKILL_DIR/scripts/fetch_wikipedia.py --query "Bill Gates" \
+    --out data/wp.csv
+
+# GDELT — global news in 100+ languages, ~2015→present
+python3 SKILL_DIR/scripts/fetch_gdelt.py --query '"Example Corp"' \
+    --timespan 1y --out data/gdelt.csv
+```
+
+All outputs are normalized CSV with a header row. Re-run scripts idempotently.
+
+When a private individual won't be in a source (e.g. SEC EDGAR for a non-public-
+company person, USAspending for someone who isn't a federal contractor, Senate
+LDA for someone who isn't a lobbying client), the script returns 0 rows with a
+clear warning rather than silently writing an empty CSV. EDGAR specifically
+flags when the company-name resolver matched an individual Form 3/4/5 filer
+rather than a corporate registrant.
+
+Rate-limit notes are in each source's wiki entry. Default fetchers sleep
+politely between paginated requests. **API keys raise rate limits** for
+sources that support them (`SEC_USER_AGENT`, `SENATE_LDA_TOKEN`,
+`OPENCORPORATES_API_TOKEN`, `COURTLISTENER_TOKEN`). All scripts surface
+429 responses immediately with the upstream's quota message so the user
+knows to slow down or supply a key.
+
+### 3. Resolve entities across sources
+
+Normalize names and find matches between two CSV files:
+
+```bash
+# Match lobbying clients (Senate LDA) against contract recipients (USAspending)
+python3 SKILL_DIR/scripts/entity_resolution.py \
+    --left  data/lobbying.csv   --left-name-col  client_name \
+    --right data/contracts.csv  --right-name-col recipient_name \
+    --out data/cross_links.csv
+```
+
+Three matching tiers with explicit confidence:
+
+| Tier | Method | Confidence |
+|------|--------|------------|
+| `exact` | Normalized strings equal after suffix/punctuation strip | high |
+| `fuzzy` | Sorted-token equality (word-bag match) | medium |
+| `token_overlap` | ≥60% token overlap, ≥2 shared tokens, tokens ≥4 chars | low |
+
+Output `cross_links.csv` columns: `match_type, confidence, left_name,
+right_name, left_normalized, right_normalized, left_row, right_row`.
+
+### 4. Statistical timing correlation (optional)
+
+Test whether two time series cluster suspiciously close together — e.g.
+lobbying filings near contract awards — using a permutation test:
+
+```bash
+python3 SKILL_DIR/scripts/timing_analysis.py \
+    --donations data/lobbying.csv --donation-date-col filing_date \
+        --donation-amount-col income --donation-donor-col client_name \
+        --donation-recipient-col registrant_name \
+    --contracts data/contracts.csv --contract-date-col award_date \
+        --contract-vendor-col recipient_name \
+    --cross-links data/cross_links.csv \
+    --permutations 1000 \
+    --out data/timing.json
+```
+
+The script's column flags are intentionally generic — the original tool was
+written for donations vs awards, but it works for any (event, payee) time
+series joined through cross-links. Null hypothesis: event timing is
+independent of award dates. One-tailed p-value = fraction of permutations
+with mean nearest-award distance ≤ observed. Minimum 3 events per (payer,
+vendor) pair to run the test.
+
+### 5. Build the findings JSON (evidence chain)
+
+```bash
+python3 SKILL_DIR/scripts/build_findings.py \
+    --cross-links data/cross_links.csv \
+    --timing data/timing.json \
+    --out data/findings.json
+```
+
+Every finding has `id, title, severity, confidence, summary, evidence[], sources[]`.
+Each evidence item points back to a specific row in a source CSV. The user (or a
+follow-up agent) can verify every claim against its source.
+
+## Confidence and evidence discipline
+
+This is the load-bearing rule of the skill. Tell the user:
+
+- Every claim must trace to a record. No naked assertions.
+- Confidence tier travels with the claim. `match_type=fuzzy` is "probable",
+  not "confirmed."
+- Entity resolution produces candidates, NOT conclusions. A `fuzzy` match
+  between "ACME LLC" and "Acme Holdings Group" is a lead, not a fact.
+- Statistical significance ≠ wrongdoing. p < 0.05 means the timing pattern
+  is unlikely under the null. It does not establish corruption.
+- All data sources here are public records. They may still contain
+  inaccuracies, stale info, or redactions (GDPR, sealed records).
+
+## Adding a new data source
+
+Use the template:
+
+```bash
+cp SKILL_DIR/templates/source-template.md \
+    SKILL_DIR/references/sources/<your-source>.md
+```
+
+Fill in all 9 sections. Write a `fetch_<source>.py` script in `scripts/` that
+uses stdlib only and writes a normalized CSV. Update the source list in the
+"When to use" section above.
+
+## Tools and their limits
+
+- `entity_resolution.py` does NOT use external fuzzy libraries (no rapidfuzz,
+  no jellyfish). Token-bag matching is the upper bound here. If you need
+  Levenshtein, transliteration, or phonetic matching, pip-install separately.
+- `timing_analysis.py` uses Python's `random` for permutations. For
+  reproducibility, pass `--seed N`.
+- `fetch_*.py` scripts use `urllib.request` and respect `Retry-After`. Heavy
+  bulk usage may still violate ToS — read each source's legal section first.
+
+## Legal note
+
+All Phase-1 sources are public records. Bulk acquisition is permitted under
+their respective access terms (FOIA, public records law, ICIJ explicit
+publication, OFAC public data). However:
+
+- Some sources rate-limit aggressively. Respect their headers.
+- Some redact registrant info (GDPR on WHOIS, sealed filings).
+- Cross-referencing public records to identify private individuals can have
+  ethical implications. The skill produces evidence chains, not accusations.
diff --git a/optional-skills/research/osint-investigation/references/sources/courtlistener.md b/optional-skills/research/osint-investigation/references/sources/courtlistener.md
new file mode 100644
index 00000000000..0365b2ba0b1
--- /dev/null
+++ b/optional-skills/research/osint-investigation/references/sources/courtlistener.md
@@ -0,0 +1,98 @@
+# CourtListener — Free Law Project
+
+## 1. Summary
+
+CourtListener (Free Law Project) aggregates court opinions, dockets, oral
+arguments, and judge data. Covers ~10M federal and state court opinions
+back to colonial America, plus PACER docket data from RECAP submissions.
+
+## 2. Access Methods
+
+- **REST API v4:** `https://www.courtlistener.com/api/rest/v4/`
+- **Auth:** Anonymous reads allowed on most endpoints; token raises rate
+  limits and unlocks bulk export
+- **Rate limit:** ~5,000 req/hour unauthenticated for search; higher with token
+
+Set `COURTLISTENER_TOKEN` env var. Get a free token at
+https://www.courtlistener.com/sign-in/ then create an API key.
+
+## 3. Data Schema
+
+Key fields emitted by `fetch_courtlistener.py`:
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `case_name` | str | Case name |
+| `court` | str | Court name |
+| `court_id` | str | Court ID (e.g. `nysd`, `scotus`, `ca9`) |
+| `date_filed` | str | YYYY-MM-DD |
+| `docket_number` | str | Court docket number |
+| `judge` | str | Judge name(s) |
+| `citation` | str | Reporter citation(s) |
+| `result_type` | str | opinions / dockets / oral / people |
+| `snippet` | str | Search-match snippet (up to 500 chars) |
+| `absolute_url` | str | Direct CourtListener URL |
+
+## 4. Coverage
+
+- Federal: all circuit and district courts, SCOTUS
+- State: all 50 state supreme/appellate courts, many trial courts
+- Opinions: ~10M back to 1600s (colonial), full coverage 1950 → present
+- Dockets via RECAP: ~3M+ from user-submitted PACER PDFs
+- Updated continuously
+
+## 5. Cross-Reference Potential
+
+- **OpenCorporates** ↔ `case_name` (corporate litigation)
+- **SEC EDGAR** ↔ `case_name` (securities class actions)
+- **OFAC SDN** ↔ `case_name` (sanctions-related civil/criminal cases)
+
+Join key: party name from `case_name`. Note: `case_name` often abbreviates
+("Smith v. Jones" rather than full party names) — use the full case URL
+to get all parties.
+
+## 6. Data Quality
+
+- Older opinions (pre-1990) often lack docket numbers and judges
+- State coverage is more uneven than federal
+- PACER docket coverage depends on RECAP user submissions — not exhaustive
+- Sealed documents are excluded
+- Party names in case captions don't always match filing names exactly
+
+## 7. Acquisition Script
+
+Path: `scripts/fetch_courtlistener.py`
+
+```bash
+# Search opinions for a party / keyword
+python3 SKILL_DIR/scripts/fetch_courtlistener.py --query "Example Corp" \
+    --out data/cl.csv
+
+# PACER dockets (best for recent litigation)
+python3 SKILL_DIR/scripts/fetch_courtlistener.py --query "Example Corp" \
+    --type dockets --out data/cl_dockets.csv
+
+# Restrict to a court
+python3 SKILL_DIR/scripts/fetch_courtlistener.py --query "Microsoft" \
+    --court ca9 --out data/cl_9th.csv
+
+# Date range
+python3 SKILL_DIR/scripts/fetch_courtlistener.py --query "Example Corp" \
+    --date-from 2020-01-01 --date-to 2024-12-31 --out data/cl.csv
+```
+
+Pass `--token` or set `COURTLISTENER_TOKEN`.
+
+## 8. Legal & Licensing
+
+- Court opinions are public domain
+- Free Law Project provides the data under CC0 / public domain dedication
+- No commercial use restrictions on opinion text or metadata
+- Some PACER PDFs have copyright on layout (not text) — fair use applies
+
+## 9. References
+
+- API docs: https://www.courtlistener.com/help/api/rest/
+- Court IDs: https://www.courtlistener.com/api/jurisdictions/
+- RECAP archive: https://www.courtlistener.com/recap/
+- Bulk data: https://www.courtlistener.com/help/api/bulk-data/
diff --git a/optional-skills/research/osint-investigation/references/sources/gdelt.md b/optional-skills/research/osint-investigation/references/sources/gdelt.md
new file mode 100644
index 00000000000..785c171a0c9
--- /dev/null
+++ b/optional-skills/research/osint-investigation/references/sources/gdelt.md
@@ -0,0 +1,104 @@
+# GDELT — Global News Monitoring
+
+## 1. Summary
+
+GDELT (Global Database of Events, Language, and Tone) monitors world news
+in 100+ languages with full-text indexing. Updated every 15 minutes.
+~2015 → present, ~1B+ articles indexed. Free anonymous access.
+
+GDELT is wider than Google News (more international, more long-tail
+sources) and indexed by tone/sentiment, themes (CAMEO codes), people, and
+organizations.
+
+## 2. Access Methods
+
+- **DOC 2.0 API:** `https://api.gdeltproject.org/api/v2/doc/doc`
+- **Events / GKG 2.0:** `https://api.gdeltproject.org/api/v2/events/events`
+- **Auth:** None
+- **Rate limit:** **1 request per 5 seconds** for the DOC API — strict
+
+The fetch script automatically retries after a 6-second sleep when a
+429 is received.
+
+## 3. Data Schema
+
+Key fields emitted by `fetch_gdelt.py`:
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `title` | str | Article title |
+| `url` | str | Article URL |
+| `seen_date` | str | When GDELT first saw the article (UTC) |
+| `domain` | str | Publisher domain |
+| `language` | str | Source language |
+| `source_country` | str | 2-letter country code |
+| `tone` | str | GDELT-computed tone score (negative = negative coverage) |
+| `social_image` | str | Open Graph image URL when available |
+
+## 4. Coverage
+
+- Worldwide news in 100+ languages
+- ~2015 → present (Events back to 1979 via a separate stream)
+- Update frequency: 15 minutes
+- Bias: heavily Anglophone in volume but very wide source list overall
+
+## 5. Cross-Reference Potential
+
+- **All sources** ↔ `title` / `url` (news context for any subject)
+- **Wikipedia** ↔ event timeline for notable entities
+- **Wayback Machine** ↔ recover articles whose URLs have died
+- **OFAC SDN** ↔ news context for sanctions designations
+- **SEC EDGAR** ↔ news context for 8-K material events
+
+Join key: entity name appearing in article title or full-text. GDELT also
+extracts named entities into a separate stream (GKG) not exposed by this
+fetcher — query GDELT directly for entity-level filtering.
+
+## 6. Data Quality
+
+- Title extraction is automated and can be wrong (sometimes captures the
+  site name + delimiter + article title; sometimes a generic page title)
+- Sentiment / tone is computed by GDELT, not source-supplied
+- Some domains are oversampled (newswires, aggregators)
+- Source country is inferred from domain registration / TLD — can be
+  wrong for international news sites with country-neutral domains
+- Article URLs can rot — pair with Wayback Machine to preserve content
+
+## 7. Acquisition Script
+
+Path: `scripts/fetch_gdelt.py`
+
+```bash
+# Recent news mentioning an entity
+python3 SKILL_DIR/scripts/fetch_gdelt.py --query "Nous Research" \
+    --timespan 6m --out data/gdelt.csv
+
+# Phrase-exact (use double quotes inside single quotes for the shell)
+python3 SKILL_DIR/scripts/fetch_gdelt.py --query '"Dillon Rolnick"' \
+    --timespan 1y --out data/gdelt.csv
+
+# Filter to a country / language
+python3 SKILL_DIR/scripts/fetch_gdelt.py --query "Microsoft" \
+    --source-country US --source-lang English --out data/gdelt.csv
+
+# Date range
+python3 SKILL_DIR/scripts/fetch_gdelt.py --query "Microsoft" \
+    --start 2024-01-01 --end 2024-12-31 --out data/gdelt.csv
+```
+
+GDELT supports its own query operators: phrase quoting, AND/OR/NOT,
+`sourcecountry:US`, `theme:ECON_BANKRUPTCY`, `tone<-5`, etc.
+See https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/ for syntax.
+
+## 8. Legal & Licensing
+
+- GDELT data is provided free for academic and journalistic use
+- Article URLs link out to original publishers — copyright remains with
+  the publisher
+- GDELT is NOT a content archive; it's a metadata index
+
+## 9. References
+
+- DOC 2.0 API: https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/
+- Themes & query syntax: https://blog.gdeltproject.org/gkg-2-0-our-global-knowledge-graph-2-0-amazing-data-at-your-fingertips/
+- Project home: https://www.gdeltproject.org/
diff --git a/optional-skills/research/osint-investigation/references/sources/icij-offshore.md b/optional-skills/research/osint-investigation/references/sources/icij-offshore.md
new file mode 100644
index 00000000000..99e2abcb24b
--- /dev/null
+++ b/optional-skills/research/osint-investigation/references/sources/icij-offshore.md
@@ -0,0 +1,104 @@
+# ICIJ Offshore Leaks Database
+
+## 1. Summary
+
+The International Consortium of Investigative Journalists (ICIJ) publishes a
+combined database of offshore entities from the Panama Papers, Paradise Papers,
+Pandora Papers, Bahamas Leaks, and Offshore Leaks. ~800,000+ offshore entities
+with their officers, intermediaries, and addresses.
+
+## 2. Access Methods
+
+- **Bulk download (primary):** `https://offshoreleaks-data.icij.org/offshoreleaks/csv/full-oldb.LATEST.zip` (~70 MB ZIP, refreshed periodically)
+- **Search UI (human):** `https://offshoreleaks.icij.org/`
+- **Auth:** None
+- **Note:** The previous Open Refine reconciliation endpoint at
+  `/reconcile` now returns 404. ICIJ has removed it. The bulk ZIP is the
+  remaining stable access path. The skill's `fetch_icij_offshore.py` caches
+  the ZIP locally (default `~/.cache/hermes-osint/icij/`, refreshes after
+  30 days) and searches it offline.
+
+## 3. Data Schema
+
+Key fields emitted by `fetch_icij_offshore.py`:
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `node_id` | int | ICIJ canonical node ID |
+| `name` | str | Entity / officer / intermediary name |
+| `node_type` | str | entity / officer / intermediary / address |
+| `country_codes` | str | Semicolon-separated ISO codes |
+| `countries` | str | Country names |
+| `jurisdiction` | str | Offshore jurisdiction (BVI, Panama, etc.) |
+| `incorporation_date` | str | YYYY-MM-DD |
+| `inactivation_date` | str | YYYY-MM-DD (if struck) |
+| `source` | str | Panama Papers / Paradise Papers / Pandora Papers / etc. |
+| `entity_url` | str | Link to ICIJ page |
+| `connections` | str | Semicolon-separated node IDs of related entities |
+
+## 4. Coverage
+
+- Worldwide offshore entity records
+- Earliest records: 1970s (Bahamas Leaks). Most data 1990–2018.
+- NOT updated in real-time — new leaks added when ICIJ publishes them
+- ~810,000 offshore entities + ~750,000 officers + ~150,000 intermediaries
+
+## 5. Cross-Reference Potential
+
+- **SEC EDGAR** ↔ `name` (public companies with offshore arms)
+- **USAspending** ↔ `name` (federal contractors with offshore structure)
+- **OFAC SDN** ↔ `name` (sanctioned entities using offshore vehicles)
+
+Join key: normalized entity/officer name. `node_id` is canonical for cross-
+referencing within ICIJ. Connections graph traversal is in-script (BFS over
+`connections`).
+
+## 6. Data Quality
+
+- Offshore entity names sometimes appear in multiple leaks with slight variations
+- Officers may be nominees (front persons), not beneficial owners
+- Some entries have minimal info (just a name + jurisdiction)
+- The connections graph is incomplete — some relationships are documented in
+  source materials but not in the structured database
+- Inactive/struck-off entities are still included with `inactivation_date`
+
+## 7. Acquisition Script
+
+Path: `scripts/fetch_icij_offshore.py`
+
+```bash
+# Search by entity name (case-insensitive substring across the bulk DB)
+python3 SKILL_DIR/scripts/fetch_icij_offshore.py --entity "EXAMPLE CORP" \
+    --out data/icij.csv
+
+# Search by officer (individual person)
+python3 SKILL_DIR/scripts/fetch_icij_offshore.py --officer "SMITH JOHN" \
+    --out data/icij.csv
+
+# Search by jurisdiction (filter on cached results)
+python3 SKILL_DIR/scripts/fetch_icij_offshore.py --officer "SMITH" \
+    --jurisdiction "BRITISH VIRGIN ISLANDS" --out data/icij_bvi.csv
+
+# Force a fresh download (default refresh window is 30 days)
+python3 SKILL_DIR/scripts/fetch_icij_offshore.py --entity "EXAMPLE CORP" \
+    --force-refresh --out data/icij.csv
+```
+
+First call downloads the ~70 MB ZIP under `~/.cache/hermes-osint/icij/`
+(or `$HERMES_OSINT_CACHE/icij/`). Subsequent calls reuse the cache for 30 days.
+
+## 8. Legal & Licensing
+
+- Public record as published by ICIJ under explicit publication
+- No copyright on the underlying facts (entity names, jurisdictions)
+- ICIJ asks for attribution if used in derivative reporting
+- **Ethical note**: Presence in this database does NOT imply wrongdoing. Many
+  offshore structures are legal. The database is a research tool, not a list of
+  criminals.
+
+## 9. References
+
+- Database: https://offshoreleaks.icij.org/
+- About the data: https://offshoreleaks.icij.org/pages/about
+- Methodology: https://www.icij.org/investigations/panama-papers/
+- API hints: Open Refine reconciliation endpoint at `https://offshoreleaks.icij.org/reconcile`
diff --git a/optional-skills/research/osint-investigation/references/sources/nyc-acris.md b/optional-skills/research/osint-investigation/references/sources/nyc-acris.md
new file mode 100644
index 00000000000..4b20169bf3e
--- /dev/null
+++ b/optional-skills/research/osint-investigation/references/sources/nyc-acris.md
@@ -0,0 +1,90 @@
+# NYC ACRIS — NYC Real Property Records
+
+## 1. Summary
+
+The Automated City Register Information System (ACRIS) is NYC's index of
+recorded property documents: deeds, mortgages, satisfactions, liens, UCC
+filings. Covers Manhattan, Bronx, Brooklyn, Queens, Staten Island.
+Published as 4 linked Socrata datasets on the NYC Open Data portal.
+
+## 2. Access Methods
+
+- **Socrata API:** `https://data.cityofnewyork.us/resource/636b-3b5g.json` (Parties)
+- **Other datasets:** `bnx9-e6tj` (Master), `8h5j-fqxa` (Legal), `uqqa-hym2` (References)
+- **Auth:** None for read access (Socrata `$app_token` raises rate limits if needed)
+- **Rate limit:** Generous (~1000 req/hour unauthenticated)
+
+## 3. Data Schema
+
+Key fields emitted by `fetch_nyc_acris.py` (Parties joined to Master):
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `document_id` | str | ACRIS document ID |
+| `name` | str | Party name as recorded (often "LAST, FIRST" but varies) |
+| `party_type` | str | 1=grantor, 2=grantee, 3=other |
+| `party_role` | str | Human-readable role label |
+| `address_1` | str | Property or party address line 1 |
+| `city`, `state`, `zip`, `country` | str | Address parts |
+| `doc_type` | str | DEED, MTGE (mortgage), SAT (satisfaction), AGMT, etc. |
+| `doc_date`, `recorded_date` | str | YYYY-MM-DD |
+| `borough` | str | Manhattan / Bronx / Brooklyn / Queens / Staten Island |
+| `amount` | str | Document amount (USD, when applicable) |
+| `filing_url` | str | Direct ACRIS DocumentImageView link |
+
+## 4. Coverage
+
+- NYC 5 boroughs only — other counties have their own recorders
+- 1966 → present (older filings exist on microfilm at the County Clerk)
+- Updated nightly
+- ~70M+ party records cumulative
+
+## 5. Cross-Reference Potential
+
+- **SEC EDGAR** ↔ `name` (insider filers with NYC property)
+- **USAspending** ↔ `name` (federal contractors with NYC property)
+- **Senate LDA** ↔ `name` (lobbyists / clients with NYC property)
+- **ICIJ Offshore** ↔ `name` (NYC properties owned via offshore vehicles)
+
+Join key: normalized party name. NYC property records typically store names
+as "LAST, FIRST" or full LLC names — use `entity_resolution.py`.
+
+## 6. Data Quality
+
+- Same person appears with multiple name formats over time
+- LLC and trust ownership obscures beneficial owners
+- Recording lag can be 2-4 weeks after closing
+- Older documents have spottier address data
+- Sealed records (e.g. domestic violence shelters) are excluded by law
+
+## 7. Acquisition Script
+
+Path: `scripts/fetch_nyc_acris.py`
+
+```bash
+# By party name
+python3 SKILL_DIR/scripts/fetch_nyc_acris.py --name "ROLNICK" --out data/acris.csv
+
+# By address (useful when you know the property but not the names)
+python3 SKILL_DIR/scripts/fetch_nyc_acris.py --address "571 HUDSON" --out data/acris.csv
+
+# Restrict to grantees (buyers / mortgagees)
+python3 SKILL_DIR/scripts/fetch_nyc_acris.py --name "ROLNICK" --party-type 2 \
+    --out data/acris_buyers.csv
+```
+
+The script joins Parties → Master to populate doc_type, dates, borough, and
+amount. Pass `--no-enrich` to skip the join (faster, fewer columns).
+
+## 8. Legal & Licensing
+
+- Public record under NYS Real Property Law and NYC Charter
+- No commercial use restrictions on the data
+- All ACRIS data is public information by statute
+
+## 9. References
+
+- ACRIS portal: https://a836-acris.nyc.gov/CP/
+- NYC Open Data: https://data.cityofnewyork.us/
+- Parties dataset: https://data.cityofnewyork.us/City-Government/ACRIS-Real-Property-Parties/636b-3b5g
+- Document type codes: https://www1.nyc.gov/site/finance/taxes/acris.page
diff --git a/optional-skills/research/osint-investigation/references/sources/ofac-sdn.md b/optional-skills/research/osint-investigation/references/sources/ofac-sdn.md
new file mode 100644
index 00000000000..ab3602031f1
--- /dev/null
+++ b/optional-skills/research/osint-investigation/references/sources/ofac-sdn.md
@@ -0,0 +1,92 @@
+# OFAC SDN — Specially Designated Nationals List
+
+## 1. Summary
+
+The Office of Foreign Assets Control (OFAC) publishes the Specially Designated
+Nationals and Blocked Persons List (SDN). US persons are generally prohibited
+from dealing with individuals and entities on this list. Also published:
+non-SDN consolidated lists (BIS Denied Persons, FSE, etc.).
+
+## 2. Access Methods
+
+- **Full XML:** `https://www.treasury.gov/ofac/downloads/sdn.xml`
+- **Delimited:** `https://www.treasury.gov/ofac/downloads/sdn.csv`
+- **Consolidated:** `https://www.treasury.gov/ofac/downloads/consolidated/consolidated.xml`
+- **Auth:** None
+- **Rate limit:** None (static file downloads). Updated continuously.
+
+## 3. Data Schema
+
+Key fields emitted by `fetch_ofac_sdn.py`:
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `entity_id` | int | OFAC unique ID |
+| `name` | str | Primary name |
+| `entity_type` | str | individual / entity / vessel / aircraft |
+| `program_list` | str | Semicolon-separated sanctions programs (e.g. SDGT;IRAN) |
+| `title` | str | For individuals: title/role |
+| `nationalities` | str | Semicolon-separated country codes |
+| `aka_list` | str | Semicolon-separated "also known as" names |
+| `addresses` | str | Semicolon-separated known addresses |
+| `dob` | str | Date of birth (individuals) |
+| `pob` | str | Place of birth (individuals) |
+| `remarks` | str | OFAC's free-text remarks |
+| `last_updated` | str | YYYY-MM-DD (publication date) |
+
+## 4. Coverage
+
+- Worldwide — all entities sanctioned by US Treasury
+- ~10,000 entries on SDN, ~15,000 on consolidated lists
+- Updated continuously (sometimes daily during active enforcement)
+- Includes AKAs (very common, can be 10+ per entity)
+
+## 5. Cross-Reference Potential
+
+- **SEC EDGAR** ↔ `name` (public companies sanctioned)
+- **USAspending** ↔ `name` (sanctioned entity as federal contractor — should
+  be impossible but verify)
+- **ICIJ Offshore** ↔ `name` (offshore entities also sanctioned)
+
+Join key: normalized name. **CRITICAL**: must match against `aka_list` too.
+Many sanctioned entities are caught only via aliases.
+
+## 6. Data Quality
+
+- Names are transliterated from many scripts — multiple romanizations possible
+- AKAs often differ wildly from primary name
+- Some entries have minimal info (no DOB, no address) for individuals
+- Free-text `remarks` contain critical context — read them
+- "Specially Designated Global Terrorists" (SDGT) and "Cyber-related" (CYBER2)
+  programs add and remove entries frequently
+
+## 7. Acquisition Script
+
+Path: `scripts/fetch_ofac_sdn.py`
+
+```bash
+# Full snapshot
+python3 SKILL_DIR/scripts/fetch_ofac_sdn.py --out data/ofac_sdn.csv
+
+# Filter to specific program
+python3 SKILL_DIR/scripts/fetch_ofac_sdn.py --program SDGT --out data/sdn_sdgt.csv
+
+# Entities only (skip individuals, vessels, aircraft)
+python3 SKILL_DIR/scripts/fetch_ofac_sdn.py --entity-type entity --out data/sdn_entities.csv
+```
+
+## 8. Legal & Licensing
+
+- Public record under Executive Order authority and statutory sanctions programs
+- US persons MUST screen against this list — it is enforced
+- No restrictions on the data itself; restrictions are on transactions with
+  the listed entities
+- ZERO penalty for "over-matching" — false positives must be cleared but are not
+  prohibited
+
+## 9. References
+
+- OFAC home: https://ofac.treasury.gov/
+- SDN list: https://ofac.treasury.gov/specially-designated-nationals-and-blocked-persons-list-sdn-human-readable-lists
+- Data formats: https://ofac.treasury.gov/sdn-list/sanctions-list-search-tool
+- Compliance guidance: https://ofac.treasury.gov/recent-actions
diff --git a/optional-skills/research/osint-investigation/references/sources/opencorporates.md b/optional-skills/research/osint-investigation/references/sources/opencorporates.md
new file mode 100644
index 00000000000..0bd190a2f49
--- /dev/null
+++ b/optional-skills/research/osint-investigation/references/sources/opencorporates.md
@@ -0,0 +1,103 @@
+# OpenCorporates — Global Corporate Registry
+
+## 1. Summary
+
+OpenCorporates aggregates corporate registry data from 130+ jurisdictions
+worldwide (~200M companies). Covers US state-level filings (NY DOS, Delaware
+DOC, California SOS, etc.), UK Companies House, EU registries, and most
+common-law jurisdictions.
+
+## 2. Access Methods
+
+- **REST API:** `https://api.opencorporates.com/v0.4/`
+- **HTML fallback:** `https://opencorporates.com/companies?q=...`
+- **Auth:** API token required (free tier 500 calls/month, paid plans available)
+- **Rate limit:** Token-bound; un-tokened requests return 401
+
+Set `OPENCORPORATES_API_TOKEN` env var. Get a free token at
+https://opencorporates.com/api_accounts/new.
+
+## 3. Data Schema
+
+Key fields emitted by `fetch_opencorporates.py`:
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `name` | str | Company legal name |
+| `company_number` | str | Registry-assigned number |
+| `jurisdiction_code` | str | e.g. `us_ny`, `us_de`, `gb` |
+| `jurisdiction_name` | str | Human-readable jurisdiction |
+| `incorporation_date` | str | YYYY-MM-DD |
+| `dissolution_date` | str | YYYY-MM-DD (empty if active) |
+| `company_type` | str | Domestic LLC / Foreign Corp / etc. |
+| `status` | str | Active / Inactive / Dissolved |
+| `registered_address` | str | Registered office address |
+| `opencorporates_url` | str | Link to OpenCorporates entity page |
+| `officers_count` | str | Total officers on record |
+| `source` | str | `api`, `html`, or `html-fallback` |
+
+## 4. Coverage
+
+- US: all 50 states + DC at state level (LLCs, corps, LPs)
+- International: UK, EU, Canada, Australia, NZ, many APAC + LATAM jurisdictions
+- ~200M company records cumulative
+- Update frequency varies by jurisdiction (UK CH is near-realtime; some
+  state registries lag months)
+
+## 5. Cross-Reference Potential
+
+- **NYC ACRIS** ↔ `name` (LLC/corp owners of NYC property)
+- **USAspending** ↔ `name` (corporate federal contractors)
+- **SEC EDGAR** ↔ `name` (public companies + their subsidiaries)
+- **ICIJ Offshore** ↔ `name` (international corporate structures)
+
+Join key: normalized company name. Some entries have `previous_names` arrays
+which are not currently exported by the fetch script — query OC directly
+for that.
+
+## 6. Data Quality
+
+- Company-name spellings vary across re-incorporations and renames
+- Officer records are spottier than company records (many jurisdictions
+  don't require officer disclosure)
+- Beneficial-ownership data is generally NOT here — most jurisdictions
+  don't require it. UK Companies House has PSC (people with significant
+  control) but that's not universal.
+- Cross-jurisdictional links (parent / subsidiary) are based on registry
+  filings only; corporate trees are often incomplete
+
+## 7. Acquisition Script
+
+Path: `scripts/fetch_opencorporates.py`
+
+```bash
+# Search globally by name
+python3 SKILL_DIR/scripts/fetch_opencorporates.py --query "Example Corp" \
+    --out data/oc.csv
+
+# Restrict to a jurisdiction
+python3 SKILL_DIR/scripts/fetch_opencorporates.py --query "Example Corp" \
+    --jurisdiction us_ny --out data/oc_ny.csv
+
+# Set token via env or flag
+OPENCORPORATES_API_TOKEN=xxx python3 SKILL_DIR/scripts/fetch_opencorporates.py \
+    --query "Microsoft" --out data/oc.csv
+```
+
+Without a token the script falls back to scraping the HTML search page.
+The fallback is brittle and only fills in `name`, `jurisdiction_code`,
+`opencorporates_url` — set the token for serious work.
+
+## 8. Legal & Licensing
+
+- OpenCorporates aggregates public records — the underlying facts are
+  public domain
+- OpenCorporates own database is licensed CC-BY-SA-4.0; attribution required
+- API ToS prohibits redistributing the full dataset; per-record reference
+  is fine
+
+## 9. References
+
+- API docs: https://api.opencorporates.com/documentation/API-Reference
+- Jurisdiction codes: https://api.opencorporates.com/v0.4/jurisdictions.json
+- Schema: https://opencorporates.com/info/our_data
diff --git a/optional-skills/research/osint-investigation/references/sources/sec-edgar.md b/optional-skills/research/osint-investigation/references/sources/sec-edgar.md
new file mode 100644
index 00000000000..55a33d70258
--- /dev/null
+++ b/optional-skills/research/osint-investigation/references/sources/sec-edgar.md
@@ -0,0 +1,83 @@
+# SEC EDGAR — Corporate Filings
+
+## 1. Summary
+
+EDGAR (Electronic Data Gathering, Analysis, and Retrieval) is the SEC's system
+for corporate disclosure filings: 10-K (annual), 10-Q (quarterly), 8-K (current
+events), DEF 14A (proxy), Form 4 (insider trading), 13F (institutional holdings).
+
+## 2. Access Methods
+
+- **API:** `https://data.sec.gov/submissions/CIK<10-digit-padded>.json` (no auth)
+- **Filing index:** `https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=...`
+- **Full-text search:** `https://efts.sec.gov/LATEST/search-index?q=...`
+- **Auth:** None — requires `User-Agent` header with contact info per SEC policy
+- **Rate limit:** 10 requests/second per IP (enforced)
+
+## 3. Data Schema
+
+Key fields emitted by `fetch_sec_edgar.py` (filings index):
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `cik` | str | Central Index Key (10-digit padded) |
+| `company_name` | str | Registrant name |
+| `form_type` | str | 10-K, 10-Q, 8-K, etc. |
+| `filing_date` | str | YYYY-MM-DD |
+| `accession_number` | str | Filing accession (e.g. 0000320193-24-000123) |
+| `primary_document` | str | Filename of main document |
+| `filing_url` | str | Direct URL to filing index |
+| `reporting_period` | str | Period of report (where applicable) |
+
+## 4. Coverage
+
+- All public US registrants from 1993 → present
+- 1993-2000 has spotty coverage of older filings (paper-to-electronic migration)
+- ~12M filings cumulative
+- Updated within minutes of filing acceptance
+
+## 5. Cross-Reference Potential
+
+- **USAspending** ↔ `company_name` (public companies as federal contractors)
+- **Senate LD** ↔ `company_name` (public companies hire lobbyists)
+- **OFAC SDN** ↔ `company_name` (sanctions screening of public registrants)
+
+Join key: company name OR CIK if you have it. CIK is canonical and stable.
+
+## 6. Data Quality
+
+- Subsidiaries often filed under parent CIK — be careful with name matches
+- Name changes over time (rebrands, acquisitions) — CIK remains constant
+- 10-K Item 1A Risk Factors are free-form text — useful for `web_extract`-style
+  parsing, not structured queries
+- Foreign private issuers file 20-F instead of 10-K
+
+## 7. Acquisition Script
+
+Path: `scripts/fetch_sec_edgar.py`
+
+```bash
+# By CIK
+python3 SKILL_DIR/scripts/fetch_sec_edgar.py --cik 0000320193 \
+    --types 10-K,10-Q --out data/edgar_filings.csv
+
+# By company name (resolves to CIK first via name search)
+python3 SKILL_DIR/scripts/fetch_sec_edgar.py --company "APPLE INC" \
+    --types 8-K --since 2024-01-01 --out data/edgar_filings.csv
+```
+
+Set `SEC_USER_AGENT` env var with your contact email (SEC requirement).
+Example: `SEC_USER_AGENT="Research example@example.com"`.
+
+## 8. Legal & Licensing
+
+- Public record under SEC Rule 24b-2 / 17 CFR § 230.401
+- No commercial use restrictions on filing content
+- SEC asks all bulk users to include a `User-Agent` with contact info and to
+  respect 10 req/s — failure to do so can result in IP blocking
+
+## 9. References
+
+- Developer docs: https://www.sec.gov/edgar/sec-api-documentation
+- EDGAR full-text search: https://efts.sec.gov/LATEST/search-index
+- Fair access policy: https://www.sec.gov/os/accessing-edgar-data
diff --git a/optional-skills/research/osint-investigation/references/sources/senate-ld.md b/optional-skills/research/osint-investigation/references/sources/senate-ld.md
new file mode 100644
index 00000000000..5142dc6ea41
--- /dev/null
+++ b/optional-skills/research/osint-investigation/references/sources/senate-ld.md
@@ -0,0 +1,89 @@
+# Senate LD — Lobbying Disclosure (LD-1 / LD-2)
+
+## 1. Summary
+
+The Senate Office of Public Records publishes lobbying disclosures under the
+Lobbying Disclosure Act of 1995 (LDA, as amended by HLOGA 2007). LD-1 is
+registration of a new client-lobbyist relationship; LD-2 is the quarterly
+activity report.
+
+## 2. Access Methods
+
+- **API:** `https://lda.senate.gov/api/v1/` (no auth required for read-only)
+- **Bulk download:** `https://lda.senate.gov/api/v1/filings/?format=csv` (paginated)
+- **Auth:** Token required for >120 req/hour — register at https://lda.senate.gov/api/auth/register/
+- **Rate limit:** 120 req/hour unauthenticated, 1,200 req/hour authenticated
+
+## 3. Data Schema
+
+Key fields emitted by `fetch_senate_ld.py`:
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `filing_uuid` | str | Unique filing ID |
+| `filing_type` | str | LD-1, LD-2, LD-203, etc. |
+| `filing_year` | int | Year |
+| `filing_period` | str | Q1/Q2/Q3/Q4 or annual |
+| `registrant_name` | str | Lobbying firm or organization |
+| `registrant_id` | str | Senate-assigned registrant ID |
+| `client_name` | str | Client being represented |
+| `client_id` | str | Senate-assigned client ID |
+| `client_general_description` | str | Client industry / business |
+| `income` | float | LD-2 income from client this quarter (USD) |
+| `expenses` | float | LD-2 expenses (in-house lobbying) |
+| `lobbyists` | str | Semicolon-separated lobbyist names |
+| `issues` | str | Semicolon-separated issue areas |
+| `government_entities` | str | Agencies/chambers contacted |
+| `filing_date` | str | YYYY-MM-DD |
+
+## 4. Coverage
+
+- US federal lobbying only (state lobbying handled by individual state ethics offices)
+- 1999 → present (full electronic coverage from 2008)
+- Quarterly reporting cycle (LD-2)
+- ~1M+ filings cumulative
+
+## 5. Cross-Reference Potential
+
+- **USAspending** ↔ `client_name` (clients lobbying for contracts)
+- **SEC EDGAR** ↔ `client_name` (public companies as lobbying clients)
+- **OFAC SDN** ↔ `client_name` (sanctions screening of lobbying clients)
+
+Join key: normalized client_name. registrant_id and client_id are canonical
+when joining Senate-internal records.
+
+## 6. Data Quality
+
+- Many lobbyist names appear in multiple registrants over time (job changes)
+- `issues` and `government_entities` are free-text — Inconsistent capitalization
+- Foreign agents register under FARA (Department of Justice), NOT here
+- Income/expenses are reported in $10,000 brackets in some older filings
+
+## 7. Acquisition Script
+
+Path: `scripts/fetch_senate_ld.py`
+
+```bash
+# By client
+python3 SKILL_DIR/scripts/fetch_senate_ld.py --client "EXAMPLE CORP" \
+    --year 2024 --out data/lobbying.csv
+
+# By registrant (lobbying firm)
+python3 SKILL_DIR/scripts/fetch_senate_ld.py --registrant "BIG K STREET LLP" \
+    --year 2024 --out data/lobbying.csv
+```
+
+Set `SENATE_LDA_TOKEN` env var if you have one (or pass `--token`).
+Defaults to anonymous (120 req/hour).
+
+## 8. Legal & Licensing
+
+- Public record under 2 U.S.C. § 1604 (LDA)
+- No commercial use restrictions
+- Reuse is unconditional — see Senate Public Records Office disclaimer
+
+## 9. References
+
+- API docs: https://lda.senate.gov/api/redoc/v1/
+- LDA guidance: https://lobbyingdisclosure.house.gov/ld_guidance.pdf
+- Senate Public Records: https://lda.senate.gov/
diff --git a/optional-skills/research/osint-investigation/references/sources/usaspending.md b/optional-skills/research/osint-investigation/references/sources/usaspending.md
new file mode 100644
index 00000000000..6477272293b
--- /dev/null
+++ b/optional-skills/research/osint-investigation/references/sources/usaspending.md
@@ -0,0 +1,97 @@
+# USAspending — Federal Government Contracts and Grants
+
+## 1. Summary
+
+USAspending.gov is the official source of federal spending data. Coverage:
+contracts, grants, loans, direct payments, sub-awards. Required by the DATA Act
+of 2014 — all federal agencies must report to a single schema.
+
+## 2. Access Methods
+
+- **API v2:** `https://api.usaspending.gov/api/v2/` (no auth, no key)
+- **Bulk:** `https://files.usaspending.gov/` (CSV / Parquet by award type)
+- **Auth:** None
+- **Rate limit:** Not strictly enforced, but be polite — keep to <10 req/s
+
+## 3. Data Schema
+
+Key fields emitted by `fetch_usaspending.py` (prime awards):
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `award_id` | str | Federal award ID (PIID for contracts, FAIN for grants) |
+| `recipient_name` | str | Awardee legal name |
+| `recipient_uei` | str | Unique Entity Identifier (replaced DUNS in 2022) |
+| `recipient_duns` | str | Legacy DUNS number (historical only) |
+| `recipient_parent_name` | str | Ultimate parent organization |
+| `recipient_state` | str | Recipient state |
+| `awarding_agency` | str | Department / agency name |
+| `awarding_sub_agency` | str | Sub-tier (e.g. DoD → Army) |
+| `award_type` | str | Contract / Grant / Loan / Direct Payment |
+| `award_amount` | float | Current total obligation in USD |
+| `award_date` | str | Action / signed date YYYY-MM-DD |
+| `period_of_performance_start` | str | YYYY-MM-DD |
+| `period_of_performance_end` | str | YYYY-MM-DD |
+| `naics_code` | str | Industry classification |
+| `psc_code` | str | Product / Service Code |
+| `competition_extent` | str | Full / limited / sole-source |
+| `description` | str | Award description (free-text) |
+
+## 4. Coverage
+
+- US federal awards only (state/local not included)
+- FY 2008 → present (full coverage from FY 2017)
+- Updated bi-weekly from agency reporting
+- ~100M+ transaction records cumulative
+
+## 5. Cross-Reference Potential
+
+- **SEC EDGAR** ↔ `recipient_name` (public companies as contractors)
+- **Senate LD** ↔ `recipient_name` (lobbying clients winning contracts)
+- **OFAC SDN** ↔ `recipient_name` (sanctions screening of contractors — must be
+  filtered out by SAM.gov but verify)
+- **ICIJ Offshore** ↔ `recipient_name` (offshore-linked contractors)
+
+Join key: normalized recipient name. UEI is canonical when present.
+
+## 6. Data Quality
+
+- DUNS → UEI transition (April 2022) — old records have DUNS, new records have UEI
+- Some sub-awards aren't reported (FFATA threshold is $30k)
+- Award amount changes over time (mod actions) — fetch script reports current total
+- `competition_extent` field is free-text in older records — `fetch_usaspending.py`
+  normalizes to canonical values
+- Recipient name variations are extensive — "ACME LLC", "Acme L.L.C.", "ACME, INC"
+  all appear. Use `entity_resolution.py`.
+
+## 7. Acquisition Script
+
+Path: `scripts/fetch_usaspending.py`
+
+```bash
+# By recipient name
+python3 SKILL_DIR/scripts/fetch_usaspending.py --recipient "EXAMPLE CORP" \
+    --fy 2024 --out data/contracts.csv
+
+# By awarding agency
+python3 SKILL_DIR/scripts/fetch_usaspending.py --agency "Department of Defense" \
+    --fy 2024 --out data/contracts.csv
+
+# Filter to sole-source only
+python3 SKILL_DIR/scripts/fetch_usaspending.py --recipient "EXAMPLE CORP" \
+    --fy 2024 --sole-source-only --out data/contracts.csv
+```
+
+## 8. Legal & Licensing
+
+- Public record under the Federal Funding Accountability and Transparency Act
+  (FFATA, 2006) and DATA Act (2014)
+- No commercial use restrictions on the data
+- Personal information of award recipients (e.g. small business owners' addresses
+  in some grants) should be handled per the source agency's privacy notice
+
+## 9. References
+
+- API docs: https://api.usaspending.gov/
+- Data dictionary: https://www.usaspending.gov/data-dictionary
+- Award schema: https://files.usaspending.gov/docs/Data_Dictionary_Crosswalk.xlsx
diff --git a/optional-skills/research/osint-investigation/references/sources/wayback.md b/optional-skills/research/osint-investigation/references/sources/wayback.md
new file mode 100644
index 00000000000..f397c093a23
--- /dev/null
+++ b/optional-skills/research/osint-investigation/references/sources/wayback.md
@@ -0,0 +1,93 @@
+# Wayback Machine — Internet Archive CDX
+
+## 1. Summary
+
+The Internet Archive's Wayback Machine has captured ~900B+ web pages since
+1996. The CDX server API indexes those captures by URL, timestamp, and
+content hash. Free, anonymous, no auth.
+
+## 2. Access Methods
+
+- **CDX server:** `https://web.archive.org/cdx/search/cdx`
+- **Wayback URL:** `https://web.archive.org/web/<timestamp>/<url>`
+- **Save Page Now (write):** `https://web.archive.org/save/<url>` (different API)
+- **Auth:** None
+- **Rate limit:** Generous; be polite (~1 req/s)
+
+## 3. Data Schema
+
+Key fields emitted by `fetch_wayback.py`:
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `url` | str | Original URL captured |
+| `timestamp` | str | YYYYMMDDHHMMSS (CDX format) |
+| `wayback_url` | str | Direct replay URL |
+| `mimetype` | str | Content-type at capture |
+| `status` | str | HTTP status (typically 200) |
+| `digest` | str | SHA1 of capture content (collapse-friendly) |
+| `length` | str | Byte length of capture |
+
+## 4. Coverage
+
+- 1996 → present
+- ~900B+ captures across ~700M domains
+- Updated continuously by automated crawls + manual saves
+- Some domains have aggressive coverage (news), others sparse (private)
+
+## 5. Cross-Reference Potential
+
+- **Wikipedia** ↔ Reverse-lookup pages cited as references that have since
+  disappeared
+- **News URLs** ↔ Original article content when present-day URLs 404
+- **Corporate websites** ↔ Historical "About" pages, executive bios that
+  have been scrubbed
+
+The Wayback CDX is most useful as a **content-recovery** layer when other
+sources point to URLs that no longer exist.
+
+## 6. Data Quality
+
+- robots.txt-blocked domains may have spotty or no coverage
+- Captures vary in completeness (HTML may be saved without CSS/JS)
+- Some content is excluded by domain owner request (DMCA, etc.)
+- Coverage of "deep links" (URLs with query strings) is uneven
+- Time resolution is per-capture, not continuous — gaps are common
+
+## 7. Acquisition Script
+
+Path: `scripts/fetch_wayback.py`
+
+```bash
+# All captures of a specific URL
+python3 SKILL_DIR/scripts/fetch_wayback.py --url "https://example.com/page" \
+    --out data/wb.csv
+
+# All captures of a host
+python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \
+    --match host --out data/wb.csv
+
+# All captures of a domain + subdomains
+python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \
+    --match domain --out data/wb.csv
+
+# Only unique-content captures within a date window
+python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \
+    --match host --collapse digest \
+    --from-date 2020-01-01 --to-date 2023-12-31 \
+    --out data/wb.csv
+```
+
+## 8. Legal & Licensing
+
+- Internet Archive captures are made under fair-use research provisions
+- Replay URLs are stable references — citing them is encouraged
+- Internet Archive non-profit terms of use govern content
+- Some content is rights-restricted; replay may be blocked even if the
+  CDX entry shows it as captured
+
+## 9. References
+
+- CDX server docs: https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md
+- Wayback API: https://archive.org/help/wayback_api.php
+- Internet Archive: https://archive.org/
diff --git a/optional-skills/research/osint-investigation/references/sources/wikipedia.md b/optional-skills/research/osint-investigation/references/sources/wikipedia.md
new file mode 100644
index 00000000000..1a004bf2e8d
--- /dev/null
+++ b/optional-skills/research/osint-investigation/references/sources/wikipedia.md
@@ -0,0 +1,107 @@
+# Wikipedia + Wikidata
+
+## 1. Summary
+
+Wikipedia is the canonical narrative-bio source for notable people, places,
+and organizations. Wikidata is its structured-data counterpart: ~110M
+items, each with claims, dates, identifiers, and cross-references to
+external authorities (VIAF, ISNI, ORCID, GRID, etc.).
+
+Together they're a high-precision entity-resolution layer — the bar for
+inclusion is real, but anything past that bar is well-cross-referenced.
+
+## 2. Access Methods
+
+- **Wikipedia OpenSearch:** `https://en.wikipedia.org/w/api.php?action=opensearch`
+- **Wikipedia REST summary:** `https://en.wikipedia.org/api/rest_v1/page/summary/<title>`
+- **Wikidata Action API:** `https://www.wikidata.org/w/api.php?action=wbgetentities`
+- **Wikidata SPARQL:** `https://query.wikidata.org/sparql` (more powerful but aggressively rate-limited)
+- **Auth:** None, but **a meaningful User-Agent is required**
+
+Set `HERMES_OSINT_UA` to something identifying (e.g. `your-app/1.0 (you@example.com)`).
+Wikimedia returns HTTP 429 to generic UAs.
+
+## 3. Data Schema
+
+Key fields emitted by `fetch_wikipedia.py`:
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `source` | str | `wikipedia` or `wikipedia+wikidata` |
+| `label` | str | Wikipedia article title |
+| `description` | str | Short Wikidata description |
+| `qid` | str | Wikidata QID (e.g. Q2283 for Microsoft) |
+| `wikipedia_title`, `wikipedia_url` | str | Article identifier + URL |
+| `wikidata_url` | str | Wikidata entity URL |
+| `instance_of` | str | What kind of thing it is (P31) |
+| `country` | str | Country (P17 for orgs/places, P27 for people) |
+| `occupation` | str | P106 |
+| `employer` | str | P108 |
+| `date_of_birth` | str | P569, YYYY-MM-DD |
+| `place_of_birth` | str | P19 |
+| `summary` | str | Wikipedia REST extract (~1000 chars) |
+
+The fetch script uses Wikidata's Action API (NOT SPARQL) for structured
+facts — far more lenient on rate limits.
+
+## 4. Coverage
+
+- Wikipedia EN: ~7M articles
+- Wikidata: ~110M items, ~1.5B statements
+- Updated continuously; abuse filters and bots run constantly
+- High notability bar — most private individuals are not in Wikipedia
+
+## 5. Cross-Reference Potential
+
+- **All sources** ↔ `label` (entity identity resolution)
+- **SEC EDGAR** ↔ `label` (public companies)
+- **CourtListener** ↔ `label` (parties to notable litigation)
+- **Wikidata external identifiers** (not currently in this fetcher's output)
+  link to VIAF, ISNI, ORCID, GRID, GitHub, Twitter, IMDb, ...
+
+Join key: Wikidata QID is canonical. Wikipedia titles are stable for
+most articles but can be renamed.
+
+## 6. Data Quality
+
+- Notability filter — only notable entities (criteria vary by topic)
+- Recency lag — current events take days to weeks to be reflected
+- POV / vandalism — moderated, but edits between sweeps can be bad
+- Living-persons biographies have stricter sourcing requirements
+- Wikidata claims have qualifiers and references — the fetch script
+  doesn't currently export them
+
+## 7. Acquisition Script
+
+Path: `scripts/fetch_wikipedia.py`
+
+```bash
+# Look up a notable entity
+python3 SKILL_DIR/scripts/fetch_wikipedia.py --query "Microsoft" --out data/wp.csv
+
+# A specific person
+python3 SKILL_DIR/scripts/fetch_wikipedia.py --query "Bill Gates" --out data/wp_bg.csv
+
+# Skip the Wikidata enrichment for speed
+python3 SKILL_DIR/scripts/fetch_wikipedia.py --query "Microsoft" --no-wikidata \
+    --limit 5 --out data/wp.csv
+```
+
+The OpenSearch is fuzzy — `--limit 5` returns the top 5 Wikipedia article
+matches. Each is enriched with the QID + structured facts unless
+`--no-wikidata` is passed.
+
+## 8. Legal & Licensing
+
+- Wikipedia text: CC-BY-SA-3.0 / GFDL
+- Wikidata claims: CC0 (public domain)
+- API ToS: respect rate limits, identify your agent
+- Commercial use allowed with attribution
+
+## 9. References
+
+- Wikipedia OpenSearch: https://www.mediawiki.org/wiki/API:Opensearch
+- Wikipedia REST: https://en.wikipedia.org/api/rest_v1/
+- Wikidata Action API: https://www.wikidata.org/wiki/Wikidata:Data_access
+- Wikidata SPARQL: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service
+- User-Agent policy: https://meta.wikimedia.org/wiki/User-Agent_policy
diff --git a/optional-skills/research/osint-investigation/scripts/_http.py b/optional-skills/research/osint-investigation/scripts/_http.py
new file mode 100644
index 00000000000..5da62310b9f
--- /dev/null
+++ b/optional-skills/research/osint-investigation/scripts/_http.py
@@ -0,0 +1,82 @@
+"""Tiny stdlib HTTP helper used by fetch_*.py scripts.
+
+Provides polite retry + JSON convenience + User-Agent enforcement.
+"""
+from __future__ import annotations
+
+import json
+import os
+import time
+import urllib.error
+import urllib.parse
+import urllib.request
+
+DEFAULT_UA = (
+    "hermes-osint-investigation/0.2 "
+    "(+https://github.com/NousResearch/hermes-agent; "
+    "set HERMES_OSINT_UA env var to identify yourself per "
+    "Wikimedia / SEC fair-use guidance)"
+)
+
+
+def get(
+    url: str,
+    *,
+    params: dict | None = None,
+    headers: dict | None = None,
+    user_agent: str | None = None,
+    max_retries: int = 3,
+    backoff: float = 1.5,
+    timeout: float = 30.0,
+) -> bytes:
+    """GET with retry on 5xx and Retry-After honoring.
+
+    429 (rate-limit) is raised IMMEDIATELY with a clear message — retrying
+    when the upstream says "you're over quota" just wastes time. The caller
+    should slow down or supply real credentials.
+    """
+    if params:
+        sep = "&" if "?" in url else "?"
+        url = f"{url}{sep}{urllib.parse.urlencode(params)}"
+    h = {"User-Agent": user_agent or os.environ.get("HERMES_OSINT_UA", DEFAULT_UA)}
+    if headers:
+        h.update(headers)
+
+    last_err: Exception | None = None
+    for attempt in range(max_retries + 1):
+        req = urllib.request.Request(url, headers=h)
+        try:
+            with urllib.request.urlopen(req, timeout=timeout) as resp:
+                return resp.read()
+        except urllib.error.HTTPError as e:
+            if e.code == 429:
+                # Surface immediately. Read the body so the caller sees the
+                # provider's actual message ("OVER_RATE_LIMIT" etc.).
+                try:
+                    body = e.read(2048).decode("utf-8", errors="replace")
+                except Exception:  # noqa: BLE001
+                    body = ""
+                raise RuntimeError(
+                    f"HTTP 429 rate-limited by {urllib.parse.urlsplit(url).netloc}. "
+                    f"Slow down or supply a real API key. Body: {body[:300]}"
+                ) from e
+            if e.code in (500, 502, 503, 504) and attempt < max_retries:
+                retry_after = e.headers.get("Retry-After") if e.headers else None
+                wait = float(retry_after) if (retry_after and retry_after.isdigit()) else backoff ** (attempt + 1)
+                time.sleep(wait)
+                last_err = e
+                continue
+            raise
+        except urllib.error.URLError as e:
+            if attempt < max_retries:
+                time.sleep(backoff ** (attempt + 1))
+                last_err = e
+                continue
+            raise
+    if last_err:
+        raise last_err
+    raise RuntimeError("unreachable")
+
+
+def get_json(url: str, **kwargs) -> dict | list:
+    return json.loads(get(url, **kwargs).decode("utf-8"))
diff --git a/optional-skills/research/osint-investigation/scripts/_normalize.py b/optional-skills/research/osint-investigation/scripts/_normalize.py
new file mode 100644
index 00000000000..3c9a197af8b
--- /dev/null
+++ b/optional-skills/research/osint-investigation/scripts/_normalize.py
@@ -0,0 +1,67 @@
+"""Shared entity-name normalization helpers (stdlib-only).
+
+Used by entity_resolution.py and timing_analysis.py.
+"""
+from __future__ import annotations
+
+import re
+
+# Legal suffixes / corporate boilerplate to strip during normalization.
+_SUFFIX_TOKENS = {
+    "INC", "INCORPORATED", "LLC", "LLP", "LP", "LTD", "LIMITED",
+    "CORP", "CORPORATION", "CO", "COMPANY",
+    "GROUP", "GRP", "HOLDINGS", "HOLDING",
+    "PARTNERS", "ASSOCIATES",
+    "INTERNATIONAL", "INTL",
+    "ENTERPRISES", "ENTERPRISE",
+    "SERVICES", "SERVICE", "SVCS",
+    "SOLUTIONS", "MANAGEMENT", "MGMT", "CONSULTING",
+    "TECHNOLOGY", "TECHNOLOGIES", "TECH",
+    "INDUSTRIES", "INDUSTRY",
+    "AMERICA", "AMERICAN",
+    "USA", "US",
+    "PLLC", "PC",
+    "TRUST", "FOUNDATION",
+}
+
+_PUNCT_RE = re.compile(r"[^\w\s]")
+_WS_RE = re.compile(r"\s+")
+
+
+def normalize_name(name: str | None) -> str:
+    """Standard normalization: uppercase, strip suffixes, drop punctuation."""
+    if not name:
+        return ""
+    s = _PUNCT_RE.sub(" ", name.upper())
+    s = _WS_RE.sub(" ", s).strip()
+    tokens = [t for t in s.split() if t and t not in _SUFFIX_TOKENS]
+    return " ".join(tokens)
+
+
+def normalize_aggressive(name: str | None) -> str:
+    """Aggressive normalization: sorted unique tokens (word-bag)."""
+    base = normalize_name(name)
+    if not base:
+        return ""
+    return " ".join(sorted(set(base.split())))
+
+
+def name_tokens(name: str | None, min_len: int = 4) -> set[str]:
+    """Token set used for overlap matching."""
+    base = normalize_name(name)
+    if not base:
+        return set()
+    return {t for t in base.split() if len(t) >= min_len}
+
+
+def token_overlap_ratio(left: str | None, right: str | None) -> tuple[float, int]:
+    """Return (jaccard-like ratio, shared token count) over min-len tokens."""
+    a = name_tokens(left)
+    b = name_tokens(right)
+    if not a or not b:
+        return 0.0, 0
+    shared = a & b
+    if not shared:
+        return 0.0, 0
+    union = a | b
+    return len(shared) / len(union), len(shared)
diff --git a/optional-skills/research/osint-investigation/scripts/build_findings.py b/optional-skills/research/osint-investigation/scripts/build_findings.py
new file mode 100644
index 00000000000..15021eb0878
--- /dev/null
+++ b/optional-skills/research/osint-investigation/scripts/build_findings.py
@@ -0,0 +1,221 @@
+#!/usr/bin/env python3
+"""Build a structured findings.json with evidence chains (stdlib-only).
+
+Aggregates cross_links.csv (entity_resolution output) and an optional
+timing.json (timing_analysis output) into a single evidence-chain document.
+
+Output structure:
+    {
+      "metadata": {...},
+      "findings": [
+        {
+          "id": "F0001",
+          "title": "...",
+          "severity": "HIGH|MEDIUM|LOW",
+          "confidence": "high|medium|low",
+          "summary": "...",
+          "evidence": [
+            {"source": "cross_links.csv", "row": 12, "fields": {...}},
+            ...
+          ],
+          "sources": ["cross_links.csv", "timing.json"]
+        }
+      ]
+    }
+
+Every finding traces to specific source rows. No naked claims.
+"""
+from __future__ import annotations
+
+import argparse
+import csv
+import json
+from collections import defaultdict
+from pathlib import Path
+
+CONFIDENCE_ORDER = {"high": 0, "medium": 1, "low": 2}
+SEVERITY_ORDER = {"HIGH": 0, "MEDIUM": 1, "LOW": 2}
+
+
+def _read_cross_links(path: str) -> list[dict[str, str]]:
+    with open(path, newline="", encoding="utf-8") as fh:
+        return list(csv.DictReader(fh))
+
+
+def build_findings(
+    cross_links_path: str,
+    timing_path: str | None = None,
+    out_path: str = "findings.json",
+    bundled_threshold: int = 3,
+) -> dict:
+    findings: list[dict] = []
+    next_id = 1
+
+    # 1. Match-based findings, grouped by (left_normalized, right_normalized).
+    matches = _read_cross_links(cross_links_path)
+    grouped: dict[tuple[str, str], list[dict[str, str]]] = defaultdict(list)
+    for i, row in enumerate(matches):
+        row["__row__"] = str(i)
+        grouped[(row.get("left_normalized", ""), row.get("right_normalized", ""))].append(row)
+
+    for (left_norm, right_norm), rows in grouped.items():
+        if not left_norm or not right_norm:
+            continue
+        # Use the highest-confidence match for the finding's overall confidence.
+        best = min(rows, key=lambda r: CONFIDENCE_ORDER.get(r.get("confidence", "low"), 2))
+        finding_id = f"F{next_id:04d}"
+        next_id += 1
+        evidence = [
+            {
+                "source": "cross_links.csv",
+                "row": int(r["__row__"]),
+                "fields": {
+                    "match_type": r.get("match_type", ""),
+                    "confidence": r.get("confidence", ""),
+                    "left_name": r.get("left_name", ""),
+                    "right_name": r.get("right_name", ""),
+                    "overlap_ratio": r.get("overlap_ratio", ""),
+                    "shared_tokens": r.get("shared_tokens", ""),
+                },
+            }
+            for r in rows
+        ]
+        findings.append(
+            {
+                "id": finding_id,
+                "title": f"Entity match: {best.get('left_name', '')} ↔ {best.get('right_name', '')}",
+                "severity": "MEDIUM" if best.get("confidence") == "high" else "LOW",
+                "confidence": best.get("confidence", "low"),
+                "summary": (
+                    f"{len(rows)} cross-link record(s) tie "
+                    f"'{best.get('left_name', '')}' to "
+                    f"'{best.get('right_name', '')}' "
+                    f"(best tier: {best.get('match_type', '')})."
+                ),
+                "evidence": evidence,
+                "sources": ["cross_links.csv"],
+            }
+        )
+
+    # 2. Bundled-donations findings (if cross_links carries donor↔candidate pattern).
+    #    Heuristic: many distinct left names sharing the same right name.
+    by_right: dict[str, set[str]] = defaultdict(set)
+    by_right_rows: dict[str, list[dict[str, str]]] = defaultdict(list)
+    for r in matches:
+        right = r.get("right_normalized", "")
+        left_raw = r.get("left_name", "").strip()
+        if right and left_raw:
+            by_right[right].add(left_raw)
+            by_right_rows[right].append(r)
+    for right_norm, lefts in by_right.items():
+        if len(lefts) < bundled_threshold:
+            continue
+        rows = by_right_rows[right_norm]
+        right_raw = rows[0].get("right_name", "")
+        findings.append(
+            {
+                "id": f"F{next_id:04d}",
+                "title": f"Bundled cross-links: {len(lefts)} distinct left entities ↔ '{right_raw}'",
+                "severity": "HIGH",
+                "confidence": "medium",
+                "summary": (
+                    f"{len(lefts)} distinct left-side entities link to "
+                    f"'{right_raw}'. Pattern suggests coordinated relationship "
+                    f"(e.g. bundled donations, multi-vendor employer)."
+                ),
+                "evidence": [
+                    {
+                        "source": "cross_links.csv",
+                        "row": int(r.get("__row__", "0")),
+                        "fields": {
+                            "left_name": r.get("left_name", ""),
+                            "match_type": r.get("match_type", ""),
+                        },
+                    }
+                    for r in rows
+                ],
+                "sources": ["cross_links.csv"],
+            }
+        )
+        next_id += 1
+
+    # 3. Timing-based findings.
+    if timing_path and Path(timing_path).exists():
+        timing = json.loads(Path(timing_path).read_text())
+        for r in timing.get("results", []):
+            if not r.get("significant"):
+                continue
+            findings.append(
+                {
+                    "id": f"F{next_id:04d}",
+                    "title": (
+                        f"Donation timing significantly clusters near awards: "
+                        f"{r['donor']} ↔ {r['recipient']}"
+                    ),
+                    "severity": "HIGH" if r["p_value"] < 0.01 else "MEDIUM",
+                    "confidence": "medium",
+                    "summary": (
+                        f"Mean nearest-award distance {r['observed_mean_days']} days "
+                        f"(null {r['null_mean_days']} days). p={r['p_value']}, "
+                        f"effect size {r['effect_size_sd']} SD. "
+                        f"{r['n_donations']} donations, {r['n_award_dates']} awards."
+                    ),
+                    "evidence": [
+                        {
+                            "source": "timing.json",
+                            "row": None,
+                            "fields": r,
+                        }
+                    ],
+                    "sources": ["timing.json"],
+                }
+            )
+            next_id += 1
+
+    # Sort: severity → confidence → id.
+    findings.sort(
+        key=lambda f: (
+            SEVERITY_ORDER.get(f["severity"], 3),
+            CONFIDENCE_ORDER.get(f["confidence"], 3),
+            f["id"],
+        )
+    )
+
+    payload = {
+        "metadata": {
+            "n_findings": len(findings),
+            "cross_links_path": cross_links_path,
+            "timing_path": timing_path,
+            "bundled_threshold": bundled_threshold,
+        },
+        "findings": findings,
+    }
+    Path(out_path).write_text(json.dumps(payload, indent=2))
+    return payload
+
+
+def main() -> int:
+    p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
+    p.add_argument("--cross-links", required=True)
+    p.add_argument("--timing", help="Optional timing.json from timing_analysis.py")
+    p.add_argument("--out", default="findings.json")
+    p.add_argument(
+        "--bundled-threshold",
+        type=int,
+        default=3,
+        help="Minimum distinct left entities to flag as bundled (default 3)",
+    )
+    a = p.parse_args()
+
+    payload = build_findings(
+        cross_links_path=a.cross_links,
+        timing_path=a.timing,
+        out_path=a.out,
+        bundled_threshold=a.bundled_threshold,
+    )
+    print(f"Wrote {payload['metadata']['n_findings']} findings to {a.out}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/optional-skills/research/osint-investigation/scripts/entity_resolution.py b/optional-skills/research/osint-investigation/scripts/entity_resolution.py
new file mode 100644
index 00000000000..26d60d433d4
--- /dev/null
+++ b/optional-skills/research/osint-investigation/scripts/entity_resolution.py
@@ -0,0 +1,228 @@
+#!/usr/bin/env python3
+"""Cross-source entity resolution (stdlib-only).
+
+Given two CSV files with name columns, find candidate matches using three
+tiers of normalization:
+
+  1. exact          — normalized strings equal
+  2. fuzzy          — sorted-token (word-bag) match
+  3. token_overlap  — >=60% Jaccard overlap on >=4-char tokens, >=2 shared
+
+Adapted from ShinMegamiBoson/OpenPlanter (MIT) but generalized: no Boston-
+specific record types, no contribution-code filters, no fixed schemas.
+
+Output CSV columns:
+    match_type, confidence, left_name, right_name,
+    left_normalized, right_normalized, left_row, right_row,
+    overlap_ratio, shared_tokens
+"""
+from __future__ import annotations
+
+import argparse
+import csv
+import sys
+from pathlib import Path
+
+# Allow running directly or as a module.
+sys.path.insert(0, str(Path(__file__).parent))
+from _normalize import (  # noqa: E402
+    normalize_name,
+    normalize_aggressive,
+    token_overlap_ratio,
+)
+
+CONFIDENCE = {
+    "exact": "high",
+    "fuzzy": "medium",
+    "token_overlap": "low",
+}
+
+
+def _read_csv(path: str, name_col: str) -> list[dict[str, str]]:
+    rows = []
+    with open(path, newline="", encoding="utf-8") as fh:
+        reader = csv.DictReader(fh)
+        if name_col not in (reader.fieldnames or []):
+            raise SystemExit(
+                f"Column {name_col!r} not in {path}. "
+                f"Available: {reader.fieldnames}"
+            )
+        for i, row in enumerate(reader):
+            row["__row__"] = str(i)
+            rows.append(row)
+    return rows
+
+
+def _build_index(rows: list[dict[str, str]], name_col: str):
+    """Index by exact-normalized and aggressive (sorted-token) form."""
+    exact: dict[str, list[dict[str, str]]] = {}
+    aggressive: dict[str, list[dict[str, str]]] = {}
+    for row in rows:
+        raw = row.get(name_col, "")
+        n = normalize_name(raw)
+        if n:
+            exact.setdefault(n, []).append(row)
+        a = normalize_aggressive(raw)
+        if a:
+            aggressive.setdefault(a, []).append(row)
+    return exact, aggressive
+
+
+def _emit(
+    out_rows: list[dict[str, str]],
+    seen: set[tuple],
+    match_type: str,
+    left_row: dict[str, str],
+    right_row: dict[str, str],
+    left_col: str,
+    right_col: str,
+    ratio: float = 0.0,
+    shared: int = 0,
+):
+    left_raw = left_row.get(left_col, "")
+    right_raw = right_row.get(right_col, "")
+    key = (
+        left_row["__row__"],
+        right_row["__row__"],
+        match_type,
+    )
+    if key in seen:
+        return
+    seen.add(key)
+    out_rows.append(
+        {
+            "match_type": match_type,
+            "confidence": CONFIDENCE[match_type],
+            "left_name": left_raw,
+            "right_name": right_raw,
+            "left_normalized": normalize_name(left_raw),
+            "right_normalized": normalize_name(right_raw),
+            "left_row": left_row["__row__"],
+            "right_row": right_row["__row__"],
+            "overlap_ratio": f"{ratio:.3f}" if ratio else "",
+            "shared_tokens": str(shared) if shared else "",
+        }
+    )
+
+
+def resolve(
+    left_path: str,
+    left_col: str,
+    right_path: str,
+    right_col: str,
+    out_path: str,
+    overlap_threshold: float = 0.60,
+    min_shared: int = 2,
+    skip_overlap: bool = False,
+) -> int:
+    left_rows = _read_csv(left_path, left_col)
+    right_rows = _read_csv(right_path, right_col)
+
+    right_exact, right_aggressive = _build_index(right_rows, right_col)
+
+    out_rows: list[dict[str, str]] = []
+    seen: set[tuple] = set()
+
+    # Pass 1+2: exact / fuzzy via index lookup.
+    for lrow in left_rows:
+        raw = lrow.get(left_col, "")
+        n = normalize_name(raw)
+        if not n:
+            continue
+        for rrow in right_exact.get(n, []):
+            _emit(out_rows, seen, "exact", lrow, rrow, left_col, right_col)
+        a = normalize_aggressive(raw)
+        if a:
+            for rrow in right_aggressive.get(a, []):
+                _emit(out_rows, seen, "fuzzy", lrow, rrow, left_col, right_col)
+
+    if not skip_overlap:
+        # Pass 3: token overlap (O(N*M) — expensive; allow opt-out).
+        for lrow in left_rows:
+            l_raw = lrow.get(left_col, "")
+            if not normalize_name(l_raw):
+                continue
+            for rrow in right_rows:
+                ratio, shared = token_overlap_ratio(
+                    l_raw, rrow.get(right_col, "")
+                )
+                if ratio >= overlap_threshold and shared >= min_shared:
+                    _emit(
+                        out_rows,
+                        seen,
+                        "token_overlap",
+                        lrow,
+                        rrow,
+                        left_col,
+                        right_col,
+                        ratio=ratio,
+                        shared=shared,
+                    )
+
+    fieldnames = [
+        "match_type",
+        "confidence",
+        "left_name",
+        "right_name",
+        "left_normalized",
+        "right_normalized",
+        "left_row",
+        "right_row",
+        "overlap_ratio",
+        "shared_tokens",
+    ]
+    with open(out_path, "w", newline="", encoding="utf-8") as fh:
+        writer = csv.DictWriter(fh, fieldnames=fieldnames)
+        writer.writeheader()
+        writer.writerows(out_rows)
+    return len(out_rows)
+
+
+def main() -> int:
+    p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
+    p.add_argument("--left", required=True, help="Left CSV path")
+    p.add_argument(
+        "--left-name-col", required=True, help="Name column in left CSV"
+    )
+    p.add_argument("--right", required=True, help="Right CSV path")
+    p.add_argument(
+        "--right-name-col",
+        required=True,
+        help="Name column in right CSV",
+    )
+    p.add_argument("--out", required=True, help="Output CSV path")
+    p.add_argument(
+        "--overlap-threshold",
+        type=float,
+        default=0.60,
+        help="Jaccard overlap threshold for token_overlap tier (default 0.60)",
+    )
+    p.add_argument(
+        "--min-shared",
+        type=int,
+        default=2,
+        help="Minimum shared tokens for token_overlap tier (default 2)",
+    )
+    p.add_argument(
+        "--skip-overlap",
+        action="store_true",
+        help="Skip the O(N*M) token_overlap pass (much faster on large CSVs)",
+    )
+    args = p.parse_args()
+
+    count = resolve(
+        left_path=args.left,
+        left_col=args.left_name_col,
+        right_path=args.right,
+        right_col=args.right_name_col,
+        out_path=args.out,
+        overlap_threshold=args.overlap_threshold,
+        min_shared=args.min_shared,
+        skip_overlap=args.skip_overlap,
+    )
+    print(f"Wrote {count} match rows to {args.out}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/optional-skills/research/osint-investigation/scripts/fetch_courtlistener.py b/optional-skills/research/osint-investigation/scripts/fetch_courtlistener.py
new file mode 100644
index 00000000000..db5e715bf57
--- /dev/null
+++ b/optional-skills/research/osint-investigation/scripts/fetch_courtlistener.py
@@ -0,0 +1,149 @@
+#!/usr/bin/env python3
+"""Search court records via CourtListener (Free Law Project).
+
+Covers ~10M federal and state court opinions, plus PACER docket data
+where available. Public REST API v4 supports anonymous read access for
+search; some endpoints require a token (free at courtlistener.com).
+
+Set COURTLISTENER_TOKEN to authenticate (raises rate limits).
+"""
+from __future__ import annotations
+
+import argparse
+import csv
+import os
+import sys
+import urllib.parse
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).parent))
+from _http import get_json  # noqa: E402
+
+BASE = "https://www.courtlistener.com/api/rest/v4/search/"
+
+COLUMNS = [
+    "case_name",
+    "court",
+    "court_id",
+    "date_filed",
+    "docket_number",
+    "judge",
+    "citation",
+    "result_type",
+    "snippet",
+    "absolute_url",
+]
+
+SEARCH_TYPES = {
+    "opinions": "o",       # Court opinions
+    "dockets": "r",        # PACER dockets (may require auth depending on coverage)
+    "oral": "oa",          # Oral arguments
+    "people": "p",         # Judges / people
+    "recap": "r",          # Same as dockets in v4
+}
+
+
+def fetch(
+    query: str,
+    search_type: str,
+    court: str | None,
+    date_from: str | None,
+    date_to: str | None,
+    token: str | None,
+    limit: int,
+    out_path: str,
+) -> int:
+    type_code = SEARCH_TYPES.get(search_type, search_type)
+    params = {
+        "q": query,
+        "type": type_code,
+    }
+    if court:
+        params["court"] = court
+    if date_from:
+        params["filed_after"] = date_from
+    if date_to:
+        params["filed_before"] = date_to
+    headers = {"Authorization": f"Token {token}"} if token else None
+
+    rows: list[dict[str, str]] = []
+    next_url: str | None = f"{BASE}?{urllib.parse.urlencode(params)}"
+    while next_url and len(rows) < limit:
+        try:
+            payload = get_json(next_url, headers=headers)
+        except Exception as e:  # noqa: BLE001
+            print(f"CourtListener error: {e}", file=sys.stderr)
+            break
+        if not isinstance(payload, dict):
+            break
+        results = payload.get("results", [])
+        for r in results:
+            if len(rows) >= limit:
+                break
+            rows.append(
+                {
+                    "case_name": r.get("caseName", "") or r.get("case_name", "") or "",
+                    "court": r.get("court", "") or "",
+                    "court_id": r.get("court_id", "") or "",
+                    "date_filed": (r.get("dateFiled", "") or r.get("date_filed", "") or "")[:10],
+                    "docket_number": r.get("docketNumber", "") or r.get("docket_number", "") or "",
+                    "judge": r.get("judge", "") or "",
+                    "citation": "; ".join(r.get("citation", []) or []) if isinstance(r.get("citation"), list) else (r.get("citation") or ""),
+                    "result_type": search_type,
+                    "snippet": (r.get("snippet", "") or "").replace("\n", " ")[:500],
+                    "absolute_url": (
+                        f"https://www.courtlistener.com{r.get('absolute_url', '')}"
+                        if r.get("absolute_url", "").startswith("/")
+                        else r.get("absolute_url", "")
+                    ),
+                }
+            )
+        next_url = payload.get("next")
+
+    Path(out_path).parent.mkdir(parents=True, exist_ok=True)
+    with open(out_path, "w", newline="", encoding="utf-8") as fh:
+        w = csv.DictWriter(fh, fieldnames=COLUMNS)
+        w.writeheader()
+        w.writerows(rows)
+    if not rows:
+        print(
+            f"CourtListener: 0 results for type={search_type!r} q={query!r}. "
+            "Most private individuals don't appear in published court records "
+            "unless they were party to a federal or state appellate case.",
+            file=sys.stderr,
+        )
+    return len(rows)
+
+
+def main() -> int:
+    p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
+    p.add_argument("--query", required=True, help="Search query (party name, case name, keyword)")
+    p.add_argument(
+        "--type",
+        default="opinions",
+        choices=list(SEARCH_TYPES.keys()),
+        help="Search type (default: opinions)",
+    )
+    p.add_argument("--court", help="Court ID filter (e.g. 'nysd' = SDNY, 'scotus' = Supreme Court)")
+    p.add_argument("--date-from", help="Filed-after date YYYY-MM-DD")
+    p.add_argument("--date-to", help="Filed-before date YYYY-MM-DD")
+    p.add_argument("--token", default=os.environ.get("COURTLISTENER_TOKEN"))
+    p.add_argument("--limit", type=int, default=100)
+    p.add_argument("--out", required=True)
+    a = p.parse_args()
+    n = fetch(
+        query=a.query,
+        search_type=a.type,
+        court=a.court,
+        date_from=a.date_from,
+        date_to=a.date_to,
+        token=a.token,
+        limit=a.limit,
+        out_path=a.out,
+    )
+    print(f"Wrote {n} CourtListener rows to {a.out}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/optional-skills/research/osint-investigation/scripts/fetch_gdelt.py b/optional-skills/research/osint-investigation/scripts/fetch_gdelt.py
new file mode 100644
index 00000000000..fa98dabc9bb
--- /dev/null
+++ b/optional-skills/research/osint-investigation/scripts/fetch_gdelt.py
@@ -0,0 +1,162 @@
+#!/usr/bin/env python3
+"""Search the GDELT 2.0 DOC API for news mentions.
+
+GDELT monitors world news in 100+ languages and indexes the full text.
+Free, anonymous, ~15-minute update frequency. Covers ~2015→present.
+
+Useful for surfacing news mentions of a person, company, or topic across
+international media — much wider net than Google News.
+"""
+from __future__ import annotations
+
+import argparse
+import csv
+import json
+import sys
+import time
+import urllib.parse
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).parent))
+from _http import get_json  # noqa: E402
+
+BASE = "https://api.gdeltproject.org/api/v2/doc/doc"
+
+COLUMNS = [
+    "title",
+    "url",
+    "seen_date",
+    "domain",
+    "language",
+    "source_country",
+    "tone",
+    "social_image",
+]
+
+
+def fetch(
+    query: str,
+    mode: str,
+    timespan: str | None,
+    start_datetime: str | None,
+    end_datetime: str | None,
+    source_country: str | None,
+    source_lang: str | None,
+    limit: int,
+    out_path: str,
+) -> int:
+    params: dict[str, str] = {
+        "query": query,
+        "mode": mode,
+        "format": "json",
+        "maxrecords": str(min(limit, 250)),
+        "sort": "datedesc",
+    }
+    if timespan:
+        params["timespan"] = timespan
+    if start_datetime:
+        params["startdatetime"] = start_datetime.replace("-", "").replace(":", "").replace(" ", "")
+    if end_datetime:
+        params["enddatetime"] = end_datetime.replace("-", "").replace(":", "").replace(" ", "")
+    if source_country:
+        params["sourcecountry"] = source_country
+    if source_lang:
+        params["sourcelang"] = source_lang
+
+    url = f"{BASE}?{urllib.parse.urlencode(params)}"
+    payload: dict | list = {}
+    for attempt in range(3):
+        try:
+            payload = get_json(url)
+            break
+        except RuntimeError as e:
+            # GDELT requires 1 request per 5 seconds; back off and retry.
+            if "429" in str(e) and attempt < 2:
+                print(
+                    f"GDELT throttle hit; sleeping 6s before retry "
+                    f"(attempt {attempt + 1}/3)",
+                    file=sys.stderr,
+                )
+                time.sleep(6)
+                continue
+            print(f"GDELT error: {e}", file=sys.stderr)
+            payload = {}
+            break
+        except Exception as e:  # noqa: BLE001
+            print(f"GDELT error: {e}", file=sys.stderr)
+            payload = {}
+            break
+
+    rows: list[dict[str, str]] = []
+    if isinstance(payload, dict):
+        articles = payload.get("articles", []) or []
+        for a in articles[:limit]:
+            seen = (a.get("seendate") or "")
+            # GDELT format: 20260319T083000Z → 2026-03-19 08:30:00Z
+            if len(seen) == 16 and "T" in seen:
+                seen = f"{seen[0:4]}-{seen[4:6]}-{seen[6:8]} {seen[9:11]}:{seen[11:13]}:{seen[13:15]}Z"
+            rows.append(
+                {
+                    "title": (a.get("title") or "").replace("\n", " ").strip(),
+                    "url": a.get("url") or "",
+                    "seen_date": seen,
+                    "domain": a.get("domain") or "",
+                    "language": a.get("language") or "",
+                    "source_country": a.get("sourcecountry") or "",
+                    "tone": str(a.get("tone") or ""),
+                    "social_image": a.get("socialimage") or "",
+                }
+            )
+
+    Path(out_path).parent.mkdir(parents=True, exist_ok=True)
+    with open(out_path, "w", newline="", encoding="utf-8") as fh:
+        w = csv.DictWriter(fh, fieldnames=COLUMNS)
+        w.writeheader()
+        w.writerows(rows)
+    if not rows:
+        print(
+            f"GDELT: 0 articles for query={query!r}. "
+            "GDELT indexes ~2015→present. Try widening the timespan or "
+            "checking the query syntax (https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/).",
+            file=sys.stderr,
+        )
+    return len(rows)
+
+
+def main() -> int:
+    p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
+    p.add_argument("--query", required=True, help='Search query (supports GDELT operators: quoted phrases, AND/OR/NOT, sourcecountry:, theme:)')
+    p.add_argument(
+        "--mode",
+        default="ArtList",
+        choices=["ArtList", "ImageCollage", "TimelineVol", "TimelineTone", "ToneChart"],
+        help="GDELT mode (default ArtList for article list)",
+    )
+    p.add_argument(
+        "--timespan",
+        help="Relative window: e.g. '1d', '1w', '1m', '3m', '1y' (overrides start/end)",
+    )
+    p.add_argument("--start", help="Absolute start YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS")
+    p.add_argument("--end", help="Absolute end YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS")
+    p.add_argument("--source-country", help="2-letter source country (e.g. US, UK)")
+    p.add_argument("--source-lang", help="Source language (e.g. English, Spanish)")
+    p.add_argument("--limit", type=int, default=100)
+    p.add_argument("--out", required=True)
+    a = p.parse_args()
+    n = fetch(
+        query=a.query,
+        mode=a.mode,
+        timespan=a.timespan,
+        start_datetime=a.start,
+        end_datetime=a.end,
+        source_country=a.source_country,
+        source_lang=a.source_lang,
+        limit=a.limit,
+        out_path=a.out,
+    )
+    print(f"Wrote {n} GDELT article rows to {a.out}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/optional-skills/research/osint-investigation/scripts/fetch_icij_offshore.py b/optional-skills/research/osint-investigation/scripts/fetch_icij_offshore.py
new file mode 100644
index 00000000000..8d050b62bf1
--- /dev/null
+++ b/optional-skills/research/osint-investigation/scripts/fetch_icij_offshore.py
@@ -0,0 +1,234 @@
+#!/usr/bin/env python3
+"""Search ICIJ Offshore Leaks via the bulk CSV database.
+
+The old reconcile endpoint (https://offshoreleaks.icij.org/reconcile) returns
+404 — ICIJ has removed it. The remaining stable access path is the public
+bulk download:
+
+    https://offshoreleaks-data.icij.org/offshoreleaks/csv/full-oldb.LATEST.zip
+
+~70 MB, ~6 CSVs inside (nodes-entities, nodes-officers, nodes-intermediaries,
+nodes-addresses, relationships, ...). We cache it under
+$HERMES_OSINT_CACHE/icij/ (default: ~/.cache/hermes-osint/icij/) and search
+locally so the agent doesn't re-download for every query.
+
+Output CSV columns match the original `fetch_icij_offshore.py` contract.
+"""
+from __future__ import annotations
+
+import argparse
+import csv
+import io
+import os
+import re
+import sys
+import time
+import urllib.request
+import zipfile
+from pathlib import Path
+
+BULK_URL = "https://offshoreleaks-data.icij.org/offshoreleaks/csv/full-oldb.LATEST.zip"
+
+COLUMNS = [
+    "node_id",
+    "name",
+    "node_type",
+    "country_codes",
+    "countries",
+    "jurisdiction",
+    "incorporation_date",
+    "inactivation_date",
+    "source",
+    "entity_url",
+    "connections",
+]
+
+
+def _cache_dir() -> Path:
+    base = os.environ.get("HERMES_OSINT_CACHE")
+    if base:
+        return Path(base) / "icij"
+    return Path.home() / ".cache" / "hermes-osint" / "icij"
+
+
+def _download(dest: Path, force: bool = False) -> Path:
+    """Download (or reuse cached) ICIJ bulk ZIP."""
+    dest.mkdir(parents=True, exist_ok=True)
+    zip_path = dest / "full-oldb.zip"
+    if zip_path.exists() and not force:
+        # Re-check age: refetch if older than 30 days.
+        age_days = (time.time() - zip_path.stat().st_mtime) / 86400
+        if age_days < 30:
+            return zip_path
+    print(f"Downloading ICIJ bulk database (~70 MB) to {zip_path}", file=sys.stderr)
+    req = urllib.request.Request(
+        BULK_URL,
+        headers={"User-Agent": "hermes-agent osint-investigation skill"},
+    )
+    with urllib.request.urlopen(req, timeout=120) as resp:  # noqa: S310
+        tmp = zip_path.with_suffix(".zip.tmp")
+        with open(tmp, "wb") as fh:
+            while True:
+                chunk = resp.read(1 << 16)
+                if not chunk:
+                    break
+                fh.write(chunk)
+    tmp.replace(zip_path)
+    return zip_path
+
+
+def _open_csv(zf: zipfile.ZipFile, name_pattern: str):
+    """Open the first CSV matching name_pattern (case-insensitive substring)."""
+    for info in zf.infolist():
+        if name_pattern.lower() in info.filename.lower() and info.filename.lower().endswith(".csv"):
+            return zf.open(info), info.filename
+    return None, None
+
+
+def _match(needle_norm: str, hay: str) -> bool:
+    return needle_norm in (hay or "").upper()
+
+
+def _normalize_query(s: str) -> str:
+    s = s.upper()
+    s = re.sub(r"[^\w\s]", " ", s)
+    s = re.sub(r"\s+", " ", s).strip()
+    return s
+
+
+def fetch(
+    entity: str | None,
+    officer: str | None,
+    jurisdiction: str | None,
+    out_path: str,
+    cache_dir: Path,
+    force_refresh: bool = False,
+    limit: int = 500,
+) -> int:
+    zip_path = _download(cache_dir, force=force_refresh)
+    rows: list[dict[str, str]] = []
+    needles: list[tuple[str, str]] = []  # (kind, normalized needle)
+    if entity:
+        needles.append(("Entity", _normalize_query(entity)))
+    if officer:
+        needles.append(("Officer", _normalize_query(officer)))
+    jur_norm = _normalize_query(jurisdiction) if jurisdiction else None
+
+    targets = [
+        ("Entity", "nodes-entities"),
+        ("Officer", "nodes-officers"),
+        ("Intermediary", "nodes-intermediaries"),
+    ]
+
+    with zipfile.ZipFile(zip_path) as zf:
+        for node_type, csv_substring in targets:
+            relevant_needles = [n for (k, n) in needles if k in (node_type, "Entity", "Officer")] or []
+            # Only scan a CSV if we have a needle that could plausibly match it,
+            # or if we have ONLY a jurisdiction filter.
+            applicable_needles = [n for (k, n) in needles if k == node_type]
+            if needles and not applicable_needles and not jur_norm:
+                continue
+            stream, fname = _open_csv(zf, csv_substring)
+            if not stream:
+                continue
+            with stream:
+                text = io.TextIOWrapper(stream, encoding="utf-8", errors="replace")
+                reader = csv.DictReader(text)
+                for row in reader:
+                    name = (row.get("name") or "").strip()
+                    if not name:
+                        continue
+                    name_u = name.upper()
+                    matched = False
+                    for n in applicable_needles or relevant_needles:
+                        if _match(n, name_u):
+                            matched = True
+                            break
+                    if not needles:
+                        matched = True  # jurisdiction-only sweep
+                    if not matched:
+                        continue
+                    jur = (row.get("jurisdiction_description") or row.get("country_codes") or "").strip()
+                    if jur_norm and jur_norm not in jur.upper() and jur_norm not in (row.get("countries") or "").upper():
+                        continue
+                    node_id = (row.get("node_id") or "").strip()
+                    rows.append(
+                        {
+                            "node_id": node_id,
+                            "name": name,
+                            "node_type": node_type,
+                            "country_codes": row.get("country_codes", "") or "",
+                            "countries": row.get("countries", "") or "",
+                            "jurisdiction": jur,
+                            "incorporation_date": row.get("incorporation_date", "") or "",
+                            "inactivation_date": row.get("inactivation_date", "") or "",
+                            "source": row.get("sourceID", "") or row.get("source", "") or "",
+                            "entity_url": (
+                                f"https://offshoreleaks.icij.org/nodes/{node_id}" if node_id else ""
+                            ),
+                            "connections": "",
+                        }
+                    )
+                    if len(rows) >= limit:
+                        break
+            if len(rows) >= limit:
+                break
+
+    Path(out_path).parent.mkdir(parents=True, exist_ok=True)
+    with open(out_path, "w", newline="", encoding="utf-8") as fh:
+        w = csv.DictWriter(fh, fieldnames=COLUMNS)
+        w.writeheader()
+        w.writerows(rows)
+    if not rows:
+        bits = []
+        if entity:
+            bits.append(f"entity={entity!r}")
+        if officer:
+            bits.append(f"officer={officer!r}")
+        if jurisdiction:
+            bits.append(f"jurisdiction={jurisdiction!r}")
+        print(
+            f"ICIJ: 0 matches for {', '.join(bits)}. "
+            "The bulk database covers offshore leaks (Panama, Paradise, Pandora, "
+            "Bahamas, Offshore Leaks). Most private US individuals are NOT in it.",
+            file=sys.stderr,
+        )
+    return len(rows)
+
+
+def main() -> int:
+    p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
+    p.add_argument("--entity", help="Search by entity name (substring, case-insensitive)")
+    p.add_argument("--officer", help="Search by officer / individual name (substring, case-insensitive)")
+    p.add_argument("--jurisdiction", help="Filter results by jurisdiction substring")
+    p.add_argument("--limit", type=int, default=500)
+    p.add_argument("--out", required=True)
+    p.add_argument(
+        "--cache-dir",
+        type=Path,
+        default=None,
+        help="Override cache directory (default: $HERMES_OSINT_CACHE/icij or ~/.cache/hermes-osint/icij)",
+    )
+    p.add_argument(
+        "--force-refresh",
+        action="store_true",
+        help="Re-download the bulk ZIP even if a recent cached copy exists.",
+    )
+    a = p.parse_args()
+    if not (a.entity or a.officer or a.jurisdiction):
+        p.error("must supply at least one of --entity / --officer / --jurisdiction")
+    n = fetch(
+        entity=a.entity,
+        officer=a.officer,
+        jurisdiction=a.jurisdiction,
+        out_path=a.out,
+        cache_dir=a.cache_dir or _cache_dir(),
+        force_refresh=a.force_refresh,
+        limit=a.limit,
+    )
+    print(f"Wrote {n} ICIJ Offshore Leaks rows to {a.out}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/optional-skills/research/osint-investigation/scripts/fetch_nyc_acris.py b/optional-skills/research/osint-investigation/scripts/fetch_nyc_acris.py
new file mode 100644
index 00000000000..6ec448f0f53
--- /dev/null
+++ b/optional-skills/research/osint-investigation/scripts/fetch_nyc_acris.py
@@ -0,0 +1,203 @@
+#!/usr/bin/env python3
+"""Search NYC property records via ACRIS (Automated City Register Information System).
+
+Uses the city's Socrata-backed open data API. No auth required for read access.
+
+Datasets:
+  bnx9-e6tj — Real Property Master (one row per recorded document)
+  636b-3b5g — Real Property Parties (names — grantor, grantee, etc.)
+  8h5j-fqxa — Real Property Legal (lot / property identifiers)
+  uqqa-hym2 — Real Property References
+
+The Parties dataset has the names. We search by name and optionally join to
+Master to get the doc type and date.
+"""
+from __future__ import annotations
+
+import argparse
+import csv
+import sys
+import urllib.parse
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).parent))
+from _http import get_json  # noqa: E402
+
+PARTIES_URL = "https://data.cityofnewyork.us/resource/636b-3b5g.json"
+MASTER_URL = "https://data.cityofnewyork.us/resource/bnx9-e6tj.json"
+
+PARTY_TYPE = {
+    "1": "grantor (seller / mortgagor / debtor)",
+    "2": "grantee (buyer / mortgagee / creditor)",
+    "3": "other party",
+}
+
+BOROUGH = {
+    "1": "Manhattan",
+    "2": "Bronx",
+    "3": "Brooklyn",
+    "4": "Queens",
+    "5": "Staten Island",
+}
+
+COLUMNS = [
+    "document_id",
+    "name",
+    "party_type",
+    "party_role",
+    "address_1",
+    "address_2",
+    "city",
+    "state",
+    "zip",
+    "country",
+    "doc_type",
+    "doc_date",
+    "recorded_date",
+    "borough",
+    "amount",
+    "filing_url",
+]
+
+
+def _filing_url(document_id: str) -> str:
+    if not document_id:
+        return ""
+    return (
+        f"https://a836-acris.nyc.gov/DS/DocumentSearch/DocumentImageView?doc_id={document_id}"
+    )
+
+
+def fetch(
+    name: str | None,
+    address: str | None,
+    party_type: str | None,
+    limit: int,
+    out_path: str,
+    enrich: bool = True,
+) -> int:
+    if not (name or address):
+        raise SystemExit("must supply --name or --address")
+
+    where_clauses: list[str] = []
+    if name:
+        safe = name.upper().replace("'", "''")
+        where_clauses.append(f"upper(name) like '%{safe}%'")
+    if address:
+        safe_addr = address.upper().replace("'", "''")
+        where_clauses.append(f"upper(address_1) like '%{safe_addr}%'")
+    if party_type and party_type in {"1", "2", "3"}:
+        where_clauses.append(f"party_type='{party_type}'")
+
+    params = {
+        "$where": " AND ".join(where_clauses),
+        "$limit": str(limit),
+    }
+    url = f"{PARTIES_URL}?{urllib.parse.urlencode(params)}"
+    parties = get_json(url)
+    if not isinstance(parties, list):
+        raise SystemExit(f"Unexpected ACRIS response: {parties!r}")
+
+    # Enrich with master record (doc_type, dates, borough, amount).
+    doc_ids: list[str] = sorted({
+        d for d in (p.get("document_id") for p in parties) if d
+    })
+    masters: dict[str, dict] = {}
+    if enrich and doc_ids:
+        # Batch up to 100 doc_ids per request (Socrata IN-list is fine for this).
+        for i in range(0, len(doc_ids), 100):
+            chunk = doc_ids[i : i + 100]
+            id_list = ",".join(f"'{d}'" for d in chunk)
+            master_params = {
+                "$where": f"document_id in ({id_list})",
+                "$limit": "100",
+            }
+            url = f"{MASTER_URL}?{urllib.parse.urlencode(master_params)}"
+            try:
+                rows = get_json(url)
+            except Exception as e:  # noqa: BLE001
+                print(f"ACRIS master lookup failed for chunk: {e}", file=sys.stderr)
+                continue
+            if isinstance(rows, list):
+                for r in rows:
+                    did = r.get("document_id", "")
+                    if did:
+                        masters[did] = r
+
+    out_rows: list[dict[str, str]] = []
+    for p in parties:
+        did = p.get("document_id", "") or ""
+        m = masters.get(did, {})
+        out_rows.append(
+            {
+                "document_id": did,
+                "name": p.get("name", "") or "",
+                "party_type": p.get("party_type", "") or "",
+                "party_role": PARTY_TYPE.get(p.get("party_type", ""), ""),
+                "address_1": p.get("address_1", "") or "",
+                "address_2": p.get("address_2", "") or "",
+                "city": p.get("city", "") or "",
+                "state": p.get("state", "") or "",
+                "zip": p.get("zip", "") or "",
+                "country": p.get("country", "") or "",
+                "doc_type": m.get("doc_type", "") or "",
+                "doc_date": (m.get("document_date", "") or "")[:10],
+                "recorded_date": (m.get("recorded_datetime", "") or "")[:10],
+                "borough": BOROUGH.get(m.get("recorded_borough", ""), m.get("recorded_borough", "")),
+                "amount": m.get("document_amt", "") or "",
+                "filing_url": _filing_url(did),
+            }
+        )
+
+    Path(out_path).parent.mkdir(parents=True, exist_ok=True)
+    with open(out_path, "w", newline="", encoding="utf-8") as fh:
+        w = csv.DictWriter(fh, fieldnames=COLUMNS)
+        w.writeheader()
+        w.writerows(out_rows)
+
+    if not out_rows:
+        filters = []
+        if name:
+            filters.append(f"name={name!r}")
+        if address:
+            filters.append(f"address={address!r}")
+        print(
+            f"NYC ACRIS: 0 records for {', '.join(filters)}. "
+            "ACRIS covers ONLY NYC (5 boroughs). For property records elsewhere, "
+            "search the relevant county recorder directly.",
+            file=sys.stderr,
+        )
+    return len(out_rows)
+
+
+def main() -> int:
+    p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
+    p.add_argument("--name", help="Party name substring (case-insensitive)")
+    p.add_argument("--address", help="Address line 1 substring")
+    p.add_argument(
+        "--party-type",
+        choices=["1", "2", "3"],
+        help="Filter party type: 1=grantor (seller/mortgagor), 2=grantee (buyer/mortgagee), 3=other",
+    )
+    p.add_argument("--limit", type=int, default=200)
+    p.add_argument(
+        "--no-enrich",
+        action="store_true",
+        help="Skip the master-document lookup that adds doc_type/date/amount",
+    )
+    p.add_argument("--out", required=True)
+    a = p.parse_args()
+    n = fetch(
+        name=a.name,
+        address=a.address,
+        party_type=a.party_type,
+        limit=a.limit,
+        out_path=a.out,
+        enrich=not a.no_enrich,
+    )
+    print(f"Wrote {n} NYC ACRIS rows to {a.out}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/optional-skills/research/osint-investigation/scripts/fetch_ofac_sdn.py b/optional-skills/research/osint-investigation/scripts/fetch_ofac_sdn.py
new file mode 100644
index 00000000000..5233fa09ab8
--- /dev/null
+++ b/optional-skills/research/osint-investigation/scripts/fetch_ofac_sdn.py
@@ -0,0 +1,175 @@
+#!/usr/bin/env python3
+"""Fetch OFAC SDN list (CSV format) and normalize.
+
+Public endpoint: https://www.treasury.gov/ofac/downloads/sdn.csv
+Format reference: https://ofac.treasury.gov/specially-designated-nationals-and-blocked-persons-list-sdn-human-readable-lists
+
+The SDN CSV uses a specific 12-column format with no header row:
+    ent_num, sdn_name, sdn_type, program, title, call_sign, vess_type,
+    tonnage, grt, vess_flag, vess_owner, remarks
+Address and AKA records live in separate files. We fetch all three and join.
+"""
+from __future__ import annotations
+
+import argparse
+import csv
+import io
+import sys
+from collections import defaultdict
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).parent))
+from _http import get  # noqa: E402
+
+SDN_URL = "https://www.treasury.gov/ofac/downloads/sdn.csv"
+ADD_URL = "https://www.treasury.gov/ofac/downloads/add.csv"
+ALT_URL = "https://www.treasury.gov/ofac/downloads/alt.csv"
+
+SDN_COLS = [
+    "ent_num", "sdn_name", "sdn_type", "program", "title",
+    "call_sign", "vess_type", "tonnage", "grt", "vess_flag",
+    "vess_owner", "remarks",
+]
+ADD_COLS = [
+    "ent_num", "add_num", "address", "city_state_zip", "country", "add_remarks",
+]
+ALT_COLS = [
+    "ent_num", "alt_num", "alt_type", "alt_name", "alt_remarks",
+]
+
+COLUMNS = [
+    "entity_id",
+    "name",
+    "entity_type",
+    "program_list",
+    "title",
+    "nationalities",
+    "aka_list",
+    "addresses",
+    "dob",
+    "pob",
+    "remarks",
+    "last_updated",
+]
+
+_TYPE_MAP = {
+    "individual": "individual",
+    "entity": "entity",
+    "vessel": "vessel",
+    "aircraft": "aircraft",
+}
+
+
+def _read_csv(url: str, columns: list[str]) -> list[dict[str, str]]:
+    body = get(url, timeout=60).decode("latin-1", errors="replace")
+    reader = csv.reader(io.StringIO(body))
+    out = []
+    for row in reader:
+        if not row:
+            continue
+        # Pad/truncate to expected width.
+        row = row[: len(columns)] + [""] * (len(columns) - len(row))
+        out.append(dict(zip(columns, row)))
+    return out
+
+
+def _strip_quotes(s: str) -> str:
+    s = s.strip()
+    if s.startswith('"') and s.endswith('"'):
+        s = s[1:-1]
+    if s == "-0-":
+        return ""
+    return s
+
+
+def fetch(
+    program: str | None,
+    entity_type: str | None,
+    out_path: str,
+) -> int:
+    sdn = _read_csv(SDN_URL, SDN_COLS)
+    addresses = _read_csv(ADD_URL, ADD_COLS)
+    akas = _read_csv(ALT_URL, ALT_COLS)
+
+    addr_by_ent: dict[str, list[str]] = defaultdict(list)
+    for a in addresses:
+        ent = _strip_quotes(a["ent_num"])
+        parts = [
+            _strip_quotes(a[c])
+            for c in ("address", "city_state_zip", "country")
+            if _strip_quotes(a[c])
+        ]
+        if parts:
+            addr_by_ent[ent].append(", ".join(parts))
+
+    aka_by_ent: dict[str, list[str]] = defaultdict(list)
+    for k in akas:
+        ent = _strip_quotes(k["ent_num"])
+        name = _strip_quotes(k["alt_name"])
+        if name:
+            aka_by_ent[ent].append(name)
+
+    rows: list[dict[str, str]] = []
+    for r in sdn:
+        ent_num = _strip_quotes(r["ent_num"])
+        if not ent_num:
+            continue
+        sdn_type = _TYPE_MAP.get(_strip_quotes(r["sdn_type"]).lower(), _strip_quotes(r["sdn_type"]))
+        if entity_type and sdn_type != entity_type:
+            continue
+        progs = _strip_quotes(r["program"])
+        if program and program.upper() not in progs.upper().split(";"):
+            continue
+        remarks = _strip_quotes(r["remarks"])
+        # DOB / POB are commonly embedded in remarks for individuals.
+        dob = ""
+        pob = ""
+        if sdn_type == "individual" and remarks:
+            for chunk in remarks.split(";"):
+                ch = chunk.strip()
+                if ch.upper().startswith("DOB"):
+                    dob = ch.split(maxsplit=1)[1] if " " in ch else ""
+                elif ch.upper().startswith("POB"):
+                    pob = ch.split(maxsplit=1)[1] if " " in ch else ""
+        rows.append(
+            {
+                "entity_id": ent_num,
+                "name": _strip_quotes(r["sdn_name"]),
+                "entity_type": sdn_type,
+                "program_list": "; ".join(p.strip() for p in progs.split(";") if p.strip()),
+                "title": _strip_quotes(r["title"]),
+                "nationalities": "",  # not in this CSV; available in XML format
+                "aka_list": "; ".join(aka_by_ent.get(ent_num, [])),
+                "addresses": "; ".join(addr_by_ent.get(ent_num, [])),
+                "dob": dob,
+                "pob": pob,
+                "remarks": remarks,
+                "last_updated": "",
+            }
+        )
+
+    Path(out_path).parent.mkdir(parents=True, exist_ok=True)
+    with open(out_path, "w", newline="", encoding="utf-8") as fh:
+        w = csv.DictWriter(fh, fieldnames=COLUMNS)
+        w.writeheader()
+        w.writerows(rows)
+    return len(rows)
+
+
+def main() -> int:
+    p = argparse.ArgumentParser(description=__doc__)
+    p.add_argument("--program", help="Filter to specific sanctions program (e.g. SDGT, IRAN)")
+    p.add_argument(
+        "--entity-type",
+        choices=["individual", "entity", "vessel", "aircraft"],
+        help="Filter to a specific entity type",
+    )
+    p.add_argument("--out", required=True)
+    a = p.parse_args()
+    n = fetch(program=a.program, entity_type=a.entity_type, out_path=a.out)
+    print(f"Wrote {n} OFAC SDN rows to {a.out}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/optional-skills/research/osint-investigation/scripts/fetch_opencorporates.py b/optional-skills/research/osint-investigation/scripts/fetch_opencorporates.py
new file mode 100644
index 00000000000..6924a8056a6
--- /dev/null
+++ b/optional-skills/research/osint-investigation/scripts/fetch_opencorporates.py
@@ -0,0 +1,192 @@
+#!/usr/bin/env python3
+"""Search OpenCorporates company registry data.
+
+OpenCorporates aggregates ~200M companies from 130+ jurisdictions. The
+public API requires an API token (free tier: 500 calls/month). Set
+OPENCORPORATES_API_TOKEN in env or pass --token.
+
+Without a token, this script falls back to scraping the public HTML
+search page (limited fields, more brittle, no jurisdiction filter).
+"""
+from __future__ import annotations
+
+import argparse
+import csv
+import json
+import os
+import re
+import sys
+import urllib.parse
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).parent))
+from _http import get, get_json  # noqa: E402
+
+API_URL = "https://api.opencorporates.com/v0.4/companies/search"
+HTML_URL = "https://opencorporates.com/companies"
+
+COLUMNS = [
+    "name",
+    "company_number",
+    "jurisdiction_code",
+    "jurisdiction_name",
+    "incorporation_date",
+    "dissolution_date",
+    "company_type",
+    "status",
+    "registered_address",
+    "opencorporates_url",
+    "officers_count",
+    "source",
+]
+
+
+def _via_api(query: str, jurisdiction: str | None, token: str, limit: int) -> list[dict]:
+    params = {
+        "q": query,
+        "api_token": token,
+        "per_page": str(min(limit, 100)),
+    }
+    if jurisdiction:
+        params["jurisdiction_code"] = jurisdiction
+    url = f"{API_URL}?{urllib.parse.urlencode(params)}"
+    payload = get_json(url)
+    if not isinstance(payload, dict):
+        return []
+    results = payload.get("results", {}).get("companies", []) or []
+    return [r.get("company", {}) for r in results if isinstance(r, dict)]
+
+
+def _via_html(query: str, limit: int) -> list[dict]:
+    """Best-effort HTML fallback when no API token is available."""
+    params = {"q": query, "utf8": "✓"}
+    url = f"{HTML_URL}?{urllib.parse.urlencode(params)}"
+    body = get(url, user_agent="Mozilla/5.0 hermes-osint").decode("utf-8", errors="replace")
+    # Each result is in <li class="company"> ... </li> with name, url, status
+    pattern = re.compile(
+        r'<li[^>]*class="[^"]*company[^"]*"[^>]*>.*?'
+        r'<a[^>]+href="(?P<url>/companies/[^"]+)"[^>]*>(?P<name>[^<]+)</a>'
+        r'(?:.*?<span[^>]*class="[^"]*jurisdiction[^"]*"[^>]*>(?P<jur>[^<]+)</span>)?'
+        r"(?:.*?<dt[^>]*>(?:Company\s+Number|Number)</dt>\s*<dd[^>]*>(?P<num>[^<]+)</dd>)?",
+        re.DOTALL | re.IGNORECASE,
+    )
+    out = []
+    for m in pattern.finditer(body):
+        if len(out) >= limit:
+            break
+        url_path = m.group("url").strip()
+        out.append(
+            {
+                "name": (m.group("name") or "").strip(),
+                "opencorporates_url": f"https://opencorporates.com{url_path}",
+                "jurisdiction_code": (m.group("jur") or "").strip(),
+                "company_number": (m.group("num") or "").strip(),
+                "_via": "html",
+            }
+        )
+    return out
+
+
+def fetch(
+    query: str,
+    jurisdiction: str | None,
+    token: str | None,
+    limit: int,
+    out_path: str,
+) -> int:
+    if token:
+        try:
+            companies = _via_api(query, jurisdiction, token, limit)
+            source_tag = "api"
+        except Exception as e:  # noqa: BLE001
+            print(
+                f"OpenCorporates API call failed ({e}); falling back to HTML.",
+                file=sys.stderr,
+            )
+            companies = _via_html(query, limit)
+            source_tag = "html-fallback"
+    else:
+        print(
+            "OPENCORPORATES_API_TOKEN not set — using HTML fallback (limited fields). "
+            "Get a free token at https://opencorporates.com/api_accounts/new",
+            file=sys.stderr,
+        )
+        companies = _via_html(query, limit)
+        source_tag = "html"
+
+    rows: list[dict[str, str]] = []
+    for c in companies[:limit]:
+        if c.get("_via") == "html":
+            rows.append(
+                {
+                    "name": c.get("name", ""),
+                    "company_number": c.get("company_number", ""),
+                    "jurisdiction_code": c.get("jurisdiction_code", ""),
+                    "jurisdiction_name": "",
+                    "incorporation_date": "",
+                    "dissolution_date": "",
+                    "company_type": "",
+                    "status": "",
+                    "registered_address": "",
+                    "opencorporates_url": c.get("opencorporates_url", ""),
+                    "officers_count": "",
+                    "source": source_tag,
+                }
+            )
+            continue
+        addr = c.get("registered_address_in_full") or ""
+        rows.append(
+            {
+                "name": c.get("name", "") or "",
+                "company_number": c.get("company_number", "") or "",
+                "jurisdiction_code": c.get("jurisdiction_code", "") or "",
+                "jurisdiction_name": "",
+                "incorporation_date": c.get("incorporation_date", "") or "",
+                "dissolution_date": c.get("dissolution_date", "") or "",
+                "company_type": c.get("company_type", "") or "",
+                "status": c.get("current_status", "") or c.get("inactive", "") or "",
+                "registered_address": addr,
+                "opencorporates_url": c.get("opencorporates_url", "") or "",
+                "officers_count": str(c.get("officers", {}).get("total_count", "") if c.get("officers") else ""),
+                "source": source_tag,
+            }
+        )
+
+    Path(out_path).parent.mkdir(parents=True, exist_ok=True)
+    with open(out_path, "w", newline="", encoding="utf-8") as fh:
+        w = csv.DictWriter(fh, fieldnames=COLUMNS)
+        w.writeheader()
+        w.writerows(rows)
+    if not rows:
+        print(
+            f"OpenCorporates: 0 matches for query={query!r}"
+            f"{f' jurisdiction={jurisdiction!r}' if jurisdiction else ''}.",
+            file=sys.stderr,
+        )
+    return len(rows)
+
+
+def main() -> int:
+    p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
+    p.add_argument("--query", required=True, help="Company name search")
+    p.add_argument(
+        "--jurisdiction",
+        help="Jurisdiction code, e.g. 'us_ny', 'us_de', 'gb', 'sg' (lowercased OpenCorporates style)",
+    )
+    p.add_argument("--limit", type=int, default=50)
+    p.add_argument("--token", default=os.environ.get("OPENCORPORATES_API_TOKEN"))
+    p.add_argument("--out", required=True)
+    a = p.parse_args()
+    n = fetch(
+        query=a.query,
+        jurisdiction=a.jurisdiction,
+        token=a.token,
+        limit=a.limit,
+        out_path=a.out,
+    )
+    print(f"Wrote {n} OpenCorporates rows to {a.out}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/optional-skills/research/osint-investigation/scripts/fetch_sec_edgar.py b/optional-skills/research/osint-investigation/scripts/fetch_sec_edgar.py
new file mode 100644
index 00000000000..bd2fda8feb9
--- /dev/null
+++ b/optional-skills/research/osint-investigation/scripts/fetch_sec_edgar.py
@@ -0,0 +1,184 @@
+#!/usr/bin/env python3
+"""Fetch SEC EDGAR filings index for a given CIK or company name.
+
+SEC requires a User-Agent header with contact info. Set SEC_USER_AGENT,
+e.g. SEC_USER_AGENT="Research example@example.com".
+
+Filings JSON is published at:
+    https://data.sec.gov/submissions/CIK<10-digit-padded>.json
+
+Company lookup uses:
+    https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&company=<name>&output=atom
+"""
+from __future__ import annotations
+
+import argparse
+import csv
+import os
+import re
+import sys
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).parent))
+from _http import get, get_json  # noqa: E402
+
+SUBMISSIONS_URL = "https://data.sec.gov/submissions/CIK{cik}.json"
+COLUMNS = [
+    "cik",
+    "company_name",
+    "form_type",
+    "filing_date",
+    "accession_number",
+    "primary_document",
+    "filing_url",
+    "reporting_period",
+]
+
+
+def _ua() -> str:
+    ua = os.environ.get("SEC_USER_AGENT", "").strip()
+    if not ua:
+        raise SystemExit(
+            "SEC requires a User-Agent with contact info. "
+            "Set SEC_USER_AGENT='Your Name your@email'."
+        )
+    return ua
+
+
+def _resolve_cik(company: str) -> tuple[str, str]:
+    """Resolve a company name to a CIK via EDGAR's atom feed.
+
+    Returns (cik, resolved_company_name). The feed entries also reveal whether
+    the match is an individual filer (Form 3/4/5 only) — surfaced in the
+    return value so callers can warn.
+    """
+    url = "https://www.sec.gov/cgi-bin/browse-edgar"
+    params = {"action": "getcompany", "company": company, "output": "atom", "owner": "include"}
+    body = get(url, params=params, user_agent=_ua()).decode("utf-8", errors="replace")
+    m = re.search(r"CIK=(\d{10})", body)
+    if not m:
+        raise SystemExit(f"Could not resolve CIK for company={company!r}")
+    cik = m.group(1)
+    name_m = re.search(r"<title>([^<]+)\s*\((\d{10})\)</title>", body)
+    resolved = name_m.group(1).strip() if name_m else ""
+    return cik, resolved
+
+
+def fetch(
+    cik: str | None,
+    company: str | None,
+    types: list[str],
+    since: str | None,
+    out_path: str,
+) -> int:
+    resolved_name = ""
+    if not cik and company:
+        try:
+            cik, resolved_name = _resolve_cik(company)  # type: ignore[assignment]
+        except SystemExit as e:
+            # Write empty CSV with header so downstream tools still work,
+            # and tell the user clearly.
+            print(f"SEC EDGAR: {e}", file=sys.stderr)
+            Path(out_path).parent.mkdir(parents=True, exist_ok=True)
+            with open(out_path, "w", newline="", encoding="utf-8") as fh:
+                csv.DictWriter(fh, fieldnames=COLUMNS).writeheader()
+            return 0
+        if resolved_name:
+            print(
+                f"Resolved company={company!r} → CIK {cik} ({resolved_name})",
+                file=sys.stderr,
+            )
+    if not cik:
+        raise SystemExit("must supply --cik or --company")
+    cik = cik.zfill(10)
+    url = SUBMISSIONS_URL.format(cik=cik)
+    payload = get_json(url, user_agent=_ua())
+    if not isinstance(payload, dict):
+        raise SystemExit(f"Unexpected EDGAR response shape for CIK {cik}")
+    name = payload.get("name", "")
+    recent = (payload.get("filings", {}) or {}).get("recent", {}) or {}
+    form = recent.get("form", [])
+    date = recent.get("filingDate", [])
+    accession = recent.get("accessionNumber", [])
+    primary_doc = recent.get("primaryDocument", [])
+    period = recent.get("reportDate", [])
+
+    # Histogram of available filing types — useful for surfacing why a filter
+    # returned 0 (e.g. user asked for 10-K on an individual Form 4 filer).
+    type_hist: dict[str, int] = {}
+    for ftype in form:
+        type_hist[ftype] = type_hist.get(ftype, 0) + 1
+
+    type_set = {t.strip().upper() for t in types} if types else None
+    rows: list[dict[str, str]] = []
+    for i, ftype in enumerate(form):
+        if type_set and ftype.upper() not in type_set:
+            continue
+        fdate = date[i] if i < len(date) else ""
+        if since and fdate and fdate < since:
+            continue
+        acc = accession[i] if i < len(accession) else ""
+        pdoc = primary_doc[i] if i < len(primary_doc) else ""
+        acc_nodash = acc.replace("-", "")
+        filing_url = (
+            f"https://www.sec.gov/Archives/edgar/data/{int(cik)}/{acc_nodash}/{pdoc}"
+            if acc and pdoc
+            else ""
+        )
+        rows.append(
+            {
+                "cik": cik,
+                "company_name": name,
+                "form_type": ftype,
+                "filing_date": fdate,
+                "accession_number": acc,
+                "primary_document": pdoc,
+                "filing_url": filing_url,
+                "reporting_period": period[i] if i < len(period) else "",
+            }
+        )
+
+    Path(out_path).parent.mkdir(parents=True, exist_ok=True)
+    with open(out_path, "w", newline="", encoding="utf-8") as fh:
+        w = csv.DictWriter(fh, fieldnames=COLUMNS)
+        w.writeheader()
+        w.writerows(rows)
+
+    if not rows and type_hist:
+        top = sorted(type_hist.items(), key=lambda kv: -kv[1])[:8]
+        hist_str = ", ".join(f"{t}={n}" for t, n in top)
+        print(
+            f"Warning: SEC EDGAR CIK {cik} ({name}) has {sum(type_hist.values())} "
+            f"recent filings but NONE match types={types}. "
+            f"Available form types: {hist_str}.",
+            file=sys.stderr,
+        )
+        # Insider-filer heuristic: only Form 3/4/5 → individual person, not a company.
+        company_types = {"10-K", "10-Q", "8-K", "20-F", "DEF 14A", "S-1"}
+        if not (set(type_hist.keys()) & company_types):
+            print(
+                f"Note: CIK {cik} appears to be an INDIVIDUAL filer "
+                f"(insider Form 3/4/5 only), not a corporate registrant. "
+                f"The resolver may have matched an officer/director named "
+                f"{company!r} rather than a company.",
+                file=sys.stderr,
+            )
+    return len(rows)
+
+
+def main() -> int:
+    p = argparse.ArgumentParser(description=__doc__)
+    p.add_argument("--cik", help="Central Index Key (will be 10-digit zero-padded)")
+    p.add_argument("--company", help="Resolve to CIK by company name")
+    p.add_argument("--types", default="", help="Comma-separated form types (e.g. 10-K,10-Q,8-K)")
+    p.add_argument("--since", help="Skip filings before YYYY-MM-DD")
+    p.add_argument("--out", required=True)
+    a = p.parse_args()
+    types = [t for t in (a.types or "").split(",") if t.strip()]
+    n = fetch(cik=a.cik, company=a.company, types=types, since=a.since, out_path=a.out)
+    print(f"Wrote {n} EDGAR filing rows to {a.out}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/optional-skills/research/osint-investigation/scripts/fetch_senate_ld.py b/optional-skills/research/osint-investigation/scripts/fetch_senate_ld.py
new file mode 100644
index 00000000000..3119ff8a9a5
--- /dev/null
+++ b/optional-skills/research/osint-investigation/scripts/fetch_senate_ld.py
@@ -0,0 +1,146 @@
+#!/usr/bin/env python3
+"""Fetch Senate Lobbying Disclosure (LD-1 / LD-2) filings.
+
+Anonymous: 120 req/hour. Token (SENATE_LDA_TOKEN): 1200 req/hour.
+"""
+from __future__ import annotations
+
+import argparse
+import csv
+import os
+import sys
+import time
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).parent))
+from _http import get_json  # noqa: E402
+
+ENDPOINT = "https://lda.senate.gov/api/v1/filings/"
+COLUMNS = [
+    "filing_uuid",
+    "filing_type",
+    "filing_year",
+    "filing_period",
+    "registrant_name",
+    "registrant_id",
+    "client_name",
+    "client_id",
+    "client_general_description",
+    "income",
+    "expenses",
+    "lobbyists",
+    "issues",
+    "government_entities",
+    "filing_date",
+]
+
+
+def fetch(
+    client: str | None,
+    registrant: str | None,
+    year: int,
+    token: str | None,
+    out_path: str,
+    page_size: int = 100,
+    max_pages: int = 25,
+) -> int:
+    params: dict = {"filing_year": year, "page_size": page_size}
+    if client:
+        params["client_name"] = client
+    if registrant:
+        params["registrant_name"] = registrant
+
+    headers = {"Authorization": f"Token {token}"} if token else None
+    rows: list[dict[str, str]] = []
+    url = ENDPOINT
+    page = 0
+    while page < max_pages:
+        try:
+            payload = get_json(url, params=params if page == 0 else None, headers=headers)
+        except Exception as e:  # noqa: BLE001
+            print(f"Senate LDA error on page {page + 1}: {e}", file=sys.stderr)
+            break
+        if not isinstance(payload, dict):
+            break
+        results = payload.get("results", [])
+        for r in results:
+            client_obj = r.get("client") or {}
+            registrant_obj = r.get("registrant") or {}
+            lobbying_activities = r.get("lobbying_activities") or []
+            lobbyists = []
+            issues = []
+            entities = []
+            for la in lobbying_activities:
+                for lob in la.get("lobbyists") or []:
+                    lob_obj = lob.get("lobbyist") or {}
+                    name = " ".join(
+                        x for x in (lob_obj.get("first_name", ""), lob_obj.get("last_name", "")) if x
+                    )
+                    if name:
+                        lobbyists.append(name)
+                desc = la.get("description") or ""
+                if desc:
+                    issues.append(desc)
+                for ge in la.get("government_entities") or []:
+                    nm = ge.get("name") or ""
+                    if nm:
+                        entities.append(nm)
+            rows.append(
+                {
+                    "filing_uuid": r.get("filing_uuid", "") or "",
+                    "filing_type": r.get("filing_type", "") or "",
+                    "filing_year": str(r.get("filing_year", "") or year),
+                    "filing_period": r.get("filing_period", "") or "",
+                    "registrant_name": registrant_obj.get("name", "") or "",
+                    "registrant_id": str(registrant_obj.get("id", "") or ""),
+                    "client_name": client_obj.get("name", "") or "",
+                    "client_id": str(client_obj.get("id", "") or ""),
+                    "client_general_description": client_obj.get("general_description", "") or "",
+                    "income": str(r.get("income", "") or ""),
+                    "expenses": str(r.get("expenses", "") or ""),
+                    "lobbyists": "; ".join(sorted(set(lobbyists))),
+                    "issues": "; ".join(issues),
+                    "government_entities": "; ".join(sorted(set(entities))),
+                    "filing_date": (r.get("dt_posted") or "")[:10],
+                }
+            )
+        next_url = payload.get("next")
+        if not next_url:
+            break
+        url = next_url
+        page += 1
+        time.sleep(1.0 if not token else 0.3)
+
+    Path(out_path).parent.mkdir(parents=True, exist_ok=True)
+    with open(out_path, "w", newline="", encoding="utf-8") as fh:
+        w = csv.DictWriter(fh, fieldnames=COLUMNS)
+        w.writeheader()
+        w.writerows(rows)
+    return len(rows)
+
+
+def main() -> int:
+    p = argparse.ArgumentParser(description=__doc__)
+    p.add_argument("--client", help="Client name filter")
+    p.add_argument("--registrant", help="Registrant (lobbying firm) name filter")
+    p.add_argument("--year", type=int, default=2024)
+    p.add_argument("--token", default=os.environ.get("SENATE_LDA_TOKEN"))
+    p.add_argument("--max-pages", type=int, default=25)
+    p.add_argument("--out", required=True)
+    a = p.parse_args()
+    if not (a.client or a.registrant):
+        p.error("must supply at least one of --client / --registrant")
+    n = fetch(
+        client=a.client,
+        registrant=a.registrant,
+        year=a.year,
+        token=a.token,
+        out_path=a.out,
+        max_pages=a.max_pages,
+    )
+    print(f"Wrote {n} Senate LDA rows to {a.out}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/optional-skills/research/osint-investigation/scripts/fetch_usaspending.py b/optional-skills/research/osint-investigation/scripts/fetch_usaspending.py
new file mode 100644
index 00000000000..a59c5f17276
--- /dev/null
+++ b/optional-skills/research/osint-investigation/scripts/fetch_usaspending.py
@@ -0,0 +1,170 @@
+#!/usr/bin/env python3
+"""Fetch federal contracts/awards from USAspending.gov API v2.
+
+No auth required. POST to /api/v2/search/spending_by_award/ with filters.
+"""
+from __future__ import annotations
+
+import argparse
+import csv
+import json
+import sys
+import time
+import urllib.request
+from pathlib import Path
+
+ENDPOINT = "https://api.usaspending.gov/api/v2/search/spending_by_award/"
+COLUMNS = [
+    "award_id",
+    "recipient_name",
+    "recipient_uei",
+    "recipient_duns",
+    "recipient_parent_name",
+    "recipient_state",
+    "awarding_agency",
+    "awarding_sub_agency",
+    "award_type",
+    "award_amount",
+    "award_date",
+    "period_of_performance_start",
+    "period_of_performance_end",
+    "naics_code",
+    "psc_code",
+    "competition_extent",
+    "description",
+]
+
+# USAspending result column "code" → human label mapping for output.
+_FIELDS = [
+    "Award ID",
+    "Recipient Name",
+    "Recipient UEI",
+    "Recipient DUNS Number",
+    "Recipient Parent Name",
+    "Recipient State Code",
+    "Awarding Agency",
+    "Awarding Sub Agency",
+    "Award Type",
+    "Award Amount",
+    "Start Date",
+    "End Date",
+    "NAICS Code",
+    "PSC Code",
+    "Type of Set Aside",
+    "Description",
+]
+
+
+def _post(body: dict) -> dict:
+    req = urllib.request.Request(
+        ENDPOINT,
+        data=json.dumps(body).encode("utf-8"),
+        headers={"Content-Type": "application/json", "User-Agent": "hermes-agent osint-investigation"},
+        method="POST",
+    )
+    with urllib.request.urlopen(req, timeout=60) as resp:
+        return json.loads(resp.read().decode("utf-8"))
+
+
+def fetch(
+    recipient: str | None,
+    agency: str | None,
+    fy: int,
+    sole_source_only: bool,
+    out_path: str,
+    page_size: int = 100,
+    max_pages: int = 20,
+) -> int:
+    filters: dict = {
+        "time_period": [{"start_date": f"{fy - 1}-10-01", "end_date": f"{fy}-09-30"}],
+        # Contracts only by default; adjust award_type_codes for grants/loans.
+        "award_type_codes": ["A", "B", "C", "D"],
+    }
+    if recipient:
+        filters["recipient_search_text"] = [recipient]
+    if agency:
+        filters["agencies"] = [{"type": "awarding", "tier": "toptier", "name": agency}]
+
+    rows: list[dict[str, str]] = []
+    page = 1
+    while page <= max_pages:
+        body = {
+            "filters": filters,
+            "fields": _FIELDS,
+            "page": page,
+            "limit": page_size,
+            "sort": "Award Amount",
+            "order": "desc",
+        }
+        try:
+            payload = _post(body)
+        except Exception as e:  # noqa: BLE001
+            print(f"USAspending error on page {page}: {e}", file=sys.stderr)
+            break
+        results = payload.get("results", [])
+        if not results:
+            break
+        for r in results:
+            set_aside = r.get("Type of Set Aside", "") or ""
+            if sole_source_only and "sole" not in set_aside.lower():
+                continue
+            rows.append(
+                {
+                    "award_id": r.get("Award ID", "") or "",
+                    "recipient_name": r.get("Recipient Name", "") or "",
+                    "recipient_uei": r.get("Recipient UEI", "") or "",
+                    "recipient_duns": r.get("Recipient DUNS Number", "") or "",
+                    "recipient_parent_name": r.get("Recipient Parent Name", "") or "",
+                    "recipient_state": r.get("Recipient State Code", "") or "",
+                    "awarding_agency": r.get("Awarding Agency", "") or "",
+                    "awarding_sub_agency": r.get("Awarding Sub Agency", "") or "",
+                    "award_type": r.get("Award Type", "") or "",
+                    "award_amount": str(r.get("Award Amount", "") or ""),
+                    "award_date": r.get("Start Date", "") or "",
+                    "period_of_performance_start": r.get("Start Date", "") or "",
+                    "period_of_performance_end": r.get("End Date", "") or "",
+                    "naics_code": str(r.get("NAICS Code", "") or ""),
+                    "psc_code": str(r.get("PSC Code", "") or ""),
+                    "competition_extent": set_aside,
+                    "description": r.get("Description", "") or "",
+                }
+            )
+        meta = payload.get("page_metadata", {})
+        if not meta.get("hasNext"):
+            break
+        page += 1
+        time.sleep(0.5)
+
+    Path(out_path).parent.mkdir(parents=True, exist_ok=True)
+    with open(out_path, "w", newline="", encoding="utf-8") as fh:
+        w = csv.DictWriter(fh, fieldnames=COLUMNS)
+        w.writeheader()
+        w.writerows(rows)
+    return len(rows)
+
+
+def main() -> int:
+    p = argparse.ArgumentParser(description=__doc__)
+    p.add_argument("--recipient", help="Recipient name search")
+    p.add_argument("--agency", help="Awarding agency (top-tier)")
+    p.add_argument("--fy", type=int, default=2024, help="Federal fiscal year")
+    p.add_argument("--sole-source-only", action="store_true")
+    p.add_argument("--max-pages", type=int, default=20)
+    p.add_argument("--out", required=True)
+    a = p.parse_args()
+    if not (a.recipient or a.agency):
+        p.error("must supply at least one of --recipient / --agency")
+    n = fetch(
+        recipient=a.recipient,
+        agency=a.agency,
+        fy=a.fy,
+        sole_source_only=a.sole_source_only,
+        out_path=a.out,
+        max_pages=a.max_pages,
+    )
+    print(f"Wrote {n} USAspending rows to {a.out}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/optional-skills/research/osint-investigation/scripts/fetch_wayback.py b/optional-skills/research/osint-investigation/scripts/fetch_wayback.py
new file mode 100644
index 00000000000..fb9147f22c2
--- /dev/null
+++ b/optional-skills/research/osint-investigation/scripts/fetch_wayback.py
@@ -0,0 +1,142 @@
+#!/usr/bin/env python3
+"""Search the Internet Archive Wayback Machine via the CDX server.
+
+The CDX API indexes ~900B+ archived web pages. Anonymous read access,
+no auth required. Useful for finding deleted / changed pages by URL,
+domain, or substring match.
+"""
+from __future__ import annotations
+
+import argparse
+import csv
+import sys
+import urllib.parse
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).parent))
+from _http import get_json  # noqa: E402
+
+BASE = "https://web.archive.org/cdx/search/cdx"
+
+COLUMNS = [
+    "url",
+    "timestamp",
+    "wayback_url",
+    "mimetype",
+    "status",
+    "digest",
+    "length",
+]
+
+
+def fetch(
+    url_or_host: str,
+    match_type: str,
+    from_date: str | None,
+    to_date: str | None,
+    status: str | None,
+    mime: str | None,
+    collapse: str | None,
+    limit: int,
+    out_path: str,
+) -> int:
+    params: dict[str, str] = {
+        "url": url_or_host,
+        "matchType": match_type,
+        "output": "json",
+        "limit": str(limit),
+    }
+    if from_date:
+        params["from"] = from_date.replace("-", "")
+    if to_date:
+        params["to"] = to_date.replace("-", "")
+    if status:
+        params["filter"] = f"statuscode:{status}"
+    if mime:
+        params.setdefault("filter", "")
+        # Multiple filters: CDX accepts repeated filter params via urlencode list
+        params["filter"] = f"mimetype:{mime}"
+    if collapse:
+        params["collapse"] = collapse
+
+    url = f"{BASE}?{urllib.parse.urlencode(params)}"
+    try:
+        payload = get_json(url)
+    except Exception as e:  # noqa: BLE001
+        print(f"Wayback CDX error: {e}", file=sys.stderr)
+        payload = []
+
+    rows: list[dict[str, str]] = []
+    if isinstance(payload, list) and len(payload) > 1:
+        header = payload[0]
+        idx = {h: i for i, h in enumerate(header)}
+        for entry in payload[1:]:
+            ts = entry[idx["timestamp"]] if "timestamp" in idx else ""
+            orig = entry[idx["original"]] if "original" in idx else ""
+            rows.append(
+                {
+                    "url": orig,
+                    "timestamp": ts,
+                    "wayback_url": f"https://web.archive.org/web/{ts}/{orig}" if ts and orig else "",
+                    "mimetype": entry[idx["mimetype"]] if "mimetype" in idx else "",
+                    "status": entry[idx["statuscode"]] if "statuscode" in idx else "",
+                    "digest": entry[idx["digest"]] if "digest" in idx else "",
+                    "length": entry[idx["length"]] if "length" in idx else "",
+                }
+            )
+
+    Path(out_path).parent.mkdir(parents=True, exist_ok=True)
+    with open(out_path, "w", newline="", encoding="utf-8") as fh:
+        w = csv.DictWriter(fh, fieldnames=COLUMNS)
+        w.writeheader()
+        w.writerows(rows)
+    if not rows:
+        print(
+            f"Wayback Machine: 0 captures for {url_or_host!r} matchType={match_type}.",
+            file=sys.stderr,
+        )
+    return len(rows)
+
+
+def main() -> int:
+    p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
+    p.add_argument("--url", required=True, help="URL or host to look up in the archive")
+    p.add_argument(
+        "--match",
+        default="exact",
+        choices=["exact", "prefix", "host", "domain"],
+        help=(
+            "exact: this URL only. "
+            "prefix: this URL's path-prefix. "
+            "host: any URL on this host. "
+            "domain: any URL on this domain or subdomains."
+        ),
+    )
+    p.add_argument("--from-date", help="Earliest capture YYYY-MM-DD")
+    p.add_argument("--to-date", help="Latest capture YYYY-MM-DD")
+    p.add_argument("--status", help="HTTP status filter (e.g. 200)")
+    p.add_argument("--mime", help="MIME type filter (e.g. text/html)")
+    p.add_argument(
+        "--collapse",
+        help="Collapse adjacent identical entries (e.g. 'digest' for unique-content captures)",
+    )
+    p.add_argument("--limit", type=int, default=200)
+    p.add_argument("--out", required=True)
+    a = p.parse_args()
+    n = fetch(
+        url_or_host=a.url,
+        match_type=a.match,
+        from_date=a.from_date,
+        to_date=a.to_date,
+        status=a.status,
+        mime=a.mime,
+        collapse=a.collapse,
+        limit=a.limit,
+        out_path=a.out,
+    )
+    print(f"Wrote {n} Wayback capture rows to {a.out}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/optional-skills/research/osint-investigation/scripts/fetch_wikipedia.py b/optional-skills/research/osint-investigation/scripts/fetch_wikipedia.py
new file mode 100644
index 00000000000..4ce5c93813c
--- /dev/null
+++ b/optional-skills/research/osint-investigation/scripts/fetch_wikipedia.py
@@ -0,0 +1,267 @@
+#!/usr/bin/env python3
+"""Search Wikipedia + Wikidata for an entity (person, company, place, concept).
+
+Two free APIs:
+  - Wikipedia OpenSearch + REST summary endpoint for narrative bio
+  - Wikidata SPARQL endpoint for structured facts (birth, employer, awards, etc.)
+
+Both are anonymous-access. Useful for resolving who-is-this-entity questions
+and surfacing cross-references that other sources can join against.
+"""
+from __future__ import annotations
+
+import argparse
+import csv
+import json
+import re
+import sys
+import urllib.parse
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).parent))
+from _http import get_json  # noqa: E402
+
+WP_OPENSEARCH = "https://en.wikipedia.org/w/api.php"
+WP_SUMMARY = "https://en.wikipedia.org/api/rest_v1/page/summary/"
+WD_ACTION = "https://www.wikidata.org/w/api.php"
+
+COLUMNS = [
+    "source",
+    "label",
+    "description",
+    "qid",
+    "wikipedia_title",
+    "wikipedia_url",
+    "wikidata_url",
+    "instance_of",
+    "country",
+    "occupation",
+    "employer",
+    "date_of_birth",
+    "place_of_birth",
+    "summary",
+]
+
+
+def _wp_search(query: str, limit: int) -> list[dict]:
+    params = {
+        "action": "opensearch",
+        "search": query,
+        "limit": str(min(limit, 20)),
+        "format": "json",
+    }
+    url = f"{WP_OPENSEARCH}?{urllib.parse.urlencode(params)}"
+    data = get_json(url)
+    if not isinstance(data, list) or len(data) < 4:
+        return []
+    titles, descs, urls = data[1], data[2], data[3]
+    out = []
+    for i, title in enumerate(titles):
+        out.append(
+            {
+                "title": title,
+                "description": descs[i] if i < len(descs) else "",
+                "url": urls[i] if i < len(urls) else "",
+            }
+        )
+    return out
+
+
+def _wp_summary(title: str) -> dict:
+    """Pull the REST summary for a title — short bio, image, type."""
+    url = f"{WP_SUMMARY}{urllib.parse.quote(title.replace(' ', '_'))}"
+    try:
+        return get_json(url)  # type: ignore[return-value]
+    except Exception as e:  # noqa: BLE001
+        print(f"Wikipedia summary lookup for {title!r} failed: {e}", file=sys.stderr)
+        return {}
+
+
+def _wd_lookup_by_qid(qid: str) -> dict:
+    """Pull common facts for a QID via Wikidata's Action API (no SPARQL).
+
+    The Action API is far more lenient on rate-limits than the SPARQL Query
+    Service. We get claims as QIDs and then resolve labels in one batch call.
+    """
+    # Properties of interest. The Action API returns claims as QIDs or
+    # typed literals, so the slot mapping is local-only.
+    interesting = {
+        "P31": "instance_of",
+        "P17": "country",          # for orgs / places
+        "P27": "country",          # for individuals (country of citizenship)
+        "P106": "occupation",
+        "P108": "employer",
+        "P569": "date_of_birth",
+        "P19": "place_of_birth",
+    }
+    params = {
+        "action": "wbgetentities",
+        "ids": qid,
+        "props": "claims",
+        "format": "json",
+    }
+    url = f"{WD_ACTION}?{urllib.parse.urlencode(params)}"
+    try:
+        data = get_json(url)
+    except Exception as e:  # noqa: BLE001
+        print(f"Wikidata wbgetentities for {qid} failed: {e}", file=sys.stderr)
+        return {}
+    if not isinstance(data, dict):
+        return {}
+    claims = (data.get("entities", {}).get(qid, {}) or {}).get("claims", {}) or {}
+
+    # Collect raw values (QIDs or literals) and remember which slot each
+    # came from. Date literals come back as ISO strings; QIDs need a label
+    # resolution pass.
+    qid_to_slots: dict[str, list[str]] = {}
+    facts: dict[str, list[str]] = {}
+    for prop_id, slot in interesting.items():
+        for claim in claims.get(prop_id, []) or []:
+            v = (claim.get("mainsnak", {}) or {}).get("datavalue", {}) or {}
+            vtype = v.get("type")
+            value = v.get("value")
+            if vtype == "wikibase-entityid" and isinstance(value, dict):
+                vqid = value.get("id", "")
+                if vqid:
+                    qid_to_slots.setdefault(vqid, [])
+                    if slot not in qid_to_slots[vqid]:
+                        qid_to_slots[vqid].append(slot)
+            elif vtype == "time" and isinstance(value, dict):
+                raw = value.get("time", "") or ""
+                # +1955-10-28T00:00:00Z → 1955-10-28
+                m = re.search(r"[+-]?(\d{4})-(\d{2})-(\d{2})", raw)
+                if m:
+                    facts.setdefault(slot, []).append(
+                        f"{m.group(1)}-{m.group(2)}-{m.group(3)}"
+                    )
+            elif vtype == "string":
+                facts.setdefault(slot, []).append(str(value))
+
+    # Resolve labels for all referenced QIDs in one batch (up to 50 at a time).
+    qids = list(qid_to_slots)
+    for i in range(0, len(qids), 50):
+        batch = qids[i : i + 50]
+        params = {
+            "action": "wbgetentities",
+            "ids": "|".join(batch),
+            "props": "labels",
+            "languages": "en",
+            "format": "json",
+        }
+        url = f"{WD_ACTION}?{urllib.parse.urlencode(params)}"
+        try:
+            data = get_json(url)
+        except Exception as e:  # noqa: BLE001
+            print(f"Wikidata label batch failed: {e}", file=sys.stderr)
+            continue
+        if not isinstance(data, dict):
+            continue
+        ents = data.get("entities", {}) or {}
+        for vqid, ent in ents.items():
+            label = (ent.get("labels", {}).get("en", {}) or {}).get("value", "") or vqid
+            for slot in qid_to_slots.get(vqid, []):
+                facts.setdefault(slot, []).append(label)
+
+    # Deduplicate per slot, preserving order.
+    deduped: dict[str, list[str]] = {}
+    for slot, vals in facts.items():
+        seen = set()
+        out = []
+        for v in vals:
+            if v in seen:
+                continue
+            seen.add(v)
+            out.append(v)
+        deduped[slot] = out
+    return deduped
+
+
+def _wd_qid_for_title(title: str) -> str:
+    """Get the Wikidata QID associated with a Wikipedia article title."""
+    params = {
+        "action": "query",
+        "format": "json",
+        "prop": "pageprops",
+        "ppprop": "wikibase_item",
+        "titles": title,
+        "redirects": 1,
+    }
+    url = f"{WP_OPENSEARCH}?{urllib.parse.urlencode(params)}"
+    try:
+        data = get_json(url)
+    except Exception:  # noqa: BLE001
+        return ""
+    if not isinstance(data, dict):
+        return ""
+    pages = data.get("query", {}).get("pages", {}) or {}
+    for page in pages.values():
+        qid = (page.get("pageprops") or {}).get("wikibase_item", "")
+        if qid:
+            return qid
+    return ""
+
+
+def fetch(query: str, limit: int, no_wikidata: bool, out_path: str) -> int:
+    hits = _wp_search(query, limit)
+    rows: list[dict[str, str]] = []
+    for hit in hits[:limit]:
+        title = hit.get("title", "")
+        if not title:
+            continue
+        summary = _wp_summary(title)
+        qid = _wd_qid_for_title(title) if not no_wikidata else ""
+        facts: dict = {}
+        if qid:
+            facts = _wd_lookup_by_qid(qid)
+        rows.append(
+            {
+                "source": "wikipedia+wikidata" if qid else "wikipedia",
+                "label": title,
+                "description": (summary.get("description") or hit.get("description") or "").strip(),
+                "qid": qid,
+                "wikipedia_title": title,
+                "wikipedia_url": hit.get("url", ""),
+                "wikidata_url": f"https://www.wikidata.org/wiki/{qid}" if qid else "",
+                "instance_of": "; ".join(facts.get("instance_of", [])),
+                "country": "; ".join(facts.get("country", [])),
+                "occupation": "; ".join(facts.get("occupation", [])),
+                "employer": "; ".join(facts.get("employer", [])),
+                "date_of_birth": "; ".join(facts.get("date_of_birth", []))[:10] if facts.get("date_of_birth") else "",
+                "place_of_birth": "; ".join(facts.get("place_of_birth", [])),
+                "summary": (summary.get("extract") or "").replace("\n", " ")[:1000],
+            }
+        )
+
+    Path(out_path).parent.mkdir(parents=True, exist_ok=True)
+    with open(out_path, "w", newline="", encoding="utf-8") as fh:
+        w = csv.DictWriter(fh, fieldnames=COLUMNS)
+        w.writeheader()
+        w.writerows(rows)
+    if not rows:
+        print(
+            f"Wikipedia: 0 articles for query={query!r}. "
+            "Private individuals not notable enough for a Wikipedia article "
+            "won't appear here (the bar is real).",
+            file=sys.stderr,
+        )
+    return len(rows)
+
+
+def main() -> int:
+    p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
+    p.add_argument("--query", required=True, help="Entity name (person, company, place, concept)")
+    p.add_argument("--limit", type=int, default=5)
+    p.add_argument(
+        "--no-wikidata",
+        action="store_true",
+        help="Skip the Wikidata SPARQL enrichment (faster, less detail)",
+    )
+    p.add_argument("--out", required=True)
+    a = p.parse_args()
+    n = fetch(query=a.query, limit=a.limit, no_wikidata=a.no_wikidata, out_path=a.out)
+    print(f"Wrote {n} Wikipedia/Wikidata rows to {a.out}")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/optional-skills/research/osint-investigation/scripts/timing_analysis.py b/optional-skills/research/osint-investigation/scripts/timing_analysis.py
new file mode 100644
index 00000000000..4e0ece227b4
--- /dev/null
+++ b/optional-skills/research/osint-investigation/scripts/timing_analysis.py
@@ -0,0 +1,253 @@
+#!/usr/bin/env python3
+"""Permutation test for donation/contract timing correlation (stdlib-only).
+
+For each (donor, vendor) pair, compute the mean number of days between each
+donation and the nearest contract award. Then shuffle contract award dates
+N times within the observation window and compute the same statistic. The
+one-tailed p-value is the fraction of permutations whose mean is <= the
+observed mean (smaller distance = tighter clustering).
+
+Adapted from ShinMegamiBoson/OpenPlanter (MIT). Differences:
+  - Pure stdlib (no pandas / numpy)
+  - Domain-agnostic (no snow-vendor / CRITICAL-politician filter)
+  - Configurable column names via flags
+  - Optional --seed for reproducibility
+"""
+from __future__ import annotations
+
+import argparse
+import csv
+import datetime as dt
+import json
+import math
+import random
+import statistics
+from collections import defaultdict
+from pathlib import Path
+
+_DATE_FORMATS = ("%Y-%m-%d", "%m/%d/%Y", "%Y/%m/%d", "%m-%d-%Y", "%Y%m%d")
+
+
+def parse_date(raw: str) -> dt.date | None:
+    if not raw:
+        return None
+    raw = raw.strip()
+    for fmt in _DATE_FORMATS:
+        try:
+            return dt.datetime.strptime(raw, fmt).date()
+        except ValueError:
+            continue
+    return None
+
+
+def _read(path: str) -> list[dict[str, str]]:
+    with open(path, newline="", encoding="utf-8") as fh:
+        return list(csv.DictReader(fh))
+
+
+def _nearest_distance(donation_date: dt.date, awards: list[dt.date]) -> int:
+    """Absolute days to nearest award date."""
+    return min(abs((donation_date - a).days) for a in awards)
+
+
+def _permute(
+    awards_count: int,
+    donations: list[dt.date],
+    date_min: dt.date,
+    date_max: dt.date,
+    rng: random.Random,
+) -> float:
+    """One permutation: draw uniform random award dates, compute mean nearest-distance."""
+    span_days = (date_max - date_min).days or 1
+    rand_awards = [
+        date_min + dt.timedelta(days=rng.randint(0, span_days))
+        for _ in range(awards_count)
+    ]
+    distances = [_nearest_distance(d, rand_awards) for d in donations]
+    return statistics.mean(distances)
+
+
+def analyze(
+    donations_path: str,
+    donation_date_col: str,
+    donation_amount_col: str,
+    donation_donor_col: str,
+    donation_recipient_col: str,
+    contracts_path: str,
+    contract_date_col: str,
+    contract_vendor_col: str,
+    cross_links_path: str | None,
+    n_permutations: int = 1000,
+    min_donations: int = 3,
+    p_threshold: float = 0.05,
+    seed: int | None = None,
+    out_path: str = "timing.json",
+) -> dict:
+    rng = random.Random(seed)
+
+    donations = _read(donations_path)
+    contracts = _read(contracts_path)
+
+    # Allow optional join through cross_links — donor (left) ↔ vendor (right).
+    # When present, donor strings get mapped to matched vendor names so the
+    # vendor-date index lookup actually finds the contracts.
+    matched_pairs: set[tuple[str, str]] | None = None
+    donor_to_vendors: dict[str, set[str]] = defaultdict(set)
+    if cross_links_path:
+        matched_pairs = set()
+        for row in _read(cross_links_path):
+            left = row.get("left_name", "")
+            right = row.get("right_name", "")
+            matched_pairs.add((left, right))
+            donor_to_vendors[left].add(right)
+
+    # Index contract dates by vendor name.
+    vendor_to_award_dates: dict[str, list[dt.date]] = defaultdict(list)
+    all_award_dates: list[dt.date] = []
+    for row in contracts:
+        d = parse_date(row.get(contract_date_col, ""))
+        if not d:
+            continue
+        vendor_to_award_dates[row.get(contract_vendor_col, "").strip()].append(d)
+        all_award_dates.append(d)
+
+    if not all_award_dates:
+        raise SystemExit(f"No parseable dates in {contracts_path}/{contract_date_col}")
+    global_min = min(all_award_dates)
+    global_max = max(all_award_dates)
+
+    # Group donations by (donor, recipient).
+    grouped: dict[tuple[str, str], list[tuple[dt.date, float]]] = defaultdict(list)
+    for row in donations:
+        donor = row.get(donation_donor_col, "").strip()
+        recip = row.get(donation_recipient_col, "").strip()
+        d = parse_date(row.get(donation_date_col, ""))
+        try:
+            amt = float(row.get(donation_amount_col, "0") or 0)
+        except ValueError:
+            amt = 0.0
+        if not (donor and recip and d):
+            continue
+        grouped[(donor, recip)].append((d, amt))
+
+    results = []
+    skipped = 0
+    for (donor, recip), records in grouped.items():
+        if len(records) < min_donations:
+            skipped += 1
+            continue
+        # Only test if donor appears in cross-links (when provided). The
+        # (donor, candidate) tuple itself is NOT what's in matched_pairs —
+        # cross_links pairs are (donor, vendor). We use the cross-link to
+        # map donor → vendor name(s) so the vendor-date index resolves.
+        if matched_pairs is not None and donor not in donor_to_vendors:
+            skipped += 1
+            continue
+        # Try direct donor→awards first, then go through cross-link vendor names.
+        award_dates = list(vendor_to_award_dates.get(donor, []))
+        if not award_dates:
+            award_dates = list(vendor_to_award_dates.get(recip, []))
+        if not award_dates and donor_to_vendors.get(donor):
+            for vendor_name in donor_to_vendors[donor]:
+                award_dates.extend(vendor_to_award_dates.get(vendor_name, []))
+        if not award_dates:
+            skipped += 1
+            continue
+
+        donation_dates = [d for (d, _) in records]
+        observed = statistics.mean(
+            _nearest_distance(d, award_dates) for d in donation_dates
+        )
+
+        permuted_means = [
+            _permute(len(award_dates), donation_dates, global_min, global_max, rng)
+            for _ in range(n_permutations)
+        ]
+        p_value = sum(1 for m in permuted_means if m <= observed) / n_permutations
+        null_mean = statistics.mean(permuted_means)
+        null_std = statistics.pstdev(permuted_means) or 1.0
+        effect_size = (null_mean - observed) / null_std
+
+        results.append(
+            {
+                "donor": donor,
+                "recipient": recip,
+                "n_donations": len(records),
+                "n_award_dates": len(award_dates),
+                "observed_mean_days": round(observed, 2),
+                "null_mean_days": round(null_mean, 2),
+                "p_value": round(p_value, 4),
+                "effect_size_sd": round(effect_size, 2),
+                "significant": p_value < p_threshold,
+                "total_donation_amount": round(sum(a for (_, a) in records), 2),
+            }
+        )
+
+    results.sort(key=lambda r: r["p_value"])
+
+    payload = {
+        "metadata": {
+            "n_permutations": n_permutations,
+            "min_donations": min_donations,
+            "p_threshold": p_threshold,
+            "seed": seed,
+            "n_pairs_tested": len(results),
+            "n_pairs_skipped": skipped,
+            "n_significant": sum(1 for r in results if r["significant"]),
+            "observation_window": [global_min.isoformat(), global_max.isoformat()],
+        },
+        "results": results,
+    }
+
+    Path(out_path).write_text(json.dumps(payload, indent=2))
+    return payload
+
+
+def main() -> int:
+    p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
+    p.add_argument("--donations", required=True)
+    p.add_argument("--donation-date-col", required=True)
+    p.add_argument("--donation-amount-col", required=True)
+    p.add_argument("--donation-donor-col", required=True)
+    p.add_argument("--donation-recipient-col", required=True)
+    p.add_argument("--contracts", required=True)
+    p.add_argument("--contract-date-col", required=True)
+    p.add_argument("--contract-vendor-col", required=True)
+    p.add_argument(
+        "--cross-links",
+        help="Optional cross_links.csv to restrict (donor, vendor) pairs",
+    )
+    p.add_argument("--permutations", type=int, default=1000)
+    p.add_argument("--min-donations", type=int, default=3)
+    p.add_argument("--p-threshold", type=float, default=0.05)
+    p.add_argument("--seed", type=int)
+    p.add_argument("--out", default="timing.json")
+    a = p.parse_args()
+
+    payload = analyze(
+        donations_path=a.donations,
+        donation_date_col=a.donation_date_col,
+        donation_amount_col=a.donation_amount_col,
+        donation_donor_col=a.donation_donor_col,
+        donation_recipient_col=a.donation_recipient_col,
+        contracts_path=a.contracts,
+        contract_date_col=a.contract_date_col,
+        contract_vendor_col=a.contract_vendor_col,
+        cross_links_path=a.cross_links,
+        n_permutations=a.permutations,
+        min_donations=a.min_donations,
+        p_threshold=a.p_threshold,
+        seed=a.seed,
+        out_path=a.out,
+    )
+    meta = payload["metadata"]
+    print(
+        f"Tested {meta['n_pairs_tested']} pairs ({meta['n_pairs_skipped']} skipped). "
+        f"Significant (p<{meta['p_threshold']}): {meta['n_significant']}. "
+        f"Wrote {a.out}"
+    )
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/optional-skills/research/osint-investigation/templates/source-template.md b/optional-skills/research/osint-investigation/templates/source-template.md
new file mode 100644
index 00000000000..b023cc26888
--- /dev/null
+++ b/optional-skills/research/osint-investigation/templates/source-template.md
@@ -0,0 +1,59 @@
+# <Source Name>
+
+## 1. Summary
+
+What this data source is, who publishes it, why it matters for investigations.
+
+## 2. Access Methods
+
+- API endpoint(s)
+- Bulk download URLs
+- Auth requirements (none / API key / OAuth)
+- Rate limits
+
+## 3. Data Schema
+
+Key fields, record types, table relationships. List the columns the fetch
+script emits.
+
+## 4. Coverage
+
+- Jurisdiction
+- Time range
+- Update frequency
+- Data volume (rows / GB)
+
+## 5. Cross-Reference Potential
+
+Which other sources can be joined and on what keys. Be explicit:
+
+- `<source>` ↔ `<column>` (join key: <normalized entity name / EIN / CIK / etc.>)
+
+## 6. Data Quality
+
+Known issues — formatting inconsistencies, missing fields, duplicates,
+historical gaps, redaction.
+
+## 7. Acquisition Script
+
+Path: `scripts/fetch_<source>.py`
+
+Example:
+
+```bash
+python3 SKILL_DIR/scripts/fetch_<source>.py --<filter> <value> --out data/<source>.csv
+```
+
+Output CSV columns: `<col1>, <col2>, ...`
+
+## 8. Legal & Licensing
+
+- Public records law / FOIA basis
+- Terms of use / acceptable use
+- Attribution requirements (if any)
+
+## 9. References
+
+- Official docs: <url>
+- Data dictionary: <url>
+- Related coverage / journalism: <url>
diff --git a/website/docs/reference/optional-skills-catalog.md b/website/docs/reference/optional-skills-catalog.md
index d1544ce89b9..ce1861431a6 100644
--- a/website/docs/reference/optional-skills-catalog.md
+++ b/website/docs/reference/optional-skills-catalog.md
@@ -167,6 +167,7 @@ hermes skills uninstall <skill-name>
 | [**drug-discovery**](/docs/user-guide/skills/optional/research/research-drug-discovery) | Pharmaceutical research assistant for drug discovery workflows. Search bioactive compounds on ChEMBL, calculate drug-likeness (Lipinski Ro5, QED, TPSA, synthetic accessibility), look up drug-drug interactions via OpenFDA, interpret ADMET... |
 | [**duckduckgo-search**](/docs/user-guide/skills/optional/research/research-duckduckgo-search) | Free web search via DuckDuckGo — text, news, images, videos. No API key needed. Prefer the `ddgs` CLI when installed; use the Python DDGS library only after verifying that `ddgs` is available in the current runtime. |
 | [**gitnexus-explorer**](/docs/user-guide/skills/optional/research/research-gitnexus-explorer) | Index a codebase with GitNexus and serve an interactive knowledge graph via web UI + Cloudflare tunnel. |
+| [**osint-investigation**](/docs/user-guide/skills/optional/research/research-osint-investigation) | Public-records OSINT investigation framework — SEC EDGAR filings, USAspending contracts, Senate lobbying, OFAC sanctions, ICIJ offshore leaks, NYC property records (ACRIS), OpenCorporates registries, CourtListener court records, Wayback... |
 | [**parallel-cli**](/docs/user-guide/skills/optional/research/research-parallel-cli) | Optional vendor skill for Parallel CLI — agent-native web search, extraction, deep research, enrichment, FindAll, and monitoring. Prefer JSON output and non-interactive flows. |
 | [**qmd**](/docs/user-guide/skills/optional/research/research-qmd) | Search personal knowledge bases, notes, docs, and meeting transcripts locally using qmd — a hybrid retrieval engine with BM25, vector search, and LLM reranking. Supports CLI and MCP integration. |
 | [**scrapling**](/docs/user-guide/skills/optional/research/research-scrapling) | Web scraping with Scrapling - HTTP fetching, stealth browser automation, Cloudflare bypass, and spider crawling via CLI and Python. |
diff --git a/website/docs/user-guide/skills/optional/research/research-osint-investigation.md b/website/docs/user-guide/skills/optional/research/research-osint-investigation.md
new file mode 100644
index 00000000000..7428c3022b2
--- /dev/null
+++ b/website/docs/user-guide/skills/optional/research/research-osint-investigation.md
@@ -0,0 +1,294 @@
+---
+title: "Osint Investigation"
+sidebar_label: "Osint Investigation"
+description: "Public-records OSINT investigation framework — SEC EDGAR filings, USAspending contracts, Senate lobbying, OFAC sanctions, ICIJ offshore leaks, NYC property r..."
+---
+
+{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}
+
+# Osint Investigation
+
+Public-records OSINT investigation framework — SEC EDGAR filings, USAspending contracts, Senate lobbying, OFAC sanctions, ICIJ offshore leaks, NYC property records (ACRIS), OpenCorporates registries, CourtListener court records, Wayback Machine archives, Wikipedia + Wikidata, GDELT news monitoring. Entity resolution across sources, cross-link analysis, timing correlation, evidence chains. Python stdlib only.
+
+## Skill metadata
+
+| | |
+|---|---|
+| Source | Optional — install with `hermes skills install official/research/osint-investigation` |
+| Path | `optional-skills/research/osint-investigation` |
+| Version | `0.1.0` |
+| Author | Hermes Agent (adapted from ShinMegamiBoson/OpenPlanter, MIT) |
+| Platforms | linux, macos, windows |
+| Tags | `osint`, `investigation`, `public-records`, `sec`, `sanctions`, `corporate-registry`, `property`, `courts`, `due-diligence`, `journalism` |
+| Related skills | [`domain-intel`](/docs/user-guide/skills/optional/research/research-domain-intel), [`arxiv`](/docs/user-guide/skills/bundled/research/research-arxiv) |
+
+## Reference: full SKILL.md
+
+:::info
+The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
+:::
+
+# OSINT Investigation — Public Records Cross-Reference
+
+Investigative framework for public-records OSINT: government contracts,
+corporate filings, lobbying, sanctions, offshore leaks, property records,
+court records, web archives, knowledge bases, and global news. Resolve
+entities across heterogeneous sources, build cross-links with explicit
+confidence, run statistical timing tests, and produce structured evidence
+chains.
+
+**Python stdlib only.** Zero install. Works on Linux, macOS, Windows. Most
+sources work with no API key (OpenCorporates has an optional free token
+that raises rate limits).
+
+Adapted from the MIT-licensed ShinMegamiBoson/OpenPlanter project; expanded
+to cover identity / property / litigation / archives / news sources that
+the original didn't address.
+
+## When to use this skill
+
+Use when the user asks for:
+
+- "follow the money" — government contracts, lobbying → legislation, sanctions
+- corporate due diligence — who controls company X, where are they
+  incorporated, who serves on their boards, what filings have they made
+- sanctions screening — is entity X on OFAC SDN, ICIJ offshore leaks
+- pay-to-play investigation — contractors with offshore ties, lobbying
+  clients winning awards
+- property ownership — find recorded deeds/mortgages by name or address
+  (NYC; for other counties point users at the relevant recorder)
+- litigation history — find federal + state court opinions and PACER dockets
+- multi-source entity resolution where naming varies (LLC suffixes, abbreviations)
+- evidence-chain construction with explicit confidence levels
+- "what's been said about X" — international news (GDELT) + Wikipedia
+  narrative + Wayback Machine to recover dead URLs
+
+Do NOT use this skill for:
+
+- general web research → `web_search` / `web_extract`
+- domain/infrastructure OSINT → `domain-intel` skill
+- academic literature → `arxiv` skill
+- social-media profile discovery → `sherlock` skill (optional)
+- US **federal** campaign finance — FEC is intentionally NOT covered here
+  (the API is unreliable for ad-hoc contributor-name queries on the free
+  DEMO_KEY tier). For federal donations, point users at
+  https://www.fec.gov/data/ directly.
+
+## Workflow
+
+The agent runs scripts via the `terminal` tool. `SKILL_DIR` is the directory
+holding this SKILL.md.
+
+### 1. Identify which sources apply
+
+Read the data-source wiki entries to plan the investigation:
+
+```
+ls SKILL_DIR/references/sources/
+
+# Federal financial / regulatory
+cat SKILL_DIR/references/sources/sec-edgar.md       # corporate filings
+cat SKILL_DIR/references/sources/usaspending.md     # federal contracts
+cat SKILL_DIR/references/sources/senate-ld.md       # lobbying
+cat SKILL_DIR/references/sources/ofac-sdn.md        # sanctions
+cat SKILL_DIR/references/sources/icij-offshore.md   # offshore leaks
+
+# Identity / property / litigation / archives / news
+cat SKILL_DIR/references/sources/nyc-acris.md       # NYC property records
+cat SKILL_DIR/references/sources/opencorporates.md  # global corporate registry
+cat SKILL_DIR/references/sources/courtlistener.md   # court records (federal + state)
+cat SKILL_DIR/references/sources/wayback.md         # Wayback Machine archives
+cat SKILL_DIR/references/sources/wikipedia.md       # Wikipedia + Wikidata
+cat SKILL_DIR/references/sources/gdelt.md           # global news monitoring
+```
+
+Each entry follows a 9-section template: summary, access, schema, coverage,
+cross-reference keys, data quality, acquisition, legal, references.
+
+The **cross-reference potential** section maps join keys between sources — read
+those first to pick the right pair.
+
+### 2. Acquire data
+
+Each source has a stdlib-only fetch script in `SKILL_DIR/scripts/`:
+
+**Federal financial / regulatory**
+
+```bash
+# SEC EDGAR filings (corporate disclosures)
+python3 SKILL_DIR/scripts/fetch_sec_edgar.py --cik 0000320193 \
+    --types 10-K,10-Q --out data/edgar_filings.csv
+
+# USAspending federal contracts
+python3 SKILL_DIR/scripts/fetch_usaspending.py --recipient "EXAMPLE CORP" \
+    --fy 2024 --out data/contracts.csv
+
+# Senate LD-1 / LD-2 lobbying disclosures
+python3 SKILL_DIR/scripts/fetch_senate_ld.py --client "EXAMPLE CORP" \
+    --year 2024 --out data/lobbying.csv
+
+# OFAC SDN sanctions list (full snapshot)
+python3 SKILL_DIR/scripts/fetch_ofac_sdn.py --out data/ofac_sdn.csv
+
+# ICIJ Offshore Leaks — downloads ~70 MB bulk CSV on first use,
+# then searches it locally. Cached for 30 days under
+# $HERMES_OSINT_CACHE/icij/ (default: ~/.cache/hermes-osint/icij/).
+python3 SKILL_DIR/scripts/fetch_icij_offshore.py --entity "EXAMPLE CORP" \
+    --out data/icij.csv
+```
+
+**Identity / property / litigation / archives / news**
+
+```bash
+# NYC property records (deeds, mortgages, liens) — ACRIS via Socrata
+python3 SKILL_DIR/scripts/fetch_nyc_acris.py --name "SMITH, JOHN" \
+    --out data/acris.csv
+python3 SKILL_DIR/scripts/fetch_nyc_acris.py --address "571 HUDSON" \
+    --out data/acris_addr.csv
+
+# OpenCorporates — 130+ jurisdiction corporate registry
+# (free token required; set OPENCORPORATES_API_TOKEN or pass --token)
+python3 SKILL_DIR/scripts/fetch_opencorporates.py --query "Example Corp" \
+    --jurisdiction us_ny --out data/opencorporates.csv
+
+# CourtListener — federal + state court opinions, PACER dockets
+python3 SKILL_DIR/scripts/fetch_courtlistener.py --query "Smith v. Example Corp" \
+    --type opinions --out data/courts.csv
+
+# Wayback Machine — historical web captures
+python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \
+    --match host --collapse digest --out data/wayback.csv
+
+# Wikipedia + Wikidata — narrative bio + structured facts
+# Set HERMES_OSINT_UA=your-app/1.0 (your@email) to identify yourself
+python3 SKILL_DIR/scripts/fetch_wikipedia.py --query "Bill Gates" \
+    --out data/wp.csv
+
+# GDELT — global news in 100+ languages, ~2015→present
+python3 SKILL_DIR/scripts/fetch_gdelt.py --query '"Example Corp"' \
+    --timespan 1y --out data/gdelt.csv
+```
+
+All outputs are normalized CSV with a header row. Re-run scripts idempotently.
+
+When a private individual won't be in a source (e.g. SEC EDGAR for a non-public-
+company person, USAspending for someone who isn't a federal contractor, Senate
+LDA for someone who isn't a lobbying client), the script returns 0 rows with a
+clear warning rather than silently writing an empty CSV. EDGAR specifically
+flags when the company-name resolver matched an individual Form 3/4/5 filer
+rather than a corporate registrant.
+
+Rate-limit notes are in each source's wiki entry. Default fetchers sleep
+politely between paginated requests. **API keys raise rate limits** for
+sources that support them (`SEC_USER_AGENT`, `SENATE_LDA_TOKEN`,
+`OPENCORPORATES_API_TOKEN`, `COURTLISTENER_TOKEN`). All scripts surface
+429 responses immediately with the upstream's quota message so the user
+knows to slow down or supply a key.
+
+### 3. Resolve entities across sources
+
+Normalize names and find matches between two CSV files:
+
+```bash
+# Match lobbying clients (Senate LDA) against contract recipients (USAspending)
+python3 SKILL_DIR/scripts/entity_resolution.py \
+    --left  data/lobbying.csv   --left-name-col  client_name \
+    --right data/contracts.csv  --right-name-col recipient_name \
+    --out data/cross_links.csv
+```
+
+Three matching tiers with explicit confidence:
+
+| Tier | Method | Confidence |
+|------|--------|------------|
+| `exact` | Normalized strings equal after suffix/punctuation strip | high |
+| `fuzzy` | Sorted-token equality (word-bag match) | medium |
+| `token_overlap` | ≥60% token overlap, ≥2 shared tokens, tokens ≥4 chars | low |
+
+Output `cross_links.csv` columns: `match_type, confidence, left_name,
+right_name, left_normalized, right_normalized, left_row, right_row`.
+
+### 4. Statistical timing correlation (optional)
+
+Test whether two time series cluster suspiciously close together — e.g.
+lobbying filings near contract awards — using a permutation test:
+
+```bash
+python3 SKILL_DIR/scripts/timing_analysis.py \
+    --donations data/lobbying.csv --donation-date-col filing_date \
+        --donation-amount-col income --donation-donor-col client_name \
+        --donation-recipient-col registrant_name \
+    --contracts data/contracts.csv --contract-date-col award_date \
+        --contract-vendor-col recipient_name \
+    --cross-links data/cross_links.csv \
+    --permutations 1000 \
+    --out data/timing.json
+```
+
+The script's column flags are intentionally generic — the original tool was
+written for donations vs awards, but it works for any (event, payee) time
+series joined through cross-links. Null hypothesis: event timing is
+independent of award dates. One-tailed p-value = fraction of permutations
+with mean nearest-award distance ≤ observed. Minimum 3 events per (payer,
+vendor) pair to run the test.
+
+### 5. Build the findings JSON (evidence chain)
+
+```bash
+python3 SKILL_DIR/scripts/build_findings.py \
+    --cross-links data/cross_links.csv \
+    --timing data/timing.json \
+    --out data/findings.json
+```
+
+Every finding has `id, title, severity, confidence, summary, evidence[], sources[]`.
+Each evidence item points back to a specific row in a source CSV. The user (or a
+follow-up agent) can verify every claim against its source.
+
+## Confidence and evidence discipline
+
+This is the load-bearing rule of the skill. Tell the user:
+
+- Every claim must trace to a record. No naked assertions.
+- Confidence tier travels with the claim. `match_type=fuzzy` is "probable",
+  not "confirmed."
+- Entity resolution produces candidates, NOT conclusions. A `fuzzy` match
+  between "ACME LLC" and "Acme Holdings Group" is a lead, not a fact.
+- Statistical significance ≠ wrongdoing. p &lt; 0.05 means the timing pattern
+  is unlikely under the null. It does not establish corruption.
+- All data sources here are public records. They may still contain
+  inaccuracies, stale info, or redactions (GDPR, sealed records).
+
+## Adding a new data source
+
+Use the template:
+
+```bash
+cp SKILL_DIR/templates/source-template.md \
+    SKILL_DIR/references/sources/<your-source>.md
+```
+
+Fill in all 9 sections. Write a `fetch_<source>.py` script in `scripts/` that
+uses stdlib only and writes a normalized CSV. Update the source list in the
+"When to use" section above.
+
+## Tools and their limits
+
+- `entity_resolution.py` does NOT use external fuzzy libraries (no rapidfuzz,
+  no jellyfish). Token-bag matching is the upper bound here. If you need
+  Levenshtein, transliteration, or phonetic matching, pip-install separately.
+- `timing_analysis.py` uses Python's `random` for permutations. For
+  reproducibility, pass `--seed N`.
+- `fetch_*.py` scripts use `urllib.request` and respect `Retry-After`. Heavy
+  bulk usage may still violate ToS — read each source's legal section first.
+
+## Legal note
+
+All Phase-1 sources are public records. Bulk acquisition is permitted under
+their respective access terms (FOIA, public records law, ICIJ explicit
+publication, OFAC public data). However:
+
+- Some sources rate-limit aggressively. Respect their headers.
+- Some redact registrant info (GDPR on WHOIS, sealed filings).
+- Cross-referencing public records to identify private individuals can have
+  ethical implications. The skill produces evidence chains, not accusations.
diff --git a/website/sidebars.ts b/website/sidebars.ts
index f619f2318c9..1a0aa6fb0bb 100644
--- a/website/sidebars.ts
+++ b/website/sidebars.ts
@@ -554,6 +554,7 @@ const sidebars: SidebarsConfig = {
                     'user-guide/skills/optional/research/research-drug-discovery',
                     'user-guide/skills/optional/research/research-duckduckgo-search',
                     'user-guide/skills/optional/research/research-gitnexus-explorer',
+                    'user-guide/skills/optional/research/research-osint-investigation',
                     'user-guide/skills/optional/research/research-parallel-cli',
                     'user-guide/skills/optional/research/research-qmd',
                     'user-guide/skills/optional/research/research-scrapling',