diff --git a/optional-skills/research/osint-investigation/SKILL.md b/optional-skills/research/osint-investigation/SKILL.md new file mode 100644 index 00000000000..b2da82fbd00 --- /dev/null +++ b/optional-skills/research/osint-investigation/SKILL.md @@ -0,0 +1,277 @@ +--- +name: osint-investigation +description: Public-records OSINT investigation framework — SEC EDGAR filings, USAspending contracts, Senate lobbying, OFAC sanctions, ICIJ offshore leaks, NYC property records (ACRIS), OpenCorporates registries, CourtListener court records, Wayback Machine archives, Wikipedia + Wikidata, GDELT news monitoring. Entity resolution across sources, cross-link analysis, timing correlation, evidence chains. Python stdlib only. +version: 0.1.0 +platforms: [linux, macos, windows] +author: Hermes Agent (adapted from ShinMegamiBoson/OpenPlanter, MIT) +metadata: + hermes: + tags: [osint, investigation, public-records, sec, sanctions, corporate-registry, property, courts, due-diligence, journalism] + category: research + related_skills: [domain-intel, arxiv] +--- + +# OSINT Investigation — Public Records Cross-Reference + +Investigative framework for public-records OSINT: government contracts, +corporate filings, lobbying, sanctions, offshore leaks, property records, +court records, web archives, knowledge bases, and global news. Resolve +entities across heterogeneous sources, build cross-links with explicit +confidence, run statistical timing tests, and produce structured evidence +chains. + +**Python stdlib only.** Zero install. Works on Linux, macOS, Windows. Most +sources work with no API key (OpenCorporates has an optional free token +that raises rate limits). + +Adapted from the MIT-licensed ShinMegamiBoson/OpenPlanter project; expanded +to cover identity / property / litigation / archives / news sources that +the original didn't address. + +## When to use this skill + +Use when the user asks for: + +- "follow the money" — government contracts, lobbying → legislation, sanctions +- corporate due diligence — who controls company X, where are they + incorporated, who serves on their boards, what filings have they made +- sanctions screening — is entity X on OFAC SDN, ICIJ offshore leaks +- pay-to-play investigation — contractors with offshore ties, lobbying + clients winning awards +- property ownership — find recorded deeds/mortgages by name or address + (NYC; for other counties point users at the relevant recorder) +- litigation history — find federal + state court opinions and PACER dockets +- multi-source entity resolution where naming varies (LLC suffixes, abbreviations) +- evidence-chain construction with explicit confidence levels +- "what's been said about X" — international news (GDELT) + Wikipedia + narrative + Wayback Machine to recover dead URLs + +Do NOT use this skill for: + +- general web research → `web_search` / `web_extract` +- domain/infrastructure OSINT → `domain-intel` skill +- academic literature → `arxiv` skill +- social-media profile discovery → `sherlock` skill (optional) +- US **federal** campaign finance — FEC is intentionally NOT covered here + (the API is unreliable for ad-hoc contributor-name queries on the free + DEMO_KEY tier). For federal donations, point users at + https://www.fec.gov/data/ directly. + +## Workflow + +The agent runs scripts via the `terminal` tool. `SKILL_DIR` is the directory +holding this SKILL.md. + +### 1. Identify which sources apply + +Read the data-source wiki entries to plan the investigation: + +``` +ls SKILL_DIR/references/sources/ + +# Federal financial / regulatory +cat SKILL_DIR/references/sources/sec-edgar.md # corporate filings +cat SKILL_DIR/references/sources/usaspending.md # federal contracts +cat SKILL_DIR/references/sources/senate-ld.md # lobbying +cat SKILL_DIR/references/sources/ofac-sdn.md # sanctions +cat SKILL_DIR/references/sources/icij-offshore.md # offshore leaks + +# Identity / property / litigation / archives / news +cat SKILL_DIR/references/sources/nyc-acris.md # NYC property records +cat SKILL_DIR/references/sources/opencorporates.md # global corporate registry +cat SKILL_DIR/references/sources/courtlistener.md # court records (federal + state) +cat SKILL_DIR/references/sources/wayback.md # Wayback Machine archives +cat SKILL_DIR/references/sources/wikipedia.md # Wikipedia + Wikidata +cat SKILL_DIR/references/sources/gdelt.md # global news monitoring +``` + +Each entry follows a 9-section template: summary, access, schema, coverage, +cross-reference keys, data quality, acquisition, legal, references. + +The **cross-reference potential** section maps join keys between sources — read +those first to pick the right pair. + +### 2. Acquire data + +Each source has a stdlib-only fetch script in `SKILL_DIR/scripts/`: + +**Federal financial / regulatory** + +```bash +# SEC EDGAR filings (corporate disclosures) +python3 SKILL_DIR/scripts/fetch_sec_edgar.py --cik 0000320193 \ + --types 10-K,10-Q --out data/edgar_filings.csv + +# USAspending federal contracts +python3 SKILL_DIR/scripts/fetch_usaspending.py --recipient "EXAMPLE CORP" \ + --fy 2024 --out data/contracts.csv + +# Senate LD-1 / LD-2 lobbying disclosures +python3 SKILL_DIR/scripts/fetch_senate_ld.py --client "EXAMPLE CORP" \ + --year 2024 --out data/lobbying.csv + +# OFAC SDN sanctions list (full snapshot) +python3 SKILL_DIR/scripts/fetch_ofac_sdn.py --out data/ofac_sdn.csv + +# ICIJ Offshore Leaks — downloads ~70 MB bulk CSV on first use, +# then searches it locally. Cached for 30 days under +# $HERMES_OSINT_CACHE/icij/ (default: ~/.cache/hermes-osint/icij/). +python3 SKILL_DIR/scripts/fetch_icij_offshore.py --entity "EXAMPLE CORP" \ + --out data/icij.csv +``` + +**Identity / property / litigation / archives / news** + +```bash +# NYC property records (deeds, mortgages, liens) — ACRIS via Socrata +python3 SKILL_DIR/scripts/fetch_nyc_acris.py --name "SMITH, JOHN" \ + --out data/acris.csv +python3 SKILL_DIR/scripts/fetch_nyc_acris.py --address "571 HUDSON" \ + --out data/acris_addr.csv + +# OpenCorporates — 130+ jurisdiction corporate registry +# (free token required; set OPENCORPORATES_API_TOKEN or pass --token) +python3 SKILL_DIR/scripts/fetch_opencorporates.py --query "Example Corp" \ + --jurisdiction us_ny --out data/opencorporates.csv + +# CourtListener — federal + state court opinions, PACER dockets +python3 SKILL_DIR/scripts/fetch_courtlistener.py --query "Smith v. Example Corp" \ + --type opinions --out data/courts.csv + +# Wayback Machine — historical web captures +python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \ + --match host --collapse digest --out data/wayback.csv + +# Wikipedia + Wikidata — narrative bio + structured facts +# Set HERMES_OSINT_UA=your-app/1.0 (your@email) to identify yourself +python3 SKILL_DIR/scripts/fetch_wikipedia.py --query "Bill Gates" \ + --out data/wp.csv + +# GDELT — global news in 100+ languages, ~2015→present +python3 SKILL_DIR/scripts/fetch_gdelt.py --query '"Example Corp"' \ + --timespan 1y --out data/gdelt.csv +``` + +All outputs are normalized CSV with a header row. Re-run scripts idempotently. + +When a private individual won't be in a source (e.g. SEC EDGAR for a non-public- +company person, USAspending for someone who isn't a federal contractor, Senate +LDA for someone who isn't a lobbying client), the script returns 0 rows with a +clear warning rather than silently writing an empty CSV. EDGAR specifically +flags when the company-name resolver matched an individual Form 3/4/5 filer +rather than a corporate registrant. + +Rate-limit notes are in each source's wiki entry. Default fetchers sleep +politely between paginated requests. **API keys raise rate limits** for +sources that support them (`SEC_USER_AGENT`, `SENATE_LDA_TOKEN`, +`OPENCORPORATES_API_TOKEN`, `COURTLISTENER_TOKEN`). All scripts surface +429 responses immediately with the upstream's quota message so the user +knows to slow down or supply a key. + +### 3. Resolve entities across sources + +Normalize names and find matches between two CSV files: + +```bash +# Match lobbying clients (Senate LDA) against contract recipients (USAspending) +python3 SKILL_DIR/scripts/entity_resolution.py \ + --left data/lobbying.csv --left-name-col client_name \ + --right data/contracts.csv --right-name-col recipient_name \ + --out data/cross_links.csv +``` + +Three matching tiers with explicit confidence: + +| Tier | Method | Confidence | +|------|--------|------------| +| `exact` | Normalized strings equal after suffix/punctuation strip | high | +| `fuzzy` | Sorted-token equality (word-bag match) | medium | +| `token_overlap` | ≥60% token overlap, ≥2 shared tokens, tokens ≥4 chars | low | + +Output `cross_links.csv` columns: `match_type, confidence, left_name, +right_name, left_normalized, right_normalized, left_row, right_row`. + +### 4. Statistical timing correlation (optional) + +Test whether two time series cluster suspiciously close together — e.g. +lobbying filings near contract awards — using a permutation test: + +```bash +python3 SKILL_DIR/scripts/timing_analysis.py \ + --donations data/lobbying.csv --donation-date-col filing_date \ + --donation-amount-col income --donation-donor-col client_name \ + --donation-recipient-col registrant_name \ + --contracts data/contracts.csv --contract-date-col award_date \ + --contract-vendor-col recipient_name \ + --cross-links data/cross_links.csv \ + --permutations 1000 \ + --out data/timing.json +``` + +The script's column flags are intentionally generic — the original tool was +written for donations vs awards, but it works for any (event, payee) time +series joined through cross-links. Null hypothesis: event timing is +independent of award dates. One-tailed p-value = fraction of permutations +with mean nearest-award distance ≤ observed. Minimum 3 events per (payer, +vendor) pair to run the test. + +### 5. Build the findings JSON (evidence chain) + +```bash +python3 SKILL_DIR/scripts/build_findings.py \ + --cross-links data/cross_links.csv \ + --timing data/timing.json \ + --out data/findings.json +``` + +Every finding has `id, title, severity, confidence, summary, evidence[], sources[]`. +Each evidence item points back to a specific row in a source CSV. The user (or a +follow-up agent) can verify every claim against its source. + +## Confidence and evidence discipline + +This is the load-bearing rule of the skill. Tell the user: + +- Every claim must trace to a record. No naked assertions. +- Confidence tier travels with the claim. `match_type=fuzzy` is "probable", + not "confirmed." +- Entity resolution produces candidates, NOT conclusions. A `fuzzy` match + between "ACME LLC" and "Acme Holdings Group" is a lead, not a fact. +- Statistical significance ≠ wrongdoing. p < 0.05 means the timing pattern + is unlikely under the null. It does not establish corruption. +- All data sources here are public records. They may still contain + inaccuracies, stale info, or redactions (GDPR, sealed records). + +## Adding a new data source + +Use the template: + +```bash +cp SKILL_DIR/templates/source-template.md \ + SKILL_DIR/references/sources/.md +``` + +Fill in all 9 sections. Write a `fetch_.py` script in `scripts/` that +uses stdlib only and writes a normalized CSV. Update the source list in the +"When to use" section above. + +## Tools and their limits + +- `entity_resolution.py` does NOT use external fuzzy libraries (no rapidfuzz, + no jellyfish). Token-bag matching is the upper bound here. If you need + Levenshtein, transliteration, or phonetic matching, pip-install separately. +- `timing_analysis.py` uses Python's `random` for permutations. For + reproducibility, pass `--seed N`. +- `fetch_*.py` scripts use `urllib.request` and respect `Retry-After`. Heavy + bulk usage may still violate ToS — read each source's legal section first. + +## Legal note + +All Phase-1 sources are public records. Bulk acquisition is permitted under +their respective access terms (FOIA, public records law, ICIJ explicit +publication, OFAC public data). However: + +- Some sources rate-limit aggressively. Respect their headers. +- Some redact registrant info (GDPR on WHOIS, sealed filings). +- Cross-referencing public records to identify private individuals can have + ethical implications. The skill produces evidence chains, not accusations. diff --git a/optional-skills/research/osint-investigation/references/sources/courtlistener.md b/optional-skills/research/osint-investigation/references/sources/courtlistener.md new file mode 100644 index 00000000000..0365b2ba0b1 --- /dev/null +++ b/optional-skills/research/osint-investigation/references/sources/courtlistener.md @@ -0,0 +1,98 @@ +# CourtListener — Free Law Project + +## 1. Summary + +CourtListener (Free Law Project) aggregates court opinions, dockets, oral +arguments, and judge data. Covers ~10M federal and state court opinions +back to colonial America, plus PACER docket data from RECAP submissions. + +## 2. Access Methods + +- **REST API v4:** `https://www.courtlistener.com/api/rest/v4/` +- **Auth:** Anonymous reads allowed on most endpoints; token raises rate + limits and unlocks bulk export +- **Rate limit:** ~5,000 req/hour unauthenticated for search; higher with token + +Set `COURTLISTENER_TOKEN` env var. Get a free token at +https://www.courtlistener.com/sign-in/ then create an API key. + +## 3. Data Schema + +Key fields emitted by `fetch_courtlistener.py`: + +| Column | Type | Description | +|--------|------|-------------| +| `case_name` | str | Case name | +| `court` | str | Court name | +| `court_id` | str | Court ID (e.g. `nysd`, `scotus`, `ca9`) | +| `date_filed` | str | YYYY-MM-DD | +| `docket_number` | str | Court docket number | +| `judge` | str | Judge name(s) | +| `citation` | str | Reporter citation(s) | +| `result_type` | str | opinions / dockets / oral / people | +| `snippet` | str | Search-match snippet (up to 500 chars) | +| `absolute_url` | str | Direct CourtListener URL | + +## 4. Coverage + +- Federal: all circuit and district courts, SCOTUS +- State: all 50 state supreme/appellate courts, many trial courts +- Opinions: ~10M back to 1600s (colonial), full coverage 1950 → present +- Dockets via RECAP: ~3M+ from user-submitted PACER PDFs +- Updated continuously + +## 5. Cross-Reference Potential + +- **OpenCorporates** ↔ `case_name` (corporate litigation) +- **SEC EDGAR** ↔ `case_name` (securities class actions) +- **OFAC SDN** ↔ `case_name` (sanctions-related civil/criminal cases) + +Join key: party name from `case_name`. Note: `case_name` often abbreviates +("Smith v. Jones" rather than full party names) — use the full case URL +to get all parties. + +## 6. Data Quality + +- Older opinions (pre-1990) often lack docket numbers and judges +- State coverage is more uneven than federal +- PACER docket coverage depends on RECAP user submissions — not exhaustive +- Sealed documents are excluded +- Party names in case captions don't always match filing names exactly + +## 7. Acquisition Script + +Path: `scripts/fetch_courtlistener.py` + +```bash +# Search opinions for a party / keyword +python3 SKILL_DIR/scripts/fetch_courtlistener.py --query "Example Corp" \ + --out data/cl.csv + +# PACER dockets (best for recent litigation) +python3 SKILL_DIR/scripts/fetch_courtlistener.py --query "Example Corp" \ + --type dockets --out data/cl_dockets.csv + +# Restrict to a court +python3 SKILL_DIR/scripts/fetch_courtlistener.py --query "Microsoft" \ + --court ca9 --out data/cl_9th.csv + +# Date range +python3 SKILL_DIR/scripts/fetch_courtlistener.py --query "Example Corp" \ + --date-from 2020-01-01 --date-to 2024-12-31 --out data/cl.csv +``` + +Pass `--token` or set `COURTLISTENER_TOKEN`. + +## 8. Legal & Licensing + +- Court opinions are public domain +- Free Law Project provides the data under CC0 / public domain dedication +- No commercial use restrictions on opinion text or metadata +- Some PACER PDFs have copyright on layout (not text) — fair use applies + +## 9. References + +- API docs: https://www.courtlistener.com/help/api/rest/ +- Court IDs: https://www.courtlistener.com/api/jurisdictions/ +- RECAP archive: https://www.courtlistener.com/recap/ +- Bulk data: https://www.courtlistener.com/help/api/bulk-data/ diff --git a/optional-skills/research/osint-investigation/references/sources/gdelt.md b/optional-skills/research/osint-investigation/references/sources/gdelt.md new file mode 100644 index 00000000000..785c171a0c9 --- /dev/null +++ b/optional-skills/research/osint-investigation/references/sources/gdelt.md @@ -0,0 +1,104 @@ +# GDELT — Global News Monitoring + +## 1. Summary + +GDELT (Global Database of Events, Language, and Tone) monitors world news +in 100+ languages with full-text indexing. Updated every 15 minutes. +~2015 → present, ~1B+ articles indexed. Free anonymous access. + +GDELT is wider than Google News (more international, more long-tail +sources) and indexed by tone/sentiment, themes (CAMEO codes), people, and +organizations. + +## 2. Access Methods + +- **DOC 2.0 API:** `https://api.gdeltproject.org/api/v2/doc/doc` +- **Events / GKG 2.0:** `https://api.gdeltproject.org/api/v2/events/events` +- **Auth:** None +- **Rate limit:** **1 request per 5 seconds** for the DOC API — strict + +The fetch script automatically retries after a 6-second sleep when a +429 is received. + +## 3. Data Schema + +Key fields emitted by `fetch_gdelt.py`: + +| Column | Type | Description | +|--------|------|-------------| +| `title` | str | Article title | +| `url` | str | Article URL | +| `seen_date` | str | When GDELT first saw the article (UTC) | +| `domain` | str | Publisher domain | +| `language` | str | Source language | +| `source_country` | str | 2-letter country code | +| `tone` | str | GDELT-computed tone score (negative = negative coverage) | +| `social_image` | str | Open Graph image URL when available | + +## 4. Coverage + +- Worldwide news in 100+ languages +- ~2015 → present (Events back to 1979 via a separate stream) +- Update frequency: 15 minutes +- Bias: heavily Anglophone in volume but very wide source list overall + +## 5. Cross-Reference Potential + +- **All sources** ↔ `title` / `url` (news context for any subject) +- **Wikipedia** ↔ event timeline for notable entities +- **Wayback Machine** ↔ recover articles whose URLs have died +- **OFAC SDN** ↔ news context for sanctions designations +- **SEC EDGAR** ↔ news context for 8-K material events + +Join key: entity name appearing in article title or full-text. GDELT also +extracts named entities into a separate stream (GKG) not exposed by this +fetcher — query GDELT directly for entity-level filtering. + +## 6. Data Quality + +- Title extraction is automated and can be wrong (sometimes captures the + site name + delimiter + article title; sometimes a generic page title) +- Sentiment / tone is computed by GDELT, not source-supplied +- Some domains are oversampled (newswires, aggregators) +- Source country is inferred from domain registration / TLD — can be + wrong for international news sites with country-neutral domains +- Article URLs can rot — pair with Wayback Machine to preserve content + +## 7. Acquisition Script + +Path: `scripts/fetch_gdelt.py` + +```bash +# Recent news mentioning an entity +python3 SKILL_DIR/scripts/fetch_gdelt.py --query "Nous Research" \ + --timespan 6m --out data/gdelt.csv + +# Phrase-exact (use double quotes inside single quotes for the shell) +python3 SKILL_DIR/scripts/fetch_gdelt.py --query '"Dillon Rolnick"' \ + --timespan 1y --out data/gdelt.csv + +# Filter to a country / language +python3 SKILL_DIR/scripts/fetch_gdelt.py --query "Microsoft" \ + --source-country US --source-lang English --out data/gdelt.csv + +# Date range +python3 SKILL_DIR/scripts/fetch_gdelt.py --query "Microsoft" \ + --start 2024-01-01 --end 2024-12-31 --out data/gdelt.csv +``` + +GDELT supports its own query operators: phrase quoting, AND/OR/NOT, +`sourcecountry:US`, `theme:ECON_BANKRUPTCY`, `tone<-5`, etc. +See https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/ for syntax. + +## 8. Legal & Licensing + +- GDELT data is provided free for academic and journalistic use +- Article URLs link out to original publishers — copyright remains with + the publisher +- GDELT is NOT a content archive; it's a metadata index + +## 9. References + +- DOC 2.0 API: https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/ +- Themes & query syntax: https://blog.gdeltproject.org/gkg-2-0-our-global-knowledge-graph-2-0-amazing-data-at-your-fingertips/ +- Project home: https://www.gdeltproject.org/ diff --git a/optional-skills/research/osint-investigation/references/sources/icij-offshore.md b/optional-skills/research/osint-investigation/references/sources/icij-offshore.md new file mode 100644 index 00000000000..99e2abcb24b --- /dev/null +++ b/optional-skills/research/osint-investigation/references/sources/icij-offshore.md @@ -0,0 +1,104 @@ +# ICIJ Offshore Leaks Database + +## 1. Summary + +The International Consortium of Investigative Journalists (ICIJ) publishes a +combined database of offshore entities from the Panama Papers, Paradise Papers, +Pandora Papers, Bahamas Leaks, and Offshore Leaks. ~800,000+ offshore entities +with their officers, intermediaries, and addresses. + +## 2. Access Methods + +- **Bulk download (primary):** `https://offshoreleaks-data.icij.org/offshoreleaks/csv/full-oldb.LATEST.zip` (~70 MB ZIP, refreshed periodically) +- **Search UI (human):** `https://offshoreleaks.icij.org/` +- **Auth:** None +- **Note:** The previous Open Refine reconciliation endpoint at + `/reconcile` now returns 404. ICIJ has removed it. The bulk ZIP is the + remaining stable access path. The skill's `fetch_icij_offshore.py` caches + the ZIP locally (default `~/.cache/hermes-osint/icij/`, refreshes after + 30 days) and searches it offline. + +## 3. Data Schema + +Key fields emitted by `fetch_icij_offshore.py`: + +| Column | Type | Description | +|--------|------|-------------| +| `node_id` | int | ICIJ canonical node ID | +| `name` | str | Entity / officer / intermediary name | +| `node_type` | str | entity / officer / intermediary / address | +| `country_codes` | str | Semicolon-separated ISO codes | +| `countries` | str | Country names | +| `jurisdiction` | str | Offshore jurisdiction (BVI, Panama, etc.) | +| `incorporation_date` | str | YYYY-MM-DD | +| `inactivation_date` | str | YYYY-MM-DD (if struck) | +| `source` | str | Panama Papers / Paradise Papers / Pandora Papers / etc. | +| `entity_url` | str | Link to ICIJ page | +| `connections` | str | Semicolon-separated node IDs of related entities | + +## 4. Coverage + +- Worldwide offshore entity records +- Earliest records: 1970s (Bahamas Leaks). Most data 1990–2018. +- NOT updated in real-time — new leaks added when ICIJ publishes them +- ~810,000 offshore entities + ~750,000 officers + ~150,000 intermediaries + +## 5. Cross-Reference Potential + +- **SEC EDGAR** ↔ `name` (public companies with offshore arms) +- **USAspending** ↔ `name` (federal contractors with offshore structure) +- **OFAC SDN** ↔ `name` (sanctioned entities using offshore vehicles) + +Join key: normalized entity/officer name. `node_id` is canonical for cross- +referencing within ICIJ. Connections graph traversal is in-script (BFS over +`connections`). + +## 6. Data Quality + +- Offshore entity names sometimes appear in multiple leaks with slight variations +- Officers may be nominees (front persons), not beneficial owners +- Some entries have minimal info (just a name + jurisdiction) +- The connections graph is incomplete — some relationships are documented in + source materials but not in the structured database +- Inactive/struck-off entities are still included with `inactivation_date` + +## 7. Acquisition Script + +Path: `scripts/fetch_icij_offshore.py` + +```bash +# Search by entity name (case-insensitive substring across the bulk DB) +python3 SKILL_DIR/scripts/fetch_icij_offshore.py --entity "EXAMPLE CORP" \ + --out data/icij.csv + +# Search by officer (individual person) +python3 SKILL_DIR/scripts/fetch_icij_offshore.py --officer "SMITH JOHN" \ + --out data/icij.csv + +# Search by jurisdiction (filter on cached results) +python3 SKILL_DIR/scripts/fetch_icij_offshore.py --officer "SMITH" \ + --jurisdiction "BRITISH VIRGIN ISLANDS" --out data/icij_bvi.csv + +# Force a fresh download (default refresh window is 30 days) +python3 SKILL_DIR/scripts/fetch_icij_offshore.py --entity "EXAMPLE CORP" \ + --force-refresh --out data/icij.csv +``` + +First call downloads the ~70 MB ZIP under `~/.cache/hermes-osint/icij/` +(or `$HERMES_OSINT_CACHE/icij/`). Subsequent calls reuse the cache for 30 days. + +## 8. Legal & Licensing + +- Public record as published by ICIJ under explicit publication +- No copyright on the underlying facts (entity names, jurisdictions) +- ICIJ asks for attribution if used in derivative reporting +- **Ethical note**: Presence in this database does NOT imply wrongdoing. Many + offshore structures are legal. The database is a research tool, not a list of + criminals. + +## 9. References + +- Database: https://offshoreleaks.icij.org/ +- About the data: https://offshoreleaks.icij.org/pages/about +- Methodology: https://www.icij.org/investigations/panama-papers/ +- API hints: Open Refine reconciliation endpoint at `https://offshoreleaks.icij.org/reconcile` diff --git a/optional-skills/research/osint-investigation/references/sources/nyc-acris.md b/optional-skills/research/osint-investigation/references/sources/nyc-acris.md new file mode 100644 index 00000000000..4b20169bf3e --- /dev/null +++ b/optional-skills/research/osint-investigation/references/sources/nyc-acris.md @@ -0,0 +1,90 @@ +# NYC ACRIS — NYC Real Property Records + +## 1. Summary + +The Automated City Register Information System (ACRIS) is NYC's index of +recorded property documents: deeds, mortgages, satisfactions, liens, UCC +filings. Covers Manhattan, Bronx, Brooklyn, Queens, Staten Island. +Published as 4 linked Socrata datasets on the NYC Open Data portal. + +## 2. Access Methods + +- **Socrata API:** `https://data.cityofnewyork.us/resource/636b-3b5g.json` (Parties) +- **Other datasets:** `bnx9-e6tj` (Master), `8h5j-fqxa` (Legal), `uqqa-hym2` (References) +- **Auth:** None for read access (Socrata `$app_token` raises rate limits if needed) +- **Rate limit:** Generous (~1000 req/hour unauthenticated) + +## 3. Data Schema + +Key fields emitted by `fetch_nyc_acris.py` (Parties joined to Master): + +| Column | Type | Description | +|--------|------|-------------| +| `document_id` | str | ACRIS document ID | +| `name` | str | Party name as recorded (often "LAST, FIRST" but varies) | +| `party_type` | str | 1=grantor, 2=grantee, 3=other | +| `party_role` | str | Human-readable role label | +| `address_1` | str | Property or party address line 1 | +| `city`, `state`, `zip`, `country` | str | Address parts | +| `doc_type` | str | DEED, MTGE (mortgage), SAT (satisfaction), AGMT, etc. | +| `doc_date`, `recorded_date` | str | YYYY-MM-DD | +| `borough` | str | Manhattan / Bronx / Brooklyn / Queens / Staten Island | +| `amount` | str | Document amount (USD, when applicable) | +| `filing_url` | str | Direct ACRIS DocumentImageView link | + +## 4. Coverage + +- NYC 5 boroughs only — other counties have their own recorders +- 1966 → present (older filings exist on microfilm at the County Clerk) +- Updated nightly +- ~70M+ party records cumulative + +## 5. Cross-Reference Potential + +- **SEC EDGAR** ↔ `name` (insider filers with NYC property) +- **USAspending** ↔ `name` (federal contractors with NYC property) +- **Senate LDA** ↔ `name` (lobbyists / clients with NYC property) +- **ICIJ Offshore** ↔ `name` (NYC properties owned via offshore vehicles) + +Join key: normalized party name. NYC property records typically store names +as "LAST, FIRST" or full LLC names — use `entity_resolution.py`. + +## 6. Data Quality + +- Same person appears with multiple name formats over time +- LLC and trust ownership obscures beneficial owners +- Recording lag can be 2-4 weeks after closing +- Older documents have spottier address data +- Sealed records (e.g. domestic violence shelters) are excluded by law + +## 7. Acquisition Script + +Path: `scripts/fetch_nyc_acris.py` + +```bash +# By party name +python3 SKILL_DIR/scripts/fetch_nyc_acris.py --name "ROLNICK" --out data/acris.csv + +# By address (useful when you know the property but not the names) +python3 SKILL_DIR/scripts/fetch_nyc_acris.py --address "571 HUDSON" --out data/acris.csv + +# Restrict to grantees (buyers / mortgagees) +python3 SKILL_DIR/scripts/fetch_nyc_acris.py --name "ROLNICK" --party-type 2 \ + --out data/acris_buyers.csv +``` + +The script joins Parties → Master to populate doc_type, dates, borough, and +amount. Pass `--no-enrich` to skip the join (faster, fewer columns). + +## 8. Legal & Licensing + +- Public record under NYS Real Property Law and NYC Charter +- No commercial use restrictions on the data +- All ACRIS data is public information by statute + +## 9. References + +- ACRIS portal: https://a836-acris.nyc.gov/CP/ +- NYC Open Data: https://data.cityofnewyork.us/ +- Parties dataset: https://data.cityofnewyork.us/City-Government/ACRIS-Real-Property-Parties/636b-3b5g +- Document type codes: https://www1.nyc.gov/site/finance/taxes/acris.page diff --git a/optional-skills/research/osint-investigation/references/sources/ofac-sdn.md b/optional-skills/research/osint-investigation/references/sources/ofac-sdn.md new file mode 100644 index 00000000000..ab3602031f1 --- /dev/null +++ b/optional-skills/research/osint-investigation/references/sources/ofac-sdn.md @@ -0,0 +1,92 @@ +# OFAC SDN — Specially Designated Nationals List + +## 1. Summary + +The Office of Foreign Assets Control (OFAC) publishes the Specially Designated +Nationals and Blocked Persons List (SDN). US persons are generally prohibited +from dealing with individuals and entities on this list. Also published: +non-SDN consolidated lists (BIS Denied Persons, FSE, etc.). + +## 2. Access Methods + +- **Full XML:** `https://www.treasury.gov/ofac/downloads/sdn.xml` +- **Delimited:** `https://www.treasury.gov/ofac/downloads/sdn.csv` +- **Consolidated:** `https://www.treasury.gov/ofac/downloads/consolidated/consolidated.xml` +- **Auth:** None +- **Rate limit:** None (static file downloads). Updated continuously. + +## 3. Data Schema + +Key fields emitted by `fetch_ofac_sdn.py`: + +| Column | Type | Description | +|--------|------|-------------| +| `entity_id` | int | OFAC unique ID | +| `name` | str | Primary name | +| `entity_type` | str | individual / entity / vessel / aircraft | +| `program_list` | str | Semicolon-separated sanctions programs (e.g. SDGT;IRAN) | +| `title` | str | For individuals: title/role | +| `nationalities` | str | Semicolon-separated country codes | +| `aka_list` | str | Semicolon-separated "also known as" names | +| `addresses` | str | Semicolon-separated known addresses | +| `dob` | str | Date of birth (individuals) | +| `pob` | str | Place of birth (individuals) | +| `remarks` | str | OFAC's free-text remarks | +| `last_updated` | str | YYYY-MM-DD (publication date) | + +## 4. Coverage + +- Worldwide — all entities sanctioned by US Treasury +- ~10,000 entries on SDN, ~15,000 on consolidated lists +- Updated continuously (sometimes daily during active enforcement) +- Includes AKAs (very common, can be 10+ per entity) + +## 5. Cross-Reference Potential + +- **SEC EDGAR** ↔ `name` (public companies sanctioned) +- **USAspending** ↔ `name` (sanctioned entity as federal contractor — should + be impossible but verify) +- **ICIJ Offshore** ↔ `name` (offshore entities also sanctioned) + +Join key: normalized name. **CRITICAL**: must match against `aka_list` too. +Many sanctioned entities are caught only via aliases. + +## 6. Data Quality + +- Names are transliterated from many scripts — multiple romanizations possible +- AKAs often differ wildly from primary name +- Some entries have minimal info (no DOB, no address) for individuals +- Free-text `remarks` contain critical context — read them +- "Specially Designated Global Terrorists" (SDGT) and "Cyber-related" (CYBER2) + programs add and remove entries frequently + +## 7. Acquisition Script + +Path: `scripts/fetch_ofac_sdn.py` + +```bash +# Full snapshot +python3 SKILL_DIR/scripts/fetch_ofac_sdn.py --out data/ofac_sdn.csv + +# Filter to specific program +python3 SKILL_DIR/scripts/fetch_ofac_sdn.py --program SDGT --out data/sdn_sdgt.csv + +# Entities only (skip individuals, vessels, aircraft) +python3 SKILL_DIR/scripts/fetch_ofac_sdn.py --entity-type entity --out data/sdn_entities.csv +``` + +## 8. Legal & Licensing + +- Public record under Executive Order authority and statutory sanctions programs +- US persons MUST screen against this list — it is enforced +- No restrictions on the data itself; restrictions are on transactions with + the listed entities +- ZERO penalty for "over-matching" — false positives must be cleared but are not + prohibited + +## 9. References + +- OFAC home: https://ofac.treasury.gov/ +- SDN list: https://ofac.treasury.gov/specially-designated-nationals-and-blocked-persons-list-sdn-human-readable-lists +- Data formats: https://ofac.treasury.gov/sdn-list/sanctions-list-search-tool +- Compliance guidance: https://ofac.treasury.gov/recent-actions diff --git a/optional-skills/research/osint-investigation/references/sources/opencorporates.md b/optional-skills/research/osint-investigation/references/sources/opencorporates.md new file mode 100644 index 00000000000..0bd190a2f49 --- /dev/null +++ b/optional-skills/research/osint-investigation/references/sources/opencorporates.md @@ -0,0 +1,103 @@ +# OpenCorporates — Global Corporate Registry + +## 1. Summary + +OpenCorporates aggregates corporate registry data from 130+ jurisdictions +worldwide (~200M companies). Covers US state-level filings (NY DOS, Delaware +DOC, California SOS, etc.), UK Companies House, EU registries, and most +common-law jurisdictions. + +## 2. Access Methods + +- **REST API:** `https://api.opencorporates.com/v0.4/` +- **HTML fallback:** `https://opencorporates.com/companies?q=...` +- **Auth:** API token required (free tier 500 calls/month, paid plans available) +- **Rate limit:** Token-bound; un-tokened requests return 401 + +Set `OPENCORPORATES_API_TOKEN` env var. Get a free token at +https://opencorporates.com/api_accounts/new. + +## 3. Data Schema + +Key fields emitted by `fetch_opencorporates.py`: + +| Column | Type | Description | +|--------|------|-------------| +| `name` | str | Company legal name | +| `company_number` | str | Registry-assigned number | +| `jurisdiction_code` | str | e.g. `us_ny`, `us_de`, `gb` | +| `jurisdiction_name` | str | Human-readable jurisdiction | +| `incorporation_date` | str | YYYY-MM-DD | +| `dissolution_date` | str | YYYY-MM-DD (empty if active) | +| `company_type` | str | Domestic LLC / Foreign Corp / etc. | +| `status` | str | Active / Inactive / Dissolved | +| `registered_address` | str | Registered office address | +| `opencorporates_url` | str | Link to OpenCorporates entity page | +| `officers_count` | str | Total officers on record | +| `source` | str | `api`, `html`, or `html-fallback` | + +## 4. Coverage + +- US: all 50 states + DC at state level (LLCs, corps, LPs) +- International: UK, EU, Canada, Australia, NZ, many APAC + LATAM jurisdictions +- ~200M company records cumulative +- Update frequency varies by jurisdiction (UK CH is near-realtime; some + state registries lag months) + +## 5. Cross-Reference Potential + +- **NYC ACRIS** ↔ `name` (LLC/corp owners of NYC property) +- **USAspending** ↔ `name` (corporate federal contractors) +- **SEC EDGAR** ↔ `name` (public companies + their subsidiaries) +- **ICIJ Offshore** ↔ `name` (international corporate structures) + +Join key: normalized company name. Some entries have `previous_names` arrays +which are not currently exported by the fetch script — query OC directly +for that. + +## 6. Data Quality + +- Company-name spellings vary across re-incorporations and renames +- Officer records are spottier than company records (many jurisdictions + don't require officer disclosure) +- Beneficial-ownership data is generally NOT here — most jurisdictions + don't require it. UK Companies House has PSC (people with significant + control) but that's not universal. +- Cross-jurisdictional links (parent / subsidiary) are based on registry + filings only; corporate trees are often incomplete + +## 7. Acquisition Script + +Path: `scripts/fetch_opencorporates.py` + +```bash +# Search globally by name +python3 SKILL_DIR/scripts/fetch_opencorporates.py --query "Example Corp" \ + --out data/oc.csv + +# Restrict to a jurisdiction +python3 SKILL_DIR/scripts/fetch_opencorporates.py --query "Example Corp" \ + --jurisdiction us_ny --out data/oc_ny.csv + +# Set token via env or flag +OPENCORPORATES_API_TOKEN=xxx python3 SKILL_DIR/scripts/fetch_opencorporates.py \ + --query "Microsoft" --out data/oc.csv +``` + +Without a token the script falls back to scraping the HTML search page. +The fallback is brittle and only fills in `name`, `jurisdiction_code`, +`opencorporates_url` — set the token for serious work. + +## 8. Legal & Licensing + +- OpenCorporates aggregates public records — the underlying facts are + public domain +- OpenCorporates own database is licensed CC-BY-SA-4.0; attribution required +- API ToS prohibits redistributing the full dataset; per-record reference + is fine + +## 9. References + +- API docs: https://api.opencorporates.com/documentation/API-Reference +- Jurisdiction codes: https://api.opencorporates.com/v0.4/jurisdictions.json +- Schema: https://opencorporates.com/info/our_data diff --git a/optional-skills/research/osint-investigation/references/sources/sec-edgar.md b/optional-skills/research/osint-investigation/references/sources/sec-edgar.md new file mode 100644 index 00000000000..55a33d70258 --- /dev/null +++ b/optional-skills/research/osint-investigation/references/sources/sec-edgar.md @@ -0,0 +1,83 @@ +# SEC EDGAR — Corporate Filings + +## 1. Summary + +EDGAR (Electronic Data Gathering, Analysis, and Retrieval) is the SEC's system +for corporate disclosure filings: 10-K (annual), 10-Q (quarterly), 8-K (current +events), DEF 14A (proxy), Form 4 (insider trading), 13F (institutional holdings). + +## 2. Access Methods + +- **API:** `https://data.sec.gov/submissions/CIK<10-digit-padded>.json` (no auth) +- **Filing index:** `https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=...` +- **Full-text search:** `https://efts.sec.gov/LATEST/search-index?q=...` +- **Auth:** None — requires `User-Agent` header with contact info per SEC policy +- **Rate limit:** 10 requests/second per IP (enforced) + +## 3. Data Schema + +Key fields emitted by `fetch_sec_edgar.py` (filings index): + +| Column | Type | Description | +|--------|------|-------------| +| `cik` | str | Central Index Key (10-digit padded) | +| `company_name` | str | Registrant name | +| `form_type` | str | 10-K, 10-Q, 8-K, etc. | +| `filing_date` | str | YYYY-MM-DD | +| `accession_number` | str | Filing accession (e.g. 0000320193-24-000123) | +| `primary_document` | str | Filename of main document | +| `filing_url` | str | Direct URL to filing index | +| `reporting_period` | str | Period of report (where applicable) | + +## 4. Coverage + +- All public US registrants from 1993 → present +- 1993-2000 has spotty coverage of older filings (paper-to-electronic migration) +- ~12M filings cumulative +- Updated within minutes of filing acceptance + +## 5. Cross-Reference Potential + +- **USAspending** ↔ `company_name` (public companies as federal contractors) +- **Senate LD** ↔ `company_name` (public companies hire lobbyists) +- **OFAC SDN** ↔ `company_name` (sanctions screening of public registrants) + +Join key: company name OR CIK if you have it. CIK is canonical and stable. + +## 6. Data Quality + +- Subsidiaries often filed under parent CIK — be careful with name matches +- Name changes over time (rebrands, acquisitions) — CIK remains constant +- 10-K Item 1A Risk Factors are free-form text — useful for `web_extract`-style + parsing, not structured queries +- Foreign private issuers file 20-F instead of 10-K + +## 7. Acquisition Script + +Path: `scripts/fetch_sec_edgar.py` + +```bash +# By CIK +python3 SKILL_DIR/scripts/fetch_sec_edgar.py --cik 0000320193 \ + --types 10-K,10-Q --out data/edgar_filings.csv + +# By company name (resolves to CIK first via name search) +python3 SKILL_DIR/scripts/fetch_sec_edgar.py --company "APPLE INC" \ + --types 8-K --since 2024-01-01 --out data/edgar_filings.csv +``` + +Set `SEC_USER_AGENT` env var with your contact email (SEC requirement). +Example: `SEC_USER_AGENT="Research example@example.com"`. + +## 8. Legal & Licensing + +- Public record under SEC Rule 24b-2 / 17 CFR § 230.401 +- No commercial use restrictions on filing content +- SEC asks all bulk users to include a `User-Agent` with contact info and to + respect 10 req/s — failure to do so can result in IP blocking + +## 9. References + +- Developer docs: https://www.sec.gov/edgar/sec-api-documentation +- EDGAR full-text search: https://efts.sec.gov/LATEST/search-index +- Fair access policy: https://www.sec.gov/os/accessing-edgar-data diff --git a/optional-skills/research/osint-investigation/references/sources/senate-ld.md b/optional-skills/research/osint-investigation/references/sources/senate-ld.md new file mode 100644 index 00000000000..5142dc6ea41 --- /dev/null +++ b/optional-skills/research/osint-investigation/references/sources/senate-ld.md @@ -0,0 +1,89 @@ +# Senate LD — Lobbying Disclosure (LD-1 / LD-2) + +## 1. Summary + +The Senate Office of Public Records publishes lobbying disclosures under the +Lobbying Disclosure Act of 1995 (LDA, as amended by HLOGA 2007). LD-1 is +registration of a new client-lobbyist relationship; LD-2 is the quarterly +activity report. + +## 2. Access Methods + +- **API:** `https://lda.senate.gov/api/v1/` (no auth required for read-only) +- **Bulk download:** `https://lda.senate.gov/api/v1/filings/?format=csv` (paginated) +- **Auth:** Token required for >120 req/hour — register at https://lda.senate.gov/api/auth/register/ +- **Rate limit:** 120 req/hour unauthenticated, 1,200 req/hour authenticated + +## 3. Data Schema + +Key fields emitted by `fetch_senate_ld.py`: + +| Column | Type | Description | +|--------|------|-------------| +| `filing_uuid` | str | Unique filing ID | +| `filing_type` | str | LD-1, LD-2, LD-203, etc. | +| `filing_year` | int | Year | +| `filing_period` | str | Q1/Q2/Q3/Q4 or annual | +| `registrant_name` | str | Lobbying firm or organization | +| `registrant_id` | str | Senate-assigned registrant ID | +| `client_name` | str | Client being represented | +| `client_id` | str | Senate-assigned client ID | +| `client_general_description` | str | Client industry / business | +| `income` | float | LD-2 income from client this quarter (USD) | +| `expenses` | float | LD-2 expenses (in-house lobbying) | +| `lobbyists` | str | Semicolon-separated lobbyist names | +| `issues` | str | Semicolon-separated issue areas | +| `government_entities` | str | Agencies/chambers contacted | +| `filing_date` | str | YYYY-MM-DD | + +## 4. Coverage + +- US federal lobbying only (state lobbying handled by individual state ethics offices) +- 1999 → present (full electronic coverage from 2008) +- Quarterly reporting cycle (LD-2) +- ~1M+ filings cumulative + +## 5. Cross-Reference Potential + +- **USAspending** ↔ `client_name` (clients lobbying for contracts) +- **SEC EDGAR** ↔ `client_name` (public companies as lobbying clients) +- **OFAC SDN** ↔ `client_name` (sanctions screening of lobbying clients) + +Join key: normalized client_name. registrant_id and client_id are canonical +when joining Senate-internal records. + +## 6. Data Quality + +- Many lobbyist names appear in multiple registrants over time (job changes) +- `issues` and `government_entities` are free-text — Inconsistent capitalization +- Foreign agents register under FARA (Department of Justice), NOT here +- Income/expenses are reported in $10,000 brackets in some older filings + +## 7. Acquisition Script + +Path: `scripts/fetch_senate_ld.py` + +```bash +# By client +python3 SKILL_DIR/scripts/fetch_senate_ld.py --client "EXAMPLE CORP" \ + --year 2024 --out data/lobbying.csv + +# By registrant (lobbying firm) +python3 SKILL_DIR/scripts/fetch_senate_ld.py --registrant "BIG K STREET LLP" \ + --year 2024 --out data/lobbying.csv +``` + +Set `SENATE_LDA_TOKEN` env var if you have one (or pass `--token`). +Defaults to anonymous (120 req/hour). + +## 8. Legal & Licensing + +- Public record under 2 U.S.C. § 1604 (LDA) +- No commercial use restrictions +- Reuse is unconditional — see Senate Public Records Office disclaimer + +## 9. References + +- API docs: https://lda.senate.gov/api/redoc/v1/ +- LDA guidance: https://lobbyingdisclosure.house.gov/ld_guidance.pdf +- Senate Public Records: https://lda.senate.gov/ diff --git a/optional-skills/research/osint-investigation/references/sources/usaspending.md b/optional-skills/research/osint-investigation/references/sources/usaspending.md new file mode 100644 index 00000000000..6477272293b --- /dev/null +++ b/optional-skills/research/osint-investigation/references/sources/usaspending.md @@ -0,0 +1,97 @@ +# USAspending — Federal Government Contracts and Grants + +## 1. Summary + +USAspending.gov is the official source of federal spending data. Coverage: +contracts, grants, loans, direct payments, sub-awards. Required by the DATA Act +of 2014 — all federal agencies must report to a single schema. + +## 2. Access Methods + +- **API v2:** `https://api.usaspending.gov/api/v2/` (no auth, no key) +- **Bulk:** `https://files.usaspending.gov/` (CSV / Parquet by award type) +- **Auth:** None +- **Rate limit:** Not strictly enforced, but be polite — keep to <10 req/s + +## 3. Data Schema + +Key fields emitted by `fetch_usaspending.py` (prime awards): + +| Column | Type | Description | +|--------|------|-------------| +| `award_id` | str | Federal award ID (PIID for contracts, FAIN for grants) | +| `recipient_name` | str | Awardee legal name | +| `recipient_uei` | str | Unique Entity Identifier (replaced DUNS in 2022) | +| `recipient_duns` | str | Legacy DUNS number (historical only) | +| `recipient_parent_name` | str | Ultimate parent organization | +| `recipient_state` | str | Recipient state | +| `awarding_agency` | str | Department / agency name | +| `awarding_sub_agency` | str | Sub-tier (e.g. DoD → Army) | +| `award_type` | str | Contract / Grant / Loan / Direct Payment | +| `award_amount` | float | Current total obligation in USD | +| `award_date` | str | Action / signed date YYYY-MM-DD | +| `period_of_performance_start` | str | YYYY-MM-DD | +| `period_of_performance_end` | str | YYYY-MM-DD | +| `naics_code` | str | Industry classification | +| `psc_code` | str | Product / Service Code | +| `competition_extent` | str | Full / limited / sole-source | +| `description` | str | Award description (free-text) | + +## 4. Coverage + +- US federal awards only (state/local not included) +- FY 2008 → present (full coverage from FY 2017) +- Updated bi-weekly from agency reporting +- ~100M+ transaction records cumulative + +## 5. Cross-Reference Potential + +- **SEC EDGAR** ↔ `recipient_name` (public companies as contractors) +- **Senate LD** ↔ `recipient_name` (lobbying clients winning contracts) +- **OFAC SDN** ↔ `recipient_name` (sanctions screening of contractors — must be + filtered out by SAM.gov but verify) +- **ICIJ Offshore** ↔ `recipient_name` (offshore-linked contractors) + +Join key: normalized recipient name. UEI is canonical when present. + +## 6. Data Quality + +- DUNS → UEI transition (April 2022) — old records have DUNS, new records have UEI +- Some sub-awards aren't reported (FFATA threshold is $30k) +- Award amount changes over time (mod actions) — fetch script reports current total +- `competition_extent` field is free-text in older records — `fetch_usaspending.py` + normalizes to canonical values +- Recipient name variations are extensive — "ACME LLC", "Acme L.L.C.", "ACME, INC" + all appear. Use `entity_resolution.py`. + +## 7. Acquisition Script + +Path: `scripts/fetch_usaspending.py` + +```bash +# By recipient name +python3 SKILL_DIR/scripts/fetch_usaspending.py --recipient "EXAMPLE CORP" \ + --fy 2024 --out data/contracts.csv + +# By awarding agency +python3 SKILL_DIR/scripts/fetch_usaspending.py --agency "Department of Defense" \ + --fy 2024 --out data/contracts.csv + +# Filter to sole-source only +python3 SKILL_DIR/scripts/fetch_usaspending.py --recipient "EXAMPLE CORP" \ + --fy 2024 --sole-source-only --out data/contracts.csv +``` + +## 8. Legal & Licensing + +- Public record under the Federal Funding Accountability and Transparency Act + (FFATA, 2006) and DATA Act (2014) +- No commercial use restrictions on the data +- Personal information of award recipients (e.g. small business owners' addresses + in some grants) should be handled per the source agency's privacy notice + +## 9. References + +- API docs: https://api.usaspending.gov/ +- Data dictionary: https://www.usaspending.gov/data-dictionary +- Award schema: https://files.usaspending.gov/docs/Data_Dictionary_Crosswalk.xlsx diff --git a/optional-skills/research/osint-investigation/references/sources/wayback.md b/optional-skills/research/osint-investigation/references/sources/wayback.md new file mode 100644 index 00000000000..f397c093a23 --- /dev/null +++ b/optional-skills/research/osint-investigation/references/sources/wayback.md @@ -0,0 +1,93 @@ +# Wayback Machine — Internet Archive CDX + +## 1. Summary + +The Internet Archive's Wayback Machine has captured ~900B+ web pages since +1996. The CDX server API indexes those captures by URL, timestamp, and +content hash. Free, anonymous, no auth. + +## 2. Access Methods + +- **CDX server:** `https://web.archive.org/cdx/search/cdx` +- **Wayback URL:** `https://web.archive.org/web//` +- **Save Page Now (write):** `https://web.archive.org/save/` (different API) +- **Auth:** None +- **Rate limit:** Generous; be polite (~1 req/s) + +## 3. Data Schema + +Key fields emitted by `fetch_wayback.py`: + +| Column | Type | Description | +|--------|------|-------------| +| `url` | str | Original URL captured | +| `timestamp` | str | YYYYMMDDHHMMSS (CDX format) | +| `wayback_url` | str | Direct replay URL | +| `mimetype` | str | Content-type at capture | +| `status` | str | HTTP status (typically 200) | +| `digest` | str | SHA1 of capture content (collapse-friendly) | +| `length` | str | Byte length of capture | + +## 4. Coverage + +- 1996 → present +- ~900B+ captures across ~700M domains +- Updated continuously by automated crawls + manual saves +- Some domains have aggressive coverage (news), others sparse (private) + +## 5. Cross-Reference Potential + +- **Wikipedia** ↔ Reverse-lookup pages cited as references that have since + disappeared +- **News URLs** ↔ Original article content when present-day URLs 404 +- **Corporate websites** ↔ Historical "About" pages, executive bios that + have been scrubbed + +The Wayback CDX is most useful as a **content-recovery** layer when other +sources point to URLs that no longer exist. + +## 6. Data Quality + +- robots.txt-blocked domains may have spotty or no coverage +- Captures vary in completeness (HTML may be saved without CSS/JS) +- Some content is excluded by domain owner request (DMCA, etc.) +- Coverage of "deep links" (URLs with query strings) is uneven +- Time resolution is per-capture, not continuous — gaps are common + +## 7. Acquisition Script + +Path: `scripts/fetch_wayback.py` + +```bash +# All captures of a specific URL +python3 SKILL_DIR/scripts/fetch_wayback.py --url "https://example.com/page" \ + --out data/wb.csv + +# All captures of a host +python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \ + --match host --out data/wb.csv + +# All captures of a domain + subdomains +python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \ + --match domain --out data/wb.csv + +# Only unique-content captures within a date window +python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \ + --match host --collapse digest \ + --from-date 2020-01-01 --to-date 2023-12-31 \ + --out data/wb.csv +``` + +## 8. Legal & Licensing + +- Internet Archive captures are made under fair-use research provisions +- Replay URLs are stable references — citing them is encouraged +- Internet Archive non-profit terms of use govern content +- Some content is rights-restricted; replay may be blocked even if the + CDX entry shows it as captured + +## 9. References + +- CDX server docs: https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md +- Wayback API: https://archive.org/help/wayback_api.php +- Internet Archive: https://archive.org/ diff --git a/optional-skills/research/osint-investigation/references/sources/wikipedia.md b/optional-skills/research/osint-investigation/references/sources/wikipedia.md new file mode 100644 index 00000000000..1a004bf2e8d --- /dev/null +++ b/optional-skills/research/osint-investigation/references/sources/wikipedia.md @@ -0,0 +1,107 @@ +# Wikipedia + Wikidata + +## 1. Summary + +Wikipedia is the canonical narrative-bio source for notable people, places, +and organizations. Wikidata is its structured-data counterpart: ~110M +items, each with claims, dates, identifiers, and cross-references to +external authorities (VIAF, ISNI, ORCID, GRID, etc.). + +Together they're a high-precision entity-resolution layer — the bar for +inclusion is real, but anything past that bar is well-cross-referenced. + +## 2. Access Methods + +- **Wikipedia OpenSearch:** `https://en.wikipedia.org/w/api.php?action=opensearch` +- **Wikipedia REST summary:** `https://en.wikipedia.org/api/rest_v1/page/summary/` +- **Wikidata Action API:** `https://www.wikidata.org/w/api.php?action=wbgetentities` +- **Wikidata SPARQL:** `https://query.wikidata.org/sparql` (more powerful but aggressively rate-limited) +- **Auth:** None, but **a meaningful User-Agent is required** + +Set `HERMES_OSINT_UA` to something identifying (e.g. `your-app/1.0 (you@example.com)`). +Wikimedia returns HTTP 429 to generic UAs. + +## 3. Data Schema + +Key fields emitted by `fetch_wikipedia.py`: + +| Column | Type | Description | +|--------|------|-------------| +| `source` | str | `wikipedia` or `wikipedia+wikidata` | +| `label` | str | Wikipedia article title | +| `description` | str | Short Wikidata description | +| `qid` | str | Wikidata QID (e.g. Q2283 for Microsoft) | +| `wikipedia_title`, `wikipedia_url` | str | Article identifier + URL | +| `wikidata_url` | str | Wikidata entity URL | +| `instance_of` | str | What kind of thing it is (P31) | +| `country` | str | Country (P17 for orgs/places, P27 for people) | +| `occupation` | str | P106 | +| `employer` | str | P108 | +| `date_of_birth` | str | P569, YYYY-MM-DD | +| `place_of_birth` | str | P19 | +| `summary` | str | Wikipedia REST extract (~1000 chars) | + +The fetch script uses Wikidata's Action API (NOT SPARQL) for structured +facts — far more lenient on rate limits. + +## 4. Coverage + +- Wikipedia EN: ~7M articles +- Wikidata: ~110M items, ~1.5B statements +- Updated continuously; abuse filters and bots run constantly +- High notability bar — most private individuals are not in Wikipedia + +## 5. Cross-Reference Potential + +- **All sources** ↔ `label` (entity identity resolution) +- **SEC EDGAR** ↔ `label` (public companies) +- **CourtListener** ↔ `label` (parties to notable litigation) +- **Wikidata external identifiers** (not currently in this fetcher's output) + link to VIAF, ISNI, ORCID, GRID, GitHub, Twitter, IMDb, ... + +Join key: Wikidata QID is canonical. Wikipedia titles are stable for +most articles but can be renamed. + +## 6. Data Quality + +- Notability filter — only notable entities (criteria vary by topic) +- Recency lag — current events take days to weeks to be reflected +- POV / vandalism — moderated, but edits between sweeps can be bad +- Living-persons biographies have stricter sourcing requirements +- Wikidata claims have qualifiers and references — the fetch script + doesn't currently export them + +## 7. Acquisition Script + +Path: `scripts/fetch_wikipedia.py` + +```bash +# Look up a notable entity +python3 SKILL_DIR/scripts/fetch_wikipedia.py --query "Microsoft" --out data/wp.csv + +# A specific person +python3 SKILL_DIR/scripts/fetch_wikipedia.py --query "Bill Gates" --out data/wp_bg.csv + +# Skip the Wikidata enrichment for speed +python3 SKILL_DIR/scripts/fetch_wikipedia.py --query "Microsoft" --no-wikidata \ + --limit 5 --out data/wp.csv +``` + +The OpenSearch is fuzzy — `--limit 5` returns the top 5 Wikipedia article +matches. Each is enriched with the QID + structured facts unless +`--no-wikidata` is passed. + +## 8. Legal & Licensing + +- Wikipedia text: CC-BY-SA-3.0 / GFDL +- Wikidata claims: CC0 (public domain) +- API ToS: respect rate limits, identify your agent +- Commercial use allowed with attribution + +## 9. References + +- Wikipedia OpenSearch: https://www.mediawiki.org/wiki/API:Opensearch +- Wikipedia REST: https://en.wikipedia.org/api/rest_v1/ +- Wikidata Action API: https://www.wikidata.org/wiki/Wikidata:Data_access +- Wikidata SPARQL: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service +- User-Agent policy: https://meta.wikimedia.org/wiki/User-Agent_policy diff --git a/optional-skills/research/osint-investigation/scripts/_http.py b/optional-skills/research/osint-investigation/scripts/_http.py new file mode 100644 index 00000000000..5da62310b9f --- /dev/null +++ b/optional-skills/research/osint-investigation/scripts/_http.py @@ -0,0 +1,82 @@ +"""Tiny stdlib HTTP helper used by fetch_*.py scripts. + +Provides polite retry + JSON convenience + User-Agent enforcement. +""" +from __future__ import annotations + +import json +import os +import time +import urllib.error +import urllib.parse +import urllib.request + +DEFAULT_UA = ( + "hermes-osint-investigation/0.2 " + "(+https://github.com/NousResearch/hermes-agent; " + "set HERMES_OSINT_UA env var to identify yourself per " + "Wikimedia / SEC fair-use guidance)" +) + + +def get( + url: str, + *, + params: dict | None = None, + headers: dict | None = None, + user_agent: str | None = None, + max_retries: int = 3, + backoff: float = 1.5, + timeout: float = 30.0, +) -> bytes: + """GET with retry on 5xx and Retry-After honoring. + + 429 (rate-limit) is raised IMMEDIATELY with a clear message — retrying + when the upstream says "you're over quota" just wastes time. The caller + should slow down or supply real credentials. + """ + if params: + sep = "&" if "?" in url else "?" + url = f"{url}{sep}{urllib.parse.urlencode(params)}" + h = {"User-Agent": user_agent or os.environ.get("HERMES_OSINT_UA", DEFAULT_UA)} + if headers: + h.update(headers) + + last_err: Exception | None = None + for attempt in range(max_retries + 1): + req = urllib.request.Request(url, headers=h) + try: + with urllib.request.urlopen(req, timeout=timeout) as resp: + return resp.read() + except urllib.error.HTTPError as e: + if e.code == 429: + # Surface immediately. Read the body so the caller sees the + # provider's actual message ("OVER_RATE_LIMIT" etc.). + try: + body = e.read(2048).decode("utf-8", errors="replace") + except Exception: # noqa: BLE001 + body = "" + raise RuntimeError( + f"HTTP 429 rate-limited by {urllib.parse.urlsplit(url).netloc}. " + f"Slow down or supply a real API key. Body: {body[:300]}" + ) from e + if e.code in (500, 502, 503, 504) and attempt < max_retries: + retry_after = e.headers.get("Retry-After") if e.headers else None + wait = float(retry_after) if (retry_after and retry_after.isdigit()) else backoff ** (attempt + 1) + time.sleep(wait) + last_err = e + continue + raise + except urllib.error.URLError as e: + if attempt < max_retries: + time.sleep(backoff ** (attempt + 1)) + last_err = e + continue + raise + if last_err: + raise last_err + raise RuntimeError("unreachable") + + +def get_json(url: str, **kwargs) -> dict | list: + return json.loads(get(url, **kwargs).decode("utf-8")) diff --git a/optional-skills/research/osint-investigation/scripts/_normalize.py b/optional-skills/research/osint-investigation/scripts/_normalize.py new file mode 100644 index 00000000000..3c9a197af8b --- /dev/null +++ b/optional-skills/research/osint-investigation/scripts/_normalize.py @@ -0,0 +1,67 @@ +"""Shared entity-name normalization helpers (stdlib-only). + +Used by entity_resolution.py and timing_analysis.py. +""" +from __future__ import annotations + +import re + +# Legal suffixes / corporate boilerplate to strip during normalization. +_SUFFIX_TOKENS = { + "INC", "INCORPORATED", "LLC", "LLP", "LP", "LTD", "LIMITED", + "CORP", "CORPORATION", "CO", "COMPANY", + "GROUP", "GRP", "HOLDINGS", "HOLDING", + "PARTNERS", "ASSOCIATES", + "INTERNATIONAL", "INTL", + "ENTERPRISES", "ENTERPRISE", + "SERVICES", "SERVICE", "SVCS", + "SOLUTIONS", "MANAGEMENT", "MGMT", "CONSULTING", + "TECHNOLOGY", "TECHNOLOGIES", "TECH", + "INDUSTRIES", "INDUSTRY", + "AMERICA", "AMERICAN", + "USA", "US", + "PLLC", "PC", + "TRUST", "FOUNDATION", +} + +_PUNCT_RE = re.compile(r"[^\w\s]") +_WS_RE = re.compile(r"\s+") + + +def normalize_name(name: str | None) -> str: + """Standard normalization: uppercase, strip suffixes, drop punctuation.""" + if not name: + return "" + s = _PUNCT_RE.sub(" ", name.upper()) + s = _WS_RE.sub(" ", s).strip() + tokens = [t for t in s.split() if t and t not in _SUFFIX_TOKENS] + return " ".join(tokens) + + +def normalize_aggressive(name: str | None) -> str: + """Aggressive normalization: sorted unique tokens (word-bag).""" + base = normalize_name(name) + if not base: + return "" + return " ".join(sorted(set(base.split()))) + + +def name_tokens(name: str | None, min_len: int = 4) -> set[str]: + """Token set used for overlap matching.""" + base = normalize_name(name) + if not base: + return set() + return {t for t in base.split() if len(t) >= min_len} + + +def token_overlap_ratio(left: str | None, right: str | None) -> tuple[float, int]: + """Return (jaccard-like ratio, shared token count) over min-len tokens.""" + a = name_tokens(left) + b = name_tokens(right) + if not a or not b: + return 0.0, 0 + shared = a & b + if not shared: + return 0.0, 0 + union = a | b + return len(shared) / len(union), len(shared) diff --git a/optional-skills/research/osint-investigation/scripts/build_findings.py b/optional-skills/research/osint-investigation/scripts/build_findings.py new file mode 100644 index 00000000000..15021eb0878 --- /dev/null +++ b/optional-skills/research/osint-investigation/scripts/build_findings.py @@ -0,0 +1,221 @@ +#!/usr/bin/env python3 +"""Build a structured findings.json with evidence chains (stdlib-only). + +Aggregates cross_links.csv (entity_resolution output) and an optional +timing.json (timing_analysis output) into a single evidence-chain document. + +Output structure: + { + "metadata": {...}, + "findings": [ + { + "id": "F0001", + "title": "...", + "severity": "HIGH|MEDIUM|LOW", + "confidence": "high|medium|low", + "summary": "...", + "evidence": [ + {"source": "cross_links.csv", "row": 12, "fields": {...}}, + ... + ], + "sources": ["cross_links.csv", "timing.json"] + } + ] + } + +Every finding traces to specific source rows. No naked claims. +""" +from __future__ import annotations + +import argparse +import csv +import json +from collections import defaultdict +from pathlib import Path + +CONFIDENCE_ORDER = {"high": 0, "medium": 1, "low": 2} +SEVERITY_ORDER = {"HIGH": 0, "MEDIUM": 1, "LOW": 2} + + +def _read_cross_links(path: str) -> list[dict[str, str]]: + with open(path, newline="", encoding="utf-8") as fh: + return list(csv.DictReader(fh)) + + +def build_findings( + cross_links_path: str, + timing_path: str | None = None, + out_path: str = "findings.json", + bundled_threshold: int = 3, +) -> dict: + findings: list[dict] = [] + next_id = 1 + + # 1. Match-based findings, grouped by (left_normalized, right_normalized). + matches = _read_cross_links(cross_links_path) + grouped: dict[tuple[str, str], list[dict[str, str]]] = defaultdict(list) + for i, row in enumerate(matches): + row["__row__"] = str(i) + grouped[(row.get("left_normalized", ""), row.get("right_normalized", ""))].append(row) + + for (left_norm, right_norm), rows in grouped.items(): + if not left_norm or not right_norm: + continue + # Use the highest-confidence match for the finding's overall confidence. + best = min(rows, key=lambda r: CONFIDENCE_ORDER.get(r.get("confidence", "low"), 2)) + finding_id = f"F{next_id:04d}" + next_id += 1 + evidence = [ + { + "source": "cross_links.csv", + "row": int(r["__row__"]), + "fields": { + "match_type": r.get("match_type", ""), + "confidence": r.get("confidence", ""), + "left_name": r.get("left_name", ""), + "right_name": r.get("right_name", ""), + "overlap_ratio": r.get("overlap_ratio", ""), + "shared_tokens": r.get("shared_tokens", ""), + }, + } + for r in rows + ] + findings.append( + { + "id": finding_id, + "title": f"Entity match: {best.get('left_name', '')} ↔ {best.get('right_name', '')}", + "severity": "MEDIUM" if best.get("confidence") == "high" else "LOW", + "confidence": best.get("confidence", "low"), + "summary": ( + f"{len(rows)} cross-link record(s) tie " + f"'{best.get('left_name', '')}' to " + f"'{best.get('right_name', '')}' " + f"(best tier: {best.get('match_type', '')})." + ), + "evidence": evidence, + "sources": ["cross_links.csv"], + } + ) + + # 2. Bundled-donations findings (if cross_links carries donor↔candidate pattern). + # Heuristic: many distinct left names sharing the same right name. + by_right: dict[str, set[str]] = defaultdict(set) + by_right_rows: dict[str, list[dict[str, str]]] = defaultdict(list) + for r in matches: + right = r.get("right_normalized", "") + left_raw = r.get("left_name", "").strip() + if right and left_raw: + by_right[right].add(left_raw) + by_right_rows[right].append(r) + for right_norm, lefts in by_right.items(): + if len(lefts) < bundled_threshold: + continue + rows = by_right_rows[right_norm] + right_raw = rows[0].get("right_name", "") + findings.append( + { + "id": f"F{next_id:04d}", + "title": f"Bundled cross-links: {len(lefts)} distinct left entities ↔ '{right_raw}'", + "severity": "HIGH", + "confidence": "medium", + "summary": ( + f"{len(lefts)} distinct left-side entities link to " + f"'{right_raw}'. Pattern suggests coordinated relationship " + f"(e.g. bundled donations, multi-vendor employer)." + ), + "evidence": [ + { + "source": "cross_links.csv", + "row": int(r.get("__row__", "0")), + "fields": { + "left_name": r.get("left_name", ""), + "match_type": r.get("match_type", ""), + }, + } + for r in rows + ], + "sources": ["cross_links.csv"], + } + ) + next_id += 1 + + # 3. Timing-based findings. + if timing_path and Path(timing_path).exists(): + timing = json.loads(Path(timing_path).read_text()) + for r in timing.get("results", []): + if not r.get("significant"): + continue + findings.append( + { + "id": f"F{next_id:04d}", + "title": ( + f"Donation timing significantly clusters near awards: " + f"{r['donor']} ↔ {r['recipient']}" + ), + "severity": "HIGH" if r["p_value"] < 0.01 else "MEDIUM", + "confidence": "medium", + "summary": ( + f"Mean nearest-award distance {r['observed_mean_days']} days " + f"(null {r['null_mean_days']} days). p={r['p_value']}, " + f"effect size {r['effect_size_sd']} SD. " + f"{r['n_donations']} donations, {r['n_award_dates']} awards." + ), + "evidence": [ + { + "source": "timing.json", + "row": None, + "fields": r, + } + ], + "sources": ["timing.json"], + } + ) + next_id += 1 + + # Sort: severity → confidence → id. + findings.sort( + key=lambda f: ( + SEVERITY_ORDER.get(f["severity"], 3), + CONFIDENCE_ORDER.get(f["confidence"], 3), + f["id"], + ) + ) + + payload = { + "metadata": { + "n_findings": len(findings), + "cross_links_path": cross_links_path, + "timing_path": timing_path, + "bundled_threshold": bundled_threshold, + }, + "findings": findings, + } + Path(out_path).write_text(json.dumps(payload, indent=2)) + return payload + + +def main() -> int: + p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + p.add_argument("--cross-links", required=True) + p.add_argument("--timing", help="Optional timing.json from timing_analysis.py") + p.add_argument("--out", default="findings.json") + p.add_argument( + "--bundled-threshold", + type=int, + default=3, + help="Minimum distinct left entities to flag as bundled (default 3)", + ) + a = p.parse_args() + + payload = build_findings( + cross_links_path=a.cross_links, + timing_path=a.timing, + out_path=a.out, + bundled_threshold=a.bundled_threshold, + ) + print(f"Wrote {payload['metadata']['n_findings']} findings to {a.out}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/optional-skills/research/osint-investigation/scripts/entity_resolution.py b/optional-skills/research/osint-investigation/scripts/entity_resolution.py new file mode 100644 index 00000000000..26d60d433d4 --- /dev/null +++ b/optional-skills/research/osint-investigation/scripts/entity_resolution.py @@ -0,0 +1,228 @@ +#!/usr/bin/env python3 +"""Cross-source entity resolution (stdlib-only). + +Given two CSV files with name columns, find candidate matches using three +tiers of normalization: + + 1. exact — normalized strings equal + 2. fuzzy — sorted-token (word-bag) match + 3. token_overlap — >=60% Jaccard overlap on >=4-char tokens, >=2 shared + +Adapted from ShinMegamiBoson/OpenPlanter (MIT) but generalized: no Boston- +specific record types, no contribution-code filters, no fixed schemas. + +Output CSV columns: + match_type, confidence, left_name, right_name, + left_normalized, right_normalized, left_row, right_row, + overlap_ratio, shared_tokens +""" +from __future__ import annotations + +import argparse +import csv +import sys +from pathlib import Path + +# Allow running directly or as a module. +sys.path.insert(0, str(Path(__file__).parent)) +from _normalize import ( # noqa: E402 + normalize_name, + normalize_aggressive, + token_overlap_ratio, +) + +CONFIDENCE = { + "exact": "high", + "fuzzy": "medium", + "token_overlap": "low", +} + + +def _read_csv(path: str, name_col: str) -> list[dict[str, str]]: + rows = [] + with open(path, newline="", encoding="utf-8") as fh: + reader = csv.DictReader(fh) + if name_col not in (reader.fieldnames or []): + raise SystemExit( + f"Column {name_col!r} not in {path}. " + f"Available: {reader.fieldnames}" + ) + for i, row in enumerate(reader): + row["__row__"] = str(i) + rows.append(row) + return rows + + +def _build_index(rows: list[dict[str, str]], name_col: str): + """Index by exact-normalized and aggressive (sorted-token) form.""" + exact: dict[str, list[dict[str, str]]] = {} + aggressive: dict[str, list[dict[str, str]]] = {} + for row in rows: + raw = row.get(name_col, "") + n = normalize_name(raw) + if n: + exact.setdefault(n, []).append(row) + a = normalize_aggressive(raw) + if a: + aggressive.setdefault(a, []).append(row) + return exact, aggressive + + +def _emit( + out_rows: list[dict[str, str]], + seen: set[tuple], + match_type: str, + left_row: dict[str, str], + right_row: dict[str, str], + left_col: str, + right_col: str, + ratio: float = 0.0, + shared: int = 0, +): + left_raw = left_row.get(left_col, "") + right_raw = right_row.get(right_col, "") + key = ( + left_row["__row__"], + right_row["__row__"], + match_type, + ) + if key in seen: + return + seen.add(key) + out_rows.append( + { + "match_type": match_type, + "confidence": CONFIDENCE[match_type], + "left_name": left_raw, + "right_name": right_raw, + "left_normalized": normalize_name(left_raw), + "right_normalized": normalize_name(right_raw), + "left_row": left_row["__row__"], + "right_row": right_row["__row__"], + "overlap_ratio": f"{ratio:.3f}" if ratio else "", + "shared_tokens": str(shared) if shared else "", + } + ) + + +def resolve( + left_path: str, + left_col: str, + right_path: str, + right_col: str, + out_path: str, + overlap_threshold: float = 0.60, + min_shared: int = 2, + skip_overlap: bool = False, +) -> int: + left_rows = _read_csv(left_path, left_col) + right_rows = _read_csv(right_path, right_col) + + right_exact, right_aggressive = _build_index(right_rows, right_col) + + out_rows: list[dict[str, str]] = [] + seen: set[tuple] = set() + + # Pass 1+2: exact / fuzzy via index lookup. + for lrow in left_rows: + raw = lrow.get(left_col, "") + n = normalize_name(raw) + if not n: + continue + for rrow in right_exact.get(n, []): + _emit(out_rows, seen, "exact", lrow, rrow, left_col, right_col) + a = normalize_aggressive(raw) + if a: + for rrow in right_aggressive.get(a, []): + _emit(out_rows, seen, "fuzzy", lrow, rrow, left_col, right_col) + + if not skip_overlap: + # Pass 3: token overlap (O(N*M) — expensive; allow opt-out). + for lrow in left_rows: + l_raw = lrow.get(left_col, "") + if not normalize_name(l_raw): + continue + for rrow in right_rows: + ratio, shared = token_overlap_ratio( + l_raw, rrow.get(right_col, "") + ) + if ratio >= overlap_threshold and shared >= min_shared: + _emit( + out_rows, + seen, + "token_overlap", + lrow, + rrow, + left_col, + right_col, + ratio=ratio, + shared=shared, + ) + + fieldnames = [ + "match_type", + "confidence", + "left_name", + "right_name", + "left_normalized", + "right_normalized", + "left_row", + "right_row", + "overlap_ratio", + "shared_tokens", + ] + with open(out_path, "w", newline="", encoding="utf-8") as fh: + writer = csv.DictWriter(fh, fieldnames=fieldnames) + writer.writeheader() + writer.writerows(out_rows) + return len(out_rows) + + +def main() -> int: + p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + p.add_argument("--left", required=True, help="Left CSV path") + p.add_argument( + "--left-name-col", required=True, help="Name column in left CSV" + ) + p.add_argument("--right", required=True, help="Right CSV path") + p.add_argument( + "--right-name-col", + required=True, + help="Name column in right CSV", + ) + p.add_argument("--out", required=True, help="Output CSV path") + p.add_argument( + "--overlap-threshold", + type=float, + default=0.60, + help="Jaccard overlap threshold for token_overlap tier (default 0.60)", + ) + p.add_argument( + "--min-shared", + type=int, + default=2, + help="Minimum shared tokens for token_overlap tier (default 2)", + ) + p.add_argument( + "--skip-overlap", + action="store_true", + help="Skip the O(N*M) token_overlap pass (much faster on large CSVs)", + ) + args = p.parse_args() + + count = resolve( + left_path=args.left, + left_col=args.left_name_col, + right_path=args.right, + right_col=args.right_name_col, + out_path=args.out, + overlap_threshold=args.overlap_threshold, + min_shared=args.min_shared, + skip_overlap=args.skip_overlap, + ) + print(f"Wrote {count} match rows to {args.out}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/optional-skills/research/osint-investigation/scripts/fetch_courtlistener.py b/optional-skills/research/osint-investigation/scripts/fetch_courtlistener.py new file mode 100644 index 00000000000..db5e715bf57 --- /dev/null +++ b/optional-skills/research/osint-investigation/scripts/fetch_courtlistener.py @@ -0,0 +1,149 @@ +#!/usr/bin/env python3 +"""Search court records via CourtListener (Free Law Project). + +Covers ~10M federal and state court opinions, plus PACER docket data +where available. Public REST API v4 supports anonymous read access for +search; some endpoints require a token (free at courtlistener.com). + +Set COURTLISTENER_TOKEN to authenticate (raises rate limits). +""" +from __future__ import annotations + +import argparse +import csv +import os +import sys +import urllib.parse +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).parent)) +from _http import get_json # noqa: E402 + +BASE = "https://www.courtlistener.com/api/rest/v4/search/" + +COLUMNS = [ + "case_name", + "court", + "court_id", + "date_filed", + "docket_number", + "judge", + "citation", + "result_type", + "snippet", + "absolute_url", +] + +SEARCH_TYPES = { + "opinions": "o", # Court opinions + "dockets": "r", # PACER dockets (may require auth depending on coverage) + "oral": "oa", # Oral arguments + "people": "p", # Judges / people + "recap": "r", # Same as dockets in v4 +} + + +def fetch( + query: str, + search_type: str, + court: str | None, + date_from: str | None, + date_to: str | None, + token: str | None, + limit: int, + out_path: str, +) -> int: + type_code = SEARCH_TYPES.get(search_type, search_type) + params = { + "q": query, + "type": type_code, + } + if court: + params["court"] = court + if date_from: + params["filed_after"] = date_from + if date_to: + params["filed_before"] = date_to + headers = {"Authorization": f"Token {token}"} if token else None + + rows: list[dict[str, str]] = [] + next_url: str | None = f"{BASE}?{urllib.parse.urlencode(params)}" + while next_url and len(rows) < limit: + try: + payload = get_json(next_url, headers=headers) + except Exception as e: # noqa: BLE001 + print(f"CourtListener error: {e}", file=sys.stderr) + break + if not isinstance(payload, dict): + break + results = payload.get("results", []) + for r in results: + if len(rows) >= limit: + break + rows.append( + { + "case_name": r.get("caseName", "") or r.get("case_name", "") or "", + "court": r.get("court", "") or "", + "court_id": r.get("court_id", "") or "", + "date_filed": (r.get("dateFiled", "") or r.get("date_filed", "") or "")[:10], + "docket_number": r.get("docketNumber", "") or r.get("docket_number", "") or "", + "judge": r.get("judge", "") or "", + "citation": "; ".join(r.get("citation", []) or []) if isinstance(r.get("citation"), list) else (r.get("citation") or ""), + "result_type": search_type, + "snippet": (r.get("snippet", "") or "").replace("\n", " ")[:500], + "absolute_url": ( + f"https://www.courtlistener.com{r.get('absolute_url', '')}" + if r.get("absolute_url", "").startswith("/") + else r.get("absolute_url", "") + ), + } + ) + next_url = payload.get("next") + + Path(out_path).parent.mkdir(parents=True, exist_ok=True) + with open(out_path, "w", newline="", encoding="utf-8") as fh: + w = csv.DictWriter(fh, fieldnames=COLUMNS) + w.writeheader() + w.writerows(rows) + if not rows: + print( + f"CourtListener: 0 results for type={search_type!r} q={query!r}. " + "Most private individuals don't appear in published court records " + "unless they were party to a federal or state appellate case.", + file=sys.stderr, + ) + return len(rows) + + +def main() -> int: + p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + p.add_argument("--query", required=True, help="Search query (party name, case name, keyword)") + p.add_argument( + "--type", + default="opinions", + choices=list(SEARCH_TYPES.keys()), + help="Search type (default: opinions)", + ) + p.add_argument("--court", help="Court ID filter (e.g. 'nysd' = SDNY, 'scotus' = Supreme Court)") + p.add_argument("--date-from", help="Filed-after date YYYY-MM-DD") + p.add_argument("--date-to", help="Filed-before date YYYY-MM-DD") + p.add_argument("--token", default=os.environ.get("COURTLISTENER_TOKEN")) + p.add_argument("--limit", type=int, default=100) + p.add_argument("--out", required=True) + a = p.parse_args() + n = fetch( + query=a.query, + search_type=a.type, + court=a.court, + date_from=a.date_from, + date_to=a.date_to, + token=a.token, + limit=a.limit, + out_path=a.out, + ) + print(f"Wrote {n} CourtListener rows to {a.out}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/optional-skills/research/osint-investigation/scripts/fetch_gdelt.py b/optional-skills/research/osint-investigation/scripts/fetch_gdelt.py new file mode 100644 index 00000000000..fa98dabc9bb --- /dev/null +++ b/optional-skills/research/osint-investigation/scripts/fetch_gdelt.py @@ -0,0 +1,162 @@ +#!/usr/bin/env python3 +"""Search the GDELT 2.0 DOC API for news mentions. + +GDELT monitors world news in 100+ languages and indexes the full text. +Free, anonymous, ~15-minute update frequency. Covers ~2015→present. + +Useful for surfacing news mentions of a person, company, or topic across +international media — much wider net than Google News. +""" +from __future__ import annotations + +import argparse +import csv +import json +import sys +import time +import urllib.parse +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).parent)) +from _http import get_json # noqa: E402 + +BASE = "https://api.gdeltproject.org/api/v2/doc/doc" + +COLUMNS = [ + "title", + "url", + "seen_date", + "domain", + "language", + "source_country", + "tone", + "social_image", +] + + +def fetch( + query: str, + mode: str, + timespan: str | None, + start_datetime: str | None, + end_datetime: str | None, + source_country: str | None, + source_lang: str | None, + limit: int, + out_path: str, +) -> int: + params: dict[str, str] = { + "query": query, + "mode": mode, + "format": "json", + "maxrecords": str(min(limit, 250)), + "sort": "datedesc", + } + if timespan: + params["timespan"] = timespan + if start_datetime: + params["startdatetime"] = start_datetime.replace("-", "").replace(":", "").replace(" ", "") + if end_datetime: + params["enddatetime"] = end_datetime.replace("-", "").replace(":", "").replace(" ", "") + if source_country: + params["sourcecountry"] = source_country + if source_lang: + params["sourcelang"] = source_lang + + url = f"{BASE}?{urllib.parse.urlencode(params)}" + payload: dict | list = {} + for attempt in range(3): + try: + payload = get_json(url) + break + except RuntimeError as e: + # GDELT requires 1 request per 5 seconds; back off and retry. + if "429" in str(e) and attempt < 2: + print( + f"GDELT throttle hit; sleeping 6s before retry " + f"(attempt {attempt + 1}/3)", + file=sys.stderr, + ) + time.sleep(6) + continue + print(f"GDELT error: {e}", file=sys.stderr) + payload = {} + break + except Exception as e: # noqa: BLE001 + print(f"GDELT error: {e}", file=sys.stderr) + payload = {} + break + + rows: list[dict[str, str]] = [] + if isinstance(payload, dict): + articles = payload.get("articles", []) or [] + for a in articles[:limit]: + seen = (a.get("seendate") or "") + # GDELT format: 20260319T083000Z → 2026-03-19 08:30:00Z + if len(seen) == 16 and "T" in seen: + seen = f"{seen[0:4]}-{seen[4:6]}-{seen[6:8]} {seen[9:11]}:{seen[11:13]}:{seen[13:15]}Z" + rows.append( + { + "title": (a.get("title") or "").replace("\n", " ").strip(), + "url": a.get("url") or "", + "seen_date": seen, + "domain": a.get("domain") or "", + "language": a.get("language") or "", + "source_country": a.get("sourcecountry") or "", + "tone": str(a.get("tone") or ""), + "social_image": a.get("socialimage") or "", + } + ) + + Path(out_path).parent.mkdir(parents=True, exist_ok=True) + with open(out_path, "w", newline="", encoding="utf-8") as fh: + w = csv.DictWriter(fh, fieldnames=COLUMNS) + w.writeheader() + w.writerows(rows) + if not rows: + print( + f"GDELT: 0 articles for query={query!r}. " + "GDELT indexes ~2015→present. Try widening the timespan or " + "checking the query syntax (https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/).", + file=sys.stderr, + ) + return len(rows) + + +def main() -> int: + p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + p.add_argument("--query", required=True, help='Search query (supports GDELT operators: quoted phrases, AND/OR/NOT, sourcecountry:, theme:)') + p.add_argument( + "--mode", + default="ArtList", + choices=["ArtList", "ImageCollage", "TimelineVol", "TimelineTone", "ToneChart"], + help="GDELT mode (default ArtList for article list)", + ) + p.add_argument( + "--timespan", + help="Relative window: e.g. '1d', '1w', '1m', '3m', '1y' (overrides start/end)", + ) + p.add_argument("--start", help="Absolute start YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS") + p.add_argument("--end", help="Absolute end YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS") + p.add_argument("--source-country", help="2-letter source country (e.g. US, UK)") + p.add_argument("--source-lang", help="Source language (e.g. English, Spanish)") + p.add_argument("--limit", type=int, default=100) + p.add_argument("--out", required=True) + a = p.parse_args() + n = fetch( + query=a.query, + mode=a.mode, + timespan=a.timespan, + start_datetime=a.start, + end_datetime=a.end, + source_country=a.source_country, + source_lang=a.source_lang, + limit=a.limit, + out_path=a.out, + ) + print(f"Wrote {n} GDELT article rows to {a.out}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/optional-skills/research/osint-investigation/scripts/fetch_icij_offshore.py b/optional-skills/research/osint-investigation/scripts/fetch_icij_offshore.py new file mode 100644 index 00000000000..8d050b62bf1 --- /dev/null +++ b/optional-skills/research/osint-investigation/scripts/fetch_icij_offshore.py @@ -0,0 +1,234 @@ +#!/usr/bin/env python3 +"""Search ICIJ Offshore Leaks via the bulk CSV database. + +The old reconcile endpoint (https://offshoreleaks.icij.org/reconcile) returns +404 — ICIJ has removed it. The remaining stable access path is the public +bulk download: + + https://offshoreleaks-data.icij.org/offshoreleaks/csv/full-oldb.LATEST.zip + +~70 MB, ~6 CSVs inside (nodes-entities, nodes-officers, nodes-intermediaries, +nodes-addresses, relationships, ...). We cache it under +$HERMES_OSINT_CACHE/icij/ (default: ~/.cache/hermes-osint/icij/) and search +locally so the agent doesn't re-download for every query. + +Output CSV columns match the original `fetch_icij_offshore.py` contract. +""" +from __future__ import annotations + +import argparse +import csv +import io +import os +import re +import sys +import time +import urllib.request +import zipfile +from pathlib import Path + +BULK_URL = "https://offshoreleaks-data.icij.org/offshoreleaks/csv/full-oldb.LATEST.zip" + +COLUMNS = [ + "node_id", + "name", + "node_type", + "country_codes", + "countries", + "jurisdiction", + "incorporation_date", + "inactivation_date", + "source", + "entity_url", + "connections", +] + + +def _cache_dir() -> Path: + base = os.environ.get("HERMES_OSINT_CACHE") + if base: + return Path(base) / "icij" + return Path.home() / ".cache" / "hermes-osint" / "icij" + + +def _download(dest: Path, force: bool = False) -> Path: + """Download (or reuse cached) ICIJ bulk ZIP.""" + dest.mkdir(parents=True, exist_ok=True) + zip_path = dest / "full-oldb.zip" + if zip_path.exists() and not force: + # Re-check age: refetch if older than 30 days. + age_days = (time.time() - zip_path.stat().st_mtime) / 86400 + if age_days < 30: + return zip_path + print(f"Downloading ICIJ bulk database (~70 MB) to {zip_path}", file=sys.stderr) + req = urllib.request.Request( + BULK_URL, + headers={"User-Agent": "hermes-agent osint-investigation skill"}, + ) + with urllib.request.urlopen(req, timeout=120) as resp: # noqa: S310 + tmp = zip_path.with_suffix(".zip.tmp") + with open(tmp, "wb") as fh: + while True: + chunk = resp.read(1 << 16) + if not chunk: + break + fh.write(chunk) + tmp.replace(zip_path) + return zip_path + + +def _open_csv(zf: zipfile.ZipFile, name_pattern: str): + """Open the first CSV matching name_pattern (case-insensitive substring).""" + for info in zf.infolist(): + if name_pattern.lower() in info.filename.lower() and info.filename.lower().endswith(".csv"): + return zf.open(info), info.filename + return None, None + + +def _match(needle_norm: str, hay: str) -> bool: + return needle_norm in (hay or "").upper() + + +def _normalize_query(s: str) -> str: + s = s.upper() + s = re.sub(r"[^\w\s]", " ", s) + s = re.sub(r"\s+", " ", s).strip() + return s + + +def fetch( + entity: str | None, + officer: str | None, + jurisdiction: str | None, + out_path: str, + cache_dir: Path, + force_refresh: bool = False, + limit: int = 500, +) -> int: + zip_path = _download(cache_dir, force=force_refresh) + rows: list[dict[str, str]] = [] + needles: list[tuple[str, str]] = [] # (kind, normalized needle) + if entity: + needles.append(("Entity", _normalize_query(entity))) + if officer: + needles.append(("Officer", _normalize_query(officer))) + jur_norm = _normalize_query(jurisdiction) if jurisdiction else None + + targets = [ + ("Entity", "nodes-entities"), + ("Officer", "nodes-officers"), + ("Intermediary", "nodes-intermediaries"), + ] + + with zipfile.ZipFile(zip_path) as zf: + for node_type, csv_substring in targets: + relevant_needles = [n for (k, n) in needles if k in (node_type, "Entity", "Officer")] or [] + # Only scan a CSV if we have a needle that could plausibly match it, + # or if we have ONLY a jurisdiction filter. + applicable_needles = [n for (k, n) in needles if k == node_type] + if needles and not applicable_needles and not jur_norm: + continue + stream, fname = _open_csv(zf, csv_substring) + if not stream: + continue + with stream: + text = io.TextIOWrapper(stream, encoding="utf-8", errors="replace") + reader = csv.DictReader(text) + for row in reader: + name = (row.get("name") or "").strip() + if not name: + continue + name_u = name.upper() + matched = False + for n in applicable_needles or relevant_needles: + if _match(n, name_u): + matched = True + break + if not needles: + matched = True # jurisdiction-only sweep + if not matched: + continue + jur = (row.get("jurisdiction_description") or row.get("country_codes") or "").strip() + if jur_norm and jur_norm not in jur.upper() and jur_norm not in (row.get("countries") or "").upper(): + continue + node_id = (row.get("node_id") or "").strip() + rows.append( + { + "node_id": node_id, + "name": name, + "node_type": node_type, + "country_codes": row.get("country_codes", "") or "", + "countries": row.get("countries", "") or "", + "jurisdiction": jur, + "incorporation_date": row.get("incorporation_date", "") or "", + "inactivation_date": row.get("inactivation_date", "") or "", + "source": row.get("sourceID", "") or row.get("source", "") or "", + "entity_url": ( + f"https://offshoreleaks.icij.org/nodes/{node_id}" if node_id else "" + ), + "connections": "", + } + ) + if len(rows) >= limit: + break + if len(rows) >= limit: + break + + Path(out_path).parent.mkdir(parents=True, exist_ok=True) + with open(out_path, "w", newline="", encoding="utf-8") as fh: + w = csv.DictWriter(fh, fieldnames=COLUMNS) + w.writeheader() + w.writerows(rows) + if not rows: + bits = [] + if entity: + bits.append(f"entity={entity!r}") + if officer: + bits.append(f"officer={officer!r}") + if jurisdiction: + bits.append(f"jurisdiction={jurisdiction!r}") + print( + f"ICIJ: 0 matches for {', '.join(bits)}. " + "The bulk database covers offshore leaks (Panama, Paradise, Pandora, " + "Bahamas, Offshore Leaks). Most private US individuals are NOT in it.", + file=sys.stderr, + ) + return len(rows) + + +def main() -> int: + p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + p.add_argument("--entity", help="Search by entity name (substring, case-insensitive)") + p.add_argument("--officer", help="Search by officer / individual name (substring, case-insensitive)") + p.add_argument("--jurisdiction", help="Filter results by jurisdiction substring") + p.add_argument("--limit", type=int, default=500) + p.add_argument("--out", required=True) + p.add_argument( + "--cache-dir", + type=Path, + default=None, + help="Override cache directory (default: $HERMES_OSINT_CACHE/icij or ~/.cache/hermes-osint/icij)", + ) + p.add_argument( + "--force-refresh", + action="store_true", + help="Re-download the bulk ZIP even if a recent cached copy exists.", + ) + a = p.parse_args() + if not (a.entity or a.officer or a.jurisdiction): + p.error("must supply at least one of --entity / --officer / --jurisdiction") + n = fetch( + entity=a.entity, + officer=a.officer, + jurisdiction=a.jurisdiction, + out_path=a.out, + cache_dir=a.cache_dir or _cache_dir(), + force_refresh=a.force_refresh, + limit=a.limit, + ) + print(f"Wrote {n} ICIJ Offshore Leaks rows to {a.out}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/optional-skills/research/osint-investigation/scripts/fetch_nyc_acris.py b/optional-skills/research/osint-investigation/scripts/fetch_nyc_acris.py new file mode 100644 index 00000000000..6ec448f0f53 --- /dev/null +++ b/optional-skills/research/osint-investigation/scripts/fetch_nyc_acris.py @@ -0,0 +1,203 @@ +#!/usr/bin/env python3 +"""Search NYC property records via ACRIS (Automated City Register Information System). + +Uses the city's Socrata-backed open data API. No auth required for read access. + +Datasets: + bnx9-e6tj — Real Property Master (one row per recorded document) + 636b-3b5g — Real Property Parties (names — grantor, grantee, etc.) + 8h5j-fqxa — Real Property Legal (lot / property identifiers) + uqqa-hym2 — Real Property References + +The Parties dataset has the names. We search by name and optionally join to +Master to get the doc type and date. +""" +from __future__ import annotations + +import argparse +import csv +import sys +import urllib.parse +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).parent)) +from _http import get_json # noqa: E402 + +PARTIES_URL = "https://data.cityofnewyork.us/resource/636b-3b5g.json" +MASTER_URL = "https://data.cityofnewyork.us/resource/bnx9-e6tj.json" + +PARTY_TYPE = { + "1": "grantor (seller / mortgagor / debtor)", + "2": "grantee (buyer / mortgagee / creditor)", + "3": "other party", +} + +BOROUGH = { + "1": "Manhattan", + "2": "Bronx", + "3": "Brooklyn", + "4": "Queens", + "5": "Staten Island", +} + +COLUMNS = [ + "document_id", + "name", + "party_type", + "party_role", + "address_1", + "address_2", + "city", + "state", + "zip", + "country", + "doc_type", + "doc_date", + "recorded_date", + "borough", + "amount", + "filing_url", +] + + +def _filing_url(document_id: str) -> str: + if not document_id: + return "" + return ( + f"https://a836-acris.nyc.gov/DS/DocumentSearch/DocumentImageView?doc_id={document_id}" + ) + + +def fetch( + name: str | None, + address: str | None, + party_type: str | None, + limit: int, + out_path: str, + enrich: bool = True, +) -> int: + if not (name or address): + raise SystemExit("must supply --name or --address") + + where_clauses: list[str] = [] + if name: + safe = name.upper().replace("'", "''") + where_clauses.append(f"upper(name) like '%{safe}%'") + if address: + safe_addr = address.upper().replace("'", "''") + where_clauses.append(f"upper(address_1) like '%{safe_addr}%'") + if party_type and party_type in {"1", "2", "3"}: + where_clauses.append(f"party_type='{party_type}'") + + params = { + "$where": " AND ".join(where_clauses), + "$limit": str(limit), + } + url = f"{PARTIES_URL}?{urllib.parse.urlencode(params)}" + parties = get_json(url) + if not isinstance(parties, list): + raise SystemExit(f"Unexpected ACRIS response: {parties!r}") + + # Enrich with master record (doc_type, dates, borough, amount). + doc_ids: list[str] = sorted({ + d for d in (p.get("document_id") for p in parties) if d + }) + masters: dict[str, dict] = {} + if enrich and doc_ids: + # Batch up to 100 doc_ids per request (Socrata IN-list is fine for this). + for i in range(0, len(doc_ids), 100): + chunk = doc_ids[i : i + 100] + id_list = ",".join(f"'{d}'" for d in chunk) + master_params = { + "$where": f"document_id in ({id_list})", + "$limit": "100", + } + url = f"{MASTER_URL}?{urllib.parse.urlencode(master_params)}" + try: + rows = get_json(url) + except Exception as e: # noqa: BLE001 + print(f"ACRIS master lookup failed for chunk: {e}", file=sys.stderr) + continue + if isinstance(rows, list): + for r in rows: + did = r.get("document_id", "") + if did: + masters[did] = r + + out_rows: list[dict[str, str]] = [] + for p in parties: + did = p.get("document_id", "") or "" + m = masters.get(did, {}) + out_rows.append( + { + "document_id": did, + "name": p.get("name", "") or "", + "party_type": p.get("party_type", "") or "", + "party_role": PARTY_TYPE.get(p.get("party_type", ""), ""), + "address_1": p.get("address_1", "") or "", + "address_2": p.get("address_2", "") or "", + "city": p.get("city", "") or "", + "state": p.get("state", "") or "", + "zip": p.get("zip", "") or "", + "country": p.get("country", "") or "", + "doc_type": m.get("doc_type", "") or "", + "doc_date": (m.get("document_date", "") or "")[:10], + "recorded_date": (m.get("recorded_datetime", "") or "")[:10], + "borough": BOROUGH.get(m.get("recorded_borough", ""), m.get("recorded_borough", "")), + "amount": m.get("document_amt", "") or "", + "filing_url": _filing_url(did), + } + ) + + Path(out_path).parent.mkdir(parents=True, exist_ok=True) + with open(out_path, "w", newline="", encoding="utf-8") as fh: + w = csv.DictWriter(fh, fieldnames=COLUMNS) + w.writeheader() + w.writerows(out_rows) + + if not out_rows: + filters = [] + if name: + filters.append(f"name={name!r}") + if address: + filters.append(f"address={address!r}") + print( + f"NYC ACRIS: 0 records for {', '.join(filters)}. " + "ACRIS covers ONLY NYC (5 boroughs). For property records elsewhere, " + "search the relevant county recorder directly.", + file=sys.stderr, + ) + return len(out_rows) + + +def main() -> int: + p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + p.add_argument("--name", help="Party name substring (case-insensitive)") + p.add_argument("--address", help="Address line 1 substring") + p.add_argument( + "--party-type", + choices=["1", "2", "3"], + help="Filter party type: 1=grantor (seller/mortgagor), 2=grantee (buyer/mortgagee), 3=other", + ) + p.add_argument("--limit", type=int, default=200) + p.add_argument( + "--no-enrich", + action="store_true", + help="Skip the master-document lookup that adds doc_type/date/amount", + ) + p.add_argument("--out", required=True) + a = p.parse_args() + n = fetch( + name=a.name, + address=a.address, + party_type=a.party_type, + limit=a.limit, + out_path=a.out, + enrich=not a.no_enrich, + ) + print(f"Wrote {n} NYC ACRIS rows to {a.out}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/optional-skills/research/osint-investigation/scripts/fetch_ofac_sdn.py b/optional-skills/research/osint-investigation/scripts/fetch_ofac_sdn.py new file mode 100644 index 00000000000..5233fa09ab8 --- /dev/null +++ b/optional-skills/research/osint-investigation/scripts/fetch_ofac_sdn.py @@ -0,0 +1,175 @@ +#!/usr/bin/env python3 +"""Fetch OFAC SDN list (CSV format) and normalize. + +Public endpoint: https://www.treasury.gov/ofac/downloads/sdn.csv +Format reference: https://ofac.treasury.gov/specially-designated-nationals-and-blocked-persons-list-sdn-human-readable-lists + +The SDN CSV uses a specific 12-column format with no header row: + ent_num, sdn_name, sdn_type, program, title, call_sign, vess_type, + tonnage, grt, vess_flag, vess_owner, remarks +Address and AKA records live in separate files. We fetch all three and join. +""" +from __future__ import annotations + +import argparse +import csv +import io +import sys +from collections import defaultdict +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).parent)) +from _http import get # noqa: E402 + +SDN_URL = "https://www.treasury.gov/ofac/downloads/sdn.csv" +ADD_URL = "https://www.treasury.gov/ofac/downloads/add.csv" +ALT_URL = "https://www.treasury.gov/ofac/downloads/alt.csv" + +SDN_COLS = [ + "ent_num", "sdn_name", "sdn_type", "program", "title", + "call_sign", "vess_type", "tonnage", "grt", "vess_flag", + "vess_owner", "remarks", +] +ADD_COLS = [ + "ent_num", "add_num", "address", "city_state_zip", "country", "add_remarks", +] +ALT_COLS = [ + "ent_num", "alt_num", "alt_type", "alt_name", "alt_remarks", +] + +COLUMNS = [ + "entity_id", + "name", + "entity_type", + "program_list", + "title", + "nationalities", + "aka_list", + "addresses", + "dob", + "pob", + "remarks", + "last_updated", +] + +_TYPE_MAP = { + "individual": "individual", + "entity": "entity", + "vessel": "vessel", + "aircraft": "aircraft", +} + + +def _read_csv(url: str, columns: list[str]) -> list[dict[str, str]]: + body = get(url, timeout=60).decode("latin-1", errors="replace") + reader = csv.reader(io.StringIO(body)) + out = [] + for row in reader: + if not row: + continue + # Pad/truncate to expected width. + row = row[: len(columns)] + [""] * (len(columns) - len(row)) + out.append(dict(zip(columns, row))) + return out + + +def _strip_quotes(s: str) -> str: + s = s.strip() + if s.startswith('"') and s.endswith('"'): + s = s[1:-1] + if s == "-0-": + return "" + return s + + +def fetch( + program: str | None, + entity_type: str | None, + out_path: str, +) -> int: + sdn = _read_csv(SDN_URL, SDN_COLS) + addresses = _read_csv(ADD_URL, ADD_COLS) + akas = _read_csv(ALT_URL, ALT_COLS) + + addr_by_ent: dict[str, list[str]] = defaultdict(list) + for a in addresses: + ent = _strip_quotes(a["ent_num"]) + parts = [ + _strip_quotes(a[c]) + for c in ("address", "city_state_zip", "country") + if _strip_quotes(a[c]) + ] + if parts: + addr_by_ent[ent].append(", ".join(parts)) + + aka_by_ent: dict[str, list[str]] = defaultdict(list) + for k in akas: + ent = _strip_quotes(k["ent_num"]) + name = _strip_quotes(k["alt_name"]) + if name: + aka_by_ent[ent].append(name) + + rows: list[dict[str, str]] = [] + for r in sdn: + ent_num = _strip_quotes(r["ent_num"]) + if not ent_num: + continue + sdn_type = _TYPE_MAP.get(_strip_quotes(r["sdn_type"]).lower(), _strip_quotes(r["sdn_type"])) + if entity_type and sdn_type != entity_type: + continue + progs = _strip_quotes(r["program"]) + if program and program.upper() not in progs.upper().split(";"): + continue + remarks = _strip_quotes(r["remarks"]) + # DOB / POB are commonly embedded in remarks for individuals. + dob = "" + pob = "" + if sdn_type == "individual" and remarks: + for chunk in remarks.split(";"): + ch = chunk.strip() + if ch.upper().startswith("DOB"): + dob = ch.split(maxsplit=1)[1] if " " in ch else "" + elif ch.upper().startswith("POB"): + pob = ch.split(maxsplit=1)[1] if " " in ch else "" + rows.append( + { + "entity_id": ent_num, + "name": _strip_quotes(r["sdn_name"]), + "entity_type": sdn_type, + "program_list": "; ".join(p.strip() for p in progs.split(";") if p.strip()), + "title": _strip_quotes(r["title"]), + "nationalities": "", # not in this CSV; available in XML format + "aka_list": "; ".join(aka_by_ent.get(ent_num, [])), + "addresses": "; ".join(addr_by_ent.get(ent_num, [])), + "dob": dob, + "pob": pob, + "remarks": remarks, + "last_updated": "", + } + ) + + Path(out_path).parent.mkdir(parents=True, exist_ok=True) + with open(out_path, "w", newline="", encoding="utf-8") as fh: + w = csv.DictWriter(fh, fieldnames=COLUMNS) + w.writeheader() + w.writerows(rows) + return len(rows) + + +def main() -> int: + p = argparse.ArgumentParser(description=__doc__) + p.add_argument("--program", help="Filter to specific sanctions program (e.g. SDGT, IRAN)") + p.add_argument( + "--entity-type", + choices=["individual", "entity", "vessel", "aircraft"], + help="Filter to a specific entity type", + ) + p.add_argument("--out", required=True) + a = p.parse_args() + n = fetch(program=a.program, entity_type=a.entity_type, out_path=a.out) + print(f"Wrote {n} OFAC SDN rows to {a.out}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/optional-skills/research/osint-investigation/scripts/fetch_opencorporates.py b/optional-skills/research/osint-investigation/scripts/fetch_opencorporates.py new file mode 100644 index 00000000000..6924a8056a6 --- /dev/null +++ b/optional-skills/research/osint-investigation/scripts/fetch_opencorporates.py @@ -0,0 +1,192 @@ +#!/usr/bin/env python3 +"""Search OpenCorporates company registry data. + +OpenCorporates aggregates ~200M companies from 130+ jurisdictions. The +public API requires an API token (free tier: 500 calls/month). Set +OPENCORPORATES_API_TOKEN in env or pass --token. + +Without a token, this script falls back to scraping the public HTML +search page (limited fields, more brittle, no jurisdiction filter). +""" +from __future__ import annotations + +import argparse +import csv +import json +import os +import re +import sys +import urllib.parse +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).parent)) +from _http import get, get_json # noqa: E402 + +API_URL = "https://api.opencorporates.com/v0.4/companies/search" +HTML_URL = "https://opencorporates.com/companies" + +COLUMNS = [ + "name", + "company_number", + "jurisdiction_code", + "jurisdiction_name", + "incorporation_date", + "dissolution_date", + "company_type", + "status", + "registered_address", + "opencorporates_url", + "officers_count", + "source", +] + + +def _via_api(query: str, jurisdiction: str | None, token: str, limit: int) -> list[dict]: + params = { + "q": query, + "api_token": token, + "per_page": str(min(limit, 100)), + } + if jurisdiction: + params["jurisdiction_code"] = jurisdiction + url = f"{API_URL}?{urllib.parse.urlencode(params)}" + payload = get_json(url) + if not isinstance(payload, dict): + return [] + results = payload.get("results", {}).get("companies", []) or [] + return [r.get("company", {}) for r in results if isinstance(r, dict)] + + +def _via_html(query: str, limit: int) -> list[dict]: + """Best-effort HTML fallback when no API token is available.""" + params = {"q": query, "utf8": "✓"} + url = f"{HTML_URL}?{urllib.parse.urlencode(params)}" + body = get(url, user_agent="Mozilla/5.0 hermes-osint").decode("utf-8", errors="replace") + # Each result is in <li class="company"> ... </li> with name, url, status + pattern = re.compile( + r'<li[^>]*class="[^"]*company[^"]*"[^>]*>.*?' + r'<a[^>]+href="(?P<url>/companies/[^"]+)"[^>]*>(?P<name>[^<]+)</a>' + r'(?:.*?<span[^>]*class="[^"]*jurisdiction[^"]*"[^>]*>(?P<jur>[^<]+)</span>)?' + r"(?:.*?<dt[^>]*>(?:Company\s+Number|Number)</dt>\s*<dd[^>]*>(?P<num>[^<]+)</dd>)?", + re.DOTALL | re.IGNORECASE, + ) + out = [] + for m in pattern.finditer(body): + if len(out) >= limit: + break + url_path = m.group("url").strip() + out.append( + { + "name": (m.group("name") or "").strip(), + "opencorporates_url": f"https://opencorporates.com{url_path}", + "jurisdiction_code": (m.group("jur") or "").strip(), + "company_number": (m.group("num") or "").strip(), + "_via": "html", + } + ) + return out + + +def fetch( + query: str, + jurisdiction: str | None, + token: str | None, + limit: int, + out_path: str, +) -> int: + if token: + try: + companies = _via_api(query, jurisdiction, token, limit) + source_tag = "api" + except Exception as e: # noqa: BLE001 + print( + f"OpenCorporates API call failed ({e}); falling back to HTML.", + file=sys.stderr, + ) + companies = _via_html(query, limit) + source_tag = "html-fallback" + else: + print( + "OPENCORPORATES_API_TOKEN not set — using HTML fallback (limited fields). " + "Get a free token at https://opencorporates.com/api_accounts/new", + file=sys.stderr, + ) + companies = _via_html(query, limit) + source_tag = "html" + + rows: list[dict[str, str]] = [] + for c in companies[:limit]: + if c.get("_via") == "html": + rows.append( + { + "name": c.get("name", ""), + "company_number": c.get("company_number", ""), + "jurisdiction_code": c.get("jurisdiction_code", ""), + "jurisdiction_name": "", + "incorporation_date": "", + "dissolution_date": "", + "company_type": "", + "status": "", + "registered_address": "", + "opencorporates_url": c.get("opencorporates_url", ""), + "officers_count": "", + "source": source_tag, + } + ) + continue + addr = c.get("registered_address_in_full") or "" + rows.append( + { + "name": c.get("name", "") or "", + "company_number": c.get("company_number", "") or "", + "jurisdiction_code": c.get("jurisdiction_code", "") or "", + "jurisdiction_name": "", + "incorporation_date": c.get("incorporation_date", "") or "", + "dissolution_date": c.get("dissolution_date", "") or "", + "company_type": c.get("company_type", "") or "", + "status": c.get("current_status", "") or c.get("inactive", "") or "", + "registered_address": addr, + "opencorporates_url": c.get("opencorporates_url", "") or "", + "officers_count": str(c.get("officers", {}).get("total_count", "") if c.get("officers") else ""), + "source": source_tag, + } + ) + + Path(out_path).parent.mkdir(parents=True, exist_ok=True) + with open(out_path, "w", newline="", encoding="utf-8") as fh: + w = csv.DictWriter(fh, fieldnames=COLUMNS) + w.writeheader() + w.writerows(rows) + if not rows: + print( + f"OpenCorporates: 0 matches for query={query!r}" + f"{f' jurisdiction={jurisdiction!r}' if jurisdiction else ''}.", + file=sys.stderr, + ) + return len(rows) + + +def main() -> int: + p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + p.add_argument("--query", required=True, help="Company name search") + p.add_argument( + "--jurisdiction", + help="Jurisdiction code, e.g. 'us_ny', 'us_de', 'gb', 'sg' (lowercased OpenCorporates style)", + ) + p.add_argument("--limit", type=int, default=50) + p.add_argument("--token", default=os.environ.get("OPENCORPORATES_API_TOKEN")) + p.add_argument("--out", required=True) + a = p.parse_args() + n = fetch( + query=a.query, + jurisdiction=a.jurisdiction, + token=a.token, + limit=a.limit, + out_path=a.out, + ) + print(f"Wrote {n} OpenCorporates rows to {a.out}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/optional-skills/research/osint-investigation/scripts/fetch_sec_edgar.py b/optional-skills/research/osint-investigation/scripts/fetch_sec_edgar.py new file mode 100644 index 00000000000..bd2fda8feb9 --- /dev/null +++ b/optional-skills/research/osint-investigation/scripts/fetch_sec_edgar.py @@ -0,0 +1,184 @@ +#!/usr/bin/env python3 +"""Fetch SEC EDGAR filings index for a given CIK or company name. + +SEC requires a User-Agent header with contact info. Set SEC_USER_AGENT, +e.g. SEC_USER_AGENT="Research example@example.com". + +Filings JSON is published at: + https://data.sec.gov/submissions/CIK<10-digit-padded>.json + +Company lookup uses: + https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&company=<name>&output=atom +""" +from __future__ import annotations + +import argparse +import csv +import os +import re +import sys +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).parent)) +from _http import get, get_json # noqa: E402 + +SUBMISSIONS_URL = "https://data.sec.gov/submissions/CIK{cik}.json" +COLUMNS = [ + "cik", + "company_name", + "form_type", + "filing_date", + "accession_number", + "primary_document", + "filing_url", + "reporting_period", +] + + +def _ua() -> str: + ua = os.environ.get("SEC_USER_AGENT", "").strip() + if not ua: + raise SystemExit( + "SEC requires a User-Agent with contact info. " + "Set SEC_USER_AGENT='Your Name your@email'." + ) + return ua + + +def _resolve_cik(company: str) -> tuple[str, str]: + """Resolve a company name to a CIK via EDGAR's atom feed. + + Returns (cik, resolved_company_name). The feed entries also reveal whether + the match is an individual filer (Form 3/4/5 only) — surfaced in the + return value so callers can warn. + """ + url = "https://www.sec.gov/cgi-bin/browse-edgar" + params = {"action": "getcompany", "company": company, "output": "atom", "owner": "include"} + body = get(url, params=params, user_agent=_ua()).decode("utf-8", errors="replace") + m = re.search(r"CIK=(\d{10})", body) + if not m: + raise SystemExit(f"Could not resolve CIK for company={company!r}") + cik = m.group(1) + name_m = re.search(r"<title>([^<]+)\s*\((\d{10})\)", body) + resolved = name_m.group(1).strip() if name_m else "" + return cik, resolved + + +def fetch( + cik: str | None, + company: str | None, + types: list[str], + since: str | None, + out_path: str, +) -> int: + resolved_name = "" + if not cik and company: + try: + cik, resolved_name = _resolve_cik(company) # type: ignore[assignment] + except SystemExit as e: + # Write empty CSV with header so downstream tools still work, + # and tell the user clearly. + print(f"SEC EDGAR: {e}", file=sys.stderr) + Path(out_path).parent.mkdir(parents=True, exist_ok=True) + with open(out_path, "w", newline="", encoding="utf-8") as fh: + csv.DictWriter(fh, fieldnames=COLUMNS).writeheader() + return 0 + if resolved_name: + print( + f"Resolved company={company!r} → CIK {cik} ({resolved_name})", + file=sys.stderr, + ) + if not cik: + raise SystemExit("must supply --cik or --company") + cik = cik.zfill(10) + url = SUBMISSIONS_URL.format(cik=cik) + payload = get_json(url, user_agent=_ua()) + if not isinstance(payload, dict): + raise SystemExit(f"Unexpected EDGAR response shape for CIK {cik}") + name = payload.get("name", "") + recent = (payload.get("filings", {}) or {}).get("recent", {}) or {} + form = recent.get("form", []) + date = recent.get("filingDate", []) + accession = recent.get("accessionNumber", []) + primary_doc = recent.get("primaryDocument", []) + period = recent.get("reportDate", []) + + # Histogram of available filing types — useful for surfacing why a filter + # returned 0 (e.g. user asked for 10-K on an individual Form 4 filer). + type_hist: dict[str, int] = {} + for ftype in form: + type_hist[ftype] = type_hist.get(ftype, 0) + 1 + + type_set = {t.strip().upper() for t in types} if types else None + rows: list[dict[str, str]] = [] + for i, ftype in enumerate(form): + if type_set and ftype.upper() not in type_set: + continue + fdate = date[i] if i < len(date) else "" + if since and fdate and fdate < since: + continue + acc = accession[i] if i < len(accession) else "" + pdoc = primary_doc[i] if i < len(primary_doc) else "" + acc_nodash = acc.replace("-", "") + filing_url = ( + f"https://www.sec.gov/Archives/edgar/data/{int(cik)}/{acc_nodash}/{pdoc}" + if acc and pdoc + else "" + ) + rows.append( + { + "cik": cik, + "company_name": name, + "form_type": ftype, + "filing_date": fdate, + "accession_number": acc, + "primary_document": pdoc, + "filing_url": filing_url, + "reporting_period": period[i] if i < len(period) else "", + } + ) + + Path(out_path).parent.mkdir(parents=True, exist_ok=True) + with open(out_path, "w", newline="", encoding="utf-8") as fh: + w = csv.DictWriter(fh, fieldnames=COLUMNS) + w.writeheader() + w.writerows(rows) + + if not rows and type_hist: + top = sorted(type_hist.items(), key=lambda kv: -kv[1])[:8] + hist_str = ", ".join(f"{t}={n}" for t, n in top) + print( + f"Warning: SEC EDGAR CIK {cik} ({name}) has {sum(type_hist.values())} " + f"recent filings but NONE match types={types}. " + f"Available form types: {hist_str}.", + file=sys.stderr, + ) + # Insider-filer heuristic: only Form 3/4/5 → individual person, not a company. + company_types = {"10-K", "10-Q", "8-K", "20-F", "DEF 14A", "S-1"} + if not (set(type_hist.keys()) & company_types): + print( + f"Note: CIK {cik} appears to be an INDIVIDUAL filer " + f"(insider Form 3/4/5 only), not a corporate registrant. " + f"The resolver may have matched an officer/director named " + f"{company!r} rather than a company.", + file=sys.stderr, + ) + return len(rows) + + +def main() -> int: + p = argparse.ArgumentParser(description=__doc__) + p.add_argument("--cik", help="Central Index Key (will be 10-digit zero-padded)") + p.add_argument("--company", help="Resolve to CIK by company name") + p.add_argument("--types", default="", help="Comma-separated form types (e.g. 10-K,10-Q,8-K)") + p.add_argument("--since", help="Skip filings before YYYY-MM-DD") + p.add_argument("--out", required=True) + a = p.parse_args() + types = [t for t in (a.types or "").split(",") if t.strip()] + n = fetch(cik=a.cik, company=a.company, types=types, since=a.since, out_path=a.out) + print(f"Wrote {n} EDGAR filing rows to {a.out}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/optional-skills/research/osint-investigation/scripts/fetch_senate_ld.py b/optional-skills/research/osint-investigation/scripts/fetch_senate_ld.py new file mode 100644 index 00000000000..3119ff8a9a5 --- /dev/null +++ b/optional-skills/research/osint-investigation/scripts/fetch_senate_ld.py @@ -0,0 +1,146 @@ +#!/usr/bin/env python3 +"""Fetch Senate Lobbying Disclosure (LD-1 / LD-2) filings. + +Anonymous: 120 req/hour. Token (SENATE_LDA_TOKEN): 1200 req/hour. +""" +from __future__ import annotations + +import argparse +import csv +import os +import sys +import time +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).parent)) +from _http import get_json # noqa: E402 + +ENDPOINT = "https://lda.senate.gov/api/v1/filings/" +COLUMNS = [ + "filing_uuid", + "filing_type", + "filing_year", + "filing_period", + "registrant_name", + "registrant_id", + "client_name", + "client_id", + "client_general_description", + "income", + "expenses", + "lobbyists", + "issues", + "government_entities", + "filing_date", +] + + +def fetch( + client: str | None, + registrant: str | None, + year: int, + token: str | None, + out_path: str, + page_size: int = 100, + max_pages: int = 25, +) -> int: + params: dict = {"filing_year": year, "page_size": page_size} + if client: + params["client_name"] = client + if registrant: + params["registrant_name"] = registrant + + headers = {"Authorization": f"Token {token}"} if token else None + rows: list[dict[str, str]] = [] + url = ENDPOINT + page = 0 + while page < max_pages: + try: + payload = get_json(url, params=params if page == 0 else None, headers=headers) + except Exception as e: # noqa: BLE001 + print(f"Senate LDA error on page {page + 1}: {e}", file=sys.stderr) + break + if not isinstance(payload, dict): + break + results = payload.get("results", []) + for r in results: + client_obj = r.get("client") or {} + registrant_obj = r.get("registrant") or {} + lobbying_activities = r.get("lobbying_activities") or [] + lobbyists = [] + issues = [] + entities = [] + for la in lobbying_activities: + for lob in la.get("lobbyists") or []: + lob_obj = lob.get("lobbyist") or {} + name = " ".join( + x for x in (lob_obj.get("first_name", ""), lob_obj.get("last_name", "")) if x + ) + if name: + lobbyists.append(name) + desc = la.get("description") or "" + if desc: + issues.append(desc) + for ge in la.get("government_entities") or []: + nm = ge.get("name") or "" + if nm: + entities.append(nm) + rows.append( + { + "filing_uuid": r.get("filing_uuid", "") or "", + "filing_type": r.get("filing_type", "") or "", + "filing_year": str(r.get("filing_year", "") or year), + "filing_period": r.get("filing_period", "") or "", + "registrant_name": registrant_obj.get("name", "") or "", + "registrant_id": str(registrant_obj.get("id", "") or ""), + "client_name": client_obj.get("name", "") or "", + "client_id": str(client_obj.get("id", "") or ""), + "client_general_description": client_obj.get("general_description", "") or "", + "income": str(r.get("income", "") or ""), + "expenses": str(r.get("expenses", "") or ""), + "lobbyists": "; ".join(sorted(set(lobbyists))), + "issues": "; ".join(issues), + "government_entities": "; ".join(sorted(set(entities))), + "filing_date": (r.get("dt_posted") or "")[:10], + } + ) + next_url = payload.get("next") + if not next_url: + break + url = next_url + page += 1 + time.sleep(1.0 if not token else 0.3) + + Path(out_path).parent.mkdir(parents=True, exist_ok=True) + with open(out_path, "w", newline="", encoding="utf-8") as fh: + w = csv.DictWriter(fh, fieldnames=COLUMNS) + w.writeheader() + w.writerows(rows) + return len(rows) + + +def main() -> int: + p = argparse.ArgumentParser(description=__doc__) + p.add_argument("--client", help="Client name filter") + p.add_argument("--registrant", help="Registrant (lobbying firm) name filter") + p.add_argument("--year", type=int, default=2024) + p.add_argument("--token", default=os.environ.get("SENATE_LDA_TOKEN")) + p.add_argument("--max-pages", type=int, default=25) + p.add_argument("--out", required=True) + a = p.parse_args() + if not (a.client or a.registrant): + p.error("must supply at least one of --client / --registrant") + n = fetch( + client=a.client, + registrant=a.registrant, + year=a.year, + token=a.token, + out_path=a.out, + max_pages=a.max_pages, + ) + print(f"Wrote {n} Senate LDA rows to {a.out}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/optional-skills/research/osint-investigation/scripts/fetch_usaspending.py b/optional-skills/research/osint-investigation/scripts/fetch_usaspending.py new file mode 100644 index 00000000000..a59c5f17276 --- /dev/null +++ b/optional-skills/research/osint-investigation/scripts/fetch_usaspending.py @@ -0,0 +1,170 @@ +#!/usr/bin/env python3 +"""Fetch federal contracts/awards from USAspending.gov API v2. + +No auth required. POST to /api/v2/search/spending_by_award/ with filters. +""" +from __future__ import annotations + +import argparse +import csv +import json +import sys +import time +import urllib.request +from pathlib import Path + +ENDPOINT = "https://api.usaspending.gov/api/v2/search/spending_by_award/" +COLUMNS = [ + "award_id", + "recipient_name", + "recipient_uei", + "recipient_duns", + "recipient_parent_name", + "recipient_state", + "awarding_agency", + "awarding_sub_agency", + "award_type", + "award_amount", + "award_date", + "period_of_performance_start", + "period_of_performance_end", + "naics_code", + "psc_code", + "competition_extent", + "description", +] + +# USAspending result column "code" → human label mapping for output. +_FIELDS = [ + "Award ID", + "Recipient Name", + "Recipient UEI", + "Recipient DUNS Number", + "Recipient Parent Name", + "Recipient State Code", + "Awarding Agency", + "Awarding Sub Agency", + "Award Type", + "Award Amount", + "Start Date", + "End Date", + "NAICS Code", + "PSC Code", + "Type of Set Aside", + "Description", +] + + +def _post(body: dict) -> dict: + req = urllib.request.Request( + ENDPOINT, + data=json.dumps(body).encode("utf-8"), + headers={"Content-Type": "application/json", "User-Agent": "hermes-agent osint-investigation"}, + method="POST", + ) + with urllib.request.urlopen(req, timeout=60) as resp: + return json.loads(resp.read().decode("utf-8")) + + +def fetch( + recipient: str | None, + agency: str | None, + fy: int, + sole_source_only: bool, + out_path: str, + page_size: int = 100, + max_pages: int = 20, +) -> int: + filters: dict = { + "time_period": [{"start_date": f"{fy - 1}-10-01", "end_date": f"{fy}-09-30"}], + # Contracts only by default; adjust award_type_codes for grants/loans. + "award_type_codes": ["A", "B", "C", "D"], + } + if recipient: + filters["recipient_search_text"] = [recipient] + if agency: + filters["agencies"] = [{"type": "awarding", "tier": "toptier", "name": agency}] + + rows: list[dict[str, str]] = [] + page = 1 + while page <= max_pages: + body = { + "filters": filters, + "fields": _FIELDS, + "page": page, + "limit": page_size, + "sort": "Award Amount", + "order": "desc", + } + try: + payload = _post(body) + except Exception as e: # noqa: BLE001 + print(f"USAspending error on page {page}: {e}", file=sys.stderr) + break + results = payload.get("results", []) + if not results: + break + for r in results: + set_aside = r.get("Type of Set Aside", "") or "" + if sole_source_only and "sole" not in set_aside.lower(): + continue + rows.append( + { + "award_id": r.get("Award ID", "") or "", + "recipient_name": r.get("Recipient Name", "") or "", + "recipient_uei": r.get("Recipient UEI", "") or "", + "recipient_duns": r.get("Recipient DUNS Number", "") or "", + "recipient_parent_name": r.get("Recipient Parent Name", "") or "", + "recipient_state": r.get("Recipient State Code", "") or "", + "awarding_agency": r.get("Awarding Agency", "") or "", + "awarding_sub_agency": r.get("Awarding Sub Agency", "") or "", + "award_type": r.get("Award Type", "") or "", + "award_amount": str(r.get("Award Amount", "") or ""), + "award_date": r.get("Start Date", "") or "", + "period_of_performance_start": r.get("Start Date", "") or "", + "period_of_performance_end": r.get("End Date", "") or "", + "naics_code": str(r.get("NAICS Code", "") or ""), + "psc_code": str(r.get("PSC Code", "") or ""), + "competition_extent": set_aside, + "description": r.get("Description", "") or "", + } + ) + meta = payload.get("page_metadata", {}) + if not meta.get("hasNext"): + break + page += 1 + time.sleep(0.5) + + Path(out_path).parent.mkdir(parents=True, exist_ok=True) + with open(out_path, "w", newline="", encoding="utf-8") as fh: + w = csv.DictWriter(fh, fieldnames=COLUMNS) + w.writeheader() + w.writerows(rows) + return len(rows) + + +def main() -> int: + p = argparse.ArgumentParser(description=__doc__) + p.add_argument("--recipient", help="Recipient name search") + p.add_argument("--agency", help="Awarding agency (top-tier)") + p.add_argument("--fy", type=int, default=2024, help="Federal fiscal year") + p.add_argument("--sole-source-only", action="store_true") + p.add_argument("--max-pages", type=int, default=20) + p.add_argument("--out", required=True) + a = p.parse_args() + if not (a.recipient or a.agency): + p.error("must supply at least one of --recipient / --agency") + n = fetch( + recipient=a.recipient, + agency=a.agency, + fy=a.fy, + sole_source_only=a.sole_source_only, + out_path=a.out, + max_pages=a.max_pages, + ) + print(f"Wrote {n} USAspending rows to {a.out}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/optional-skills/research/osint-investigation/scripts/fetch_wayback.py b/optional-skills/research/osint-investigation/scripts/fetch_wayback.py new file mode 100644 index 00000000000..fb9147f22c2 --- /dev/null +++ b/optional-skills/research/osint-investigation/scripts/fetch_wayback.py @@ -0,0 +1,142 @@ +#!/usr/bin/env python3 +"""Search the Internet Archive Wayback Machine via the CDX server. + +The CDX API indexes ~900B+ archived web pages. Anonymous read access, +no auth required. Useful for finding deleted / changed pages by URL, +domain, or substring match. +""" +from __future__ import annotations + +import argparse +import csv +import sys +import urllib.parse +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).parent)) +from _http import get_json # noqa: E402 + +BASE = "https://web.archive.org/cdx/search/cdx" + +COLUMNS = [ + "url", + "timestamp", + "wayback_url", + "mimetype", + "status", + "digest", + "length", +] + + +def fetch( + url_or_host: str, + match_type: str, + from_date: str | None, + to_date: str | None, + status: str | None, + mime: str | None, + collapse: str | None, + limit: int, + out_path: str, +) -> int: + params: dict[str, str] = { + "url": url_or_host, + "matchType": match_type, + "output": "json", + "limit": str(limit), + } + if from_date: + params["from"] = from_date.replace("-", "") + if to_date: + params["to"] = to_date.replace("-", "") + if status: + params["filter"] = f"statuscode:{status}" + if mime: + params.setdefault("filter", "") + # Multiple filters: CDX accepts repeated filter params via urlencode list + params["filter"] = f"mimetype:{mime}" + if collapse: + params["collapse"] = collapse + + url = f"{BASE}?{urllib.parse.urlencode(params)}" + try: + payload = get_json(url) + except Exception as e: # noqa: BLE001 + print(f"Wayback CDX error: {e}", file=sys.stderr) + payload = [] + + rows: list[dict[str, str]] = [] + if isinstance(payload, list) and len(payload) > 1: + header = payload[0] + idx = {h: i for i, h in enumerate(header)} + for entry in payload[1:]: + ts = entry[idx["timestamp"]] if "timestamp" in idx else "" + orig = entry[idx["original"]] if "original" in idx else "" + rows.append( + { + "url": orig, + "timestamp": ts, + "wayback_url": f"https://web.archive.org/web/{ts}/{orig}" if ts and orig else "", + "mimetype": entry[idx["mimetype"]] if "mimetype" in idx else "", + "status": entry[idx["statuscode"]] if "statuscode" in idx else "", + "digest": entry[idx["digest"]] if "digest" in idx else "", + "length": entry[idx["length"]] if "length" in idx else "", + } + ) + + Path(out_path).parent.mkdir(parents=True, exist_ok=True) + with open(out_path, "w", newline="", encoding="utf-8") as fh: + w = csv.DictWriter(fh, fieldnames=COLUMNS) + w.writeheader() + w.writerows(rows) + if not rows: + print( + f"Wayback Machine: 0 captures for {url_or_host!r} matchType={match_type}.", + file=sys.stderr, + ) + return len(rows) + + +def main() -> int: + p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + p.add_argument("--url", required=True, help="URL or host to look up in the archive") + p.add_argument( + "--match", + default="exact", + choices=["exact", "prefix", "host", "domain"], + help=( + "exact: this URL only. " + "prefix: this URL's path-prefix. " + "host: any URL on this host. " + "domain: any URL on this domain or subdomains." + ), + ) + p.add_argument("--from-date", help="Earliest capture YYYY-MM-DD") + p.add_argument("--to-date", help="Latest capture YYYY-MM-DD") + p.add_argument("--status", help="HTTP status filter (e.g. 200)") + p.add_argument("--mime", help="MIME type filter (e.g. text/html)") + p.add_argument( + "--collapse", + help="Collapse adjacent identical entries (e.g. 'digest' for unique-content captures)", + ) + p.add_argument("--limit", type=int, default=200) + p.add_argument("--out", required=True) + a = p.parse_args() + n = fetch( + url_or_host=a.url, + match_type=a.match, + from_date=a.from_date, + to_date=a.to_date, + status=a.status, + mime=a.mime, + collapse=a.collapse, + limit=a.limit, + out_path=a.out, + ) + print(f"Wrote {n} Wayback capture rows to {a.out}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/optional-skills/research/osint-investigation/scripts/fetch_wikipedia.py b/optional-skills/research/osint-investigation/scripts/fetch_wikipedia.py new file mode 100644 index 00000000000..4ce5c93813c --- /dev/null +++ b/optional-skills/research/osint-investigation/scripts/fetch_wikipedia.py @@ -0,0 +1,267 @@ +#!/usr/bin/env python3 +"""Search Wikipedia + Wikidata for an entity (person, company, place, concept). + +Two free APIs: + - Wikipedia OpenSearch + REST summary endpoint for narrative bio + - Wikidata SPARQL endpoint for structured facts (birth, employer, awards, etc.) + +Both are anonymous-access. Useful for resolving who-is-this-entity questions +and surfacing cross-references that other sources can join against. +""" +from __future__ import annotations + +import argparse +import csv +import json +import re +import sys +import urllib.parse +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).parent)) +from _http import get_json # noqa: E402 + +WP_OPENSEARCH = "https://en.wikipedia.org/w/api.php" +WP_SUMMARY = "https://en.wikipedia.org/api/rest_v1/page/summary/" +WD_ACTION = "https://www.wikidata.org/w/api.php" + +COLUMNS = [ + "source", + "label", + "description", + "qid", + "wikipedia_title", + "wikipedia_url", + "wikidata_url", + "instance_of", + "country", + "occupation", + "employer", + "date_of_birth", + "place_of_birth", + "summary", +] + + +def _wp_search(query: str, limit: int) -> list[dict]: + params = { + "action": "opensearch", + "search": query, + "limit": str(min(limit, 20)), + "format": "json", + } + url = f"{WP_OPENSEARCH}?{urllib.parse.urlencode(params)}" + data = get_json(url) + if not isinstance(data, list) or len(data) < 4: + return [] + titles, descs, urls = data[1], data[2], data[3] + out = [] + for i, title in enumerate(titles): + out.append( + { + "title": title, + "description": descs[i] if i < len(descs) else "", + "url": urls[i] if i < len(urls) else "", + } + ) + return out + + +def _wp_summary(title: str) -> dict: + """Pull the REST summary for a title — short bio, image, type.""" + url = f"{WP_SUMMARY}{urllib.parse.quote(title.replace(' ', '_'))}" + try: + return get_json(url) # type: ignore[return-value] + except Exception as e: # noqa: BLE001 + print(f"Wikipedia summary lookup for {title!r} failed: {e}", file=sys.stderr) + return {} + + +def _wd_lookup_by_qid(qid: str) -> dict: + """Pull common facts for a QID via Wikidata's Action API (no SPARQL). + + The Action API is far more lenient on rate-limits than the SPARQL Query + Service. We get claims as QIDs and then resolve labels in one batch call. + """ + # Properties of interest. The Action API returns claims as QIDs or + # typed literals, so the slot mapping is local-only. + interesting = { + "P31": "instance_of", + "P17": "country", # for orgs / places + "P27": "country", # for individuals (country of citizenship) + "P106": "occupation", + "P108": "employer", + "P569": "date_of_birth", + "P19": "place_of_birth", + } + params = { + "action": "wbgetentities", + "ids": qid, + "props": "claims", + "format": "json", + } + url = f"{WD_ACTION}?{urllib.parse.urlencode(params)}" + try: + data = get_json(url) + except Exception as e: # noqa: BLE001 + print(f"Wikidata wbgetentities for {qid} failed: {e}", file=sys.stderr) + return {} + if not isinstance(data, dict): + return {} + claims = (data.get("entities", {}).get(qid, {}) or {}).get("claims", {}) or {} + + # Collect raw values (QIDs or literals) and remember which slot each + # came from. Date literals come back as ISO strings; QIDs need a label + # resolution pass. + qid_to_slots: dict[str, list[str]] = {} + facts: dict[str, list[str]] = {} + for prop_id, slot in interesting.items(): + for claim in claims.get(prop_id, []) or []: + v = (claim.get("mainsnak", {}) or {}).get("datavalue", {}) or {} + vtype = v.get("type") + value = v.get("value") + if vtype == "wikibase-entityid" and isinstance(value, dict): + vqid = value.get("id", "") + if vqid: + qid_to_slots.setdefault(vqid, []) + if slot not in qid_to_slots[vqid]: + qid_to_slots[vqid].append(slot) + elif vtype == "time" and isinstance(value, dict): + raw = value.get("time", "") or "" + # +1955-10-28T00:00:00Z → 1955-10-28 + m = re.search(r"[+-]?(\d{4})-(\d{2})-(\d{2})", raw) + if m: + facts.setdefault(slot, []).append( + f"{m.group(1)}-{m.group(2)}-{m.group(3)}" + ) + elif vtype == "string": + facts.setdefault(slot, []).append(str(value)) + + # Resolve labels for all referenced QIDs in one batch (up to 50 at a time). + qids = list(qid_to_slots) + for i in range(0, len(qids), 50): + batch = qids[i : i + 50] + params = { + "action": "wbgetentities", + "ids": "|".join(batch), + "props": "labels", + "languages": "en", + "format": "json", + } + url = f"{WD_ACTION}?{urllib.parse.urlencode(params)}" + try: + data = get_json(url) + except Exception as e: # noqa: BLE001 + print(f"Wikidata label batch failed: {e}", file=sys.stderr) + continue + if not isinstance(data, dict): + continue + ents = data.get("entities", {}) or {} + for vqid, ent in ents.items(): + label = (ent.get("labels", {}).get("en", {}) or {}).get("value", "") or vqid + for slot in qid_to_slots.get(vqid, []): + facts.setdefault(slot, []).append(label) + + # Deduplicate per slot, preserving order. + deduped: dict[str, list[str]] = {} + for slot, vals in facts.items(): + seen = set() + out = [] + for v in vals: + if v in seen: + continue + seen.add(v) + out.append(v) + deduped[slot] = out + return deduped + + +def _wd_qid_for_title(title: str) -> str: + """Get the Wikidata QID associated with a Wikipedia article title.""" + params = { + "action": "query", + "format": "json", + "prop": "pageprops", + "ppprop": "wikibase_item", + "titles": title, + "redirects": 1, + } + url = f"{WP_OPENSEARCH}?{urllib.parse.urlencode(params)}" + try: + data = get_json(url) + except Exception: # noqa: BLE001 + return "" + if not isinstance(data, dict): + return "" + pages = data.get("query", {}).get("pages", {}) or {} + for page in pages.values(): + qid = (page.get("pageprops") or {}).get("wikibase_item", "") + if qid: + return qid + return "" + + +def fetch(query: str, limit: int, no_wikidata: bool, out_path: str) -> int: + hits = _wp_search(query, limit) + rows: list[dict[str, str]] = [] + for hit in hits[:limit]: + title = hit.get("title", "") + if not title: + continue + summary = _wp_summary(title) + qid = _wd_qid_for_title(title) if not no_wikidata else "" + facts: dict = {} + if qid: + facts = _wd_lookup_by_qid(qid) + rows.append( + { + "source": "wikipedia+wikidata" if qid else "wikipedia", + "label": title, + "description": (summary.get("description") or hit.get("description") or "").strip(), + "qid": qid, + "wikipedia_title": title, + "wikipedia_url": hit.get("url", ""), + "wikidata_url": f"https://www.wikidata.org/wiki/{qid}" if qid else "", + "instance_of": "; ".join(facts.get("instance_of", [])), + "country": "; ".join(facts.get("country", [])), + "occupation": "; ".join(facts.get("occupation", [])), + "employer": "; ".join(facts.get("employer", [])), + "date_of_birth": "; ".join(facts.get("date_of_birth", []))[:10] if facts.get("date_of_birth") else "", + "place_of_birth": "; ".join(facts.get("place_of_birth", [])), + "summary": (summary.get("extract") or "").replace("\n", " ")[:1000], + } + ) + + Path(out_path).parent.mkdir(parents=True, exist_ok=True) + with open(out_path, "w", newline="", encoding="utf-8") as fh: + w = csv.DictWriter(fh, fieldnames=COLUMNS) + w.writeheader() + w.writerows(rows) + if not rows: + print( + f"Wikipedia: 0 articles for query={query!r}. " + "Private individuals not notable enough for a Wikipedia article " + "won't appear here (the bar is real).", + file=sys.stderr, + ) + return len(rows) + + +def main() -> int: + p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + p.add_argument("--query", required=True, help="Entity name (person, company, place, concept)") + p.add_argument("--limit", type=int, default=5) + p.add_argument( + "--no-wikidata", + action="store_true", + help="Skip the Wikidata SPARQL enrichment (faster, less detail)", + ) + p.add_argument("--out", required=True) + a = p.parse_args() + n = fetch(query=a.query, limit=a.limit, no_wikidata=a.no_wikidata, out_path=a.out) + print(f"Wrote {n} Wikipedia/Wikidata rows to {a.out}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/optional-skills/research/osint-investigation/scripts/timing_analysis.py b/optional-skills/research/osint-investigation/scripts/timing_analysis.py new file mode 100644 index 00000000000..4e0ece227b4 --- /dev/null +++ b/optional-skills/research/osint-investigation/scripts/timing_analysis.py @@ -0,0 +1,253 @@ +#!/usr/bin/env python3 +"""Permutation test for donation/contract timing correlation (stdlib-only). + +For each (donor, vendor) pair, compute the mean number of days between each +donation and the nearest contract award. Then shuffle contract award dates +N times within the observation window and compute the same statistic. The +one-tailed p-value is the fraction of permutations whose mean is <= the +observed mean (smaller distance = tighter clustering). + +Adapted from ShinMegamiBoson/OpenPlanter (MIT). Differences: + - Pure stdlib (no pandas / numpy) + - Domain-agnostic (no snow-vendor / CRITICAL-politician filter) + - Configurable column names via flags + - Optional --seed for reproducibility +""" +from __future__ import annotations + +import argparse +import csv +import datetime as dt +import json +import math +import random +import statistics +from collections import defaultdict +from pathlib import Path + +_DATE_FORMATS = ("%Y-%m-%d", "%m/%d/%Y", "%Y/%m/%d", "%m-%d-%Y", "%Y%m%d") + + +def parse_date(raw: str) -> dt.date | None: + if not raw: + return None + raw = raw.strip() + for fmt in _DATE_FORMATS: + try: + return dt.datetime.strptime(raw, fmt).date() + except ValueError: + continue + return None + + +def _read(path: str) -> list[dict[str, str]]: + with open(path, newline="", encoding="utf-8") as fh: + return list(csv.DictReader(fh)) + + +def _nearest_distance(donation_date: dt.date, awards: list[dt.date]) -> int: + """Absolute days to nearest award date.""" + return min(abs((donation_date - a).days) for a in awards) + + +def _permute( + awards_count: int, + donations: list[dt.date], + date_min: dt.date, + date_max: dt.date, + rng: random.Random, +) -> float: + """One permutation: draw uniform random award dates, compute mean nearest-distance.""" + span_days = (date_max - date_min).days or 1 + rand_awards = [ + date_min + dt.timedelta(days=rng.randint(0, span_days)) + for _ in range(awards_count) + ] + distances = [_nearest_distance(d, rand_awards) for d in donations] + return statistics.mean(distances) + + +def analyze( + donations_path: str, + donation_date_col: str, + donation_amount_col: str, + donation_donor_col: str, + donation_recipient_col: str, + contracts_path: str, + contract_date_col: str, + contract_vendor_col: str, + cross_links_path: str | None, + n_permutations: int = 1000, + min_donations: int = 3, + p_threshold: float = 0.05, + seed: int | None = None, + out_path: str = "timing.json", +) -> dict: + rng = random.Random(seed) + + donations = _read(donations_path) + contracts = _read(contracts_path) + + # Allow optional join through cross_links — donor (left) ↔ vendor (right). + # When present, donor strings get mapped to matched vendor names so the + # vendor-date index lookup actually finds the contracts. + matched_pairs: set[tuple[str, str]] | None = None + donor_to_vendors: dict[str, set[str]] = defaultdict(set) + if cross_links_path: + matched_pairs = set() + for row in _read(cross_links_path): + left = row.get("left_name", "") + right = row.get("right_name", "") + matched_pairs.add((left, right)) + donor_to_vendors[left].add(right) + + # Index contract dates by vendor name. + vendor_to_award_dates: dict[str, list[dt.date]] = defaultdict(list) + all_award_dates: list[dt.date] = [] + for row in contracts: + d = parse_date(row.get(contract_date_col, "")) + if not d: + continue + vendor_to_award_dates[row.get(contract_vendor_col, "").strip()].append(d) + all_award_dates.append(d) + + if not all_award_dates: + raise SystemExit(f"No parseable dates in {contracts_path}/{contract_date_col}") + global_min = min(all_award_dates) + global_max = max(all_award_dates) + + # Group donations by (donor, recipient). + grouped: dict[tuple[str, str], list[tuple[dt.date, float]]] = defaultdict(list) + for row in donations: + donor = row.get(donation_donor_col, "").strip() + recip = row.get(donation_recipient_col, "").strip() + d = parse_date(row.get(donation_date_col, "")) + try: + amt = float(row.get(donation_amount_col, "0") or 0) + except ValueError: + amt = 0.0 + if not (donor and recip and d): + continue + grouped[(donor, recip)].append((d, amt)) + + results = [] + skipped = 0 + for (donor, recip), records in grouped.items(): + if len(records) < min_donations: + skipped += 1 + continue + # Only test if donor appears in cross-links (when provided). The + # (donor, candidate) tuple itself is NOT what's in matched_pairs — + # cross_links pairs are (donor, vendor). We use the cross-link to + # map donor → vendor name(s) so the vendor-date index resolves. + if matched_pairs is not None and donor not in donor_to_vendors: + skipped += 1 + continue + # Try direct donor→awards first, then go through cross-link vendor names. + award_dates = list(vendor_to_award_dates.get(donor, [])) + if not award_dates: + award_dates = list(vendor_to_award_dates.get(recip, [])) + if not award_dates and donor_to_vendors.get(donor): + for vendor_name in donor_to_vendors[donor]: + award_dates.extend(vendor_to_award_dates.get(vendor_name, [])) + if not award_dates: + skipped += 1 + continue + + donation_dates = [d for (d, _) in records] + observed = statistics.mean( + _nearest_distance(d, award_dates) for d in donation_dates + ) + + permuted_means = [ + _permute(len(award_dates), donation_dates, global_min, global_max, rng) + for _ in range(n_permutations) + ] + p_value = sum(1 for m in permuted_means if m <= observed) / n_permutations + null_mean = statistics.mean(permuted_means) + null_std = statistics.pstdev(permuted_means) or 1.0 + effect_size = (null_mean - observed) / null_std + + results.append( + { + "donor": donor, + "recipient": recip, + "n_donations": len(records), + "n_award_dates": len(award_dates), + "observed_mean_days": round(observed, 2), + "null_mean_days": round(null_mean, 2), + "p_value": round(p_value, 4), + "effect_size_sd": round(effect_size, 2), + "significant": p_value < p_threshold, + "total_donation_amount": round(sum(a for (_, a) in records), 2), + } + ) + + results.sort(key=lambda r: r["p_value"]) + + payload = { + "metadata": { + "n_permutations": n_permutations, + "min_donations": min_donations, + "p_threshold": p_threshold, + "seed": seed, + "n_pairs_tested": len(results), + "n_pairs_skipped": skipped, + "n_significant": sum(1 for r in results if r["significant"]), + "observation_window": [global_min.isoformat(), global_max.isoformat()], + }, + "results": results, + } + + Path(out_path).write_text(json.dumps(payload, indent=2)) + return payload + + +def main() -> int: + p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter) + p.add_argument("--donations", required=True) + p.add_argument("--donation-date-col", required=True) + p.add_argument("--donation-amount-col", required=True) + p.add_argument("--donation-donor-col", required=True) + p.add_argument("--donation-recipient-col", required=True) + p.add_argument("--contracts", required=True) + p.add_argument("--contract-date-col", required=True) + p.add_argument("--contract-vendor-col", required=True) + p.add_argument( + "--cross-links", + help="Optional cross_links.csv to restrict (donor, vendor) pairs", + ) + p.add_argument("--permutations", type=int, default=1000) + p.add_argument("--min-donations", type=int, default=3) + p.add_argument("--p-threshold", type=float, default=0.05) + p.add_argument("--seed", type=int) + p.add_argument("--out", default="timing.json") + a = p.parse_args() + + payload = analyze( + donations_path=a.donations, + donation_date_col=a.donation_date_col, + donation_amount_col=a.donation_amount_col, + donation_donor_col=a.donation_donor_col, + donation_recipient_col=a.donation_recipient_col, + contracts_path=a.contracts, + contract_date_col=a.contract_date_col, + contract_vendor_col=a.contract_vendor_col, + cross_links_path=a.cross_links, + n_permutations=a.permutations, + min_donations=a.min_donations, + p_threshold=a.p_threshold, + seed=a.seed, + out_path=a.out, + ) + meta = payload["metadata"] + print( + f"Tested {meta['n_pairs_tested']} pairs ({meta['n_pairs_skipped']} skipped). " + f"Significant (p<{meta['p_threshold']}): {meta['n_significant']}. " + f"Wrote {a.out}" + ) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/optional-skills/research/osint-investigation/templates/source-template.md b/optional-skills/research/osint-investigation/templates/source-template.md new file mode 100644 index 00000000000..b023cc26888 --- /dev/null +++ b/optional-skills/research/osint-investigation/templates/source-template.md @@ -0,0 +1,59 @@ +# + +## 1. Summary + +What this data source is, who publishes it, why it matters for investigations. + +## 2. Access Methods + +- API endpoint(s) +- Bulk download URLs +- Auth requirements (none / API key / OAuth) +- Rate limits + +## 3. Data Schema + +Key fields, record types, table relationships. List the columns the fetch +script emits. + +## 4. Coverage + +- Jurisdiction +- Time range +- Update frequency +- Data volume (rows / GB) + +## 5. Cross-Reference Potential + +Which other sources can be joined and on what keys. Be explicit: + +- `` ↔ `` (join key: ) + +## 6. Data Quality + +Known issues — formatting inconsistencies, missing fields, duplicates, +historical gaps, redaction. + +## 7. Acquisition Script + +Path: `scripts/fetch_.py` + +Example: + +```bash +python3 SKILL_DIR/scripts/fetch_.py -- --out data/.csv +``` + +Output CSV columns: `, , ...` + +## 8. Legal & Licensing + +- Public records law / FOIA basis +- Terms of use / acceptable use +- Attribution requirements (if any) + +## 9. References + +- Official docs: +- Data dictionary: +- Related coverage / journalism: diff --git a/website/docs/reference/optional-skills-catalog.md b/website/docs/reference/optional-skills-catalog.md index d1544ce89b9..ce1861431a6 100644 --- a/website/docs/reference/optional-skills-catalog.md +++ b/website/docs/reference/optional-skills-catalog.md @@ -167,6 +167,7 @@ hermes skills uninstall | [**drug-discovery**](/docs/user-guide/skills/optional/research/research-drug-discovery) | Pharmaceutical research assistant for drug discovery workflows. Search bioactive compounds on ChEMBL, calculate drug-likeness (Lipinski Ro5, QED, TPSA, synthetic accessibility), look up drug-drug interactions via OpenFDA, interpret ADMET... | | [**duckduckgo-search**](/docs/user-guide/skills/optional/research/research-duckduckgo-search) | Free web search via DuckDuckGo — text, news, images, videos. No API key needed. Prefer the `ddgs` CLI when installed; use the Python DDGS library only after verifying that `ddgs` is available in the current runtime. | | [**gitnexus-explorer**](/docs/user-guide/skills/optional/research/research-gitnexus-explorer) | Index a codebase with GitNexus and serve an interactive knowledge graph via web UI + Cloudflare tunnel. | +| [**osint-investigation**](/docs/user-guide/skills/optional/research/research-osint-investigation) | Public-records OSINT investigation framework — SEC EDGAR filings, USAspending contracts, Senate lobbying, OFAC sanctions, ICIJ offshore leaks, NYC property records (ACRIS), OpenCorporates registries, CourtListener court records, Wayback... | | [**parallel-cli**](/docs/user-guide/skills/optional/research/research-parallel-cli) | Optional vendor skill for Parallel CLI — agent-native web search, extraction, deep research, enrichment, FindAll, and monitoring. Prefer JSON output and non-interactive flows. | | [**qmd**](/docs/user-guide/skills/optional/research/research-qmd) | Search personal knowledge bases, notes, docs, and meeting transcripts locally using qmd — a hybrid retrieval engine with BM25, vector search, and LLM reranking. Supports CLI and MCP integration. | | [**scrapling**](/docs/user-guide/skills/optional/research/research-scrapling) | Web scraping with Scrapling - HTTP fetching, stealth browser automation, Cloudflare bypass, and spider crawling via CLI and Python. | diff --git a/website/docs/user-guide/skills/optional/research/research-osint-investigation.md b/website/docs/user-guide/skills/optional/research/research-osint-investigation.md new file mode 100644 index 00000000000..7428c3022b2 --- /dev/null +++ b/website/docs/user-guide/skills/optional/research/research-osint-investigation.md @@ -0,0 +1,294 @@ +--- +title: "Osint Investigation" +sidebar_label: "Osint Investigation" +description: "Public-records OSINT investigation framework — SEC EDGAR filings, USAspending contracts, Senate lobbying, OFAC sanctions, ICIJ offshore leaks, NYC property r..." +--- + +{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */} + +# Osint Investigation + +Public-records OSINT investigation framework — SEC EDGAR filings, USAspending contracts, Senate lobbying, OFAC sanctions, ICIJ offshore leaks, NYC property records (ACRIS), OpenCorporates registries, CourtListener court records, Wayback Machine archives, Wikipedia + Wikidata, GDELT news monitoring. Entity resolution across sources, cross-link analysis, timing correlation, evidence chains. Python stdlib only. + +## Skill metadata + +| | | +|---|---| +| Source | Optional — install with `hermes skills install official/research/osint-investigation` | +| Path | `optional-skills/research/osint-investigation` | +| Version | `0.1.0` | +| Author | Hermes Agent (adapted from ShinMegamiBoson/OpenPlanter, MIT) | +| Platforms | linux, macos, windows | +| Tags | `osint`, `investigation`, `public-records`, `sec`, `sanctions`, `corporate-registry`, `property`, `courts`, `due-diligence`, `journalism` | +| Related skills | [`domain-intel`](/docs/user-guide/skills/optional/research/research-domain-intel), [`arxiv`](/docs/user-guide/skills/bundled/research/research-arxiv) | + +## Reference: full SKILL.md + +:::info +The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active. +::: + +# OSINT Investigation — Public Records Cross-Reference + +Investigative framework for public-records OSINT: government contracts, +corporate filings, lobbying, sanctions, offshore leaks, property records, +court records, web archives, knowledge bases, and global news. Resolve +entities across heterogeneous sources, build cross-links with explicit +confidence, run statistical timing tests, and produce structured evidence +chains. + +**Python stdlib only.** Zero install. Works on Linux, macOS, Windows. Most +sources work with no API key (OpenCorporates has an optional free token +that raises rate limits). + +Adapted from the MIT-licensed ShinMegamiBoson/OpenPlanter project; expanded +to cover identity / property / litigation / archives / news sources that +the original didn't address. + +## When to use this skill + +Use when the user asks for: + +- "follow the money" — government contracts, lobbying → legislation, sanctions +- corporate due diligence — who controls company X, where are they + incorporated, who serves on their boards, what filings have they made +- sanctions screening — is entity X on OFAC SDN, ICIJ offshore leaks +- pay-to-play investigation — contractors with offshore ties, lobbying + clients winning awards +- property ownership — find recorded deeds/mortgages by name or address + (NYC; for other counties point users at the relevant recorder) +- litigation history — find federal + state court opinions and PACER dockets +- multi-source entity resolution where naming varies (LLC suffixes, abbreviations) +- evidence-chain construction with explicit confidence levels +- "what's been said about X" — international news (GDELT) + Wikipedia + narrative + Wayback Machine to recover dead URLs + +Do NOT use this skill for: + +- general web research → `web_search` / `web_extract` +- domain/infrastructure OSINT → `domain-intel` skill +- academic literature → `arxiv` skill +- social-media profile discovery → `sherlock` skill (optional) +- US **federal** campaign finance — FEC is intentionally NOT covered here + (the API is unreliable for ad-hoc contributor-name queries on the free + DEMO_KEY tier). For federal donations, point users at + https://www.fec.gov/data/ directly. + +## Workflow + +The agent runs scripts via the `terminal` tool. `SKILL_DIR` is the directory +holding this SKILL.md. + +### 1. Identify which sources apply + +Read the data-source wiki entries to plan the investigation: + +``` +ls SKILL_DIR/references/sources/ + +# Federal financial / regulatory +cat SKILL_DIR/references/sources/sec-edgar.md # corporate filings +cat SKILL_DIR/references/sources/usaspending.md # federal contracts +cat SKILL_DIR/references/sources/senate-ld.md # lobbying +cat SKILL_DIR/references/sources/ofac-sdn.md # sanctions +cat SKILL_DIR/references/sources/icij-offshore.md # offshore leaks + +# Identity / property / litigation / archives / news +cat SKILL_DIR/references/sources/nyc-acris.md # NYC property records +cat SKILL_DIR/references/sources/opencorporates.md # global corporate registry +cat SKILL_DIR/references/sources/courtlistener.md # court records (federal + state) +cat SKILL_DIR/references/sources/wayback.md # Wayback Machine archives +cat SKILL_DIR/references/sources/wikipedia.md # Wikipedia + Wikidata +cat SKILL_DIR/references/sources/gdelt.md # global news monitoring +``` + +Each entry follows a 9-section template: summary, access, schema, coverage, +cross-reference keys, data quality, acquisition, legal, references. + +The **cross-reference potential** section maps join keys between sources — read +those first to pick the right pair. + +### 2. Acquire data + +Each source has a stdlib-only fetch script in `SKILL_DIR/scripts/`: + +**Federal financial / regulatory** + +```bash +# SEC EDGAR filings (corporate disclosures) +python3 SKILL_DIR/scripts/fetch_sec_edgar.py --cik 0000320193 \ + --types 10-K,10-Q --out data/edgar_filings.csv + +# USAspending federal contracts +python3 SKILL_DIR/scripts/fetch_usaspending.py --recipient "EXAMPLE CORP" \ + --fy 2024 --out data/contracts.csv + +# Senate LD-1 / LD-2 lobbying disclosures +python3 SKILL_DIR/scripts/fetch_senate_ld.py --client "EXAMPLE CORP" \ + --year 2024 --out data/lobbying.csv + +# OFAC SDN sanctions list (full snapshot) +python3 SKILL_DIR/scripts/fetch_ofac_sdn.py --out data/ofac_sdn.csv + +# ICIJ Offshore Leaks — downloads ~70 MB bulk CSV on first use, +# then searches it locally. Cached for 30 days under +# $HERMES_OSINT_CACHE/icij/ (default: ~/.cache/hermes-osint/icij/). +python3 SKILL_DIR/scripts/fetch_icij_offshore.py --entity "EXAMPLE CORP" \ + --out data/icij.csv +``` + +**Identity / property / litigation / archives / news** + +```bash +# NYC property records (deeds, mortgages, liens) — ACRIS via Socrata +python3 SKILL_DIR/scripts/fetch_nyc_acris.py --name "SMITH, JOHN" \ + --out data/acris.csv +python3 SKILL_DIR/scripts/fetch_nyc_acris.py --address "571 HUDSON" \ + --out data/acris_addr.csv + +# OpenCorporates — 130+ jurisdiction corporate registry +# (free token required; set OPENCORPORATES_API_TOKEN or pass --token) +python3 SKILL_DIR/scripts/fetch_opencorporates.py --query "Example Corp" \ + --jurisdiction us_ny --out data/opencorporates.csv + +# CourtListener — federal + state court opinions, PACER dockets +python3 SKILL_DIR/scripts/fetch_courtlistener.py --query "Smith v. Example Corp" \ + --type opinions --out data/courts.csv + +# Wayback Machine — historical web captures +python3 SKILL_DIR/scripts/fetch_wayback.py --url "example.com" \ + --match host --collapse digest --out data/wayback.csv + +# Wikipedia + Wikidata — narrative bio + structured facts +# Set HERMES_OSINT_UA=your-app/1.0 (your@email) to identify yourself +python3 SKILL_DIR/scripts/fetch_wikipedia.py --query "Bill Gates" \ + --out data/wp.csv + +# GDELT — global news in 100+ languages, ~2015→present +python3 SKILL_DIR/scripts/fetch_gdelt.py --query '"Example Corp"' \ + --timespan 1y --out data/gdelt.csv +``` + +All outputs are normalized CSV with a header row. Re-run scripts idempotently. + +When a private individual won't be in a source (e.g. SEC EDGAR for a non-public- +company person, USAspending for someone who isn't a federal contractor, Senate +LDA for someone who isn't a lobbying client), the script returns 0 rows with a +clear warning rather than silently writing an empty CSV. EDGAR specifically +flags when the company-name resolver matched an individual Form 3/4/5 filer +rather than a corporate registrant. + +Rate-limit notes are in each source's wiki entry. Default fetchers sleep +politely between paginated requests. **API keys raise rate limits** for +sources that support them (`SEC_USER_AGENT`, `SENATE_LDA_TOKEN`, +`OPENCORPORATES_API_TOKEN`, `COURTLISTENER_TOKEN`). All scripts surface +429 responses immediately with the upstream's quota message so the user +knows to slow down or supply a key. + +### 3. Resolve entities across sources + +Normalize names and find matches between two CSV files: + +```bash +# Match lobbying clients (Senate LDA) against contract recipients (USAspending) +python3 SKILL_DIR/scripts/entity_resolution.py \ + --left data/lobbying.csv --left-name-col client_name \ + --right data/contracts.csv --right-name-col recipient_name \ + --out data/cross_links.csv +``` + +Three matching tiers with explicit confidence: + +| Tier | Method | Confidence | +|------|--------|------------| +| `exact` | Normalized strings equal after suffix/punctuation strip | high | +| `fuzzy` | Sorted-token equality (word-bag match) | medium | +| `token_overlap` | ≥60% token overlap, ≥2 shared tokens, tokens ≥4 chars | low | + +Output `cross_links.csv` columns: `match_type, confidence, left_name, +right_name, left_normalized, right_normalized, left_row, right_row`. + +### 4. Statistical timing correlation (optional) + +Test whether two time series cluster suspiciously close together — e.g. +lobbying filings near contract awards — using a permutation test: + +```bash +python3 SKILL_DIR/scripts/timing_analysis.py \ + --donations data/lobbying.csv --donation-date-col filing_date \ + --donation-amount-col income --donation-donor-col client_name \ + --donation-recipient-col registrant_name \ + --contracts data/contracts.csv --contract-date-col award_date \ + --contract-vendor-col recipient_name \ + --cross-links data/cross_links.csv \ + --permutations 1000 \ + --out data/timing.json +``` + +The script's column flags are intentionally generic — the original tool was +written for donations vs awards, but it works for any (event, payee) time +series joined through cross-links. Null hypothesis: event timing is +independent of award dates. One-tailed p-value = fraction of permutations +with mean nearest-award distance ≤ observed. Minimum 3 events per (payer, +vendor) pair to run the test. + +### 5. Build the findings JSON (evidence chain) + +```bash +python3 SKILL_DIR/scripts/build_findings.py \ + --cross-links data/cross_links.csv \ + --timing data/timing.json \ + --out data/findings.json +``` + +Every finding has `id, title, severity, confidence, summary, evidence[], sources[]`. +Each evidence item points back to a specific row in a source CSV. The user (or a +follow-up agent) can verify every claim against its source. + +## Confidence and evidence discipline + +This is the load-bearing rule of the skill. Tell the user: + +- Every claim must trace to a record. No naked assertions. +- Confidence tier travels with the claim. `match_type=fuzzy` is "probable", + not "confirmed." +- Entity resolution produces candidates, NOT conclusions. A `fuzzy` match + between "ACME LLC" and "Acme Holdings Group" is a lead, not a fact. +- Statistical significance ≠ wrongdoing. p < 0.05 means the timing pattern + is unlikely under the null. It does not establish corruption. +- All data sources here are public records. They may still contain + inaccuracies, stale info, or redactions (GDPR, sealed records). + +## Adding a new data source + +Use the template: + +```bash +cp SKILL_DIR/templates/source-template.md \ + SKILL_DIR/references/sources/.md +``` + +Fill in all 9 sections. Write a `fetch_.py` script in `scripts/` that +uses stdlib only and writes a normalized CSV. Update the source list in the +"When to use" section above. + +## Tools and their limits + +- `entity_resolution.py` does NOT use external fuzzy libraries (no rapidfuzz, + no jellyfish). Token-bag matching is the upper bound here. If you need + Levenshtein, transliteration, or phonetic matching, pip-install separately. +- `timing_analysis.py` uses Python's `random` for permutations. For + reproducibility, pass `--seed N`. +- `fetch_*.py` scripts use `urllib.request` and respect `Retry-After`. Heavy + bulk usage may still violate ToS — read each source's legal section first. + +## Legal note + +All Phase-1 sources are public records. Bulk acquisition is permitted under +their respective access terms (FOIA, public records law, ICIJ explicit +publication, OFAC public data). However: + +- Some sources rate-limit aggressively. Respect their headers. +- Some redact registrant info (GDPR on WHOIS, sealed filings). +- Cross-referencing public records to identify private individuals can have + ethical implications. The skill produces evidence chains, not accusations. diff --git a/website/sidebars.ts b/website/sidebars.ts index f619f2318c9..1a0aa6fb0bb 100644 --- a/website/sidebars.ts +++ b/website/sidebars.ts @@ -554,6 +554,7 @@ const sidebars: SidebarsConfig = { 'user-guide/skills/optional/research/research-drug-discovery', 'user-guide/skills/optional/research/research-duckduckgo-search', 'user-guide/skills/optional/research/research-gitnexus-explorer', + 'user-guide/skills/optional/research/research-osint-investigation', 'user-guide/skills/optional/research/research-parallel-cli', 'user-guide/skills/optional/research/research-qmd', 'user-guide/skills/optional/research/research-scrapling',